<< Using simulation to cure PTSD   |   WEBLOG   |   Big computer systems >>

An example of repurposed data

The Enron Email Corpus is an excellent example of 'repurposed' data: generated for one purpose, published, and the re-used for another.

After the Enron affair, the FERC (US Federal Energy Regulatory Commission) published many of the company's documents, including a collection of emails, the Enron Email Corpus.

Of course this was in itself a 'repurposing' - the originators had not intended these emails to be public or to be used to build a case against them.

But what is interesting is that they are now being used by an AI researcher (see the IEEE Intelligent Systems journal (full text of article here) . Jennifer Golbeck's paper on "AI and social networks" says: "I research the dynamics of social networks found in online communities and email networks.
I believe that we can analyze these networks to compute useful data about each userís social environment and that we can use the result to develop intelligent user interfaces and inform an understanding of communication patterns.... Iím also looking at networks built from the Enron Email Corpus, a public collection of mailboxes from 150 Enron executives comprising over 500,000 messages and 20,000 unique users....With these data sets, Iím ... investigating trust in social networks. As part of that project, I developed algorithms for computing personalized inferences of trust relationships between individuals in the network, based on trust values on the paths that connect them." Note that Ms G has no interest in the content of the emails as such: just the relationships they demonstrate.

Ms Golbeck has researched the subject thoroughly. Her blog refers to a list of 'every social network on the web': she has looked at 145 networks with 254m members.

See my earlier posts on massive databases and the alternative uses they might have. We are collecting almost infinite datasets, which offer great opportunities for repurposing, for research or simulation (or research using simulation). The fascinating thing is that these datasets are being collected almost accidentally, often for some limited short-term purpose: we need intelligent researchers to find new uses for them.

(I don't approve of mass surveillance, but since we seem to be stuck with it, we might as well use the data sensibly.)


Notify me when someone replies to this post?


Powered by pMachine