What is the largest database in existence, and what might you do with it?

According to USA Today a database, collected by the NSA, of most telephone calls made in the USA since 2001, is the largest database in the world. A recent article on simulation demonstrates one more technique for getting useful information out of such huge piles of data.

Wikipedia speculates that the database contains details of over 1.9 trillion calls.

According to a 2005 survey by Winter Corporation:
1. Yahoo! has the first system to surpass the 100 TB mark.
2. The largest OLTP system is Land Registry for England and Wales at 23.1 TB
3. UPS achieved a peak workload of 1.1 billion SQL statements per hour.

I'm not sure where this leaves the NSA database. (On a scale from 0 to 10, pretty big.) This sort of power, which would have been unthinkable only a few years ago, is going to revolutionise simulation and modelling, as well as the analysis of reality.

Wikipedia identifies several data mining possibilities for it:
- "using the data to connect phone numbers with names and links to persons of interest.... to organize and view links that are demonstrated through such information as telephone and financial records
- Neural network software is used to detect patterns, classify and cluster data as well as forecast future events.
- Using relational mathematics it is possible to find out if someone changes their telephone number depending on calling patterns."

I'm sure there are far more data mining possibilites than this; imagine what Steven Levitt would make of all that data. I've just been reading an article in JASSS by Derek Gatherer, who uses Monte Carlo techniques to look at data about the Eurovision Song Competition ("Comparison of simulated histories with the actual history of the contest allows the identification of statistically significant changes in patterns..") and confirm that the voting is rigged along national and regional lines. (For US readers: you don't even want to know what the Eurovision Song Competition is. It is trivial and stupid. But Gatherer's article is a good illustration of the use of simulation techniques to analyse data.)

This blog makes no comment about civil liberties, government actions, etc. But it is ironical that we live in a two-part society as far as individual freedom goes. If you walk, write a letter, or visit, then the intelligence agencies find it physically difficult and expensive to monitor you, and are often obliged to get special authority to do so. If you use email, a phone (mobile or landline), or a car, or if you travel by air or book a ticket with a credit card, they can track you easily and don't need a warrant to do so. The only question is whether they can mine through all that data to make any sense of it. So far the record in the UK at least suggests they can't.



Seems to me that the "largest database in existense" would be an awesome boon to corpus linguistics. CL is using "bodies" of examples from a given language to study the whole language, to find patterns, discern grammars, etc. Listening to phone conversations could be a great excercise for language learners -- if the database could be parsed up into different language and dialect categories then made available to the language learners and teachers.

I hadn't heard of corpus linguistics, despite having an English degree (some years ago...), but I have now read the Wikipedia article.

This idea that we might collect up all we actually do (or in this case say) is an interesting one. I posted earlier about a British police database of car movements, and about the Recording Angel. Rather with tongue in cheek I think. But Woody's comment makes me return to the subject.

The fact is that we do now have the technology to see very large areas of human activity in aggregate. In other words, rather than just looking at what I do (or think I do), or what 150 graduate students at the University of Boringsville did during an experiment, we now have the possibility to look at everything everyone did, albeit only in a particular field or location.

The implications for 'social engineering' are large, and also for simulation. On the other hand, we risk flooding ourselves with so much data and metadata that we can't interpret it properly. (Nothing will stop us from interpreting it improperly - see any British government's use of econometrics.)

Suppose - as a 'thought experiment' - we have the 'tapes' of every telephone conversation made in the USA last year, in a form which can easily be analysed by 'audio mining' tools. I can use this to
(a) trace individual actions/ words (who mentioned al Qaeda and plastic explosive, so I can send DHS to their door)
(b) trace 'themes' (who talked about a product or a political issue and where, so I can send the advertisers or the political agents round)
(c) trace linguistic techniques (who used which words in which contexts; how long were sentences; how correct was grammar, etc.)
(d) make psychological deductions (after a major disaster, how worried do people become? Did Hurricane Katrina cause more distress in Idaho than Iowa?)

Any other ideas?

