<< More on Situation Rooms.   |   WEBLOG   |   You read it here first. >>

A taxonomy of data mining?

The US Office of the Director of National Intelligence has just published a 'Data Mining Report', which outlines its current data mining research activities. This is a fascinating if incomplete survey of the field and hints at a taxonomy.

The report defines data mining as "pattern-based queries, searches or other analyses of one or more databases in order to discover or locate a predictive pattern or anomaly indicative of terrorist or criminal activity..." and adds "the limitation to pattern-based, predictive mining is significant because analysis is often performed using various types of link analysis tools...These tools start with a known or suspected terrorist ... and use various methods to uncover links between that known subhject and potential associates....[but] ... such analyses are not pattern-based." (This distinction matters for the DNI, because data mining is defined in this way in the US Data Mining Act, in pursuance of which the report has been published.)

So that's the first distinction: not so much pattern-based, as based solely on a pattern, without any initial specified starting point. Seems rather a big distinction, actually: most of the MI5 analysis in recent cases seems to have revolved around making links to known or suspected individuals. Even here there's a pattern: you're not interested when a terrorist goes to the corner shop for a pint of milk, but you are interested if (say) he hires a car but just drives it to a motorway service station and then back again, especially if another suspect makes a mobile phone call from that service station at that time.

Undoubtedly patterns do exist and can be revealing: see the work of Phillips and Levitt.

The report describes projects funded by the Intelligence Advanced projects Research Agency (IARPA). IARPA is pursuing 'high-risk, high-payoff' solutions which aim to create "never before seen" capabilities. (I envy them, actually... Just as DARPA built the internet, heaven only knows what IARPA may do for us when their research can be applied in non-paranoid environments. There's a good interview here.) Its 'Incisive Analysis' portfolio includes:

- Knowledge Discovery and Dissemination (KDD): "some tools have been developed to discover patterns associated with deceptive behaviour in groups using an analytic techinque called network tomography. This tool looks for deception patterns in large databases. Other tools have been designed for predictive analysis, attempting to identify the next step in an emerging pattermn, and hypothesis generation, seeking to provide possible explanations for observed anomalous patterns."

- Tangram: "seeking to demonstrate the .. value.. of a semi-autonomous terrorist threat assessment system...a warning system must also provide warnings where 1. the data are sparse, incomplete or erroneous, and 2. the threats are assesed across multiple lines of enquiry that individually would not reveal an entity's threat likelihood. Pattern-based data mining methods have proven effective at compensating for common data issues, and fusing multi-sensor data...a significant aspect of Tangram's reaserch is discovering highly reliable threat patterns and statistics that will provide reliable warnings..." Interestingly, Tangram has so far only worked with "fabricated simulations of real intelligence data", though it is intended to give it real data in the future

- Video Analysis and Content Extraction (VACE): "not a data mining project per se, but...developing advanced video searching capabilites that could involve looking for particular patterns that might indicate a broadcast of terrorist events (eg bombings, beheadings)....[Also]... a video event manager, that allows analysts to find a particular event within a video, such as an event which has a possible security significance, for example a person is obeserved entering a restricted area..."

- Proactive Intelligence (PAINT): "studies the dynamics of complex intelligence targets, such as terrorist organisations, and employs models of causal relationships that are designed to increase analyst efficiency... does not specifically aim to uncover patterns... "

- Reynard: small seedling project to "identify the emerging social, behavioural and cultural norms in virtual worlds and gaming environments...the project would then apply the lessons learned to determine the feasibility of automatically detecting suspicous behaviour and actions in the virtual world...". {Does this mean that real-world terrorists would betray themselves by the way they behaved in 'Second Life'? Or that the decay of a soceity might be identified from shifting national behaviours? To be fair, it is only a research project.)

Most of the programmes draw data from the Research and Development Experimental Collaboration (RDEC) Network, which includes "a number of classified databases containing lawfully collected foreign intelligence information". Not much has been written about this (only 28 entries on Google) and some of it is in the form of CV entries, and acknowledgements of funding or collaboration, which illuminate odd corners of the project.

Does this amount to a data mining taxonomy? Well,
1. it distinguishes arbitrarily between data mining techniques that start from a given point (a known suspect) and those that look purely for a pattern, existing anywhere on the database.
2. deception analysis: a subtle concept, because what you are saying is that the facts are there, but they are misleading or wrong. Too much of this and you have the Galileo effect.
3. predictive analysis: using patterns to look ahead (see myearlier posting about Tangram and Bayes.)
4. hypothesis generation: another subtle one: this pattern did not work out as we expected, so let's try to guess why. (The 'Silver Blaze' approach.)

No reference to 'sparse relation spaces' -which would seem ideal for analysing a collection fo datasets like RDEC - but this may be because these start from a known subject.

If you look at Toby Segaran's Collective Intelligence book there is an interesting implicit comparison between the uses made of data mining in the commerical world, eg
- spotting similarities beteen your views and someone else's, eg film critic recommendations. This works by comparing each critic's pattern of likes and dislikes to your own. (it would be interesting if a US analyst could model the search methods of a different analyst, say a British or Soviet analyst, and have the option to say 'I can't see a pattern here, but what might they make of this data?'.)
- optimisation algorithms: can you look for sets of people who are well-optimised for certain purposes?.
- matching people in on-line dating sites: look for people who don't appear to know each other, but ought to!
- building decision trees to model the way decisions are made: might be nice to 'grow' patterns of terrorist activity rather than trying to deduce them
- evolving programmes to solve problems.

In addition, no reference that I can see to simulation.

Firstly, I would have thought that there might be value in creating simulated terrorist agents and asking them what they would do if let loose in the RDEC world: ie what sort of people would they contact, what would their psychological needs be, what would they do for money, etc. Perhaps this is being done elsewhere. JASSS includes a data-simple if computationally sophisticated example.

Secondly, JASS also has a Levitt-like paper showing how a simulation model can reveal underlying changes in real events. ("Comparison of simulated histories [..of the Eurovision Song Contest..] with the actual history of the contest allows the identification of statistically significant changes in patterns of voting behaviour.."). Eurovision is of course an activity with simple known and finite rules, but I like this paper.



Show email   Remember me

Notify me when someone replies to this post?


Powered by pMachine