Take a look at this line in the code: > mapreduce::plugin('anonymizer/tor/plugin/onion') = So they are using mapreduce to analyze this traffic on a cluster. Interesting. This gives me an idea on what they might be doing here. If they gather all the data up and measure frequencies coming from a given IP address, say: IP: 1.2.3.4
I'm not sure what they would use for training data, maybe the first few years of operation they collected IP addresses of people who were convicted of cybercrime and/or terrorism, and fed it through the system. Then the machine learning algorithm they chose can parse all keywords collected on the internet trunk lines and classify people who have a high probability of being criminals. People are probably going to scream "skynet"/"minority report"/etc, but in reality machine learning isn't to the point of being self aware and probably never will be. If they are able to have the skills and intelligence required to implement such a system, they are aware of the false positive risk. What they probably do is after someone is identified as a likely risk, they will investigate that person manually to see if it is a valid match and if not then possibly if that person is a good candidate for recruit if not. So ultimately, just searching for or visiting Linux Journal isn't going to get you targeted immediately, it is simply an input into a larger targeting system. It probably analyzes people's long term usage and long term patterns, so if you're looking to be "in the system" clicking the link is probably not the first time you'll get in there. You likely don't want to be identified for targeting, either, so people stop saying "oooo I want to get on the blacklist!!! z0mg!". To be honest, I think this idea is brilliant. Invasive and illegal, yes, but brilliant nonetheless.
They should be able to feed this into machine learning algorithms. That is generally the ultimate goal of datasets this large and generally what mapreduce is used for on large clusters, to start feeding it to machine learning. So they might collect data on you, but not immediately classify you as an extremist until it passes through a machine learning process. Usage of term 'linux': 5
Usage of term 'tor': 10
Usage of term 'secure desktop': 20