With an increasing number of organizations looking to leverage so-called big data comes increasing risk that the datasets created could be misused by internal staff or hackers.
To combat this, many companies try to “anonymize” the data by making the numbers vague — in addition to deleting names, they also add “binning,” which creates discrete bins that correspond to a range of values and assign the records to those bins. That might change the time and location of a retail purchase from a day into a week, and a store location into a general region.
But according to a report from researchers at the Massachusetts Institute of Technology (MIT), who examined three months of credit card transactions from an unnamed source, “binning” may not be enough to hide the identity of people in the data.
The researchers, who published their results in the latest issue of Science magazine, found that four dates and locations of recent purchases are all that is needed to identity 90 per cent of people making the purchases. If price information is included, then only three transactions are necessary.
The study used anonymized data on 1.1 million people and transactions at 10,000 stores.The bank had stripped away names, credit card numbers, shop addresses, and even the exact times of the transactions, said the magazine’s synopsis. All that was left were the metadata: amounts spent, shop type — a restaurant, gym, or grocery store, for example — and a code representing each person. More than 40 per cent of the people could be identified with just two data points, it says, while five purchases identified nearly everyone.
How? By correlating the data with outside information. First, researchers pulled random observations about each individual in the data: information equivalent to a single time-stamped photo. These clues were simulated, the report says, but people generate the real-world equivalent of this information day in and day out, through geo-located tweets or mobile phone apps that log location, for example. A computer then used those clues to identify some of the anonymous spenders. The researchers then fed a different piece of outside information into the algorithm and tried again, until every person was de-anonymized.
The report is a caution not only to organizations that try to de-personalize data they hold themselves, but also to companies that collect data and resell it to other parties.
“In light of the results, data custodians should carefully limit access to data,” Science quotes Arvind Narayanan, a computer scientist at Princeton University. Narayanan was not involved with the study.