As many learned for the first time earlier this year when popular outrage forced Facebook and Google to publicly reveal just how much valuable personal data they harvest from their users, tech companies know almost everything about us, including the establishments we frequent, the stuff we buy and the people we know. And in the latest example of just how much detail is unknowingly embedded in our social media profiles, researchers at University College London and the Alan Turing Institute have demonstrated that they can identify a twitter user with a staggering 96.7% accuracy using only their tweets and publicly available metadata run through a machine-learning algorithm.
For users who occasionally engage in anonymous tweeting, this revelation shouldn’t go unacknowledged. In their study, the researchers discovered that their most basic algorithm could correctly identify an individual user in a group of 10,000 using just 14 pieces of metadata from their posts on twitter nearly 96.7% of the time. Furthermore, attempts to obscure the individuals’ identity by tampering with the data were remarkably ineffective: Researchers found that they could still identify users with 95%+ accuracy when 60% of their metadata had been tampered with. When researchers broadened their scope to the 10 most likely candidates, the algorithm’s accuracy rose to 99.2%. A single tweet reportedly contains 144 fields of metadata, according to RT.
“That’s the mentality with metadata,” the study’s lead co-author Beatrice Perez of University College London told Wired. “People think it’s not a big deal.”
The study’s findings have major implications for data privacy, as the researchers explain in their introduction:
Previous work shows that the content of a message posted on an OSN platform reveals a wealth of information about its author. Through text analysis, it is possible to derive age, gender, and political orientation of individuals (Rao et al. 2010); the general mood of groups (Bollen, Mao, and Pepe 2011) and the mood of individuals (Tang et al. 2012). Image analysis reveals, for example, the place a photo was taken (Hays and Efros 2008), the place of residence of the photographer (Jahanbakhsh, King, and Shoja 2012), or even the relationship status of two individuals (Shoshitaishvili, Kruegel, and Vigna 2015). If we look at mobility data from location-based social networks, the check-in behavior of users can tell us their cultural background (Silva et al. 2014) or identify users uniquely in a crowd (Rossi and Musolesi 2014). Finally, even if an attacker only had access to anonymized datasets, by looking at the structure of the network someone may be able to re-identify users (Narayanan and Shmatikov 2009).
The study’s goal was “to determine if the information contained in users’ metadata is sufficient to fingerprint an account”, and it showed that even rudimentary algorithms had high success rates when it came to correctly identifying users. During the study, the researchers used metadata like the date the account was created, its followers, the accounts it follows and the tweets it likes, and ran it through three different machine-learning algorithms. This method, according to RT, could be used to identify an account if a user changes its name, or creates multiple accounts – or to tell if a legitimate account has been hacked.
While the researchers used Twitter for their data, they warned that “the methods presented in this work are generic and can be applied to a variety of social media platforms.”
Read the study below: