When Trump tweeted last week « … FLOTUS [the first lady] and I tested positive for COVID-19… », the message spread like a wildfire. His followers retweeted it nearly a million times and commented it another half a million times, making this tweet the new personal record for Trump. Interestingly, such events open up big opportunities for social scientists.

Comments on platforms like Twitter are anything but calm words combined into multi-faceted paragraphs. Most of the time, they are short statements. Examples, in this case, included: « Please get well soon! Our country and the world need you Mr. President! », « So what, NOT Such an ‘HOAX’ after all?? » and ignominiously also « HAHAHAHAHA eat shit ». Leaving the debate about the appropriateness of such statements and the general debate on hate speech in online forums aside, one can easily guess the general sentiment of each tweet.

This is an opportunity for social scientists, in particular in political sciences. As the reactions of people to major events are publicly available, one can investigate them in detail. Years ago, it required costly surveys to gain insights into the public opinion, but having learned a thing or two from their friends in the computer science departments, social scientists are well equipped to quickly investigate it on their laptops. Analysing the sentiment in half a million twitter comments seems quite complicated, but in fact, very simplistic techniques are used to do so. First, keywords can be simply counted and compared over time. Second, unsupervised machine learning can be used to cluster similar tweets. Last, the words can be compared to dictionaries like AFINN, which assigns each word a value, for example, how positive or negative a word is. Aside from these basic steps, the bag-of-word method and a thousand others can be used where tweets are more complicated, for example, negations and irony.

In recent years, there was a wave of research using data offered by platforms like Facebook, Google, Weibo, WeChat, and in particular Twitter. Still, there has been a lot of critiques, too. What about data privacy? How to be sure that the data on Twitter is statistically representative? And how to deal with the fact that platforms only offer researchers limited access to the data sources? These are all important questions, but still, it leaves a big one aside, which actually seems quite obvious but somehow was not part of the discussion of academic critiques.

As Angela Xiao Wu, assistant professor at NYU, and Tanja Harsh were able to show in a recent paper; data by platforms are not unobtrusive recordings of human behaviour. « Rather, they are direct records of how we behave under platforms’ influence. » The data that is generated by platforms, for example, the number of reactions to a tweet are actually used by the company itself to measure its success. User engagement is important for the income of Twitter, and the company employs hundreds of people to tweak the design and functioning of the platform so that users are nudged to interact more. When user interaction is falling, Twitter is likely to change its recommendation algorithm, the search query autocompletion, the personalized recommendations or the social feed curation. In other words, they constantly change the measurement condition.

This is already an old and well-known problem. In 1968, Andrew Ehrenberg investigated the differences between which programs people viewed on a given time of day and their preferences for certain programs which were examined in surveys. The interesting findings were that the program schedule and people’s socially situated availability were more important than their content preferences. Analogous to the Twitter data, it was the behaviour of TV companies that determined user behaviour to a great extent.

So, having seen the underlying problems with using data generated by social media platforms, social scientists should become more careful when conducting research. If we are data journalists trying to predict the next public elections or academics studying the reactions of people to political events, we should bear in mind that the data also shows how effective these platforms are in nudging our behaviour.

But we should not abandon the research based on platform data altogether, at least according to Angela Wu. As an improvement to the current situation, she rather proposes that platform data should be collected by independent third-party measurement firms, just as this was done in the TV age. Data could be collected that does not equally serve as a measure of the company’s success but would be a more unobtrusive record of human behaviour. Still, for this to become reality, big tech would have to freely agree. But given their current market power, chances are low.

Jakob Kampik