Many governments impose traditional censorship methods on social media platforms. Instead of removing it completely, many social media companies, including Twitter, only withhold the content from the requesting country. This makes such content still accessible outside of the censored region, allowing for an excellent setting in which to study government censorship on social media. We mine such content using the Internet Archive's Twitter Stream Grab. We release a dataset of 583,437 tweets by 155,715 users that were censored between 2012-2020 July. We also release 4,301 accounts that were censored in their entirety. Additionally, we release a set of 22,083,759 supplemental tweets made up of all tweets by users with at least one censored tweet as well as instances of other users retweeting the censored user. We provide an exploratory analysis of this dataset. Our dataset will not only aid in the study of government censorship but will also aid in studying hate speech detection and the effect of censorship on social media users. The dataset is publicly available at https://doi.org/10.5281/zenodo.4439509
We uncover a previously unknown, ongoing as-troturfing attack on the popularity mechanisms of social media platforms: ephemeral astroturfing attacks. In this attack, a chosen keyword or topic is artificially promoted by coordinated and inauthentic activity to appear popular, and, crucially, this activity is removed as part of the attack. We observe such attacks on Twitter trends and find that these attacks are not only successful but also pervasive. We detected over 19,000 unique fake trends promoted by over 108,000 accounts, including not only fake but also compromised accounts, many of which remained active and continued participating in the attacks. Trends astroturfed by these attacks account for at least 20% of the top 10 global trends. Ephemeral astroturfing threatens the integrity of popularity mechanisms on social media platforms and by extension the integrity of the platforms.
We uncover and study ongoing ephemeral astroturfing attacks in which many automatically generated tweets are posted by a collection of fake and compromised accounts and then deleted immediately to artificially propel a chosen keywords to the top of Twitter trends. We observe such attacks in the wild and determine that they are not only quite successful in pushing a keyword to trends but also extremely prevalent. We detected over 19,000 unique keywords pushed to trends by over 108,000 bots, 55% of which still exist on the platform by July 2020 using Internet Archive's Twitter Stream Grab over four years. Trends astroturfed by these attacks account for at least 20% of top 10 world trends. Ephemeral astroturfing pollutes trends; allows for the manipulation of users' opinions; and permits content that could otherwise be filtered by the platform, such as illicit advertisements, political disinformation and hate speech targeting vulnerable populations. Our results aid in understanding user manipulation on social media and more generally shed light on the types of adversarial behavior that arise to evade detection.
This is the dataset associated with the paper of the same name. You can find it here: https://arxiv.org/abs/2101.05919 Files: tweets.csv : All 583k censored tweets tweets_debiased.csv : Debiased sample of tweets (Section 6.1) all_users.csv : All users who are censored once at least once users.csv : All 4301 users whose entire profile is censored users_inferred.csv : 1931 extra users inferred to be the censored by the procedure described in Section 3.3 supplement.csv : The supplementary tweet data. (Section 3.5) Please refer to this Github repo for the detailed documentation and the code for reproduction. https://github.com/tugrulz/CensoredTweets
We present the first in-depth and large-scale study of misleading repurposing, in which a malicious user changes the identity of their social media account via, among other things, changes to the profile attributes in order to use the account for a new purpose while retaining their followers. We propose a definition for the behavior and a methodology that uses supervised learning on data mined from the Internet Archive's Twitter Stream Grab to flag repurposed accounts. We found over 100,000 accounts that may have been repurposed. Of those, 28% were removed from the platform after 2 years, thereby confirming their inauthenticity. We also characterize repurposed accounts and found that they are more likely to be repurposed after a period of inactivity and deleting old tweets. We also provide evidence that adversaries target accounts with high follower counts to repurpose, and some make them have high follower counts by participating in follow-back schemes. The results we present have implications for the security and integrity of social media platforms, for data science studies in how historical data is considered, and for society at large in how users can be deceived about the popularity of an opinion. The data and the code is available at https://github.com/tugrulz/MisleadingRepurposing.
We present the first in-depth and large-scale study of misleading repurposing, in which a malicious user changes the identity of their social media account via, among other things, changes to the profile attributes in order to use the account for a new purpose while retaining their followers. We propose a definition for the behavior and a methodology that uses supervised learning on data mined from the Internet Archive's Twitter Stream Grab to flag repurposed accounts. We found over 100,000 accounts that may have been repurposed. We also characterize repurposed accounts and found that they are more likely to be repurposed after a period of inactivity and deleting old tweets. We also provide evidence that adversaries target accounts with high follower counts to repurpose, and some make them have high follower counts by participating in follow-back schemes. The results we present have implications for the security and integrity of social media platforms, for data science studies in how historical data is considered, and for society at large in how users can be deceived about the popularity of an opinion.
Malicious Twitter bots are detrimental to public discourse on social media. Past studies have looked at spammers, fake followers, and astroturfing bots, but retweet bots, which artificially inflate content, are not well understood. In this study, we characterize retweet bots that have been uncovered by purchasing retweets from the black market. We detect whether they are fake or genuine accounts involved in inauthentic activities and what they do in order to appear legitimate. We also analyze their differences from human-controlled accounts. From our findings on the nature and life-cycle of retweet bots, we also point out several inconsistencies between the retweet bots used in this work and bots studied in prior works. Our findings challenge some of the fundamental assumptions related to bots and in particular how to detect them.
This is the dataset associated with the paper of the same name. You can find it here: https://arxiv.org/abs/2101.05919 Files: tweets.csv : All 583k censored tweets tweets_debiased.csv : Debiased sample of tweets (Section 6.1) all_users.csv : All users who are censored once at least once users.csv : All 4301 users whose entire profile is censored users_inferred.csv : 1931 extra users inferred to be the censored by the procedure described in Section 3.3 supplement.csv : The supplementary tweet data. (Section 3.5) Please refer to this Github repo for the detailed documentation and the code for reproduction. https://github.com/tugrulz/CensoredTweets