skip to main content
survey
Open access

Blackmarket-Driven Collusion on Online Media: A Survey

Published: 17 May 2022 Publication History

Abstract

Online media platforms have enabled users to connect with individuals and organizations, and share their thoughts. Other than connectivity, these platforms also serve multiple purposes, such as education, promotion, updates, and awareness. Increasing, the reputation of individuals in online media (aka social reputation) is thus essential these days, particularly for business owners and event managers who are looking to improve their publicity and sales. The natural way of gaining social reputation is a tedious task, which leads to the creation of unfair ways to boost the reputation of individuals artificially. Several online blackmarket services have developed a thriving ecosystem with lucrative offers to attract content promoters for publicizing their content online. These services are operated in such a way that most of their inorganic activities are going unnoticed by the media authorities, and the customers of the blackmarket services are less likely to be spotted. We refer to such unfair ways of bolstering social reputation in online media as collusion. This survey is the first attempt to provide readers a comprehensive outline of the latest studies dealing with the identification and analysis of blackmarket-driven collusion in online media. We present a broad overview of the problem, definitions of the related problems and concepts, the taxonomy of the proposed approaches, and a description of the publicly available datasets and online tools, and we discuss the outstanding issues. We believe that collusive entity detection is a newly emerging topic in anomaly detection and cyber-security research in general, and the current survey will provide readers with an easy-to-access and comprehensive list of methods, tools, and resources proposed so far for detecting and analyzing collusive entities on online media.

1 Introduction

The prosperity of online media has attracted people and organizations to join the platform and use it for several purposes, such as creating a network among commonalities, building and broadening their business, and promoting/demoting e-commerce products. This has led users to choose artificial ways of gaining social reputation to get benefits within a short time. The main reason behind choosing artificial boosting is that the legitimate efforts of gaining appraisals (followers, retweets, likes, shares, etc.) take a significant amount of time and may not meet the actual needs of users. Such activities impose a massive threat to social media platforms as these are mostly against the Terms of Service (ToS) of many social media platforms (see Section 2.1.2 for more details). Such artificial boosting of social reputation/growth is often known as collusion [29, 48]. In the following, we present the formal definition of “collusive users”:
Collusive users are those online users who are clients of various blackmarket services to inflate the popularity of their account and get appraisals for their content.
According to a recent survey by HitSearch,1 \(98\%\) of content creators admitted to having spotted follower fraud among online influencers on Instagram. The adversarial impact of the collusive entities poses a massive threat to online media. These entities create an atmosphere where people start trusting their information due to the popularity they receive. For example, in the 2019 UK general election, politicians approached the blackmarket services2 for online political campaigning to reach out to their potential voters. It is also reported that most of the politicians currently in contention or conversation for the 2020 U.S. presidential election have a very high percentage or volume of non-human followers linked to their Twitter account. The preceding examples show how collusive entities boost the believability of information during events. Moreover, the limitation of humans’ potential to distinguish between the collusive and genuine entities due to the vast amount of available information is an important concern that motivates the design of methods for automatic identification of these entities. The collusive entities not only deceive people but also pollute the entire social space. Using the blackmarket services, the collusive entities can perform appraisals such as improving the credibility of rumors, propagate fake news, inflate/deflate the ratings of products in e-commerce platforms, and gain popularity in video sharing platforms. Figure 1 illustrates how collusive entities gain artificial appraisals on various online media platforms. We request the reader to refer to Section 4 for a thorough explanation of Figure 1. Recent studies that tackle the problem of collusive entity detection state that it is harder to discern these entities as they express a mixture of organic and inorganic activities [48].
Fig. 1.
Fig. 1. Illustration of blackmarket services with types, platforms, and appraisals.
Although artificial boosting has become a regular practice with the increasing popularity of different online media platforms, there is a lack of coherent and collective attempts from different research communities to explore the micro-dynamics controlling such malpractice and the extent of its effect in manipulating social reputation. Most of the existing studies aim to detect spam [14, 31, 91, 94, 109, 123, 129, 130, 134, 135, 136, 145, 150], fake [21, 27, 32, 33, 51, 60, 61, 63, 73, 95, 107, 116, 151], and bots [22, 25, 26, 30, 31, 39, 44, 55, 58, 76, 78, 97, 128, 131, 144, 149], and how these accounts are used for information propagation [13, 28, 41, 83, 87, 87, 101, 114, 133] in online media. Some studies [47, 48] have shown that collusive users are neither fake users nor bots but are normal human beings who express a mixture of organic and inorganic activities. Unlike bots, these users have no synchronicity across their behaviors [48], which makes it difficult to design automated techniques to detect them. In this article, we present a comprehensive survey of existing literature on topics related to collusion in different online media platforms. Recently, Kumar and Shah [82] surveyed three aspects of false information on the web and social media: fake reviews, hoaxes, and false news. They also mentioned about the lack of publicly available datasets related to false information and social media rumors. In comparison to Kumar and Shah [82] and other related surveys, we mention our contributions in the remaining part of this section.

1.1 Related Surveys and Our Contributions

To the best of our knowledge, this is the first survey to provide a detailed overview of collusive activities in online media. The aim of this survey is to provide readers with a comprehensive overview of the major studies conducted in the field of social network analysis to detect and analyze collusive activities across different online media platforms. The following four aspects make our survey unique and different from other related surveys:
(1)
Existing surveys do not directly focus on the problem of “collusive activities” and are more centered around the detection and analysis of fake users, fraudsters, and spammers on the web. Previous studies mentioned that unlike these problems, collusive activities are very different in nature [8, 47, 48] due to the mixture of organic and inorganic activities. In this article, we conduct a thorough analysis of past studies dealing with collusive activities in different online platforms.
(2)
Existing surveys mentioned only fraud and spam-related datasets. Here, we describe the datasets on collusion from multiple aspects—the type of dataset and the entities present in them. This will largely contribute to increasing the performance and applicability of state-of-the-art collusion detection approaches.
(3)
We also outline the annotation guidelines and evaluation metrics used for collusive entity detection, as mentioned in the related studies.
(4)
We conclude the article by highlighting a set of key challenges and open problems in the area of collusion in online media.
We use the terms collusion and artificial boosting of social reputation interchangeably throughout the article. We refer to an action/appraisal as online media activity such as retweet, follow, review, view, and subscription, and an entity as a social entity such as user, tweet, post, and review.

1.2 Survey Methodology

1.2.1 Survey Scope.

Since our scope is on investigating collusive entities in online media platforms, we will systematize the studies related to the analysis and detection of the collusive entities. As collusion is related to a few other aspects such as fake, bot, and spam, we partly focus on the previous works in those domains as well.

1.2.2 Survey Organization.

In this article, we survey the existing algorithms, methodologies, and applications in detecting and analyzing collusive activities in online media platforms. In Section 2, we provide a broad overview and background of collusion in online media. We also analyze particular cases of collusion, show how collusion is closely related to other close concepts, and see how collusion has been evolving and who are the main targets of it. Next, in Section 3, we provide some preliminary concepts on blackmarket services. In Section 4, we show how collusion happens across multiple online media platforms and overview the state-of-the-art techniques for each platform. Section 5 explains two broad categories of collusive activities. To identify what has been done so far in the literature, in Section 6, we conduct a systematic literature review of the existing techniques. In Section 7, we present the annotation guidelines, related datasets, and evaluation metrics that can be used for studies in collusive entity detection. Being an underexplored research problem, there are plenty of important issues that need further attention. Pointers to such issues are mentioned in Section 8. Section 9 concludes this survey with a summary of the main contributions and open problems.

2 Background and Preliminaries

2.1 Overview of Online Media

2.1.1 Online Media Platforms.

Online media refers to the technologies present on the Internet that connect people or organizations to exchange information [42, 65]. The reason online media has become famous is the abundant sources of information offered by the Internet where a user can get access to the same news from several places. To keep users engaged, online media platforms also enable them to share their opinions on the information. In today’s world, social media is considered to be a fast, inexpensive, and effective way to reach a target audience. The history of online media started at the end of the 19th century with the arrival of Arpanet, email, blogging, and bulletin boards. Prior to this, services such as telegram, radio, and telephone, were used to exchange information. However, their contributions were limited, as information exchange in these platforms did not take place “online” and was mostly used to send individual messages between two people. The rapid growth of the Internet in early 2000 set the real stage for the emergence of online media with the arrival of knowledge sharing and social networking sites. Figure 2 depicts the history of online media starting from postal service in the early 1990s to live streaming (Periscope, Meerkat) and messaging platforms (Discord) in 2020. Example of online media platforms are social networking sites (e.g., Facebook, LinkedIn), microblogs (e.g., Twitter, Tumblr), wiki-based knowledge sharing sites (e.g., Wikipedia), social news sites and websites of news media (e.g., Huffington Post), forums, mailing lists, newsgroups community media sites (e.g., YouTube, Flickr, Instagram), social Q&A sites (e.g., Quora, Yahoo Answers), user reviews (e.g., Yelp, Amazon.com), social curation sites (e.g., Reddit, Pinterest), and location-based social networks (e.g., Foursquare).
Fig. 2.
Fig. 2. Evolution of online media platforms.

2.1.2 ToS in Online Media Platforms.

ToS in online media platforms refer to the rules and regulations to be agreed upon by the users of the services. Each service has its own policy against fake and spam engagements. For instance, Twitter has declared its own platform manipulation and spam policy,3 and YouTube has a Fake Engagement Policy.4 These policies ensure that violations of the rules may result in permanent suspension of an account and its content. The ToS followed by the online media platforms forbids artificially inflating one’s own or others’ appraisals (followers/retweets/views/likes/subscriptions). This includes selling or purchasing engagements using premium services, using or promoting third-party apps by posting content that helps in gaining engagements, trading to exchange engagements using freemium services, and so forth. Some of the terms proposed by Twitter on collusion-related activities are engagement churn (first following a large number of unrelated Twitter accounts and then unfollowing them), indiscriminate engagement (using third-party APIs or automated software to follow a large number of unrelated accounts in a short time period), and aggressive engagement (aggressively engaging with Tweets to drive traffic or attention to accounts).

2.2 Collusion in Online Media

2.2.1 Definition.

In general, collusion is defined as a covert and secret conspiracy or collaboration to deceive others. Collusion in online media is a process by which users artificially gain social reputation, which violates the ToS of the online media platform. This results in online media entities to appear credible and legitimate to the end users, thus leading to activities such as fake promotions, campaigns, and misinformation, thereby creating an inadequate social space. In this article, we refer to collusive users as those accounts which are the customers of these blackmarket services and have submitted their account/content to get artificial appraisals. The blackmarket services provide eminent online media services ranging from online social networks to various other platforms such as rating/review platforms, video sharing platforms, and even recruitment platforms. In Section 4, we discuss in detail how collusion happens in all of these platforms. For example, a boost in YouTube views can transform a small event into a big campaign or a promotional event. Despite its apparent presence in the real world, collusion has remained an underexplored concept. We go deeper into the notion of collusion by discussing particular cases and examples in the next section.

2.2.2 Examples of Collusive Activities.

In this section, we show two examples of how collusion happens in online media. Figure 3 shows an example of collusive activities in online media. Both images are taken from the profile information of collusive entities in the ABOME dataset [46]. ABOME is a dataset of tweets and users on Twitter, YouTube videos, and YouTube channels collected from two blackmarket services (YouLikeHits and Like4Like) for artificial appraisals. The entire dataset is publicly available in the Zenodo5 platform. Note that we redacted the name of the Twitter user/YouTube channel to maintain anonymity. Figure 3(a) shows the official Twitter account of an organization registered in blackmarket services for collusive follower appraisals. Figure 3(b) shows a video posted by a verified YouTube channel registered in blackmarket services for collusive “like” requests. In both examples, the accounts are marked as verified by Twitter/YouTube. The presence of verified online media entities in the blackmarket services clearly shows that the in-house algorithms deployed by these platforms have been unsuccessful in detecting such entities. This further motivates the problem and necessitates the development of automated techniques to detect these entities. In Section 4.1, we discuss in detail how online media users can request a verification badge through the blackmarket services.
Fig. 3.
Fig. 3. Example of collusive activities in online media. (a) An official Twitter account of an organization registered in blackmarket services for collusive follower appraisals. (b) A video posted by a verified YouTube channel registered in blackmarket services for collusive “like” requests. Both images are taken from the profile information of collusive entities in the ABOME dataset [46]. Some profile information are blurred for the purpose of anonymity.

2.3 Reasons Behind Collusion

To keep up with the pace of today’s turbo-charged world, a large number of users like to go against the stream [48, 113]. Although shortcuts to success always conflict with the enormous efforts needed to be successful, it may also lead to some costly mistakes, eventually undermining the true goals. Online media is considered as the perfect convergence of communication and information. It has already become the most important source for public information, organizations, and for individuals to promote their ideologies, products, and services. Manipulating such an important source of information can result in a significant gain in terms of fame and finance for individuals/organizations. This is clearly evident in online social networking platforms like Twitter in the form of attaining fake followers [3, 24], rating platforms like Amazon [99] in the form of posting fake reviews, video streaming platforms like YouTube/Twitch [111] in the form of synthetic gain in viewership count, and many more. With a huge competition in online media, it is hard for online media users who spend months creating new content but fail to reach their target audience. Blackmarket services serve as a complete social media toolkit that helps an online media entity gain a stronger social media presence among its competitors. It is very effective for (i) companies that want to hype a campaign; (ii) musicians who want to promote new projects, album, and releases; (iii) entertainment companies who want to publicize their shows; and (iv) online media moguls who need quick online media presence.

2.4 Challenges in Collusion Detection

Detecting collusive activities is a challenging task. Dutta et al. [48] showed how collusive users show a mixed behavior of organic and inorganic activities. In the case of Twitter, these users, on the one hand, are involved in retweeting (following) genuine tweets (genuine users), and on the other hand, they are involved in retweeting (following) tweets (users) submitted to the blackmarket services. This kind of mixed behavior makes it difficult to be detected by traditional fake user detection methods [113]. Moreover, collusive users are not bots; these accounts are mostly controlled by normal human beings. This makes it difficult to be flagged by bot detection methods as shown in previous studies [8, 29, 48]. A recent study by Dutta et al. [50] investigated how collusion happens on YouTube. They designed web scrapers to collect a large set of YouTube videos and channels submitted to the blackmarket services for collusive appraisals. They further designed CollATe, an end-to-end neural framework that combines representation from three feature extractors (metadata, anomaly, and comment) to detect videos submitted in blackmarket services for collusive comment appraisals. In the case of rating platforms like Amazon, methods to detect collusive reviewers have not been successful so far due to the lack of enough labeled data. The dynamic nature of the propagation of collusive activities is quite complicated. Collusive activities can easily propagate and impact a large number of users in a short time by spreading misinformation. Kim et al. [77] developed CURB, an online algorithm that leverages information from the crowd to prevent the spread of misinformation. They also reported that fact-checking organizations like Snope and Politifact are not able to properly limit the spread of misinformation, as it requires significant human effort. Moreover, as collusion happens in multiple online media platforms, it is also a difficult task to understand the complete adversarial intent of the collusive entities. This raises concern on developing systems for early detection of collusive entities to limit its artificial social reputation. Finally, due to the restrictions of the online media platforms on collecting the public data, the research community has limited training data, which does not include all of the necessary information of collusive entities. Most of the previous studies [8, 50] had to design their own scrapers due to the limitations and restrictions of the APIs provided by the online media platforms.
Some related concepts to collusion are fake, bot, sockpuppetry, malicious promotion, spam, content polluters, relationship infiltrators, and slanderous users. We distinguish between these concepts and collusion in Table 1. Even though these concepts differ from collusion, they are highly related. Therefore, studying the literature that focuses on these topics will give us better knowledge of how to analyze and detect the collusive entities in online media platforms.
Table 1.
ConceptDefinition of the ConceptDifference from Collusion
FakeFake is fabricated information of original content or appraisals [85].Collusion is artificial inflation/deflation of appraisals using blackmarket services.
BotBots may have time-synchronous fraudulent activities as well as time-asynchronous ones, and they are computer programs.Collusive users are mostly humans and exhibit asynchronous fraudulent activities.
SockpuppetrySockpuppets are operated by a puppetmaster who controls other user accounts [79].In collusion, every account is controlled by the real owner of the account.
Malicious promotionPromote a specific product/topic for a target audience [90].Collusion is more general and not necessarily focused on a specific product/topic.
SpamConsistently performs similar operations across multiple accounts to manipulate or undermine current trends [56].The collusive entity may contain spammy contents, but not necessarily.
Content pollutersPolluters post nearly identical contents, sometimes by randomly adding mentions to unrelated legitimate users [86].In collusion, content pollution never occurs, as every action happens through the blackmarket services.
Relationship infiltratorsThey follow the reciprocity in the relationship (follow/retweet/subscribe) to engage in spam activities [86].In collusion, reciprocity happens in freemium services to gain credits.
Slanderous usersThey give a false low rating with a positive review to confuse recommender systems [143]Collusive users always give a high rating and positive review.
Table 1. Comparison between Collusion and Other Related Concepts

3 A Note On Blackmarket Services

Blackmarket services are the major controlling authorities of collusive activities. Customers join these services and contribute directly or indirectly to boost their online profiles artificially. The blackmarket services are divided into two types based on the mode of service [113]: premium and freemium.
Premium services. These services require the customers to pay a certain amount of money to obtain the facilities (e.g., SocialShop, RedSocial). Most of the premium services provide a comprehensive range of social media enhancement services for all purposes. These services also ensure strategic social media promotion to maintain an edge over the competitors, in the following ways: (i) location-specific actions, (ii) time during which users want to gain social reputation, (iii) users gained from the services have eye-catching display pictures and filled-out bios in the profile, and (iv) domain-specific actions (users gained from services can be from various domains like fashion, music, and blogging). These services offer actions in tiers (100, 1K, 10K, etc.) and lucrative offers to their customers and some additional facilities such as a 100% money-back guarantee, a retention guarantee, and complete privacy. Few of these services ensure replenishment if the customer experiences any drop in the actions. We divide this type of services into two categories:
(1)
General premium: Here, customers have to choose a plan that suits their budget.
(2)
Auto premium: These services provide daily/weekly/monthly delivery service of actions. Customers need to select one of the auto action packages in advance for a time duration, such as 100 Twitter auto retweets (2 to 4 days). The main advantage of these services is to automate the process of social reputation to an extent.
The basic principle of auto premium remains the same as that of general premium services but can be considered as a faster and effective way to boost social reputation. The main difference is the recurrence of delivery (auto is recurrent, general usually not).
Freemium services. These services are free to use, but they also have premium subscription plans (e.g., YouLikeHits, Like4Like). The main idea is to get customers familiar with the workflow of the services and motivate them to opt for the premium plans. Freemium services operate in one of the three ways: social-share services, auto-time freemium services, and credit-based services. More details on these types can be found in the work of Dutter et al. [47, 48]. Freemium services operate in one of the three ways [48]:
(1)
Social-share services: These services require customers to perform social actions on multiple platforms to get appraisals for their content. Some of the possible actions are share/like/follow on Facebook, follow/like/view/comment on Instagram, and like/view/ share on YouTube, among others. As an example, customers in FreeFollowers6 have to perform five social actions or complete a survey in 15 to 60 seconds to get appraisals.
(2)
Auto-time freemium services: These services require customers to get access tokens from the services, after which they can request a fixed number of actions for a time duration (e.g., 10 to 50 retweets in a 10-minute window).
(3)
Credit-based services: These services are operated based on a “give and take” relationship. Customers of these services lose credits when other customers perform actions on their submitted content. Similarly, customers gain credits when they perform actions on the content of other customers.
Figure 5(a) shows the working of premium services. Here, two types of entities are involved: customers who ask for appraisals and suppliers of those supply appraisals. Figure 5(b) shows the working of freemium services. Here, customers are involved in a credit-based ecosystem, hence customers are also the suppliers. To understand the common keywords used by the users of these services, we conducted an experiment [47] by visualizing the wordclouds of the description/bios present in the profile of Twitter users from premium and freemium services (c.f. Figure 4). Notice that premium users are associated with high-profile accounts having keywords such as “CEO,” “official,” “speaker,” and “Founder.” However, freemium users use advertisement-based keywords such as “like,” “agency,” “SocialMedia,” and “YouTubeMarketing.” This suggests that collusive users participating in these services use targeted keywords in their profiles to gain quick popularity. We encourage the reader to go refer to the work of Stringhini et al. [124, 125, 127] for a detailed study of the Twitter follower markets.
Fig. 4.
Fig. 4. Wordclouds generated from the profile description of customers in premium (a) and freemium (b) services. Reprinted with permission from Dutta and Chakraborty [47].
Fig. 5.
Fig. 5. Illustration showing the working principle of premium and freemium services.

4 Compromised Online Platforms

As discussed in previous sections, collusion happens across multiple online media platforms. In this section, we discuss in detail how appraisals in these platforms are artificially manipulated by the collusive entities. We will look into seven types of platforms: social networks, rating/review platforms, video streaming platforms, recruitment platforms, discussion platforms, music sharing platforms, and development platforms. Other than the preceding platforms, we will look into the artificial manipulation of website traffic through blackmarket services. Figure 1 illustrates how different types of blackmarket services provide collusive appraisals to various online media platforms.

4.1 Social Networks

Social networks serve as a platform to build social relations among users of common interests (personal or professional). Platforms like Twitter, Facebook, and Instagram allow their users to perform actions such as post, like, retweet, follow, and share, among others. Today, these platforms also serve as real-time news delivery services and a medium for the business owners to connect with their customers and thus expand their outreach. However, the natural way to attract users is usually a tedious task and takes significant time. This motivates users to choose an artificial way of gaining appraisals.
Facebook. Facebook is the most popular social network to connect and share with family and friends online. Facebook has four types of appraisals: shares, likes, comments, and followers. The large number of appraisals on Facebook acts as a form of social proof for the sign of popularity and importance.
Twitter. Twitter is a microblogging service where users write tweets about topics such as politics, sport, cooking, and fashion. Twitter has three types of appraisals: retweets, likes, and followers. Acquiring more appraisals on Twitter helps increase the user’s social signals and attract more visitors to the profile.
Instagram. Instagram is a social networking platform that enables users to share images or videos with their audience. Users can upload photos and videos, then share them with their followers or with a group of friends. Instagram has three types of appraisals: likes, followers, and views. Higher appraisals are the key to visibility on Instagram. The more appraisals a post receives on Instagram, the higher is the post rank in search results and on the Explore page.
Pinterest. Pinterest is an image sharing platform that is designed as a visual discovery engine for finding ideas like recipes, home and style inspiration, and so forth. A message on Pinterest is known as a Pin. Pinterest has two types of appraisals: likes and followers.
Other than the common appraisals, we found a few blackmarket services where social networking users can request a verification badge. By verification badge, we refer to the Twitter white/blue verified badge7 and the Instagram blue verified badge.8Verifiedbadge,9 StaticKing,10 Prime badges,11 and SocialKing12 are the most popular platforms providing such appraisals. Verification badges are coveted checkmark badges that enhance the user’s social media presence and improve credibility on the platform. The minimum requirements to request verification badges through blackmarket services are as follows:
The user/organization should be a celebrity, journalist, popular brand, government official, or sports company.
The user/organization should have a Wikipedia page or media coverage in leading online news portals.
For social media platforms where a subscription is an appraisal, the user should have a minimum of 100K subscribers.
There is a vast literature on the detection and analysis of collusive entities in social networks. Shah et al. [113] and Dutta et al. [48] are two of the first few studies that investigate collusive entities on Twitter registered in blackmarket services. Dutta et al. [48] trained multiple state-of-the-art supervised classifiers using a set of 64 features to distinguish collusive users from genuine users. They also divided the set of collusive users into three categories: bots, promotional customers, and normal customers. Arora et al. [8] obtained better classification performance than Dutta et al. [48] by incorporating the content-level and network-level properties of Twitter users in a multitask setting. De Cristofaro et al. [40] performed the first work on Facebook where the authors presented a comparative measurement study of page promotion methods. Sen et al. [110] conducted the first work on Instagram, where they developed an automated mechanism to detect fake likes on Instagram. We discuss these studies in detail in Section 6.

4.2 Rating/Review Platforms

Rating/review platforms allow users to rate or share their opinion about entities, such as products, applications, food items, restaurants, and movies. Examples of these platforms are e-commerce platforms (e.g., Amazon), travel platforms (e.g., TripAdvisor), and business rating platforms (e.g., Yelp, Google reviews), among others.
Amazon. Amazon is the world’s largest e-commerce platform and is considered the ultimate hub for selling merchandise on the web. It allows two types of appraisals: reviews and ratings. High ratings and reviews for a product on Amazon attract customers and make the product more trustworthy.
Google. Google provides valuable information to businesses and its customers using two types of appraisals: reviews and ratings. Higher ratings and reviews on Google help improve the business and enhance local search rankings.
TripAdvisor. TripAdvisor is an online travel platform that offers multiple services like online hotel reservations, travel experiences, restaurant reviews, and so forth. Similar to other platforms, TripAdvisor has two types of appraisals: reviews and ratings. Positive reviews from previous occupants help the business improve its reputation and increase the customer base.
Yelp. Yelp is a local business review platform that collects crowdsourced data from its users. Yelp has two types of appraisals: reviews and ratings. Getting more reviews in turn improves business reputation and gets a lot more customers with free traffic.
Today, online reviews/ratings are becoming highly relevant for customers to take any purchase-related decisions. As these appraisals play a significant role in deciding the sentiment/popularity of a product/business, there is a massive scope for collusion among sellers/buyers to manipulate it artificially. The work of Jindal et al. [74], Li et al. [88], and Wang et al. [74, 88, 137] represents a few initial studies on detecting fake reviews in review patterns representing unusual behaviors of reviewers. Another early work by Li et al. [89] identified fake reviews on Dianping, the largest Chinese review hosting site. The authors proposed a supervised learning algorithm to identify fake reviews in a heterogeneous network of users, reviews, and IP addresses. Mukherjee et al. [99] performed one of the first attempts to detect fraudulent reviewer groups in e-commerce platforms. Recently, Kumar et al. [81] identified users in rating platforms who give fraudulent ratings for excessive monetary gains. Most of the studies in review platforms are focused on e-commerce platforms. We believe that the newer platforms such as Google reviews, Yelp, and TripAdvisor would open more research questions, as these platforms contain reviews of millions of hotels, restaurants, attractions, and other tourist-related businesses.

4.3 Video Streaming Platforms

Video streaming platforms are mostly used for sharing videos and streaming live videos. These platforms allow users to upload, view, rate, share, add to playlists, report, comment on videos, and subscribe to other users. Here, we discuss how appraisals on video streaming platforms are inflated artificially. With the increase in the popularity of live streaming came the concept of astroturfing, broader and sophisticated term referring to the synthetic increase of appraisals in an online social network by means of blackmarket services. The consequences of such synthetic inflation are not only restricted to increase monetary benefits, directory listings, and partnership benefits but also expanded to better recommendation rankings, thus doctoring the experiences of viewers who are recommended all of these boosted materials instead of genuine materials produced by the honest broadcasters.
YouTube. YouTube is a video sharing platform where users can create their own profile; upload videos; and watch, like, and comment on other videos. YouTube has four types of appraisals: likes, comments, subscribers, and views. Likes, comments, and views are for the videos, and subscribers are for the channels. The higher the number of YouTube users interacting with a video/channel, the higher that video/channel will be listed to other users. These appraisals are considered as the measure of engagement and not only make the entity popular but also help the channel with sponsorship opportunities and monetization options.
Twitch. Twitch is the most popular live streaming platform on the web. It is mostly used by gamers to stream their games while other users can watch them. Twitch has two types of appraisals: followers and views. The higher views a channel has, the higher is its popularity on Twitch and to be ranked on the featured list. Twitch is usually considered to be one of the most difficult platforms to earn quick popularity due to the presence of so many streamers using the platform.
Tiktok. Tiktok is a video sharing social network that is primarily used to create short videos of 3 to 15 seconds. Tiktok has two types of appraisals: likes and followers. Getting more likes on Tiktok posts increases the chance to maintain its presence and make it famous among the audience.
Vimeo. Vimeo is a video hosting and sharing platform that allows users to upload and promote their videos with a high degree of customization that is not available on other competing platforms. Vimeo has two types of appraisals: likes and followers. A higher number of appraisals helps show the video in the suggested videos list created by Vimeo’s algorithm.
Few studies addressed issues of astroturfing in video streaming platforms. Shah [111] made the pioneering attempt to combat astroturfing on live streaming platforms. He proposed Flock, an unsupervised method to identify botted broadcasts and their constituent botted views. Note that he did not disclose the name of the live streaming corporation on which the study was performed. In a recent study, Dutta et al. [50] proposed CollATe, an unsupervised method to detect collusive entities on YouTube. The method utilizes the metadata features, temporal features, and textual features of a video to detect whether it is collusive or not.

4.4 Recruitment Platforms

Recruitment platforms are employment-oriented platforms that also provide professional networking among users. These platforms offer the opportunity to discover new professionals either locally or internationally and help one with their professional endeavors. Hiring new employees is a crucial part of every organization, which starts with posting new job ads and ends with recruitment. It can be thought of as a multistep process, which is normally quite time consuming and prone to human errors. With the advent and rise of automated systems, the hiring process of an organization is being done in the cloud with the help of tools such as the Applicant-Tracking-System (ATS).13 ATS makes the hiring process faster and accurate by preparing job ads, posting them online, collecting applicant resumes, making efficient communication with them, and finding the best-fit resumes for the organization. However, increasing use of ATS also invokes various disadvantages such as spammers compromising the job seekers’ privacy, slandering the reputation of the organizations, and financially hurting it by manipulating the normal flow of functioning of the system—most frequently the job ads publishing process (often recognized as the employment scam).
LinkedIn. LinkedIn is the most popular online recruitment platform where employers can post jobs and job seekers can post their profiles. There are four types of appraisals on LinkedIn: followers, recommendations, endorsements, and connections. A higher number of connections and followers on LinkedIn helps the user gain attention. LinkedIn endorsements help add validity to the user’s profile by backing up his or her work experience.
Adikari and Dutta [1] identified fake profiles on LinkedIn. The authors considered state-of-the-art supervised classifiers designed on a set of profile-based features to detect fake profiles. Prieto et al. [102] detected spammers on LinkedIn based on a set of heuristics and their combinations using a supervised classifier. Another work by Vidros et al. [132] tackled the problem of Online Recruitment Frauds (ORF) (see Section 6 for more details). The authors also released a publicly available dataset of 17,880 annotated job ads (17,014 legitimate and 866 fraudulent job ads) from various recruitment platforms.

4.5 Discussion Platforms

Discussion platforms are content sharing platforms primarily used to entertain community-based discussions. The discussions can be in the form of questions and answers (Q&A) or content that other users have submitted using links, text posts, or images. Next, we explain how collusion happens on discussion platforms.
Quora. Quora is a Q&A-based discussion forum that empowers the user to ask questions on any subject and connect with like-minded people who contribute unique insights and high-quality answers. Quora has four types of appraisals: followers, upvotes, downvotes, and comments. The higher the number of followers and upvotes a user gains for his or her answers, the higher is the ranking factor and the more the answers are displayed to other users. The aim of a user on Quora is to be named as a top writer in the long run. Answers written by top writers on Quora are considered as expert opinions and are also displayed in the featured list.
ASKfm. ASKfm is another Q&A-based discussion forum and is mostly used by users to post questions anonymously. The platform has two types of appraisals: followers and likes. Higher likes on answers grow its rating on ASKfm at a faster rate.
Reddit. Reddit is a social news based discussion forum that allows users to discuss and vote on content that other users have submitted. Reddit has five types of appraisals: subscribers, upvotes, downvotes, karma, and comments. Having more karma on Reddit allows the user to post more often on the platform and gives him or her more reputation. Having more upvotes on posts helps users gain more exposure, which eventually pushes the posts higher up on the targeted Reddit or Subreddit.
Studies in discussion platforms have been conducted with the goal of manipulating the visibility of political threads on Reddit [23]. The authors measured the effect of manipulation of upvotes and downvotes on article visibility and user engagement by comparing Reddit threads whose visibility is artificially increased. Another work by Shen and Rose [115] investigated polarized user responses on an update to Reddit’s quarantine policy.

4.6 Music Sharing Platforms

Music sharing platforms enable users to upload, promote, and share audio. Here, we explain how collusion happens on music sharing platforms.
SoundCloud. SoundCloud is an audio sharing platform that connects the community of music creators, listeners, and curators. SoundCloud has three types of appraisals: plays, followers, and likes. A higher number of followers and likes creates a massive fan base for creators and gets more attention from the community. The popularity of a soundtrack in the platform is driven by the number of plays it receives. Plays allows users to popularize music and grow their brand on SoundCloud.
ReverbNation. ReverbNation is an independent music sharing platform where musicians, producers, and venues collaborate with each other. The platform has two types of appraisals: plays and fans. The higher the number of plays in an audio or video, the higher is the rank of the artist on the platform.
Some studies have been conducted on fraudulent entity detection in music sharing platforms. Bruns et al. [20] investigated Twitter bots that help in promoting SoundCloud tracks. The authors also proposed a number of social media metrics that help identify bot-like behavior in the sharing of such content. Another work by Ross et al. [106] proposed a method to distinguish between two groups of SoundCloud accounts: bots and humans.

4.7 Development Platforms

Interestingly, we found a few popular development platforms where collusion happens.
GitHub. GitHub is a repository hosting service that provides distributed version control and source code management (SCM) functionality. GitHub has three types of appraisals: followers, stars, and forks. User profiles with more followers make the account more popular. Similarly, stars and forks are the metrics to show the popularity of a repository. Blackmarket services help deliver GitHub followers, stars, and forks from real and active people.
Hacker News. Hacker News is the most popular discussion platform for developers. It has three types of appraisals: upvotes, karma, and comments. The higher the number of comments and upvotes on a post, the higher is its popularity. User profiles with high karma can perform additional appraisals on a post: downvote and making polls, among others. Blackmarket services help deliver upvotes and karma on Hacker News simply by adding the post link and making the payment.
Medium. Medium is an online publishing platform that is commonly used by developers to share ideas, knowledge, and perspectives. It has two types of appraisals: followers and claps. A higher number of claps in a post ranks it higher in feed and search results. Similarly, a higher number of followers increases the post’s reach and view. Blackmarket services help deliver followers and claps by adding the post link and making the payment.
Most of the studies in development platforms are on measuring user influence and identifying unusual commit behavior by analyzing the attributes of the platforms. Hu et al. [67] measured the user influence on GitHub using the network structure of the following relation, star relation, fork relation, and user activities. The authors also introduced an author-level H-index (also known as H-factor) to measure users’ capability. Goyal et al. [59] identified the unusual changes in the commit history of GitHub repositories. The authors designed an anomaly detection model based on commit characteristics of a repository. One potential research direction is to examine how collusive appraisals on these platforms help popularize the reputation of the repositories/users. Another potential research direction is to develop collusive entity detection techniques by considering the hidden relations between the entities of the platform.

4.8 Other Platforms

Other than the online media, artificial boosting is also observed in other platforms. Website owners can avail premium/freemium blackmarket services to get traffic on their website. The idea is to endow the responsibility for the necessary amount of web traffic to some other company. In blackmarket terms, the appraisal is called hits or web traffic. Gaining artificial hits helps the website gain popularity faster compared to the slow process of working on growing organic visitors via Search Engine Optimization. Blackmarket services offer two types of web traffic: regional traffic, where the customers can opt for traffic from a specific country or region and, niche traffic, where customers can opt for targeted traffic that focuses on a specific type of business such as music based, e-commerce based, and so forth. To activate traffic to a website, customers have to enter the URL of the website, the number of visitors required, and the preferred timespan for delivery.
Overall, it can clearly be seen that collusion happens in multiple online media platforms and several studies have been conducted to detect collusive entities in these platforms. The common platform is the social network, where collusive entities attempt to spread misinformation. One of the major challenges to conduct studies on other platforms is collecting the data with the restrictions and limitations posted by the APIs. Moreover, once the data are collected, labeling them by experts is a time-consuming task and requires significant manual effort. We now dive deeper into the studies that target identification of collusive entities based on the mode of operation.

5 Types of Collusive Activities

Collusive activities can be categorized based on the mode of collaboration: individual collusion and group collusion. In this section, we will discuss the two types of collusion and how individual collusion differs from group collusion when providing collusive appraisals.

5.1 Individual Collusion

Individual collusion happens when the collusive activities of individuals are independent of each other; however, they are guided by centralized blackmarket authorities. Understanding individual collusion has been studied to some extent in the literature. Most of the existing studies used supervised models based on behavioral features and profile features [48]. Some network-based approaches infer anomaly scores for nodes/edges in the network (tweet-user network, product-user network, etc.) and rank them to spot suspicious users [29, 66, 81, 112, 137]. Existing studies mostly focus on the behavioral dynamics of individuals. However, these approaches fail when it comes to detecting group collusion. Group collusion can be more damaging, as it can take total control of the appraisals for an entity due to its size.

5.2 Group Collusion

Group-level collusion takes place when a set of individuals collaborate as a group to perform collusive activities. Such collective behavior is therefore more subtle than individual behavior. At an individual level, activities might be normal; however, at the group level, they might be substantially different from the normal behavior. Moreover, it may not be possible to understand the actual dynamics of a group by aggregating the behavior of its members due to the complicated, multifaceted, and evolving nature of inter-personal dynamics [43]. Some properties of collusive groups are as follows: (i) members in a collusive group work in shorter time frames and create maximum impact (e.g., flooding deceptive opinions); (ii) members of the group may or may not know each other (agreement by a contracting agency); (iii) multiple accounts within a group can be controlled by a single master account (existence of sockpuppets [80]); (iv) the larger the size of the group, the more damaging the group is; and (v) group members have a high chance of performing similar activities (group members posting similar reviews or writing the same comments). Past studies in this direction mostly detected groups using Frequent Itemset Mining (FIM) and ranked groups based on different group-level spam indicators [62, 138]. Wang et al. [139] pointed out several limitations of FIM for group detection: high computational complexity at low minimum support, absence of temporal information, unable to capture overlapping groups, prone to detect small and tighter groups, and so on. Liu et al. [92] proposed HoloScope to find a group of suspicious users in rating platforms based on the contrasting behavior of fraudsters and honest users in terms of topology, temporal spikes, and rating deviation. Nilforoshan and Shah [100] discussed about group collusion and collusive entity detection via graph modeling. Cresci et al. [35] showed how group of spambots exhibit common patterns as compared to group of genuine accounts. The group behaviors are matched in a DNA-inspired fashion, and extensive experiments are performed to detect spambot groups. Cresci et al. [36] also highlighted approaches that leverage characteristics of groups of accounts performing fraudulent activities. Cresci et al. [37] further focused on collective behavior of spam accounts and how they are engineered to resemble genuine online media accounts. They proposed the social fingerprinting technique based on digital DNA modeling of account behaviors to identify spam and genuine accounts using supervised and unsupervised techniques. Stringhini et al. [126] proposed EVILCOHORT to identify communities of online media accounts that are accessed by a common set of computers using IP addresses.

6 Progress in Collusive Entity Detection

Despite the increasing interest in analyzing and designing anomalous activities on the web, there has been limited work in collusive entity detection. Most of the previous studies in collusive entity detection are limited to social networks and rating platforms. In this section, we provide a summary of the reviewed papers to summarize the central idea and provide the basics of the models. To structure the descriptions, the relevant papers are grouped by the type of approaches proposed in the paper. We categorize the methodologies followed by the previous studies into one of the following types: feature based, graph based, and advanced models.

6.1 Feature-Based Detection

The majority of the papers model the behavioral properties of the users in online media platforms. Feature-based methods can be used to distinguish between collusive entities from genuine entities. The aim is to design a set of features that can (well) represent and capture various behavioral characteristics of the entities. Recently, researchers have also focused on the linguistic behavior of the collusive entities, such as deceptive information [7, 57], lexical analysis [69], and sentiment analysis [44].
Dutta et al. [48] reported that collusive users show an amalgamation of organic and inorganic activities and there is no synchronicity among their behaviors. They substantiated their finding by comparing collusive retweeting behavior with that of normal retweet fraudsters. Here, normal retweet fraudsters are those retweeters who are abnormally synchronized in some patterns. The authors then collected data from four freemium blackmarket services (credit-based services). Three human annotators were asked to label the blackmarket customers into bots, promotional customers, and normal customers based on a set of annotation rules. The data are used to tackle two problems: a four-class classification problem (genuine, bot, promotional, normal) and a binary classification problem (genuine and combining all types of customers into a single class). To detect collusive retweeters, the authors developed ScoRe, a supervised model to detect collusive retweeters based on five sets of features: profile features, social network features, user activity features, likelihood features, and fluctuation features. The authors also developed a Chrome browser extension to detect the collusive retweeters in real time. To study the underlying behavior of the blackmarket services, Dutta and Chakraborty [47] extended their previous work [48] to provide an in-depth analysis of collusive users involved in premium and freemium blackmarket services. The authors collected the dataset from four premium and four freemium blackmarket services. They analyzed the activities of collusive users based on retweet-centric, network-centric, profile-centric, and timeline-centric properties. They showed that unlike premium services, the credit system in freemium services is the primary reason behind the unusual functioning of freemium collusive users. They also developed an updated version of the Chrome extension by incorporating the features calculated from both premium and freemium users. Figure 4 shows the wordcloud of the text present in the description of premium and freemium retweeters collected from the blackmarket services. It can be seen that premium retweeters are associated with high profile accounts having keywords such as “CEO,” “official,” “speaker,” and “founder.” However, freemium retweeters are associated with advertising agencies having keywords such as “like,” “agency,” “SocialMedia,” and “YouTubeMarketing.”
Dutta et al. [49] proposed HawkesEye, a classifier based on the Hawkes process and topic modeling to detect collusive retweeters on Twitter. They collected tweets using three hashtags—#deletefacebook, #cambridgeanalytica, and #facebookdataleaks—and manually annotated the retweeters as “fake” or “genuine.” The HawkesEye model takes the temporal information of retweet objects of a user and topical information of the retweet texts as input. The temporal information is modeled using the Hawkes process due to its self-excitation property, and the topical information is modeled using latent Dirichlet allocation (LDA). The authors reported highly accurate results on the classification task. Comparisons were made with respect to well-known bot detection approaches [25] and collusive retweeter detection approaches [48].
More recently, Arora et al. [8] improved the collusive retweeter detection task by considering the multifaceted characteristics of a collusive user. They created multiple views for a user by examining representations from the content, attributes, and social network. They then proposed a multiview learning-based approach based on WGCCA (Weighted Generalized Canonical Correlation Analysis) to combine individual representations (views) of a user for deriving the final user embeddings. The authors reported that as each view is more and less helpful for the classification experiment (c.f. t-SNE visualizations in Figure 6), they assigned a weight to each view differently instead of treating each view equally. Overall, this analysis shows that the collusive entity detection task can be improved by incorporating multiple aspects of an online media user. This also calls for more research in the area of fraudulent entity detection in online media using a multiview mechanism that learns a low-dimensional vector of a user to capture information from each of their views. Moreover, the views represent different modalities and must be considered individually and not concatenated with each other.
Fig. 6.
Fig. 6. Visualization (using t-SNE) of representations of collusive (red) and genuine (green) users created using Tweet2Vec (a), SCoRe [48] (b), Retweet network (c), Quote network (d), Follower network (e), and Followee network (f). Reprinted with permission from Arora et al. [8].
Several research studies have been conducted on the automatic detection of collusive followers (follower frauds) on Twitter. In an earlier study, Liu et al. [93] proposed DetectVC to detect voluntary followers (volowers). They showed how volowers gain profit by following enough users for self- and product promotion. However, do these users act as a group to perform malicious activities? To answer this, Gupta et al. [62] proposed approaches to detect malicious retweeter groups. They found that the activities appear normal at the individual level. However, at the group level, they look suspicious and group activities vary across groups. Jang et al. [71] is the first work that detects collusive followers using geographical distance. The authors reported that when the distance between the legitimate users and collusive users increases, the legitimate users’ follower ratio between the number of followers and the total number of followers decreases. Aggarwal et al. [2] detected users on Twitter with a manipulated follower count using an unsupervised local neighborhood detection method.
Other studies have looked at how collusion happens on Facebook and e-commerce platforms. De Cristofaro et al. [40] conducted an in-depth analysis of Facebook pages where they collected likes via Facebook ads and other farms. The authors monitored the “liking” activity by creating honeypot pages and crawling them after every 2 hours to check for new likes from the blackmarket services. Badri Satya et al. [12] conducted an in-depth analysis of fake likers on Facebook collected from the blackmarket services. They reported that fake likers behave quite differently from genuine likers in terms of their liking behaviors, longevity, and so on. Mukherjee et al. [99] proposed an unsupervised approach to detect and rank collusive spammer groups in Amazon. They used FIM to extract the set of reviewer groups. The extracted reviewer groups were labeled by the domain experts into spammer and genuine groups. Their proposed method GSRank considers various group-level spam behavioral indicators and individual spam behavioral indicators, then models the inter-relationship between reviewer groups, members of those reviewer groups, and products they reviewed to find the spammer groups.
Studies have also looked at artificial manipulation of viewership/comment/subscription count in video streaming platforms such as YouTube and Twitch. Shah [111] introduced FLOCK, an unsupervised multistep process for identifying botted views and botted broadcasts. FLOCK performs the following steps: (i) it models a normal broadcast as a collection of multiple views, (ii) it classifies whether a given broadcast has inflated popularity by looking at the collective behavior of all the views in a broadcast, and (iii) it classifies whether individual views of the broadcast are botted. For modeling the normal broadcast behavior, the authors included temporal features of the constituent views of broadcast, such as start time and end time of a particular view. It is also worth noting that they avoided using descriptive features (like browser used, country of origin) and engagement-based features to make the proposed solution easier. In a very recent study, Dutta et al. [50] proposed three models to detect three types of collusive entities on YouTube: (i) videos submitted for collusive like requests, (ii) videos submitted for collusive comment requests, and (iii) channels submitted for collusive subscription requests. They analyzed the videos submitted to blackmarket services for collusive appraisals (likes and comments) based on two perspectives: propagation dynamics and video metadata. They also analyzed the collusive YouTube channels based on location, channel metadata, and network properties. Their analysis on the structural properties of the giant component present in the collusive channel network shows that it is a small world. To detect the collusive videos and collusive channels, they designed one-class classifiers based on features calculated from video metadata and temporal information. The models were evaluated in terms of true positive rate, achieving state-of-the-art results. The authors also proposed CollATe, a denoising autoencoder model that leverages the power of three components—metadata feature extractor, anomaly feature extractor, and comment feature extractor—to learn feature representation of videos. Note that CollATe can only detect videos submitted in blackmarket services for collusive comment requests, as unlike temporal information of like activity, the temporal information of the comments is publicly available in the YouTube API.14 Figure 7 shows the snapshot of collusive entities detected using their model. Figure 7(a) shows a video where the number of likes is much higher than the number of views. The authors claimed that the blackmarket services use YouTube API using the credentials of its users to help these videos gain likes. Similarly, Figure 7(c) shows the snapshot of a video submitted to blackmarket services for collusive comment subscriptions with a higher number of comments as compared to views and likes. Figure 7(b) shows a YouTube channel that was submitted for collusive subscriptions. The authors claimed that collusive YouTube channels make proper use of promotional keywords to attract more subscribers to their channel.
Fig. 7.
Fig. 7. Example of collusive video (for likes) (a), collusive channel (for subscriptions) (b), and collusive video (for comments) (c) detected by our models. Sensitive information are blurred. Reprinted with permission from Dutta et al. [50].
The work by Vidros et al. [132] is the only work that deals with recruitment-based fraud detection. The authors introduced this problem as ORF, more specifically relating to employment scams. The dataset used in this work contains numerous job ads and their corresponding annotations (legitimate or fraudulent). They performed an in-depth analysis of the dataset by applying standard state-of-the-art machine learning classifiers. In these cases, the trapped user(s) unknowingly become a medium for scammers to complete their jobs, often even driving the individual to complete a direct wire transfer for them under the careful and believable disguise of working visa or travel expenses. With this much being said about employment scams, it is worth noting that although they share some similar characteristics with problems such as email phishing, online opinion fraud, trolling, and cyber-bullying, they are very much different in many aspects.
Limitation. The main limitation is that in cases where user features are used, it can be restrictive to a particular online media platform and not generalize to others. Moreover, if collusive activities are performed by new online media accounts, it is difficult to collect user behaviors for such accounts. The drawback of using linguistic features is the lack of generalizability across multiple languages and circumstances. It has been reported in the literature [6, 84] that linguistic features designed for one circumstance may not work perfectly for other circumstances.

6.2 Graph-Based Models

In this section, we present and discuss the graph-based models for collusive entity detection. Recently, graph-based models have been developed particularly for spotting outliers and anomalies in networks. The advantage of using graph techniques is to efficiently capture long-range correlations among the interdependent data objects [5].
Chetan et al. [29] proposed CoReRank, an unsupervised approach to capture the interdependency between the credibility of users and the merit of tweets using a recurrence formulation based on network and behavioral information of users, as well as topical diversity of tweets. The authors created a directed bipartite graph by modeling the interactions between users and tweets in terms of retweets/quotes. They finally reported suspiciousness scores for both users and tweets in the graph based on the recurrent formulation. By incorporating prior knowledge about the collusive users, they proposed a semi-supervised version of CoReRank, called CoReRank+. Both CoReRank and CoReRank+ outperformed their respective state-of-the-art methods by a significant margin.
Dutta et al. [45] proposed DECIFE, a framework to detect collusive users involved in producing collusive following activities. The authors proposed a novel heterogeneous graph attention network to leverage the follower/followee relationships and linguistic properties of users. The authors also showed what category of tweets are submitted by collusive users to blackmarket services.
Rayana and Akoglu [104] proposed SpEagle, a framework that considers metadata of reviews (review content, rating, timestamp) along with relational data (reviewer-review-product graph) for detecting spam users, fake reviews, and the products targeted by spammers. SpEagle used metadata for extracting the spam features, which were then transformed into a spam score to decide the genuineness of users and reviews. It was also extended to a semi-supervised setting (called SpEagle+) to improve its performance. The authors also implemented a lighter version (SpLite) that leverages a small subset of review features for the detection task to achieve a boost in runtime. Wang et al. [139] proposed a bipartite graph projection based approach called GSBP to detect review spammer groups in e-commerce websites. Unlike FIM-based methods [99], their proposed approach detects loosely connected spammer groups—each group member may not review every target product of the group.
Kumar et al. [81] developed a fraudulent reviewer detection system called REV2 for rating platforms. They proposed three interdependent metrics: fairness of a user in rating an item, reliability of a specific rating, and goodness of a product. By combining network and behavioral properties of users, products, and ratings, REV2 calculates fairness, reliability, and goodness scores in both supervised and unsupervised settings. They evaluated the computed scores by using five real-world datasets: Flipkart, Bitcoin OTC, Bitcoin Alpha, Epinions, and Amazon. The current version of REV2 is deployed at Flipkart (an Indian e-commerce company).
Wang et al. [138] proposed a graph-based framework called GGSpam to detect review spammer groups. It recursively splits the entire reviewer graph into small groups. GSBC, the key component of GGSpam, finds the spammer groups by identifying all the bi-connected components within the split reviewer graphs whose spamicity score exceeds the given threshold. They proposed several group spam indicators to compute the spamicity score of a given reviewer group. They conducted experiments on two real-world datasets: Amazon and Yelp.
Dhawan et al. [43] proposed DeFrauder, an unsupervised approach to detect online fraud reviewer groups. This approach initially detects the candidate fraud groups by leveraging the underlying product review graph and incorporating several behavioral signals that model multifaceted collaboration among reviewers. It then maps the reviewers into an embedding space and assigns a spam score to each group such that groups consisting of spammers with highly similar behavioral traits attain a high spam score. They conducted experiments on four real-world datasets: Amazon, Play Store, YelpNYC [104], and YelpZip [105]. Figure 8(a) shows the schematic diagram of DeFrauder to detect online fraud reviewer groups. Figure 8(b) shows the coherent review patterns of four spammers in a fraudulent reviewer group.
Fig. 8.
Fig. 8. (a) Schematic diagram of DeFrauder to detect online fraud reviewer groups. (b) Coherent review patterns of four spammers in a fraudulent reviewer group. Reprinted with permission from Dhawan et al. [43].
Liu et al. [92] formulated a novel suspiciousness metric from graph topology and spikes to detect fraudulent users in multiple rating platforms (BookAdvocate, Amazon). The authors proposed HoloScope, an unsupervised approach that combines suspicious signals from graph topology, temporal bursts and drops, and rating deviation. The primary goal behind the approach is to obstruct the fraudsters by increasing the time they need to perform an attack. The evaluation was done on multiple semi-real (synthetic) and real datasets.
Shin et al. [120] proposed two algorithms (M-Zoom (Multi-dimensional Zoom) and M-Biz (Multidimensional Bi-directional Zoom)—to detect suspicious lockstep behavior in online review platforms. The task of detecting lockstep behavior is to discover a set of accounts that gives fake reviews to the same set of products/restaurants. The objective of M-Zoom and M-Biz is to detect dense subtensors in a greedy way until it reaches a local optimum. Although the overall structure of M-Biz and M-Zoom is the same, it differs significantly in the way they find each dense subtensor. Using the preceding algorithms, the authors detected three dense subtensors that indicated the activities of bots, which changed the same pages hundreds of thousands of times.
Mehrotra et al. [95] detected fake followers on Twitter using graph centrality measures. The authors created a graph using the follower and friend network from a set of Twitter users. Next, six centrality-based features were used for the classification experiment. Using this feature set with a Random Forest classifier gave an accuracy of \(95\%\). Zhang and Lu [151] proposed a graph-based approach using near-duplicates to detect fake followers in Weibo, a Chinese counterpart of Twitter. They were able to discover 11.90 million users in Weibo who bought followers from blackmarket services.
Shin et al. [117] detected three empirical patterns of core users in real-world networks across diverse domains. The first pattern is the MIRROR PATTERN, which states that the coreness of a vertex is strongly related to its degree. The second pattern, CORE-TRIANGLE PATTERN, states that the degeneracy (non-core component) and the triangle-count obey a power-law with slope 13. The third pattern, STRUCTURED CORE PATTERN, states that the degeneracy cores are not cliques. Using these patterns, the authors analyzed the lockstep behavior for collusive followers on Twitter. It was reported that at least 78% of the vertices with high coreness values use blackmarket services to boost their followers.
Limitation. Graph-based methods are memory intensive and computationally expensive, which can restrict their ability to design real-time tools to detect collusive entities in online media platforms. This is primarily due to the nature of the graphs that expose huge amounts of parallelism. It is also seen that most graph algorithms traverse each edge and node in the network. As the network keeps on adding new edges/nodes, it keeps occupying more memory and increase latency.

6.3 Advanced Models

The advent of techniques such as deep learning has alleviated the shortcomings of the previous models with the power to learn more complex representations that are difficult to capture by traditional machine learning models.
Arora et al. [9] proposed a multitask learning approach to detect collusive tweets. The authors collected the data from two blackmarket services—YouLikeHits and Like4Like—after creating honeypot accounts in these services. The multitask model takes two inputs—the feature representation extracted from the tweet metadata and Tweet2vec representation of the tweets—which are passed through a cross-stitch neural network to learn the optimal combination of the inputs via soft parameter sharing. Ahsan et al. [4] tackled the problem of fake reviewers using an active learning approach with the TF-IDF features of the review content.
Studies have also looked at the detection of social spammers in online media platforms. Wu et al. [141] proposed an end-to-end deep learning model based on Graph Convolution Networks (GCNs) that operates on directed social graphs to detect social spammers. The authors built a classifier by performing graph convolution on the social graph with different types of neighbors.
In a study to identify collusive entities on YouTube, Dutta et al. [50] proposed CollATe, a deep unsupervised learning method using denoising autoencoders. The CollATe architecture consists of three components—metadata feature extractor, anomaly feature extractor, and comment feature extractor—to learn video feature representations. The final representation is then fed to a collusive video detector module that returns the label (either collusive or non-collusive) for the input video.
To detect collusive followers on Twitter, Castellini et al. [24] proposed a deep learning technique using a denoising autoencoder. The denoising autoencoder is implemented as an anomaly extractor and is trained with examples only from the features generated from the real profiles. The authors built a test set with both real and fake profiles. Once they got the reconstruction error calculated using the autoencoder for a data point, it was then compared to a threshold value (computed from the training set): if the error is higher, it is marked as collusive (i.e., the autoencoder is not able to properly reconstruct the record); otherwise, it is marked as real.
Limitation. The main limitation of the advanced models is the requirement of vast amount of data. Moreover, due to the unusual behavior of collusive entities including the mixture of organic and inorganic activities, it is a tough task to find correlations between sets of features, which makes it difficult for these methods to perform well.

6.4 Other Methods

In contrast to the preceding approaches, which were largely focused on collusive activities in online media platforms, a series of works have been conducted on detecting collusive activities in other platforms.
Asavoae et al. [10] developed a fully automated and effective detection system to detect colluding Android apps that violate the permissions causing data leaks or distributing malware over multiple apps. The authors first proposed a rule-based approach to identify colluding apps. As defining rules requires expert knowledge and for some cases explicit rules may not exist, the authors proposed a probabilistic model to calculate two likelihood components for an app: (i) the likelihood of carrying out a threat and (ii) the likelihood of performing an inter-app communication.
Blasco and Chen [17] proposed Application Collusion Engine (ACE), a system to automatically generate colluding apps with a variety depending on the configuration of app component templates and code snippets. The ACE system has two main components: colluding set engine and application engine. The first component tells the second component how it should create apps in order to collude. The second component is responsible for creating fully working Android apps that are ready to be installed in a device. The primary aim of this work is to create substantial colluding app sets for experimentation, as representative datasets do not presently exist for colluding apps. Kalutarage et al. [75] proposed a probabilistic model toward an automated threat intelligence system for app collusion. The readers are encouraged to refer to the work of Bhandari et al. [16] for a detailed survey on the app detection techniques that focus on collusive activities by investigating inter-app communication threats.
Limitation. As these methods are mostly targeted toward detecting colluding apps in the Android operating system, there is a possibility of bottleneck to detect covert channels in inter-app communications for real-life apps. This is mainly due to the rapid growth of the mobile app ecosystem and third-party marketplaces.
Table 2 provides a summary of the approaches for collusive entity identification in the area of individual and group collusion.
Table 2.
NameDetected EntitiesModel UsedTypePlatform
ScoRe [48]Collusive retweeters registered in freemium blackmarket servicesFeature-based classification using supervised modelsIndividualTwitter
CoReRank [29]Collusive users and tweets registered in blackmarket servicesUnsupervised model incorporating network properties, behavioral properties, and topical similarity of different tweetsIndividualTwitter
HawkesEye [49]Collusive retweeters promoting popular hashtagsTemporal and textual information modeled using topic modeling and Hawkes processesIndividualTwitter
MTLCollu [9]Collusive tweets registered in freemium blackmarket servicesMultitask learning framework to identify collusive tweetsIndividualTwitter
ScoRe+ [47]Collusive users registered in freemium/premium blackmarket servicesFeature-based classification using supervised modelsIndividualTwitter
Fame4Sale [34]Users involved in blackmarket-based following activitiesFeature-based classification using machine learning classifiersIndividualTwitter
View_WGCCA [8]Collusive users registered in freemium blackmarket servicesIndividual user representations/views combined using WGCCAIndividualTwitter
MalReg [62]Malicious retweeter groupsGroup-based features using supervised modelsGroupTwitter
DetectVC [93]Voluntary following activitiesCaptures structure information in user following activities and knowledge from follower marketsIndividualSina Weibo
FakeLikers [12]Fake likers on FacebookFeatures-based classification using supervised learning modelsIndividualFacebook
GSRank [99]Review spammer groupsFIM to find spammer groupsIndividualAmazon
GSBP [139]Loosely connected review spammer groupsFIM and bipartite graph projection to find spammer groupsIndividualAmazon
REV2 [81]Fraudulent users in rating platformsGraph-based algorithm by combining network and behavior properties of an accountIndividualAmazon
GGSpam [138]Review spammer groups modeled as bi-connected componentsModeling the topological structure of the reviewer graphGroupAmazon
SpEagle [104]Spam users, fake reviews, and the targeted productsUnified framework combining metadata (text, timestamp, rating) and relational data (network)IndividualYelp
DeFrauder [43]Fraud reviewer groupsUnsupervised method using product review graph and reviewer behavioral signalsGroupPlay Store
DECIFE [45]Collusive followers registered in freemium blackmarket servicesHeterogeneous user-tweet-topic network to leverage the follower/followee relationships and linguistic properties of a userIndividualTwitter
EMSCAD [132]Bag-of-words modeling using supervised modelsOnline recruitment fraudsIndividualLinkedIn
CollATe [50]Collusive videos and channels on YouTubeNeural architecture that leverages time-series information of posted comments along with static metadata of videosIndividualYouTube
Table 2. Summary of the Proposed Approaches to Identify Collusive Entities from Blackmarket Services

7 Annotation Guidelines, Datasets, Applications, and Evaluation Metrics

Most of the previous studies in collusive entity detection suggested a few guidelines that may be helpful to label these users. In this section, we first review the annotation guidelines for creating annotated datasets for collusive entities. We then provide pointers to the previous studies that have released various datasets for collusive entity detection. Next, we discuss the tools and interfaces in the analysis and detection of collusive entities in online media. Finally, we describe the common evaluation metrics used for collusive entity detection.

7.1 Annotation Guidelines

Annotation guidelines are usually seen in studies that create a labeled set of entities for various experiments, as mentioned in the following.
The dataset of Dutta et al. [48] is one of the most widely used datasets [8, 47] on collusive retweeter detection. The dataset is created from the blackmarket services, and the annotators were asked to label each retweeter by one of the three classes: bots, promotional customers, and normal customers. For instance, bots can be selected if the account is controlled by software. The promotional customers can be identified if the accounts are involved in promoting brands using keywords such as “win,” “ad,” and “Giveaway.” The normal customers class can be selected if the account does not fall under any of the preceding categories. The annotators were also given complete freedom to search for any information related to collusive entities on the web and apply their own intuition.
In the work of Wu et al. [142], a corpus of collusive spammers in a Chinese review website is described. The task of the annotator is to label each reviewer in a reviewer group as either colluder or non-colluder. The annotators in this work were computer science major graduate students who had extensive online shopping experiences. The annotators were given a set of three instructions about the colluder class and how they should reason to arrive at the colluder class when the data are not straightforward: (i) if the reviewer has too many reviews giving opposite opinions than other reviews about the same stores; (ii) if the reviewer has too many reviews giving opposite opinions about the same stores as compared to the ratings from the Better Business Bureau,15 an organization that focuses on advancing marketplace trust; and (iii) if the reviewer has too many reviews giving opposite opinions about the same stores as compared to evidence by general web search results.
In the work of Mukherjee et al. [98], a corpus of collusive spammers in an e-commerce website is described. Three annotators were given the task of examining the e-commerce users and provide a label as spammer or non-spammer based on three opinion spam signals. Note that each of the annotators was a domain expert an employee of e-commerce platforms such as Rediff Shopping, eBay.in, etc.). As the annotators were domain experts, they were also given access to detailed information, such as the entire profile history and demographic information.
Li et al. [88] built a review spam corpus. They crawled data from a consumer review site named Epinions.16 The task of the annotator was to annotate the review spam dataset and label each review as spam or non-spam. The annotators used the rules specified in a platform named Consumerist,17 an independent source of consumer news and information published by Consumer Reports, to label the reviews. The platform suggests a set of 30 rules that are helpful for annotators to spot fake online reviews. The authors employed 10 college students to annotate their reviews. Each review was annotated by two students, and conflict was resolved by another student.
In most of the preceding studies, the Fleiss/Cohen kappa value was reported to show the reliability of agreement between the annotators.

7.2 Datasets

In this section, we present a list of publicly available datasets used for the detection of collusive activities in online media.
The Fame4Sale dataset [34] is a corpus of fake followers collected from various blackmarket services: InterTwitter (INT), FastFollowerz (FSF), and TwitterTechnology (TWT). The dataset also contains the relationships between the user accounts (followers/friends). The ScoRe dataset [47, 48] is a medium-sized corpus of collusive users, collected from four freemium blackmarket retweeting services: YouLikeHits, Like4Like, Traffup, and JustRetweet. It contains an anonymized version of 1,941 users who were manually annotated into four categories (bots, promotional customers, normal customers, and genuine users). The authors also reported the values of 64 features for each user, which were used in their approach to distinguish between collusive users and genuine users. The CoReRank dataset [29] is another corpus of retweeters collected from two blackmarket services: YouLikeHits and Like4Like. It contains the details of 4,732 collusive users and 2,719 genuine users. Liu et al. [93] released the DetectVC dataset of voluntary followers containing 3,250 volower IDs and their following relations. The MalReg dataset [62] contains annotated groups of users that collude to retweet together maliciously. It consists of 1,017 malicious retweeter groups collected from three political events: UK General Election 2017, Indian banknote demonetization 2016, and Delhi legislative assembly election 2013. The FakeLikers dataset contains 13,147 likers, 4.66M pages, 0.99M posts, and 5.47M friend relations from Facebook. There is only one available dataset for recruitment frauds (EMSCAD) [132]. The dataset consists of 17,880 annotated job ads (17,014 legitimate and 866 fraudulent) published between 2012 and 2014 and annotated by people having domain expertise. Each job ad is described by a set of fields, and a label indicates whether the entry is fraudulent or not. There are a few publicly available datasets for rating platforms. Dhawan et al. [43] released a large-scale dataset of reviews from different applications available on Google Play Store. The dataset contains 325,424 reviews made by 321,436 reviews on 192 apps. YelpNYC [104] and YelpZip [105] are the datasets collected from Yelp and used for detecting collusive users/groups. The YelpNYC dataset consists of review data related to 923 restaurants in New York City. It includes 359k reviews and 160k reviewers. YelpZip is relatively large compared to the other two datasets. It consists of 608k reviews and 260k reviewers, for 5k restaurants located in multiple states of the United States. Dutta et al. [46] released ABOME, a dataset of collusive Twitter and YouTube entities collected from two blackmaket services: YouLikeHits and Like4Like. ABOME consists of 23,522 collusive retweets and 18,368 collusive follower requests for Twitter, and 58,091 collusive likes, 25,106 comments, and 7,847 subscriptions requests for YouTube. ABOME also consists of a time-series data dataset of 2,350 Twitter collusive users and 4,989 collusive tweets.
Issues regarding datasets. An issue with most of the online media datasets for anomaly detection is that such datasets are shared with only the details of the entities present and not the identifiers (Tweet id, YouTube video id, etc.). Thus, for any new study to be carried out on such datasets, no reference is available to get any new information for those entities. Another related issue is the need of large annotated datasets to carry out studies that represent a substantial size of the entire population for an online media platform.

7.3 Interfaces and Applications

In this section, we discuss the tools and interfaces developed to detect collusion in online media. A list of interfaces and applications is provided in the following with references to the corresponding studies; in addition, Figure 9 shows the working of two such applications:
Fig. 9.
Fig. 9. (a) Working of TweetCred to assess credibility of content on Twitter. (b) Working of Analisa to find the authenticity of followers on Instagram.
Dutta et al. [48] developed a Chrome extension called ScoRe to detect collusive users involved in blackmarket-based retweeting activities. ScoRe also supports online learning by incorporating user feedback.
TwitterAudit 18 and FollowerWonk19 services analyze Twitter accounts to check for fake followers.
Botometer 20 assigns a score (from 0 to 5) to an account based on its bot activities. The higher the score, the higher the chance that the account is being controlled by a bot.
TweetCred [60] measures the credibility of a tweet based on user-centric and tweet-centric properties.
FakeCheck, FakeLikes, and IGAudit21 evaluate Instagram accounts for fake followers.
TwitterPatrol [121] is a real-time tool to detect spammers and fake and compromised accounts on Twitter.
Modash 22 analyzes brands to find influencers on Instagram. It also has the functionality to detect fake followers on a public user account.
Analisa 23 is a tool for bloggers and agencies to check follower authenticity for an Instagram or TikTok account.
Cresci et al. [38] conducted a survey on three (“fakers,” “Fake Follower Check,” “Twitter Audit”) Twitter analytics that aim to detect fake followers and showed the lack of reliability of these tools. The authors proved that the tools fail to meet the basic assumptions of unbiased sampling and are unable to generate the same label over the same set of target accounts. It would be interesting to see a series of comparative experiments that summarize the limitations of the new tools that are mentioned in this work. Finally, some of the previous works have publicly released their resources and codes. Interested readers can refer to studies found elsewhere [8, 9, 43, 47, 48, 62], where public links to the resources and codes for collusive entity detection and analysis are available. We believe that such studies are worthy of attention, as they encourage the reproducibility of the experiments and enable the follow-up studies to use the models as baselines to improve collusive entity detection.

7.4 Evaluation Metrics

For any classification task, metrics such as precision, recall, and F-score are commonly used. However, in most of the studies, we observe the F-score to be more popular, as it is a combined metric that conveys the balance between the precision and recall. In general, the F-score is calculated as the weighted harmonic mean of precision and recall, with a beta parameter (\(\beta\)) that determines the weight of recall. In collusive entity detection, the most widely used version of the F-score is calculated as the macro-average of the F-scores for collusive (col) and genuine (gen) classifications as follows:
\begin{align*} F &= \dfrac{2 \times precision \times recall}{Precision + Recall} & F &= \dfrac{F_{col} + F_{gen}}{2}, \end{align*}
\begin{align*} F_{col} &= \dfrac{2 \times Precision_{col} \times Recall_{col}}{Precision_{col} + Recall_{col}} & F_{gen} &= \dfrac{2 \times Precision_{gen} \times Recall_{gen}}{Precision_{gen} + Recall_{gen}}, \end{align*}
\begin{align*} Precision_{col} &= \dfrac{Correct_{col}}{Correct_{col} + Spurious_{col}} & Precision_{gen} &= \dfrac{Correct_{gen}}{Correct_{gen} + Spurious_{gen}}, \end{align*}
\begin{align*} Recall_{col} &= \dfrac{Correct_{col}}{Correct_{col} + Missing_{col}} & Recall_{gen} &= \dfrac{Correct_{gen}}{Correct_{gen} + Missing_{gen}}, \end{align*}
where \(Correct_{col}\) and \(Correct_{gen}\) indicate correctly classified collusive and genuine users, respectively. Similarly, \(Spurious_{col}\) (respectively, \(Missing_{col}\)) and \(Spurious_{gen}\) (respectively, \(Missing_{gen}\)) indicate false positive (respectively, false negative) for collusive and genuine classes, respectively.
Precision@k and Recall@k are other metrics used for collusive entity detection [29] and are calculated as follows:
\begin{align*} Precision@k &= \frac{\text{\# of recommended items @k that are collusive}}{\text{k}} \\ Recall@k &= \frac{\text{\# of recommended items @k that are collusive}}{\text{total \# of collusive items}}. \end{align*}
Dutta et al. [50] also reported the true positive rate. The reason behind choosing the true positive rate as the evaluation metric is because all of their models are trained on one class, and the authors are only interested in the proportion of actual positives (data in the collusive class) that are correctly identified by the models. The evaluation strategy can be considered when the objective is to measure the effectiveness of a single class.

8 Open Problems

Collusion is an emerging topic in the area of anomaly detection that opens up a number of future research opportunities. Some of these topics include (i) collective collusion detection, (ii) understanding connectivity patterns in collusive networks, (iii) event-specific studies, (iv) temporal modeling of collusive entities, (v) cross-lingual and multilingual studies, (vi) core collusive user detection, (vii) cross-platform spread of collusive content, (viii) multimodal analysis of collusive entities, and (ix) how collusion promotes offensive content online such as fake news and hate speech. We briefly discuss these topics in detail next.

8.1 Collective Collusion Detection

Recently, collusive entity detection has gained a lot of attention in the literature. It is often seen that the anomalous phenomena happens in groups with a set of accounts targeted toward performing the fraudulent activities. The primary aim of the collusive group detection task is to find a set of users who jointly exhibit anomalous behavior. Detecting anomalous groups is more difficult as compared to the individual detection task due to the inter-group dynamics. We request that the reader review Section 5 for details of previous studies in individual and group collusive entity detection. We believe that the advent of new datasets (see Section 7 for more details) will foster fraudulent user/entity detection (on both individual and group levels) with the advantage of adding the topical dimension as well as the temporal dimension.

8.2 Understanding Connectivity Patterns in Collusive Networks

Understanding connectivity patterns of an underlying network is a well-studied problem in the literature. It includes tasks such as inferring lockstep behavior [15], dense block detection [119], detecting core users [117, 118], identifying the most relevant actors in a network [18], sudden appearance/disappearance of links [54], and so forth. One potential research direction is to create various networks among the entities to investigate the network’s structural patterns. Recently, several studies have modeled information diffusion for collusive entities from a topological point of view. It includes tasks such as influence maximization (selecting a seed set to maximize the influence spread) [72, 96], predicting information cascade [64, 103], measuring message propagation and social influence [19, 146], and so on.

8.3 Event-Specific Studies

Event-specific studies can be employed by deeply investigating large-scale datasets. Some of the publicly available datasets mentioned in Section 7 are obtained from multiple sources and span over a long period of time. Therefore, it may consist of information from many major events [11], which can be easily extracted for event-centric studies. Researchers can also check how these users/entities were involved in manipulating the popularity of events by artificially inflating/deflating the social reputation of entities in online media [152].

8.4 Temporal Modeling of Collusive Entities

A topic closely related to the previous topic is the temporal modeling of the collusive entities. The tasks in temporal modeling include detecting time periods containing unusual activity [57], identifying repetitive patterns in time-evolving graphs [148], and so on. Recently, detecting anomalies in streams [53, 54] has gained a lot of attraction in the research community due to the time-evolving (or dynamic) property where consecutive snapshots of activity in a time window are monitored. Related studies for temporal modeling can be found elsewhere [52, 140, 147].

8.5 Cross-lingual and Multilingual Studies

The datasets in collusive entity detection are collected from various online media platforms where the entities usually have texts written in several languages. Although we observe the lack of annotated datasets in languages other than English, cross-lingual studies can be done by converting the English texts into the target language using automated translation tools. The converted texts can then be used to create the training and test datasets. For multilingual studies, we can consider it as a future research topic that can be performed only when sufficient content is available in multiple languages.

8.6 Core Collusive User Detection

The underlying blackmarket collusive network includes two types of users: (i) core users, fake accounts or sockpuppets, which are fully controlled by the blackmarkets (puppet masters), and compromised accounts that are temporarily hired to support the core users. These two types of users are together called collusive users. Core users are the spine of any collusive blackmarket; they monitor and intelligently control all fraudulent activities in such a way that none of their hired compromised accounts are being suspended. Therefore, detecting and removing core blackmarket users is of utmost importance to decentralize the collusive network and keep the YouTube ecosystem healthy and trustworthy.
According to the literature [68], detecting core nodes in a network is to find the influential nodes. However, in the case of collusion, core nodes might not be influential, as these are the accounts that are fully controlled by the blackmarket authorities. The work by Shin et al. [117] is the only study that explores empirical patterns of core users in real-world networks. The work of Huang et al. [68] and Zhang and Zhang [153] are some of the studies on influential node detection in a network. However, none of the existing studies attempted to detect core collusive users. One reason may be the lack of ground truth for training the model and evaluation. We believe that core blackmarket user detection is highly important to understand how these services operate and flag their behavior.

8.7 Cross-platform Spread of Collusive Content

Another future work could be to investigate the cross-platform spread of collusive entities and study its impact. Online media platforms differ from each other in multiple ways: (i) different platforms have significantly different language characteristics, (ii) some platforms have restrictions on the length of posts allowing users to express themselves within less space, (iii) some platforms only allow images/videos as posts, and (iv) different platforms have different types of appraisals. Very few studies considered analyzing cross-platform data [70, 108] in online media platforms. Moreover, no work to date examined the cross-platform study of collusive entities. An important related issue to conduct the cross-platform study is the need for ground-truth datasets collected from different online media platforms.

8.8 Multimodal Analysis of Collusive Entities

A recent trend in social computing research is to conduct a multimodal analysis of online media entities. The multimodal investigation of collusive entities is important for the following reasons: (i) different modalities may exhibit different but important information about an entity (e.g., may help in detecting the authenticity of the information), and (ii) different modalities can be manipulated differently by the blackmarket services. Note that it is expected that multimodal analysis will introduce a computational cost due to the operation on image and video data. Singhal et al. [122] proposed a multimodal framework called SpotFake for fake news detection by leveraging the textual and image modalities. Hence, new models can be designed to incorporate different modalities for collusive entity detection.

8.9 How Collusion Promotes Fake News

Online social networks are a popular way of dissemination and consumption of information. However, due to their decentralized nature, they also come with limited liability for disinformation and collusive persuasion. It is often the experience that online users are subjected to a barrage of misinformation within a short span of time, all of which are targeted at persuading the user to develop a particular sentiment against a person, a political party, a system of medicine, or the cause of an event. This “echo chamber” effect influences users’ thoughts in a manner that is unfavorable to the spirit of free speech and debate. This direction may undertake research into the detection of fake news within online social networks, with a specific focus on understanding the nature of collusive orchestration of misinformation. In particular, one may consider (i) collusion in fake news (i.e., identifying telltale patterns of collusion within the context of fake news) and (ii) collusive fake news detection (i.e., leveraging such collusive patterns to enhance fake news detection). In addition to improving fake news detection, this project will make fundamental advances toward a novel direction of research in fake news analytics—that of characterizing and exploiting collusive behavior toward improving fake news detection. We also encourage the reader to review the work of Kumar and Shah [82] for detailed information of diverse aspects of false information on the web.

9 Conclusion and Open Challenges

In this survey article, we presented a detailed overview of the fundamentals of collusion and detection of artificial boosting of social reputation and manipulation observed in online media. We categorized existing studies based on the platforms that are compromised by the collusive users. The approaches were mostly developed for the detection of collusive entities using techniques such as feature-based, graph-based, and advanced models in both supervised and unsupervised settings. In addition, the survey also includes the related annotation guidelines, datasets, available applications, evaluation metrics, and future research opportunities for collusive entity detection.
This area is still in its infancy and has various open challenges. The major challenge is the collection of large-scale datasets, as crawling blackmarket services is extremely challenging and may require ethical consent. Moreover, there are restrictions and limitations of the APIs provided by the online media platforms. Furthermore, human annotation to label an action (retweet, share, like, etc.) as collusive is confusing and often incurs low inter-annotator agreement [29], as collusive activities bear high resemblance to genuine activities. Therefore, it is difficult to design efficient supervised methods. There are limited studies on the detection of such collusive manipulation in online media platforms other than social networks, which requires detailed inspection. Detection and evaluation of self-collusion (creating multiple-account deception, sockpuppets, etc.) are extremely challenging due to the difficulty in collecting ground-truth data. We believe that this survey will motivate researchers to dig deeper into exploring the dynamics of collusion in online media platforms.

Footnotes

References

[1]
Shalinda Adikari and Kaushik Dutta. 2020. Identifying fake profiles in LinkedIn. arXiv preprint arXiv:2006.01381 (2020).
[2]
Anupama Aggarwal, Saravana Kumar, Kushagra Bhargava, and Ponnurangam Kumaraguru. 2018. The follower count fallacy: Detecting Twitter users with manipulated follower count. arXiv preprint arXiv:1802.03625 (2018).
[3]
Anupama Aggarwal and Ponnurangam Kumaraguru. 2014. Followers or phantoms? An anatomy of purchased Twitter followers. arXiv preprint arXiv:1408.1534 (2014).
[4]
M. N. Istiaq Ahsan, Tamzid Nahian, Abdullah All Kafi, Md. Ismail Hossain, and Faisal Muhammad Shah. 2016. Review spam detection using active learning. In Proceedings of the 2016 IEEE 7th Annual Information Technology, Electronics, and Mobile Communication Conference (IEMCON’16). IEEE, Los Alamitos, CA, 1–7.
[5]
Leman Akoglu, Hanghang Tong, and Danai Koutra. 2015. Graph based anomaly detection and description: A survey. Data Mining and Knowledge Discovery 29, 3 (2015), 626–688.
[6]
Mohammed Ali and Timothy Levine. 2008. The language of truthful and deceptive denials and confessions. Communication Reports 21, 2 (2008), 82–91.
[7]
Jalal S. Alowibdi, Ugo A. Buy, S. Yu Philip, Sohaib Ghani, and Mohamed Mokbel. 2015. Deception detection in Twitter. Social Network Analysis and Mining 5, 1 (2015), 32.
[8]
Udit Arora, Hridoy Sankar Dutta, Brihi Joshi, Aditya Chetan, and Tanmoy Chakraborty. 2020. Analyzing and detecting collusive users involved in blackmarket retweeting activities. ACM Transactions on Intelligent Systems and Technology 11, 3 (April 2020), Article 35, 24 pages.
[9]
Udit Arora, William Scott Paka, and Tanmoy Chakraborty. 2019. Multitask learning for blackmarket tweet detection. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. 127–130.
[10]
Irina Mariuca Asavoae, Jorge Blasco, Thomas M. Chen, Harsha Kumara Kalutarage, Igor Muttik, Hoang Nga Nguyen, Markus Roggenbach, and Siraj Ahmed Shaikh. 2016. Towards automated Android app collusion detection. arXiv preprint arXiv:1603.02308 (2016).
[11]
Farzindar Atefeh and Wael Khreich. 2015. A survey of techniques for event detection in Twitter. Computational Intelligence 31, 1 (2015), 132–164.
[12]
Prudhvi Ratna Badri Satya, Kyumin Lee, Dongwon Lee, Thanh Tran, and Jason Jiasheng Zhang. 2016. Uncovering fake likers in online social networks. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM’16). 2365–2370.
[13]
Eytan Bakshy, Jake M. Hofman, Winter A. Mason, and Duncan J. Watts. 2011. Everyone’s an influencer: Quantifying influence on Twitter. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining. ACM, New York, NY, 65–74.
[14]
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida. 2010. Detecting spammers on Twitter. In Proceedings of the Collaboration, Electronic Messaging, Anti-Abuse, and Spam Conference (CEAS’10), Vol. 6. 12.
[15]
Alex Beutel, Wanhong Xu, Venkatesan Guruswami, Christopher Palow, and Christos Faloutsos. 2013. CopyCatch: Stopping group attacks by spotting lockstep behavior in social networks. In Proceedings of the 22nd International Conference on World Wide Web (WWW’13). ACM, New York, NY, 119–130.
[16]
Shweta Bhandari, Wafa Ben Jaballah, Vineeta Jain, Vijay Laxmi, Akka Zemmari, Manoj Singh Gaur, Mohamed Mosbah, and Mauro Conti. 2017. Android inter-app communication threats and detection techniques. Computers & Security 70 (2017), 392–421.
[17]
Jorge Blasco and Thomas M. Chen. 2018. Automated generation of colluding apps for experimental research. Journal of Computer Virology and Hacking Techniques 14, 2 (2018), 127–138.
[18]
Stephen P. Borgatti. 2006. Identifying sets of key players in a social network. Computational & Mathematical Organization Theory 12, 1 (2006), 21–34.
[19]
Phil E. Brown and Junlan Feng. 2011. Measuring user influence on Twitter using modified k-shell decomposition. In Proceedings of the 5th International AAAI Conference on Weblogs and Social Media. 18–23.
[20]
Axel Bruns, Brenda Moon, Felix Victor Münch, Patrik Wikström, Stefan Stieglitz, Florian Brachten, and Björn Ross. 2018. Detecting Twitter bots that share SoundCloud tracks. In Proceedings of the 9th International Conference on Social Media and Society. 251–255.
[21]
Cody Buntain and Jennifer Golbeck. 2017. Automatically identifying fake news in popular Twitter threads. In Proceedings of the 2017 IEEE International Conference on Smart Cloud (SmartCloud’17). IEEE, Los Alamitos, CA, 208–215.
[22]
Chiyu Cai, Linjing Li, and Daniel Zengi. 2017. Behavior enhanced deep bot detection in social media. In Proceedings of the 2017 IEEE International Conference on Intelligence and Security Informatics (ISI’17). IEEE, Los Alamitos, CA, 128–130.
[23]
Mark Carman, Mark Koerber, Jiuyong Li, Kim-Kwang Raymond Choo, and Helen Ashman. 2018. Manipulating visibility of political and apolitical threads on Reddit via score boosting. In Proceedings of the 2018 17th IEEE International Conference on Trust, Security, and Privacy in Computing and Communications and the 12th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE’18). IEEE, Los Alamitos, CA, 184–190.
[24]
Jacopo Castellini, Valentina Poggioni, and Giulia Sorbi. 2017. Fake Twitter followers detection by denoising autoencoder. In Proceedings of the International Conference on Web Intelligence. 195–202.
[25]
Nikan Chavoshi, Hossein Hamooni, and Abdullah Mueen. 2016. DeBot: Twitter bot detection via warped correlation. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM’16). 817–822.
[26]
Nikan Chavoshi, Hossein Hamooni, and Abdullah Mueen. 2017. Temporal patterns in bot activities. In Proceedings of the 26th International Conference on World Wide Web Companion. 1601–1606.
[27]
Liang Chen, Yipeng Zhou, and Dah Ming Chiu. 2015. Analysis and detection of fake views in online video services. ACM Transactions on Multimedia Computing, Communications, and Applications 11, 2s (2015), 44.
[28]
Justin Cheng, Lada Adamic, P. Alex Dow, Jon Michael Kleinberg, and Jure Leskovec. 2014. Can cascades be predicted? In Proceedings of the 23rd International Conference on World Wide Web. ACM, New York, NY, 925–936.
[29]
Aditya Chetan, Brihi Joshi, Hridoy Sankar Dutta, and Tanmoy Chakraborty. 2019. CoReRank: Ranking to detect users involved in blackmarket-based collusive retweeting activities. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining (WSDM’19). 330–338.
[30]
Zi Chu, Steven Gianvecchio, Haining Wang, and Sushil Jajodia. 2010. Who is tweeting on Twitter: Human, bot, or cyborg? In Proceedings of the 26th Annual Computer Security Applications Conference. ACM, New York, NY, 21–30.
[31]
Zi Chu, Steven Gianvecchio, Haining Wang, and Sushil Jajodia. 2012. Detecting automation of Twitter accounts: Are you a human, bot, or cyborg? IEEE Transactions on Dependable and Secure Computing 9, 6 (2012), 811–824.
[32]
Mauro Conti, Radha Poovendran, and Marco Secchiero. 2012. Fakebook: Detecting fake profiles in on-line social networks. In Proceedings of the 2012 International Conference on Advances in Social Networks Analysis, and Mining (ASONAM’12). IEEE, Los Alamitos, CA, 1071–1078.
[33]
David M. Cook, Benjamin Waugh, Maldini Abdipanah, Omid Hashemi, and Shaquille Abdul Rahman. 2014. Twitter deception and influence: Issues of identity, slacktivism, and puppetry. Journal of Information Warfare 13, 1 (2014), 58–71.
[34]
Stefano Cresci, Roberto Di Pietro, Marinella Petrocchi, Angelo Spognardi, and Maurizio Tesconi. 2015. Fame for sale: Efficient detection of fake Twitter followers. Decision Support Systems 80 (2015), 56–71.
[35]
Stefano Cresci, Roberto Di Pietro, Marinella Petrocchi, Angelo Spognardi, and Maurizio Tesconi. 2016. DNA-inspired online behavioral modeling and its application to spambot detection. IEEE Intelligent Systems 31, 5 (2016), 58–64.
[36]
Stefano Cresci, Roberto Di Pietro, Marinella Petrocchi, Angelo Spognardi, and Maurizio Tesconi. 2017. The paradigm-shift of social spambots: Evidence, theories, and tools for the arms race. In Proceedings of the 26th International Conference on World Wide Web Companion. 963–972.
[37]
Stefano Cresci, Roberto Di Pietro, Marinella Petrocchi, Angelo Spognardi, and Maurizio Tesconi. 2017. Social fingerprinting: Detection of spambot groups through DNA-inspired behavioral modeling. IEEE Transactions on Dependable and Secure Computing 15, 4 (2017), 561–576.
[38]
Stefano Cresci, Marinella Petrocchi, Angelo Spognardi, Maurizio Tesconi, and Roberto Di Pietro. 2014. A criticism to society (as seen by Twitter analytics). In Proceedings of the 2014 IEEE 34th International Conference on Distributed Computing Systems Workshops (ICDCSW’14). IEEE, Los Alamitos, CA, 194–200.
[39]
Clayton Allen Davis, Onur Varol, Emilio Ferrara, Alessandro Flammini, and Filippo Menczer. 2016. BotOrNot: A system to evaluate social bots. In Proceedings of the 25th International Conference Companion on World Wide Web. 273–274.
[40]
Emiliano De Cristofaro, Arik Friedman, Guillaume Jourjon, Mohamed Ali Kaafar, and M. Zubair Shafiq. 2014. Paying for likes? Understanding facebook like fraud using honeypots. In Proceedings of the 2014 Internet Measurement Conference (IMC’14). 129–136.
[41]
Michela Del Vicario, Alessandro Bessi, Fabiana Zollo, Fabio Petroni, Antonio Scala, Guido Caldarelli, H. Eugene Stanley, and Walter Quattrociocchi. 2016. The spreading of misinformation online. Proceedings of the National Academy of Sciences 113, 3 (2016), 554–559.
[42]
Michael Dewing. 2010. Social Media: An Introduction. Vol. 1. Library of Parliament, Ottawa, Canada.
[43]
Sarthika Dhawan, Siva Charan Reddy Gangireddy, Shiv Kumar, and Tanmoy Chakraborty. 2019. Spotting collective behaviour of online fraud groups in customer reviews. arXiv:1905.13649 (2019).
[44]
John P. Dickerson, Vadim Kagan, and V. S. Subrahmanian. 2014. Using sentiment to detect bots on Twitter: Are humans more opinionated than bots? In Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. IEEE, Los Alamitos, CA, 620–627.
[45]
Hridoy Sankar Dutta, Kartik Aggarwal, and Tanmoy Chakraborty. 2021. DECIFE: Detecting collusive users involved in blackmarket following services on Twitter. In Proceedings of the 32nd ACM Conference on Hypertext and Social Media. 91–100.
[46]
Hridoy Sankar Dutta, Udit Arora, and Tanmoy Chakraborty. 2021. ABOME: A multi-platform data repository of artificially boosted online media entities. In Proceedings of the International AAAI Conference on Web and Social Media (ICWSM’21). 1000–1008. https://ojs.aaai.org/index.php/ICWSM/article/view/18123.
[47]
Hridoy Sankar Dutta and Tanmoy Chakraborty. 2019. Blackmarket-driven collusion among retweeters—Analysis, detection and characterization. IEEE Transactions on Information Forensics and Security PP, 99 (2019), 1.
[48]
Hridoy Sankar Dutta, Aditya Chetan, Brihi Joshi, and Tanmoy Chakraborty. 2018. Retweet us, we will retweet you: Spotting collusive retweeters involved in blackmarket services. In Proceedings of the International Conference on Advances in Social Network Analysis and Mining (ASONAM’18). 242–249.
[49]
Hridoy Sankar Dutta, Vishal Raj Dutta, Aditya Adhikary, and Tanmoy Chakraborty. 2020. HawkesEye: Detecting fake retweeters using Hawkes process and topic modeling. IEEE Transactions on Information Forensics and Security 15 (2020), 2667–2678.
[50]
Hridoy Sankar Dutta, Mayank Jobanputra, Himani Negi, and Tanmoy Chakraborty. 2020. Detecting and analyzing collusive entities on YouTube. arXiv preprint arXiv:2005.06243 (2020).
[51]
Ahmed El Azab, Amira M Idrees, Mahmoud A. Mahmoud, and Hesham Hefny. 2016. Fake account detection in Twitter based on minimum weighted feature set. International Journal of Innovation Science and Research 10, 1 (2016), 13–18.
[52]
Dhivya Eswaran. 2020. Mining Anomalies Using Static and Dynamic Graphs. Ph.D. Dissertation. Carnegie Mellon University, Pittsburgh, PA.
[53]
Dhivya Eswaran and Christos Faloutsos. 2018. SedanSpot: Detecting anomalies in edge streams. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM’18). IEEE, Los Alamitos, CA, 953–958.
[54]
Dhivya Eswaran, Christos Faloutsos, Sudipto Guha, and Nina Mishra. 2018. Spotlight: Detecting anomalies in streaming graphs. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1378–1386.
[55]
Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer, and Alessandro Flammini. 2016. The rise of social bots. Communications of the ACM 59, 7 (2016), 96–104.
[56]
Grace Gee and Hakson Teh. 2010. Twitter Spammer Profile Detection. Retrieved March 8, 2022 from https://cs229.stanford.edu/proj2010/GeeTeh-TwitterSpammerProfileDetection.pdf.
[57]
Maria Giatsoglou, Despoina Chatzakou, Neil Shah, Christos Faloutsos, and Athena Vakali. 2015. Retweeting activity on Twitter: Signs of deception. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. 122–134.
[58]
Zafar Gilani, Liang Wang, Jon Crowcroft, Mario Almeida, and Reza Farahbakhsh. 2016. Stweeler: A framework for Twitter bot analysis. In Proceedings of the 25th International Conference Companion on World Wide Web. 37–38.
[59]
Raman Goyal, Gabriel Ferreira, Christian Kästner, and James Herbsleb. 2018. Identifying unusual commits on GitHub. Journal of Software: Evolution and Process 30, 1 (2018), e1893.
[60]
Aditi Gupta, Ponnurangam Kumaraguru, Carlos Castillo, and Patrick Meier. 2014. TweetCred: Real-time credibility assessment of content on Twitter. In Proceedings of the International Conference on Social Informatics. 228–243.
[61]
Aditi Gupta, Hemank Lamba, Ponnurangam Kumaraguru, and Anupam Joshi. 2013. Faking Sandy: Characterizing and identifying fake images on Twitter during Hurricane Sandy. In Proceedings of the 22nd International Conference on World Wide Web. ACM, New York, NY, 729–736.
[62]
Sonu Gupta, Ponnurangam Kumaraguru, and Tanmoy Chakraborty. 2019. MalReG: Detecting and analyzing malicious retweeter groups. In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data (CODS-COMAD). ACM, New York, NY, 61–69.
[63]
Supraja Gurajala, Joshua S. White, Brian Hudson, and Jeanna N. Matthews. 2015. Fake Twitter accounts: Profile characteristics obtained using an activity-based pattern detection approach. In Proceedings of the 2015 International Conference on Social Media and Society. ACM, New York, NY, 9.
[64]
Man Aris Nur Hakim and Masayu Leylia Khodra. 2014. Predicting information cascade on Twitter using support vector regression. In Proceedings of the 2014 International Conference on Data and Software Engineering (ICODSE’14). IEEE, Los Alamitos, CA, 1–6.
[65]
Larissa Hjorth and Sam Hinton. 2019. Understanding Social Media. SAGE Publications.
[66]
Bryan Hooi, Neil Shah, Alex Beutel, Stephan Günnemann, Leman Akoglu, Mohit Kumar, Disha Makhija, and Christos Faloutsos. 2016. BIRDNEST: Bayesian inference for ratings-fraud detection. In Proceedings of the 2016 SIAM International Conference on Data Mining. 495–503.
[67]
Yan Hu, Shanshan Wang, Yizhi Ren, and Kim-Kwang Raymond Choo. 2018. User influence analysis for GitHub developer social networks. Expert Systems with Applications 108 (2018), 108–118.
[68]
Xinyu Huang, Dongming Chen, Dongqi Wang, and Tao Ren. 2020. Identifying influencers in social networks. Entropy 22, 4 (2020), 450.
[69]
Isa Inuwa-Dutse, Bello Shehu Bello, and Ioannis Korkontzelos. 2018. Lexical analysis of automated accounts on Twitter. arXiv preprint arXiv:1812.07947 (2018).
[70]
Kokil Jaidka, Sharath Chandra Guntuku, Anneke Buffone, H. Andrew Schwartz, and Lyle H. Ungar. 2018. Facebook vs. Twitter: Cross-platform differences in self-disclosure and trait prediction. In Proceedings of the 12th International AAAI Conference on Web and Social Media. 141–150.
[71]
Boyeon Jang, Sihyun Jeong, and Chong-Kwon Kim. 2019. Distance-based customer detection in fake follower markets. Information Systems 81 (2019), 104–116.
[72]
Siwar Jendoubi, Arnaud Martin, Ludovic Liétard, Hend Ben Hadji, and Boutheina Ben Yaghlane. 2017. Two evidential data based models for influence maximization in Twitter. Knowledge-Based Systems 121 (2017), 58–70.
[73]
Fang Jin, Edward Dougherty, Parang Saraf, Yang Cao, and Naren Ramakrishnan. 2013. Epidemiological modeling of news and rumors on Twitter. In Proceedings of the 7th Workshop on Social Network Mining and Analysis. ACM, New York, NY, 8.
[74]
Nitin Jindal, Bing Liu, and Ee-Peng Lim. 2010. Finding unusual review patterns using unexpected rules. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 1549–1552.
[75]
Harsha Kumara Kalutarage, Hoang Nga Nguyen, and Siraj Ahmed Shaikh. 2017. Towards a threat assessment framework for apps collusion. Telecommunication Systems 66, 3 (2017), 417–430.
[76]
Mücahit Kantepe and Murat Can Ganiz. 2017. Preprocessing framework for Twitter bot detection. In Proceedings of the 2017 International Conference on Computer Science and Engineering (UBMK’17). IEEE, Los Alamitos, CA, 630–634.
[77]
Jooyeon Kim, Behzad Tabibian, Alice Oh, Bernhard Schölkopf, and Manuel Gomez-Rodriguez. 2018. Leveraging the crowd to detect and reduce the spread of fake news and misinformation. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. 324–332.
[78]
Sneha Kudugunta and Emilio Ferrara. 2018. Deep neural networks for bot detection. Information Sciences 467 (2018), 312–322.
[79]
Srijan Kumar, Justin Cheng, Jure Leskovec, and V. S. Subrahmanian. 2017. An army of me: Sockpuppets in online discussion communities. In Proceedings of the 26th International Conference on World Wide Web. 857–866.
[80]
Srijan Kumar, Justin Cheng, Jure Leskovec, and V. S. Subrahmanian. 2017. An army of me: Sockpuppets in online discussion communities. In Proceedings of the 26th International Conference on World Wide Web. 857–866.
[81]
Srijan Kumar, Bryan Hooi, Disha Makhija, Mohit Kumar, Christos Faloutsos, and V. S. Subrahmanian. 2018. REV2: Fraudulent user prediction in rating platforms. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM’18). 333–341.
[82]
Srijan Kumar and Neil Shah. 2018. False information on web and social media: A survey. arXiv preprint arXiv:1804.08559 (2018).
[83]
Andrey Kupavskii, Liudmila Ostroumova, Alexey Umnov, Svyatoslav Usachev, Pavel Serdyukov, Gleb Gusev, and Andrey Kustarev. 2012. Prediction of retweet cascade size over time. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM, New York, NY, 2335–2338.
[84]
David F. Larcker and Anastasia A. Zakolyukina. 2012. Detecting deceptive discussions in conference calls. Journal of Accounting Research 50, 2 (2012), 495–540.
[85]
David M. J. Lazer, Matthew A. Baum, Yochai Benkler, Adam J. Berinsky, Kelly M. Greenhill, Filippo Menczer, Miriam J. Metzger, et al. 2018. The science of fake news. Science 359, 6380 (2018), 1094–1096.
[86]
Kyumin Lee, Brian David Eoff, and James Caverlee. 2011. Seven months with the devils: A long-term study of content polluters on Twitter. In Proceedings of the 5th International AAAI Conference on Weblogs and Social Media. 185–192.
[87]
Kristina Lerman and Rumi Ghosh. 2010. Information contagion: An empirical study of the spread of news on Digg and Twitter social networks. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media. 90–97.
[88]
Fangtao Huang Li, Minlie Huang, Yi Yang, and Xiaoyan Zhu. 2011. Learning to identify review spam. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2488–2493.
[89]
Huayi Li, Zhiyuan Chen, Bing Liu, Xiaokai Wei, and Jidong Shao. 2014. Spotting fake reviews via collective positive-unlabeled learning. In Proceedings of the 2014 IEEE International Conference on Data Mining. IEEE, Los Alamitos, CA, 899–904.
[90]
Huayi Li, Arjun Mukherjee, Bing Liu, Rachel Kornfield, and Sherry Emery. 2014. Detecting campaign promoters on Twitter using Markov random fields. In Proceedings of the 2014 IEEE International Conference on Data Mining. IEEE, Los Alamitos, CA, 290–299.
[91]
Po-Ching Lin and Po-Min Huang. 2013. A study of effective features for detecting long-surviving Twitter spam accounts. In Proceedings of the 2013 15th International Conference on Advanced Communications Technology (ICACT’13). IEEE, Los Alamitos, CA, 841–846.
[92]
Shenghua Liu, Bryan Hooi, and Christos Faloutsos. 2017. Holoscope: Topology-and-spike aware fraud detection. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1539–1548.
[93]
Yuli Liu, Yiqun Liu, Min Zhang, and Shaoping Ma. 2016. Pay me and I’ll follow you: Detection of crowdturfing following activities in microblog environment. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). 3789–3796.
[94]
Michael McCord and M. Chuah. 2011. Spam detection on Twitter using traditional classifiers. In Autonomic and Trusted Computing. Lecture Notes in Computer Science, Vol. 6906. Springer, 175–186.
[95]
Ashish Mehrotra, Mallidi Sarreddy, and Sanjay Singh. 2016. Detection of fake Twitter followers using graph centrality measures. In Proceedings of the 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I’16). IEEE, Los Alamitos, CA, 499–504.
[96]
Yan Mei, Weiliang Zhao, and Jian Yang. 2017. Influence maximization on Twitter: A mechanism for effective marketing campaign. In Proceedings of the 2017 IEEE International Conference on Communications (ICC’17). IEEE, Los Alamitos, CA, 1–6.
[97]
Fred Morstatter, Liang Wu, Tahora H. Nazer, Kathleen M. Carley, and Huan Liu. 2016. A new approach to bot detection: Striking the balance between precision and recall. In Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM’16). IEEE, Los Alamitos, CA, 533–540.
[98]
Arjun Mukherjee, Abhinav Kumar, Bing Liu, Junhui Wang, Meichun Hsu, Malu Castellanos, and Riddhiman Ghosh. 2013. Spotting opinion spammers using behavioral footprints. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 632–640.
[99]
Arjun Mukherjee, Bing Liu, and Natalie Glance. 2012. Spotting fake reviewer groups in consumer reviews. In Proceedings of the 21st International Conference on World Wide Web (WWW’12). ACM, New York, NY, 191–200.
[100]
Hamed Nilforoshan and Neil Shah. 2019. SliceNDice: Mining suspicious multi-attribute entity groups with multi-view graphs. In Proceedings of the 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA’19). IEEE, Los Alamitos, CA, 351–363.
[101]
Yiangos Papanastasiou. 2018. Fake news propagation and detection: A sequential model. SSRN. Retrieved March 8, 2022 from https://ssrn.com/abstract=3028354.
[103]
Geerajit Rattanaritnont, Masashi Toyoda, and Masaru Kitsuregawa. 2011. A study on characteristics of topic-specific information cascade in Twitter. In Proceedings of the Forum on Data Engineering (DE’11). 65–70.
[104]
Shebuti Rayana and Leman Akoglu. 2015. Collective opinion spam detection: Bridging review networks and metadata. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15). 985–994.
[105]
Shebuti Rayana and Leman Akoglu. 2016. Collective opinion spam detection using active inference. In Proceedings of the 2016 SIAM International Conference on Data Mining (SDM’16). 630–638.
[106]
Björn Ross, Florian Brachten, Stefan Stieglitz, Patrik Wikstrom, Brenda Moon, Felix Victor Münch, and Axel Bruns. 2018. Social bots in a commercial context—A case study on SoundCloud. In Proceedings of the 26th European Conference on Information Systems (ECIS’18). 1–10.
[107]
Diego Saez-Trumper. 2014. Fake tweet buster: A webtool to identify users promoting fake news on Twitter. In Proceedings of the 25th ACM Conference on Hypertext and Social Media. ACM, New York, NY, 316–317.
[108]
Zahra Riahi Samani, Sharath Chandra Guntuku, Mohsen Ebrahimi Moghaddam, Daniel Preoţiuc-Pietro, and Lyle H. Ungar. 2018. Cross-platform and cross-interaction study of user personality based on images on Twitter and Flickr. PLoS One 13, 7 (2018), e0198660.
[109]
Igor Santos, Igor Miñambres-Marcos, Carlos Laorden, Patxi Galán-García, Aitor Santamaría-Ibirika, and Pablo García Bringas. 2014. Twitter content-based spam filtering. In International Joint Conference SOCO’13-CISIS’13-ICEUTE’13. Springer, 449–458.
[110]
Indira Sen, Anupama Aggarwal, Shiven Mian, Siddharth Singh, Ponnurangam Kumaraguru, and Anwitaman Datta. 2018. Worth its weight in likes: Towards detecting fake likes on Instagram. In Proceedings of the 10th ACM Conference on Web Science (WebSci’18). 205–209.
[111]
Neil Shah. 2017. FLOCK: Combating astroturfing on livestreaming platforms. In Proceedings of the 26th International Conference on World Wide Web (WWW’17). 1083–1091.
[112]
Neil Shah, Alex Beutel, Bryan Hooi, Leman Akoglu, Stephan Gunnemann, Disha Makhija, Mohit Kumar, and Christos Faloutsos. 2016. EdgeCentric: Anomaly detection in edge-attributed networks. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW’16). IEEE, Los Alamitos, CA, 327–334.
[113]
Neil Shah, Hemank Lamba, Alex Beutel, and Christos Faloutsos. 2017. The many faces of link fraud. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM’17). IEEE, Los Alamitos, CA, 1069–1074.
[114]
Chengcheng Shao, Giovanni Luca Ciampaglia, Onur Varol, Kai-Cheng Yang, Alessandro Flammini, and Filippo Menczer. 2018. The spread of low-credibility content by social bots. Nature Communications 9, 1 (2018), 4787.
[115]
Qinlan Shen and Carolyn Rose. 2019. The discourse of online content moderation: Investigating polarized user responses to changes in Reddit’s quarantine policy. In Proceedings of the 3rd Workshop on Abusive Language Online. 58–69.
[116]
Yi Shen, Jianjun Yu, Kejun Dong, and Kai Nan. 2014. Automatic fake followers detection in Chinese micro-blogging system. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. 596–607.
[117]
Kijung Shin, Tina Eliassi-Rad, and Christos Faloutsos. 2016. CoreScope: Graph mining using k-core analysis—Patterns, anomalies and algorithms. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM’16). IEEE, Los Alamitos, CA, 469–478.
[118]
Kijung Shin, Tina Eliassi-Rad, and Christos Faloutsos. 2018. Patterns and anomalies in k-cores of real-world graphs with applications. Knowledge and Information Systems 54, 3 (2018), 677–710.
[119]
Kijung Shin, Bryan Hooi, and Christos Faloutsos. 2016. M-Zoom: Fast dense-block detection in tensors with quality guarantees. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 264–280.
[120]
Kijung Shin, Bryan Hooi, and Christos Faloutsos. 2018. Fast, accurate, and flexible algorithms for dense subtensor mining. ACM Transactions on Knowledge Discovery from Data 12, 3 (2018), 1–30.
[121]
Monika Singh, Divya Bansal, and Sanjeev Sofat. 2018. Who is who on Twitter—Spammer, fake or compromised account? A tool to reveal true identity in real-time. Cybernetics and Systems 49, 1 (2018), 1–25.
[122]
Shivangi Singhal, Rajiv Ratn Shah, Tanmoy Chakraborty, Ponnurangam Kumaraguru, and Shin’ichi Satoh. 2019. SpotFake: A multi-modal framework for fake news detection. In Proceedings of the 2019 IEEE 5th International Conference on Multimedia Big Data (BigMM’19). IEEE, Los Alamitos, CA, 39–47.
[123]
Jonghyuk Song, Sangho Lee, and Jong Kim. 2011. Spam filtering in Twitter using sender-receiver relationship. In Proceedings of the International Workshop on Recent Advances in Intrusion Detection. 301–317.
[124]
Gianluca Stringhini. 2014. Stepping Up the Cybersecurity Game: Protecting Online Services from Malicious Activity. University of California, Santa Barbara.
[125]
Gianluca Stringhini, Manuel Egele, Christopher Kruegel, and Giovanni Vigna. 2012. Poultry markets: On the underground economy of Twitter followers. ACM SIGCOMM Computer Communication Review 42, 4 (2012), 527–532.
[126]
Gianluca Stringhini, Pierre Mourlanne, Gregoire Jacob, Manuel Egele, Christopher Kruegel, and Giovanni Vigna. 2015: EVILCOHORT: Detecting communities of malicious accounts on online services. In Proceedings of the 24th USENIX Security Symposium (USENIX Security’15). 563–578.
[127]
Gianluca Stringhini, Gang Wang, Manuel Egele, Christopher Kruegel, Giovanni Vigna, Haitao Zheng, and Ben Y. Zhao. 2013. Follow the green: Growth and dynamics in Twitter follower markets. In Proceedings of the 2013 Internet Measurement Conference(IMC’13). 163–176.
[128]
V. S. Subrahmanian, Amos Azaria, Skylar Durst, Vadim Kagan, Aram Galstyan, Kristina Lerman, Linhong Zhu, Emilio Ferrara, Alessandro Flammini, and Filippo Menczer. 2016. The DARPA Twitter bot challenge. Computer 49, 6 (2016), 38–46.
[129]
Kurt Thomas, Chris Grier, Dawn Song, and Vern Paxson. 2011. Suspended accounts in retrospect: An analysis of Twitter spam. In Proceedings of the 2011 ACM SIGCOMM Internet Measurement Conference (IMC’11). ACM, New York, NY, 243–258.
[130]
Kurt Thomas, Damon McCoy, Chris Grier, Alek Kolcz, and Vern Paxson. 2013. Trafficking fraudulent accounts: The role of the underground market in Twitter spam and abuse. In Proceedings of the 22nd USENIX Conference on Security. 195–210.
[131]
Onur Varol, Emilio Ferrara, Clayton A. Davis, Filippo Menczer, and Alessandro Flammini. 2017. Online human-bot interactions: Detection, estimation, and characterization. In Proceedings of the 11th International AAAI Conference on Web and Social Media. 280–289.
[132]
Sokratis Vidros, Constantinos Kolias, Georgios Kambourakis, and Leman Akoglu. 2017. Automatic detection of online recruitment frauds: Characteristics, methods, and a public dataset. Future Internet 9, 1 (2017), 6.
[133]
Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false news online. Science 359, 6380 (2018), 1146–1151.
[134]
Alex Hai Wang. 2010. Don’t follow me: Spam detection in Twitter. In Proceedings of the 2010 International Conference on Security and Cryptography (SECRYPT’10). 1–10.
[135]
Bo Wang, Arkaitz Zubiaga, Maria Liakata, and Rob Procter. 2015. Making the most of tweet-inherent features for social spam detection on Twitter. arXiv preprint arXiv:1503.07405 (2015).
[136]
De Wang, Shamkant B. Navathe, Ling Liu, Danesh Irani, Acar Tamersoy, and Calton Pu. 2013. Click traffic analysis of short URL spam on Twitter. In Proceedings of the 9th IEEE International Conference on Collaborative Computing: Networking, Applications, and Worksharing. IEEE, Los Alamitos, CA, 250–259.
[137]
Guan Wang, Sihong Xie, Bing Liu, and S. Yu Philip. 2011. Review graph based online store review spammer detection. In Proceedings of the 11th IEEE International Conference on Data Mining (ICDM’11). 1242–1247.
[138]
Zhuo Wang, Songmin Gu, Xiangnan Zhao, and Xiaowei Xu. 2018. Graph-based review spammer group detection. Knowledge and Information Systems 55, 3 (2018), 1–27.
[139]
Zhuo Wang, Tingting Hou, Dawei Song, Zhun Li, and Tianqi Kong. 2016. Detecting review spammer groups via bipartite graph projection. Computer Journal 59, 6 (2016), 861–874.
[140]
Audrey Wilmet, Tiphaine Viard, Matthieu Latapy, and Robin Lamarche-Perrin. 2018. Degree-based outliers detection within IP traffic modelled as a link stream. In Proceedings of the 2018 Network Traffic Measurement and Analysis Conference (TMA’18). IEEE, Los Alamitos, CA, 1–8.
[141]
Yongji Wu, Defu Lian, Yiheng Xu, Le Wu, and Enhong Chen. 2020. Graph convolutional networks with Markov random field reasoning for social spammer detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1054–1061.
[142]
Chang Xu, Jie Zhang, Kuiyu Chang, and Chong Long. 2013. Uncovering collusive spammers in Chinese review websites. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (ICKM’13). ACM, New York, NY, 979–988.
[143]
Yuanbo Xu, Yongjian Yang, Jiayu Han, En Wang, Jingci Ming, and Hui Xiong. 2019. Slanderous user detection with modified recurrent neural networks in recommender system. Information Sciences 505 (2019), 265–281.
[144]
Kai-Cheng Yang, Onur Varol, Clayton A. Davis, Emilio Ferrara, Alessandro Flammini, and Filippo Menczer. 2019. Arming the public with AI to counter social bots. arXiv preprint arXiv:1901.00912 (2019).
[146]
Shaozhi Ye and S. Felix Wu. 2010. Measuring message propagation and social influence on Twitter.com. In Proceedings of the International Conference on Social Informatics. 216–231.
[147]
Minji Yoon, Bryan Hooi, Kijung Shin, and Christos Faloutsos. 2019. Fast and accurate anomaly detection in dynamic graphs with a two-pronged approach. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 647–657.
[148]
Hossein Rouhani Zeidanloo, Azizah Bt Manaf, Payam Vahdani, Farzaneh Tabatabaei, and Mazdak Zamani. 2010. Botnet detection based on traffic monitoring. In Proceedings of the 2010 International Conference on Networking and Information Technology. IEEE, Los Alamitos, CA, 97–101.
[149]
Chao Michael Zhang and Vern Paxson. 2011. Detecting and analyzing automated activity on Twitter. In Proceedings of the International Conference on Passive and Active Network Measurement. 102–111.
[150]
Xianchao Zhang, Shaoping Zhu, and Wenxin Liang. 2012. Detecting spam and promoting campaigns in the Twitter social network. In Proceedings of the 2012 IEEE 12th International Conference on Data Mining. IEEE, Los Alamitos, CA, 1194–1199.
[151]
Yi Zhang and Jianguo Lu. 2016. Discover millions of fake followers in Weibo. Social Network Analysis and Mining 6, 1 (2016), 16.
[152]
Yubao Zhang, Xin Ruan, Haining Wang, Hui Wang, and Su He. 2016. Twitter trends manipulation: A first look inside the security of Twitter trending. IEEE Transactions on Information Forensics and Security 12, 1 (2016), 144–156.
[153]
Yu Zhang and Yan Zhang. 2017. Top-K influential nodes in social networks: A game perspective. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1029–1032.

Cited By

View all

Index Terms

  1. Blackmarket-Driven Collusion on Online Media: A Survey

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM/IMS Transactions on Data Science
      ACM/IMS Transactions on Data Science  Volume 2, Issue 4
      November 2021
      439 pages
      ISSN:2691-1922
      DOI:10.1145/3485158
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 May 2022
      Accepted: 01 February 2022
      Revised: 01 December 2021
      Received: 01 January 2021
      Published in TDS Volume 2, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Collusion
      2. blackmarket
      3. Twitter
      4. YouTube
      5. social media analysis

      Qualifiers

      • Survey
      • Refereed

      Funding Sources

      • Ramanujan Fellowship

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)922
      • Downloads (Last 6 weeks)143
      Reflects downloads up to 12 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media