Twitter’s war against fake accounts: Here’s how to win it, as Google did

Farokh Shahabi
11 min readOct 6, 2022

If there’s only one thing that Elon Musk and Jack Dorsey, the co-founder of Twitter agree on, it’s that Twitter suffers from a dangerous virus that’s killing it. That virus is fake accounts.

At first, fake accounts were mostly bots (programs that send tweets automatically) and these bots were used mostly for advertisement and spam. However as Twitter grew, a new purpose was created for these bots: To spread misinformation, fake news, and propaganda.

These fake accounts are now operated by what they call “Cyber Armies”. Almost all governments, political parties, insurgence groups, and even cults have their own cyber armies and they are managing these accounts with one goal in mind: To influence real people, based on their agenda.

The bad news is that their tactic is highly successful and the majority of the “real people” on Twitter, are not aware that they are interacting with (and therefore influenced by) fake/bot accounts.

Right now, based on official estimates, we have between 20 million to 80 million fake accounts in the world! Most experts believe that Twitter played a decisive role in both the presidential election in the US in 2016 and the UK’s Brexit referendum.

Why should we care about this on Twitter?

People around the world use Twitter more than any other social network to follow the news, get updates, and stay informed. After all these years, Twitter showed that it’s unable to battle fake accounts and misinformation on its own. (Check this study from the University of Iowa to see the depth of the problem).

Even when Elon Musk backed down on buying Twitter this year, he blamed it on the high number of fake accounts, which might or might not be the real reason, but shows how serious this problem is.

Today, the biggest problem with Twitter is fake information and fake accounts. There are millions of bot/fake accounts on Twitter that are tweeting and retweeting fake news and irrelevant data. This situation is causing the real news and the real information to be buried under the avalanche of fake information.

Online Human-Bot Interactions: Detection, Estimation, and Characterization (ICWSM 2017)
Online Human-Bot Interactions: Detection, Estimation, and Characterization (ICWSM 2017)

For example, this study in 2017 reveals that a significant fraction of Twitter accounts, between 9% and 15%, are bots. This translates into nearly 50 million accounts, according to 2017 estimates that put the Twitter userbase at above 320 million, and today it’s more than 500 million accounts. Although not all bots are dangerous, many are used for malicious purposes. You can see the analyzed interactions between humans & bots in the picture above.

So is it hopeless? Should we leave Twitter? No, the good news is that we solved the same problem before! We can use the same solution for Twitter and save it.

What can we learn from Google’s ranking algorithm to fix this on Twitter

A little more than two decades ago, the young web had a very big problem: Finding the “right” information and what you needed was very hard. In other words, back then, search engines sucked.

And then came Google, with a single purpose, to show people what they need, when they need it. It was not an easy task as it involves two major parts:

First, you have to truly understand what people want when they use search tools. It means not focusing on what they type in the search box, but on what they “desire” to see as the result. Second, Which results should come first, or in other words, “who” among the many relatively similar sources, has the right answer.

Google had a big problem, the web was rising rapidly and for every question that people asked, there seemed to be thousands of answers from thousands of different sources. Who should I put first? How should I rank every content in the world? And worst of all, most of the content on the internet was click baits, spam, and fake content.

Google’s success is all due to the effectiveness of its search algorithm. This search algorithm made the web much more usable to the public and even a safer place than before. Because Google’s mission always was and is to show the best results, first.

“The best place to hide a dead body is page two of Google search results.”

It’s not an easy task to try to “understand” every content on the internet and then rank it based on every query. That’s why Google’s search algorithm changes almost 500 times every year. However, the core of it rarely changed.

For years, it even had a name, PageRank, after both the website page and also Google’s co-founder, Larry Page. Even today, though the name of PageRank is dropped and its patents are no longer in effect, its idea is still at the core of today’s Google search algorithm.

Have you ever asked yourself why the first result of many search queries is Wikipedia? Or how news articles that were posted seconds ago are in the first page of Google? The answer is because of the PageRank idea.

The first versions of PageRank were quite simple. Basically, it was a system to estimate the “authority and quality” of every webpage. It was based on links, which served as votes of trust given to a page. According to the logic of that mechanism, the more external resources link to a page, the more valuable information it has for users. And PageRank (a score from 0 to 10 calculated based on the quantity and quality of incoming links) showed the relative authority of a page on the Internet.

Over the years, other voting factors were added to PageRank. Factors such as how often a page is updated, how important a page is in a sitemap, the content of the page copy of another website, and most important of all: Is a page trying to cheat Google SEO or not?

All these factors and thousands more decided the PageRank of each page from 0 to 10. For each search query, the page with the highest PageRank must be shown first.

Because of this ranking system, you’ll get relevant results from the best sources on the first page, and millions of pages with all the spammy results, copied content, and clickbait ads are all left for dead in the last pages.

As you might have already guessed, Google was at war with spam content on the web and created PageRank as a weapon. Every day, companies tried to highjack their desired keywords with spam content, fake websites, and duplicated content to drive up their traffic.

In simple terms, PageRank allowed Google to find out who is a qualified source and who is most likely a fake or irrelevant source.

Twitter can use a modified version of the same ranking system to evaluate every Twitter account and rank them. This rank can give real people a good idea of the accounts that they’re interacting with. This rank will destroy fake accounts, spam bots, and all cyber armies on social networks.

I worked on this algorithm for a long time and created a modified version of PageRank for Twitter. This algorithm borrows heavily from the original Google PageRank and uses many of the same practices, only modified for social networks. Same as PageRank, this algorithm’s factors should be updated frequently so nobody can cheat it.

The goal of this rank is not to evaluate the content of a Twitter account or say which account is better than the others. The only goal with this rank is to distinguish “Real accounts” from fake, spam, and inactive accounts and weaken the cyber armies that use these fake accounts.

How to create the Twitter Rank

For evaluating every account, we should first analyze two factors, the action of each account and the interactions of that account with other accounts.

Actions & Interactions

This method focus on “what’s happening” rather than analyzing the whole dataset. This method is especially useful in social networks or in scenarios where the whole data isn’t available or is too big to analyze.

The actions of every Twitter account will create a pattern of their activities. These patterns, when compared in large numbers, create various segmentations. In simple words, we can understand what is every Twitter account, is it a business account? A normal user? A radio news listener? A dog person? A Democrat? A foodie?… You get the picture.

This isn’t enough. Various segmentation means different groups, overlapping groups, and anomalies. More important than the actions, are the interactions.

Analyzing interactions of Twitter accounts shows the “quality” & “impact” of every account. Is this an automated account? Is this account influential to others? Is it a fake account? Is this account working in an organized group of accounts?

Twitter Rank works by counting the “number” and “quality” of interactions to a Twitter account to determine a rough estimate of how important the account is. The underlying assumption is that more important accounts are likely to receive more “real” interactions from other Twitter accounts.

How to calculate the Twitter Rank

Again I want to emphasize that this rank is a borrowed version of Google PageRank and we should test its effectiveness in action and update it accordingly.

The goal is to determine the authority and quality of each Twitter account and rank them between 0 to 10. 10 would be the most trusted and qualified accounts, and 0 would be the accounts that are most likely fakes/bots. The rank of each account should be updated based on a certain period, like monthly.

First, we will allocate 5 score points to each account's actions and 5 score point to each account’s interactions. The sum of these two will give us a number between 0 to 10.

Action Score

Each account’s actions should make sense of the actions of real Twitter users. For example, if the account is 2 years old and they have 50,000 tweets already, that means they tweeted around 80 tweets every day, which is bot behavior.

Another example is when an account follows thousands of accounts and also thousands of accounts in a similar number follow them. No real person can read tweets from 10,000 people. So it’s most likely a follow/follow-back scheme to generate fake authority and relation.

For this score, we need these factors:

  • A = Number of tweets
  • B = Creation year of the account
  • C = Number of followers
  • D = Number of following accounts
  • E = Number of likes
  • F = Number of retweets
  • G = Verified account status (True/False)
  • H = Type of website link they use in their bio (Personal page/website, Link shortener or suspicious links, No link)

Action Score = 5 if: (current year)- B > 10 AND A < 10000 AND D < 2000 AND C/D > 1.2

Action Score = 4 if: (current year)- B > 5 AND A < 15000 AND D < 2000 AND C/D > 1.2

Action Score = 3 if: (current year)- B > 3 AND A < 30000 AND D < 3000 AND C/D > 1

Action Score = 2 if: (current year)- B > 2 AND A < 7000 AND D < 2000

Action Score = 1 if: (current year)- B > 1 AND A < 3000 AND D < 2000

Otherwise, Action Score = 0;

Now let’s close the system on the ones who want to cheat it:

  • If G = True (verified account for celebrities, etc.) then add 3 to the Action Score.
  • If H = Personal page/website (personal website, company website, LinkedIn, social media pages, Github, Medium, etc.) Add 1 to the Action Score
  • If H = Link shortener or suspicious links, deduct 1 from the Action Score
  • If E/A > 100, deduct 2 from the Action Score
  • If F/A > 20, deduct 2 from the Action Score
  • If 0.5 < C/D < 1.5 AND D > 1000, deduct 2 from the Action Score
  • If A+F / (current year) - B > 5000, deduct 2 from the Action Score

Interaction Score

The most important thing about PageRank was that if thousands of websites with low ranks (0–2) had links to your website, it was literally worth nothing. But if only a couple of high-ranked (7–8) websites had links to your website, it boosted your rank tremendously.

The good thing was to prevent schemes like “I will put your link and you do the same for me”, this rank had a penalty. Basically, if you exchanged the links, it became worthless again, even if one of you or both had high ranks. We can apply the same practice in social networks as well.

Now that we have the Action Score of each Twitter user, it’s now important to analyze the Twitter users they are interacting with as well.

The more high-ranked users follow/interact with a user, that user gets a higher rank as well.

Unfortunately, the current Twitter API is very limited and will not give out enough information on interactions between accounts, but we can work on something rather than nothing. So here’s how you calculate the interaction score:

Interaction Score = Average Action Scores of mutual followers/following of each user (the account that they follow and they’re following them back)

  • If at least one verified account follows me, add 1 to the Interaction Score
  • If more than 10 verified accounts follow me, add 3 to the Interaction Score
  • If a user has less than 10 mutual followers/following, the Interaction Score should be set as 2.

So if for example, I follow Bill Gates, it doesn’t count, but if Bill Gates follows me (because he has a great Action Score) it will boost my interaction rank greatly.

Also if I try to cheat the system with a follow/follow-back scheme, I can’t, since the Action Score changes with the number of accounts, and not only I’ll lose my Rank, all of the people who participate in the scheme will lose their Action Score, and therefore their Interaction Score as well.

The final Twitter Rank is calculated like this:

Twitter Rank = Action Score + Interaction Score

So for example, if I got an Action Score of 3 and an Interaction Score of 2, my Twitter Rank would be 5/10.

As you can see, If I try to cheat the system, like what cyber armies are doing on Twitter, and try to fake relations between the accounts, I will lose my Twitter Rank and all of my culprits will lose their Rank as well.

Twitter accounts with scores of 0 to 2, are most likely fake accounts and If this rank is visible to people, real people will not reshare their content or follow them. It’s as if they didn’t exist.

This algorithm helps Twitter and Twitter users to identify which accounts are real & impactful and which accounts are inactive, fake, automated, or act suspiciously. With this rank, we try for you to see the real value behind every account that you’re following and decide to get your information with insight.

This proposed algorithm is very young and full of inaccuracies but with enough testing and the power of crowdsourcing, it can provide the most reliable ranking system that Twitter needs.

The social networks we deserve

Social networks today, are not the social networks we deserve, just like search engines, quality is way more important than quantity. It’s not enough to create better and better recommendation systems in search functions of social networks. We need to implement the perfect ranking system for the search functions as well.

It’s not enough to focus on engagement, which leads to addiction and later on, loathing of the same network. People have different desires, one after another, that should be answered & satisfied, every time.

I sincerely hope to see this or a similar Rank implemented into Twitter, and if not, the open source community can create an app or even a browser extension and this will bring safety to Twitter.

--

--

Farokh Shahabi

3x Entrepreneur | Co-founder & CEO at Formaloo | TEDx Speaker