One-year monitoring of the machine learning community in Twitter

Published on May 4th, 2021.

1. Introduction

In February 2020, I developed a bot that monitors the Twitter accounts of some machine learning (ML) researchers. The main reason why I developed this bot was to enable non-Twitter users to find popular ML resources shared on Twitter. This bot collects tweets with links, classifies them, and ranks them according to their popularity (i.e., number of RTs and favs). Currently, the links are classified into Arxiv, Blog, Colab, Github, News, Twitter, Paper, University, Youtube, Wikipedia, and Other. I also categorize BioRxiv links as "Arxiv", and I did the same for MedRxiv until I noticed that most of the articles were unrelated to ML. Then, I set up an online ranking of popular links/resources that displays these resources, favoring recent tweets with a high number of RTs. A few weeks later, I also implemented a daily Arxiv ranking: the bot analyzes the Arxiv link gathered in the last 24h, and it tweets the most popular.

Recently, David Holz suggested me some ideas to improve this bot (thanks!). Considering these ideas required analyzing the data, and since this data was gathered from (and for) the ML community, I wanted to share the visualizations derived from this analysis. At the end of the post, I also share some thoughts about the data, code, and the ideas I'm considering to extend the bot's functionality.

2. Visualizing the one-year long popular ML resources

To this day (May 4th, 2021), the database contains over 42000 links potentially related to ML. Considering the time this bot has been running (uptime: 95% approx.), it has collected an average of about 100 links/resources per day. The categories with largest number of links are: Other/Uncategorized (53%), Arxiv (8%), News (7%), Youtube (6%), and Github (5.5%).

Figure 1: Distribution of links (all) and Arxiv links per months.

Figure 1 shows the distribution of the tweets with links (left) and tweets with Arxiv links (right) per month. The number of tweets per month (left) is given as a reference. The distribution of Arxiv links (right) shows a peak in July, which might be explained by the NeurIPS rebuttal period.

Figure 2: Distribution of the gathered links by weekday.

Figure 2 (left) illustrates the distribution of the links by weekday classified by their type. Figure 2 (right) shows that the number of Arxiv links posted over the weekend is significantly lower (is being active on Twitter part of the job?). Interestingly, there is a peak on Tuesdays, which agrees with the analysis presented in "Best times to post on Twitter for tech".

Figure 3: To avoid the long tail, I only considered accounts with 3 or more "most popular Arxiv tweets".

Next, I wanted to figure out who are the "ML influencers of Twitter", so I gathered those Twitter accounts that achieved 3 or more "Most popular Arxiv link of the day" (Figure 3, blue bar). I also wanted to visualize the correlation with their number of followers, so I took their followers number and I normalized it to fit into the same bar chart (Figure 3, orange bar). I was a bit surprised to find almost no apparent correlation.

Figure 4: Popularity vs. Citations in Arxiv preprints uploaded in June. Outliers were discarded.

Figure 5: Popularity vs. Citations in Arxiv preprints uploaded in July. Outliers were discarded.

I also investigated the possible correlation between number of citations vs. tweet popularity, although I had low expectations since the data is very nosiy. For this, I took the tweets published in the months of June (Fig. 4) and July (Fig. 5) that also contained Arxiv links with preprints uploaded at that time. I chose this period because these months coincide with NeurIPS decisions, and I expected authors to advertise their own research on social media. Additionally, June and July happened almost a year ago, so I expected that the papers got already some citations. For obtaining the current number of citations, I used Semantic Scholar API. Also, I accounted for different tweets posting the same preprints by summing their popularity and classified them based on whether they were published on NeurIPS 2020. Figures 4 and 5 show that although there is a lot of noise (i.e., non-authors posting articles and getting a moderate number of favs + RTs), we can see that popular tweets (right side of each panel) show, on average, articles with more citations than less popular tweets. Furthermore, there were twice as many preprints finally published at NeurIPS 2020 in June than in July.

Finally, Tables 1 and 2 list the two (favs, RTs) top 10 ML-related Arxiv preprints.

Title
Fooling automated surveillance cameras: adversarial patches to attack person detection	hardmaru	4438
On the Measure of Intelligence	fchollet	4245
Flow-edge Guided Video Completion	jbhuang0604	3988
Unsupervised Translation of Programming Languages	GuillaumeLample	3544
How to represent part-whole hierarchies in a neural network	geoffreyhinton	3035
DropEdge: Towards Deep Graph Convolutional Networks on Node Classification	lorenlugosch	2931
Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography: a prospective study in 27 patients \| medRxiv	fchollet	2878
Learning to Simulate Complex Physics with Graph Networks	PeterWBattaglia	2639
At the Interface of Algebra and Statistics	math3ma	2458
Finite Versus Infinite Neural Networks: an Empirical Study	jaschasd	2268

Title
Fooling automated surveillance cameras: adversarial patches to attack person detection	hardmaru	2080
On the Measure of Intelligence	fchollet	1390
Unsupervised Translation of Programming Languages	GuillaumeLample	1109
Flow-edge Guided Video Completion	jbhuang0604	1042
Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography: a prospective study in 27 patients \| medRxiv	fchollet	949
How to represent part-whole hierarchies in a neural network	geoffreyhinton	634
Learning to Simulate Complex Physics with Graph Networks	PeterWBattaglia	612
AutoML-Zero: Evolving Machine Learning Algorithms From Scratch	quocleix	610
DropEdge: Towards Deep Graph Convolutional Networks on Node Classification	lorenlugosch	580
Single Headed Attention RNN: Stop Thinking With Your Head	Smerity	562

3. Ideas to improve the bot

I still have to think deeper, but I'm considering the following ideas.

3.1 Functionality

New category: Twitter threads that explain papers.
Instead of posting a single Arxiv link per day, posting the top 3 (in a small thread).
Posting the most popular Github repository of the day.
Online ranking of influencers. However, this probably wouldn't be too different from Figure 3.

3.2 More representation

Currently, the bot monitors about 270 accounts, and I would like to follow more researchers from less represented backgrounds.
I'm not sure what would be the best way, but I would like to give more visibility to researchers with less followers. Maybe posting a random preprint sometimes?

4. Final thoughts

Will you share the data?

The collected data is highly biased since it is collected from a few researchers who already probably have enough representation. In fact, this imbalance motivated the "more representation" ideas. Therefore, I think that sharing this data has no benefit.

Will you share the code?

Currently, the code is not available. The main "body" of this bot consist of less than 50 lines, and for collecting Twitter data I simply use an external library. The other hundreds of lines are calls to the database where I keep the data. So, I think that this code cannot be easily reused and I see no benefit of sharing it. However, I can answer any question related to this bot, and I can share some snippets if someone is particularly interested.