One-year monitoring of the machine learning community in Twitter


1. Introduction

In February 2020, I developed a bot that monitors the Twitter accounts of some machine learning (ML) researchers. The main reason why I developed this bot was to enable non-Twitter users to find popular ML resources shared on Twitter. This bot collects tweets with links, classifies them, and ranks them according to their popularity (i.e., number of RTs and favs). Currently, the links are classified into Arxiv, Blog, Colab, Github, News, Twitter, Paper, University, Youtube, Wikipedia, and Other. I also categorize BioRxiv links as "Arxiv", and I did the same for MedRxiv until I noticed that most of the articles were unrelated to ML. Then, I set up an online ranking of popular links/resources that displays these resources, favoring recent tweets with a high number of RTs. A few weeks later, I also implemented a daily Arxiv ranking: the bot analyzes the Arxiv link gathered in the last 24h, and it tweets the most popular.

Recently, David Holz suggested me some ideas to improve this bot (thanks!). Considering these ideas required analyzing the data, and since this data was gathered from (and for) the ML community, I wanted to share the visualizations derived from this analysis. At the end of the post, I also share some thoughts about the data, code, and the ideas I'm considering to extend the bot's functionality.

2. Visualizing the one-year long popular ML resources

To this day (May 4th, 2021), the database contains over 42000 links potentially related to ML. Considering the time this bot has been running (uptime: 95% approx.), it has collected an average of about 100 links/resources per day. The categories with largest number of links are: Other/Uncategorized (53%), Arxiv (8%), News (7%), Youtube (6%), and Github (5.5%).

Figure 1: Distribution of links (all) and Arxiv links per months.

Figure 1 shows the distribution of the tweets with links (left) and tweets with Arxiv links (right) per month. The number of tweets per month (left) is given as a reference. The distribution of Arxiv links (right) shows a peak in July, which might be explained by the NeurIPS rebuttal period.

Figure 2: Distribution of the gathered links by weekday.

Figure 2 (left) illustrates the distribution of the links by weekday classified by their type. Figure 2 (right) shows that the number of Arxiv links posted over the weekend is significantly lower (is being active on Twitter part of the job?). Interestingly, there is a peak on Tuesdays, which agrees with the analysis presented in "Best times to post on Twitter for tech".

Figure 3: To avoid the long tail, I only considered accounts with 3 or more "most popular Arxiv tweets".

Next, I wanted to figure out who are the "ML influencers of Twitter", so I gathered those Twitter accounts that achieved 3 or more "Most popular Arxiv link of the day" (Figure 3, blue bar). I also wanted to visualize the correlation with their number of followers, so I took their followers number and I normalized it to fit into the same bar chart (Figure 3, orange bar). I was a bit surprised to find almost no apparent correlation.

Figure 4: Popularity vs. Citations in Arxiv preprints uploaded in June. Outliers were discarded.

Figure 5: Popularity vs. Citations in Arxiv preprints uploaded in July. Outliers were discarded.

I also investigated the possible correlation between number of citations vs. tweet popularity, although I had low expectations since the data is very nosiy. For this, I took the tweets published in the months of June (Fig. 4) and July (Fig. 5) that also contained Arxiv links with preprints uploaded at that time. I chose this period because these months coincide with NeurIPS decisions, and I expected authors to advertise their own research on social media. Additionally, June and July happened almost a year ago, so I expected that the papers got already some citations. For obtaining the current number of citations, I used Semantic Scholar API. Also, I accounted for different tweets posting the same preprints by summing their popularity and classified them based on whether they were published on NeurIPS 2020. Figures 4 and 5 show that although there is a lot of noise (i.e., non-authors posting articles and getting a moderate number of favs + RTs), we can see that popular tweets (right side of each panel) show, on average, articles with more citations than less popular tweets. Furthermore, there were twice as many preprints finally published at NeurIPS 2020 in June than in July.

Finally, Tables 1 and 2 list the two (favs, RTs) top 10 ML-related Arxiv preprints.

Title
Fooling automated surveillance cameras: adversarial patches to attack person detectionhardmaru4438
On the Measure of Intelligencefchollet4245
Flow-edge Guided Video Completionjbhuang06043988
Unsupervised Translation of Programming LanguagesGuillaumeLample3544
How to represent part-whole hierarchies in a neural networkgeoffreyhinton3035
DropEdge: Towards Deep Graph Convolutional Networks on Node Classificationlorenlugosch2931
Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography: a prospective study in 27 patients | medRxivfchollet2878
Learning to Simulate Complex Physics with Graph NetworksPeterWBattaglia2639
At the Interface of Algebra and Statisticsmath3ma2458
Finite Versus Infinite Neural Networks: an Empirical Studyjaschasd2268
Title
Fooling automated surveillance cameras: adversarial patches to attack person detectionhardmaru2080
On the Measure of Intelligencefchollet1390
Unsupervised Translation of Programming LanguagesGuillaumeLample1109
Flow-edge Guided Video Completionjbhuang06041042
Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography: a prospective study in 27 patients | medRxivfchollet949
How to represent part-whole hierarchies in a neural networkgeoffreyhinton634
Learning to Simulate Complex Physics with Graph NetworksPeterWBattaglia612
AutoML-Zero: Evolving Machine Learning Algorithms From Scratchquocleix610
DropEdge: Towards Deep Graph Convolutional Networks on Node Classificationlorenlugosch580
Single Headed Attention RNN: Stop Thinking With Your HeadSmerity562

3. Ideas to improve the bot

I still have to think deeper, but I'm considering the following ideas.
3.1 Functionality
3.2 More representation

4. Final thoughts

Will you share the data?

The collected data is highly biased since it is collected from a few researchers who already probably have enough representation. In fact, this imbalance motivated the "more representation" ideas. Therefore, I think that sharing this data has no benefit.

Will you share the code?

Currently, the code is not available. The main "body" of this bot consist of less than 50 lines, and for collecting Twitter data I simply use an external library. The other hundreds of lines are calls to the database where I keep the data. So, I think that this code cannot be easily reused and I see no benefit of sharing it. However, I can answer any question related to this bot, and I can share some snippets if someone is particularly interested.