One-year monitoring of the machine learning community in Twitter
Published on May 4th, 2021.1. Introduction
In February 2020, I developed a bot that monitors the Twitter accounts of some machine learning (ML) researchers. The main reason why I developed this bot was to enable non-Twitter users to find popular ML resources shared on Twitter. This bot collects tweets with links, classifies them, and ranks them according to their popularity (i.e., number of RTs and favs). Currently, the links are classified into Arxiv, Blog, Colab, Github, News, Twitter, Paper, University, Youtube, Wikipedia, and Other. I also categorize BioRxiv links as "Arxiv", and I did the same for MedRxiv until I noticed that most of the articles were unrelated to ML. Then, I set up an online ranking of popular links/resources that displays these resources, favoring recent tweets with a high number of RTs. A few weeks later, I also implemented a daily Arxiv ranking: the bot analyzes the Arxiv link gathered in the last 24h, and it tweets the most popular.
Recently, David Holz suggested me some ideas to improve this bot (thanks!). Considering these ideas required analyzing the data, and since this data was gathered from (and for) the ML community, I wanted to share the visualizations derived from this analysis. At the end of the post, I also share some thoughts about the data, code, and the ideas I'm considering to extend the bot's functionality.
2. Visualizing the one-year long popular ML resources
To this day (May 4th, 2021), the database contains over 42000 links potentially related to ML. Considering the time this bot has been running (uptime: 95% approx.), it has collected an average of about 100 links/resources per day. The categories with largest number of links are: Other/Uncategorized (53%), Arxiv (8%), News (7%), Youtube (6%), and Github (5.5%).
Figure 1 shows the distribution of the tweets with links (left) and tweets with Arxiv links (right) per month. The number of tweets per month (left) is given as a reference. The distribution of Arxiv links (right) shows a peak in July, which might be explained by the NeurIPS rebuttal period.
Figure 2 (left) illustrates the distribution of the links by weekday classified by their type. Figure 2 (right) shows that the number of Arxiv links posted over the weekend is significantly lower (is being active on Twitter part of the job?). Interestingly, there is a peak on Tuesdays, which agrees with the analysis presented in "Best times to post on Twitter for tech".
Next, I wanted to figure out who are the "ML influencers of Twitter", so I gathered those Twitter accounts that achieved 3 or more "Most popular Arxiv link of the day" (Figure 3, blue bar). I also wanted to visualize the correlation with their number of followers, so I took their followers number and I normalized it to fit into the same bar chart (Figure 3, orange bar). I was a bit surprised to find almost no apparent correlation.
I also investigated the possible correlation between number of citations vs. tweet popularity, although I had low expectations since the data is very nosiy. For this, I took the tweets published in the months of June (Fig. 4) and July (Fig. 5) that also contained Arxiv links with preprints uploaded at that time. I chose this period because these months coincide with NeurIPS decisions, and I expected authors to advertise their own research on social media. Additionally, June and July happened almost a year ago, so I expected that the papers got already some citations. For obtaining the current number of citations, I used Semantic Scholar API. Also, I accounted for different tweets posting the same preprints by summing their popularity and classified them based on whether they were published on NeurIPS 2020. Figures 4 and 5 show that although there is a lot of noise (i.e., non-authors posting articles and getting a moderate number of favs + RTs), we can see that popular tweets (right side of each panel) show, on average, articles with more citations than less popular tweets. Furthermore, there were twice as many preprints finally published at NeurIPS 2020 in June than in July.
Finally, Tables 1 and 2 list the two (favs, RTs) top 10 ML-related Arxiv preprints.
Title | ||
Fooling automated surveillance cameras: adversarial patches to attack person detection | hardmaru | 4438 |
On the Measure of Intelligence | fchollet | 4245 |
Flow-edge Guided Video Completion | jbhuang0604 | 3988 |
Unsupervised Translation of Programming Languages | GuillaumeLample | 3544 |
How to represent part-whole hierarchies in a neural network | geoffreyhinton | 3035 |
DropEdge: Towards Deep Graph Convolutional Networks on Node Classification | lorenlugosch | 2931 |
Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography: a prospective study in 27 patients | medRxiv | fchollet | 2878 |
Learning to Simulate Complex Physics with Graph Networks | PeterWBattaglia | 2639 |
At the Interface of Algebra and Statistics | math3ma | 2458 |
Finite Versus Infinite Neural Networks: an Empirical Study | jaschasd | 2268 |
3. Ideas to improve the bot
I still have to think deeper, but I'm considering the following ideas.3.1 Functionality
- New category: Twitter threads that explain papers.
- Instead of posting a single Arxiv link per day, posting the top 3 (in a small thread).
- Posting the most popular Github repository of the day.
- Online ranking of influencers. However, this probably wouldn't be too different from Figure 3.
3.2 More representation
- Currently, the bot monitors about 270 accounts, and I would like to follow more researchers from less represented backgrounds.
- I'm not sure what would be the best way, but I would like to give more visibility to researchers with less followers. Maybe posting a random preprint sometimes?
4. Final thoughts
Will you share the data?The collected data is highly biased since it is collected from a few researchers who already probably have enough representation. In fact, this imbalance motivated the "more representation" ideas. Therefore, I think that sharing this data has no benefit.
Will you share the code?Currently, the code is not available. The main "body" of this bot consist of less than 50 lines, and for collecting Twitter data I simply use an external library. The other hundreds of lines are calls to the database where I keep the data. So, I think that this code cannot be easily reused and I see no benefit of sharing it. However, I can answer any question related to this bot, and I can share some snippets if someone is particularly interested.