论文标题

使用主题建模,UMAP和DIGRAPHS对COVID-19的探索性分析

Exploratory Analysis of Covid-19 Tweets using Topic Modeling, UMAP, and DiGraphs

论文作者

Ordun, Catherine, Purushotham, Sanjay, Raff, Edward

论文摘要

本文说明了五种不同的技术来评估主题,关键术语和特征,信息传播速度以及COVID19推文的网络行为的独特性。首先,我们使用模式匹配,其次是通过潜在的Dirichlet分配(LDA)进行主题建模,以产生20个不同的主题,讨论案例传播,医疗保健工人和个人保护设备(PPE)。现场白宫冠状病毒工作队简报后,特定于美国案件的主题将立即开始增加,这意味着许多Twitter用户都在关注政府的公告。我们为COVID19 Twitter文献中未报告的机器学习方法提供了贡献。这包括我们的第三种方法,即统一的歧管近似和投影(UMAP),它标识了独特的主题的独特群集行为,以提高我们对语料库中重要主题的理解,并有助于评估生成的主题的质量。第四,我们计算了转发时间,以了解有关COVID19的信息在Twitter上传播的速度。我们的分析表明,2020年3月的样本语料库的Covid19转发时间为2.87小时,比中国社交媒体在2013年3月对H7N9的重新发布快了约50分钟。最后,我们试图通过从快速转移到缓慢的转发来可视化用户的连接来理解转推级联。随着转发时间的增加,连接密度也会增加,而在我们的样本中,我们发现独特的用户主导了Covid19转发者的注意力。该分析的最简单亮点之一是,如正则表达式(如正则表达式)的早期描述方法可以成功识别高级主题,这些主题始终通过后续分析被始终如一地验证至重要。

This paper illustrates five different techniques to assess the distinctiveness of topics, key terms and features, speed of information dissemination, and network behaviors for Covid19 tweets. First, we use pattern matching and second, topic modeling through Latent Dirichlet Allocation (LDA) to generate twenty different topics that discuss case spread, healthcare workers, and personal protective equipment (PPE). One topic specific to U.S. cases would start to uptick immediately after live White House Coronavirus Task Force briefings, implying that many Twitter users are paying attention to government announcements. We contribute machine learning methods not previously reported in the Covid19 Twitter literature. This includes our third method, Uniform Manifold Approximation and Projection (UMAP), that identifies unique clustering-behavior of distinct topics to improve our understanding of important themes in the corpus and help assess the quality of generated topics. Fourth, we calculated retweeting times to understand how fast information about Covid19 propagates on Twitter. Our analysis indicates that the median retweeting time of Covid19 for a sample corpus in March 2020 was 2.87 hours, approximately 50 minutes faster than repostings from Chinese social media about H7N9 in March 2013. Lastly, we sought to understand retweet cascades, by visualizing the connections of users over time from fast to slow retweeting. As the time to retweet increases, the density of connections also increase where in our sample, we found distinct users dominating the attention of Covid19 retweeters. One of the simplest highlights of this analysis is that early-stage descriptive methods like regular expressions can successfully identify high-level themes which were consistently verified as important through every subsequent analysis.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源