Extracting Semantic Topics from Microblogs

Tracking #: 1656-2868

This paper is currently under review
Ahmet Yildirim
Suzan Uskudarli

Responsible editor: 
Guest Editors LD4IE 2017

Submission type: 
Full Paper
Microblogging systems are domain independent platforms, where vast numbers of people frequently post short messages about anything. As such, they are valuable resources for extracting information regarding public interests on a wide range of topics, such as news, politics, brands, and entertainment. Current approaches to detecting topics within microblogs typically rely on natural language processing (NLP) and machine learning (ML) techniques. The unstructured, informal, messy, and noisy nature of microblog posts present challenges to conventional approaches that are better suited for longer and well formulated texts. Approaches to topic detection have been applied to single or collections of posts to produce topics. Topics are represented in a variety of manners, such as sets of elements (such as terms or microblog posts), a summary, or useful external resources from Wikipedia and Wordnet. This work proposes an approach to identify topics within microblog post collections and produces a set of semantic topics. Working with post collections is preferred over single posts since individual posts about a topic complement and reinforce each other. Furthermore, microblogging platforms are designed for participatory contributions where topics based on numerous posts indicate an issue of collective significance. The choice of semantic representation of topics stems from the ability to further process them, which is useful in searching and revealing latent information. An ontology that is tailored to microblogging characteristics, named Topico, is introduced to represent semantic topics. An approach based on entity linking and co-occurrence graph processing is proposed to yield semantic topics. In order to extract topics from a post set, first the entities within each post are linked to relevant resources. The temporal entities are linked to topico:TemporalExpression instances. The remaining entities are linked to DBpedia resources. A co-occurrence graph is created based on the entities that co-occur in the posts. This graph is processed to group sets of entities using the maximal clique algorithm. The sets of cliques are processed and subjected to selection criteria to choose those that represent topics, which are mapped to topico:Topic instances. This paper presents an approach to semantic topic identification, the Topico ontology, a prototype implementation, and experiments with various datasets. The prototype uses posts from the microblogging platform Twitter. The TagMe entity linker is used during entity linking phase and DBpedia is used to detect types, in queries, and reaching other resources like WikiData. The prototype was tested on more than one million tweets corresponding to 11 datasets gathered during various events, such as the 2016 US Election, the death of Carrie Fisher, and the North Dakota pipeline demonstrations. More than 9K topics were identified from these datasets. The characteristics of the topics are described in detail and their utility is examined with various SPARQL queries. The results are promising, since the semantically structured and linked topics enable access to information based on what is found due to: (1) the internal structure of topics (who were mentioned with Hillary Clinton), (2) joining topics (when were the issues that are most referred to with Hillary Clinton or Donald Trump posted), and (3) beyond what is found within topics (what are the locations of the concerts of the most mentioned rock musicians). This is due to the semantic structure within topics and their relation to linked data. The ontology, the data, and the topics are made accessible online.
Full PDF Version: 
Under Review