A consumer’s look on Facebook and Twitter – What do people read and where?

Paper Title: 
A consumer’s look on Facebook and Twitter – What do people read and where?
Authors: 
Thomas Steiner, Ruben Verborgh, Arnaud Brousseau, Raphael Troncy, Rik Van de Walle, Joaquim Gabarro Valles
Abstract: 
With the ever-growing influence of social networks, social media mining becomes more and more important as a source for responses to all sorts of questions. “Do people like product X?”; “What do people think of a new law proposal Y?”; “Will candidate A or candidate B win the elections?”. These are just some sample questions where social networks can substantially contribute to answers. In this paper, we propose a paradigm shift in order to find responses. Where traditional social media mining focuses exclusively on the producer side of microposts, we focus on the consumer side, that is, on the readers of microposts. Traditional social media mining retrieves its data through official Application Programming Interfaces (APIs). In contrast, our approach works through accessing its data via browser extensions directly from the social network users’ timeline when they visit their social network of choice via a Web browser. In comparison to social data retrieved via APIs, the social data retrieved via our approach is more sparse, however, we argue in the paper that it is of higher quality. We have implemented browser extensions for the popular social networks Facebook and Twitter. These extensions perform named entity disambiguation on microposts and, via Web analytics software, enabled us to collect social data over the course of six months. In the first part of the paper, we present global statistics and a comparison of what topics people are interested in on the two examined social networks. In the second part, using concrete examples from recent history, we show how additional data gathered through Web analytics software can be used to get fine-grained information on geolocations of centers of interest. This allows for interesting new kinds of questions to be addressed. “Does an event X cause more reader interest in country A than in country B?”; “Which continent cares most about a catastrophe Y?”; “Do people in city Z read about product P?”. Finally, as our approach allows for cross-network ambiguity-free social media mining, we can even propose answers for a question like the following: “Is my brand B read more about in region R on social network A, or social network B?”. We see our approach not as a replacement of traditional social media mining, but more as an additional perspective that makes sense in certain scenarios, some of which we present in this paper.
Full PDF Version: 
Submission type: 
Full Paper
Responsible editor: 
Guest Editors
Decision/Status: 
Reject and Resubmit
Reviews: 

Solicited review by Harald Sack:

Although the authors have improved their previous manuscript (esp. in the state-of-the-art section), some of the major critic points still remain:
- The authors simply combine existing technology (various entity mapping applications and a web tracking service) to a consumer-side monitoring tool and claim that this approach, i.e. consumer-side monitoring has benefits over traditional author-side monitoring. This being the only original scientific contribution from my point of view is only very marginal.
- The consumer-side monitoring approach presented by the authors does not provide representative nor significant samples of the data to be examined. Esp. a direct comparison to author-side monitoring is missing.

ad 2.2) Although the authors now give a reference for the reconciliation process of the NER results of the different tools being used, it would be helpful to sketch the underlying process in more detail.

ad 4.4.2) The authors state that, since the Egypt government blocked social networks and people started to circumvent the internet barriers by using proxies located in other countries, this does only result in a minimally skewed statistics, because statistics take "various factors" into account to determine the Geo-position of the user. Please explain this in more detail.

ad 6.) The proposal of metering trends by micropost audience measurement studies is not equally comparable to the broadcast counterpart. While television and radio are solely passive, microposts are also actively created by the user, which would be measured too and probably affect the (aware) participants behaviour. This was already mentioned in the previous review and should be discussed further.

Moreover, I also support the view of fellow reviewer that what appears in the timeline of the reader is not necessarily what the reader really reads, and therefore what does really interest him. This fact at least should at least be discussed by the authors given that it was a major critic point.

Conclusion:
The approach of regarding the consumers view is very interesting and the applied methodology seems reasonable and straightforward. But, the evaluation performed by the authors lacks a comparison to the established author-based approaches. The revealed insights are based on singlular examples that can not be generalized.
The initial aim of providing arguments for a paradigm shift to the readers perspective was not justified with a solid comparison to existing author-focused practices. It would have been beneficial to contrast the outcomes of both approaches. Technically, the contribution is merely a mash-up of the two browser plugins with existing services.

Solicited review by Diana Maynard:

While I applaud the authors for substantially modifying the original paper and addressing some of the issues (such as the related work section), almost all the points raised in my original review still remain.
Essentially the paper assumes that (1) people read everything on their Faccebook/Twitter wall or timeline (which is definitely not the case, and there is no way to prove which posts they are actually interested in) and (2) that the set of people who took part in this experiment is an accurate representative sample of the social media population as a whole (which I very much doubt). There is no evidence anywhere that there is a difference between consumers and producers, which forms the whole crux of the novelty of this work, and of the conclusions. The authors are trying to retro-fit their conclusions drawn, e.g. by claiming that the popularity of cats is greater than dogs in their experiment, this means that they're studying consumers and not producers because studies involving producers show that dogs are more popular than cats. This is not a valid conclusion because there are too many variables. The authors also make claims such as "the quality of data collected via our method is higher than collecting data via APIs". How is the quality better? It's just different data (if their claims about the differences between the methods are really valid). They make other claims such as that they use a random population of social network users, but since users had to opt in, this is really not the case. I still find the distinction between unique and total named entities unclear and I don't understand the rationale between the distinction. Even minor claims such as "Twitter users spend less than half the time on that Facebook users do on social networking" is invalid for various reasons, for example that they don't take into account how many times a day the users access the sites. Similarly, the claim that "the reading experience per social networking session on Twitter is more versatile than on Facebook" makes no sense either and is not justified. Also several claims comparing Facebook and Twitter users make no sense because the test set of subjects is not the same for Facebook and Twitter. Most of the "revealed insights" are not justifiable either, or obvious anyway, as mentioned in the previous review.

In short, while some of this work could be interesting, it seems the authors are not prepared to change their fundamental viewpoints in this paper, and the majority of conclusions drawn are invalid or not justified properly, so I do not recommend accepting this paper.

Solicited review by John Breslin:

Focus on Desktop and only the Web version - hard to create a study of any real - significant size using the web version of Twitter and maybe even Facebook. Most content is read and also created using mobile devices and in Twitter's case a lot of the rest would be from Desktop clients - as can be seen in the amount of people using the plugin for the Web version of Facebook vs Twitter.
Reader vs Consumer - From this point of view it is very hard to tell if a user has read any post that he/she loads in either platform - user could load 20+ posts and only read the ones they are interested in or none at all.
Twitter vs Facebook - A direct comparison of Twitter and Facebook especially in respect to Microposts seems unbalanced - as a Twitter post can only be 140 characters vs 63206? characters for a Facebook post. I think this deserves some discussion within the paper. From a social network point of view this paper naively takes Twitter and Facebook as being similar in structure, other research has discussed this and shown this is maybe not the case and maybe even their results in this paper reflect this.
Spam - there is no mention of spam in the paper. It seems odd that SEO was one of the biggest topics on Twitter unless the sample of users was heavily biased which might also explain Linked Data as being in the top 5.
No statistics on disambiguated named entities and how combining the different services aided in improving accuracy or if it aided at all - and no discussion of how they cleaned the top 200 entities and removed false positives.
The approach is moderately novel in collecting the data but it is hard to see how the system would ever get enough data to be statistically significant, it is also hard to agree with many of the conclusions drawn from their data as even though only roughly 1-2% of Tweets are geolocated with enough volume regarding events maybe this is enough to draw some conclusions which the authors ignore.

Revised submission after a "reject and resubmit". Reviews for the original version are below.

Solicited review by Harald Sack:

Review

Title:
A consumer's look on Facebook and Twitter – What do people read and where?

Authors:
Thomas Steiner, Ruben Verborgh, Arnaud Brousseau, Raphaël Troncy, Rik Van de Walle, Joaquim Gabarró Vallés

Summary:
The paper deals with evaluating microposts from twitter and facebook exclusively from the consumers perspective.

The objectives of the paper are the implementation of analyses on named entities from microposts focussing on the readers perspective.

The authors present two browser plugins for the Google Chrome Browser covering microposts that a user consumes (presumably reads) on twitter and facebook respectively. Having these plugins installed, they extract named entities from the posts a user reads and writes, and highlights them to the user, while tracking the time they are shown, using the Google Analytics web tracking service. The Google Analytics tool allows to track users by their IP address, which allows for geolocation, and sets a cookie to identify a user.
For named entity extraction four third party tools are used, i.e. OpenCalais, DBpedia Spotlight, Alchemy API and Zemanta. Their results are combined by an own wrapper API.
The analysis has been carried out over a time period of eight months, including 858 unique facebook users and 86 tweeters, whereas the majority (almost 85 %) of facebook users deinstalled the plugin after a short while.
For the further discussion (Section 4) a six month dataset has been selected. The authors note, that this data is not statistically significant. They collected various insights about the users demographics, behaviour and interests of the two social networks. Partially they refer to events from recent history to examplify their insights.

The collected distinct DBpedia entities from twitter (18,207) and facebook (54,331) have been ranked by their number of occurrence. The rankings represent a Zipf distribution, showing that a few entities are referred very frequently. The top entities differ between the two networks, showing twitter to be more of technical and faceook more personal topics.
RDF type segmentation has been performed based on three different type systems, i.e. the DBpedia ontology, UMBEL and schema.org. The presented statistics highlight their varying granularity.
In subsection 4.4 the authors give examples of recent events to demonstrate their insights acquired by focusing on the readers perspective.
They conclude from posts read by norwegian people in the days after the Norway attacks in 2011, that readers consume news from traditional news media, being shared via social networks.
Based on posts during the Arab Spring revolution, the authors show that micropost consumption takes place elsewhere than the locations the posts refer to.
The differences between producer and consumer sides are shown on an example about cats and dogs on the internet. The reference was taken from the bitly blog, which gives no further information about the origin of this data. This comparison is not able to support their concluded insight.

The related work section (Section 5) lists publications in the field of semantic annotations of microposts, trend or popularity detection and commerzialization of social data. For semantic annotation of microposts the authors (do not differentiate their own to previous work, but) pick two problems they avoid with their approach (need of meaningful hashtags, overload caused by amount of microposts). Section 5.2 merely lists two trend detection approaches, the distinction to Section 5.3 seems to be monetarization, while is not clear from a scientific point of view. Section 5.4 delivers the explicit motivation for focusing on the readers perspective, which is mainly the unavailability of an extensive amount of microposts via APIs, which due to privacy reasons or the intentional absence of APIs.

Critics:
(I) in general
- The authors simply combine existing technology (entity mapping applications and a web tracking service) to a consumer-side monitoring tool and claim that this approach, i.e. consumer-side monitoring has benefits over traditional author-side monitoring. This being the only original scientific contribution from my point of view is only very marginal.
- Consumer-Side monitoring is not new. The authors refer to existing tv monitoring services to examine tv consumption, which also takes place at the consumer-side. But there is one major difference between traditional tv monitoring and the browser-based monitoring of microposts as being proposed by the authors: tv consumption monitoring only monitors passive reception of a broadcast program, while micropost monitoring also covers active and also presumably private information exchange. The possibility to also monitor private Facebook conversations among friends that are not supposed to be public is mentioned as an advantage by the authors. But, here I see difficulties concerning general privacy issues.
- To be applicable in large scale the consumer must realize some incentives or benefits for using the browser extension. How do you convince the user to use your monitoring tool and to contribute to your data gathering?
- The consumer-side monitoring approach presented by the authors does not provide representative nor significant samples of the data to be examined. Esp. when compared to author-side monitoring, where a complete data stream can be monitored (at least the public available parts).

ad 2.1) The applied APIs for entity mapping do not map entities to the same ontologies. While the authors later on refer to the DBpedia only DBpedia Spotlight maps directly there, while Zemanta (among others) maps to Wikipedia, which is simply to redirect, OpenCalais and AlchemyAPI do rather categorize entities and do not directly map to any knowledge bases. This fact should at least have been mentioned here and discussed how the authors solved this.

ad 2.2) It is not said whether the wrapper API runs client or server side, so the question remains open: Did the authors have access to the original posts or solely to the named entities?
Combining multiple APIs for named entity recognition creates the problem of varying results. How do you deal with the issue of contradicting or ambiguous results? Since the entity mappings of OpenCalais and AlchemyAPI are unclear, how have they been integrated? To combine different mapping services in general is a good idea. But the authors have not explained (nor justified) how the results are combined to achieve higher recall/precision or reliability. From my point of view, this would have been a valuable contribution.

ad 4.2) The authors state to have manually cleaned the list of the top-200 entities. How was the removal of false positives achieved (and justified)? Did the authors manually inspect the original posts to check all named entity mappings? How was the failure rate of each of the used NEE APIs? Since this sounds like a labour-intensive task that demands tool support, further explanation and description is mandatory.

ad 4.3) Have the top-500 entities been cleaned manually as in Section 4.2? The benefit of printing one full page with the statistics about RDF type segmentation using three different namespaces does not become clear. How does the distribution of the retrieved RDF types compare to the actual numbers of all entities having these RDF types, do they correlate?

ad 4.4.2) This show case was selected somehow clumsy, since the Egypt government blocked social networks and people started to circumvent the internet barriers by using proxies located in other countries, it is questionable how precise IP based geolocation can work in such a case.

ad 5.2/5.3) The authors' contribution to the state-of-the-art in the field of trend analysis has not been named explicitly.

ad 6.) The proposal of metering trends by micropost audience measurement studies is not equally comparable to the broadcast counterpart. While television and radio are solely passive, microposts are also actively created by the user, which would be measured too and probably affect the (aware) participants behaviour.

Conclusion:
The approach of regarding the consumers view is very interesting and the applied methodology seems reasonable and straightforward. But, the evaluation performed by the authors is insufficient, it lacks a comparison to the established author-based approaches. The revealed insights are based on singlular examples that can not be generalized.
The initial aim of providing arguments for a paradigm shift to the readers perspective was not justified with a solid comparison to existing author-focused practices. It would have been beneficial to contrast the outcomes of both approaches. The paper could not show to which extend the suggested approach improves the analysis of microposts.
Technically, the contribution is merely a mash-up of the two browser plugins with existing services. Some details of the implementation and setup remain unclear.

Solicited review by Diana Maynard:

While this paper promises some interesting insights into what Facebook and Twitter users are interested in, which would be a useful piece of research, it sadly fails to deliver in some key areas. First, I find there is very little real research here: the authors do little more than to use some existing Named Entity recognition software coupled with some browser extensions to collect the relevant information, and then analyse the frequency of various kinds of entity in order to make some (very vague) generalisations and predictions. Second, they do not situate their work with respect to the quite substantial state of the art in this area - neither in terms of the annotation of social media, nor in terms of the analysis. Furthermore, I find the main novelty premise - that this approach analyses not what people post, but what people read - fundamentally flawed in that (as far as I can tell) it actually performs no analysis of what people read, and therefore the conclusions drawn about what people are interested in are simply not valid. Just because someone posts some information on a user's facebook wall does not mean that the person is interested in that information, and this is even more true of Twitter feeds, where one has no say over what topics the people one is following post about, and there is no guarantee that just because something appears in one's twitter feed, will it be read by the user, or that even if read, it will be interesting to the user. There is in any case a big difference between what a user sees and what they area interested in. I don't believe that one can draw conclusions such as "which continent cares most about a catastrophe" from this kind of very simple analysis, other than the obvious conclusions that one could guess anyway (for example, people will be more interested in catastrophes in their own country than elsewhere). In order for these kind of conclusions to be drawn, one would need a far larger dataset than this, especially in terms of number of users. Indeed, the authors add the disclaimer that the results are not statistically significant, while still attempting to draw a number of conclusions about user behaviour from them. I find that almost all the comments from the previous reviewers (from the initial submission to MSM workshop) are still valid in this longer paper and have not been dealt with sufficiently: for example, the concerns about potential benefits, significance of the evaluation, the lack of novelty, the analysis of retweets and so on. While the authors have attempted to rebut these issues, their arguments are mostly not convincing enough for me. For example, the "new insights" that one can draw from the results are mostly very obvious things, such as that performing city-level analysis would give insights into brand popularity over time. If the authors had carried out such analyses, and found interesting conclusions, this would be useful, but suggesting such analyses is not really a sufficient "insight" gleaned from the actual analysis done. What I miss in this paper, in particular, is a clear explanation of how this work improves on existing approaches to trend analysis of social media, of which there are many. I am also not entirely clear what constitutes a "named entity" as clearly the authors do not use this in the traditional sense (Person, Organisation, Location etc) but in a much wider sense. I have no problem with named entities being used as the pivot around which the analysis is performed (and indeed, such an approach is not new), but it would be useful to see how this method compares with other techniques. Finally, I would suggest a more grammatically correct title of "A consumer's look at Facebook and Twitter".

In summary, this work could be interesting if this work could be clearly shown to go beyond the existing state of the art, both in terms of methodology and in terms of results.

Solicited review by anonymous reviewer:

The authors describe browser extensions for Facebook and Twitter that combine existing named entity extraction tools (Zemanta, …) with Google analytics. They aim to show how the combination allows to answer to new kinds of questions about the consumers of microposts : what topics there are interested. Where is this interest important: geolocations of center of interest (country, region ..). Some scenarios are presented with data carefully collected. So, the paper is focused on the consumer (reader) side : a really original enough approach.
Technically the approach seems efficient even if it is essentially supported by existing tools. The results are convenient according to the use cases though often raw even if the authors propose interpretations.

Some remarks:
The state of art is maybe too general and should be more focused to similar papers according to the different aspects.
A subsidiary question: what is a named entity for the authors? The example of cats (4.1) seems strange. But, this refers also to the dependence on the extraction tools.
In fact, the consequences of the errors produced by these tools should be a little evaluated.
The paper is clearly structured and easy to read and follow.
What I am not really convinced is the importance of scientific novelty of the work presented.

Tags: