Review Comment:
The authors investigate the use of crowdsourcing for the task of link prediction and schema mapping. In particular, they investigate qualitative differences between annotations made by trained experts as well as lay people in the crowd.
The paper is well written and clearly communicates goals and methods.
I strongly urge the authors to rephrase the manuscript's title. Currently it is much too general in scope given what was actually done. The manuscript exclusively focuses on a single annotation task type and makes no statement about the multitude of possible applications (e.g., classification, surveys, etc.). It would be much preferable to let the title reflect the manuscript's true scope rather than leading the reader to believe this were a general investigation of experts vs. crowd.
Similarly, I strongly disagree with the author's claims of novelty. Over the years, we have seen many papers that investigate both the expert/crowd comparison as well as the influence of different interfaces. Here some concrete examples:
-[5], Eickhoff 2013
-Alonso, Omar, and Stefano Mizzaro. 2009 "Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment."
-Kazai, Gabriella, et al. 2011 "Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking."
-Kazai, Gabriella, et al. 2012 "The face of quality in crowdsourcing relevance labels: Demographics, personality and labeling accuracy."
As a consequence, none of the proposed research questions struck me as particularly novel:
RQ#1/RQ#2:
-Zhang, Chen Jason, et al. 2013 "Reducing uncertainty of schema matching via crowdsourcing."
-Wang, Jingjing, et al. 2014 "Learning an accurate entity resolution model from crowdsourced labels."
-Sarasua et al. 2012 [20]
-Mortensen, Jonathan M. 2013 "Crowdsourcing ontology verification."
-Eickhoff, Carsten. 2014 "Crowd-powered experts: Helping surgeons interpret breast cancer images."
RQ#3:
-[5], Eickhoff 2013
-Kazai, Gabriella, et al. 2011 "Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking."
RQ#4:
-Alonso, Omar, and Ricardo Baeza-Yates. 2011 "Design and implementation of relevance assessments using crowdsourcing."
-Kazai, Gabriella. 2011 "In search of quality in crowdsourcing for search engine evaluation."
As far as I can see, this manuscript only ever compares the performance of single crowd workers with that of single experts. In practice, however, we typically find that, while single crowd workers' performance does not compete with that of single experts, aggregate solutions across multiple crowd workers meet the performance of a single expert. In order to obtain realistic results, comparable to the state of the art in crowdsourcing, I caution the authors to consider and evaluate the use of aggregation schemes such as majority voting as well.
The payment of 0.1 cent per annotation is extremely low even for crowdsourcing standards. It would be interesting to discuss how long people spent on each judgement task (both experts as well as the crowd) given this pay rate. Previous work often finds time-on-task to be a good proxy for quality.
There is another, undocumented difference between Interfaces 1 & 2. Where in the first condition, the annotator immediately gets the full range of choices presented as a radio group, in the second condition the interaction seems to be more tedious, requiring to open a drop-down menu. Was this a conscious choice? And if so, what was the intended/observed effect?
In the overview of the authors' own previous work in Section 2, it would be helpful to briefly describe the concrete task that was addressed in [8].
It is interesting that 'people' and 'society' show such fundamentally different difficulties. Are there any explanations for this rather counter-intuitive observation?
=== Presentation Remarks ===
-Erred-on and Erred-as could be labeled as 'false negatives' and 'false positives', respectively, in order to increase the intuitive interpretability.
-Significance of results should be included in tables and figures rather than only mentioning them in the text.
-Sec. 2: "pattern" -> "patterns"
In summary, this manuscript gives an interesting overview of observations relating to the use of crowdsourcing for the task of link prediction and schema matching. Unfortunately, the authors significantly overstate the novelty of their contribution as well as the generality of their findings (see title discussion). Additionally, a standard practice, aggregation of individual crowdsourcing results, seems to be ignored, resulting in unrealistic results. As a consequence, I cannot recommend acceptance of this manuscript in its current state.
|