Review Comment:
The paper introduces a study of the quality of literals in Semantic Web data, and proposes a number of measures and tools to improve that quality, e.g., by canonicalizing and inferring data types.
My general impression is that the paper, although conveying many interesting ideas, still lacks coherence and clarity in many places. I will detail on those below.
Section 3 describes some benefits of improving the quality of literals, and section 4 defines quality criteria. In my opinion, the order should be reversed. Section 3 describes the benefits of improving literals along the criteria in section 4, and other criteria might lead to other benefits (for example, eliminating outliers in numerical literals may lead to better consumption in data mining tool chains, eliminating redundant literals may lead to more efficient storage and transmission, etc.).
Section 4.3 mixes quality criteria that lie in the data itself (e.g., using undefined datatype IRIs) with quality criteria depending on tools (e.g., using a datatype not implemented by some tools). The authors should tell them apart more critically (although 4.4 seems like a bit of a distinction in that direction). In the same section, the notion of "underspecified" is, well, underspecified. As XSD datatypes form a hierarchy, the authors should define a level in the hierarchy which they deem specific enough, and justify that decision. A similar thing holds for language tags. Is "de-DE" really better than "de", and, in particular, would you consider it better to repeat a literal that is the same in German, Austrian, and Swiss German with three language tags? One could probably argue in both directions here.
One thing I wonder about missing or incoherent language tags are proper names, e.g., of persons. Should a literal like "Albert Einstein" really have a language tag? If yes, what would be the proper one?
For the sake of coherence, the wording of section 4 and the following should be harmonized. For example, there is the definition of "valid literals (category 'Valid') T_{correctLiterals}" - why not simply call it T_{valid}? There are quite a few of such cases where different terms are used in sections 4 and 5, which complicate the readability and reduce the coherence of the paper. At the same time, section 4.5 defines metrics, but those are not used later on in section 6 to report numbers. So why define them then?
In definition 2, would it not make sense to divde T_{correctLiterals} by the number of datatyped literals, instead of all literals (including, e.g., language tagged strings)?
In general, the numbers reported in section 6.1 are quite shallow. For each datatype, I would also like to see the number of datasets this datatype originates from (e.g., is it a commonly made mistake, or a mistake made by a dominant data provider such as DBpedia). In general, some more statistics about the dataset at hand would be interesting, as well as some statements about the representativity of the sample, compared to, e.g., the Billion Triples Challenge Dataset or the LOD Cloud dataset. I would also appreciate a distribution of datasets the triples/documents come from to see a potential bias or skewness to some large/major datasets. Furthermore, not only absolute, but also relative numbers would be appropriate for tables 2+3, i.e., what portion of xsd:int is invalid/non-canonical. The authors should show the top datatypes both by the absolute and the relative numbers.
Sections 6.2 and 6.3 are generally weak. First, it should be defined what it means that a "lexical form contains natural language expressions" (the same for "linguistic content" in section 6.3). Furthermore, it is hard to see any relations to quality metrics in section 6.2. It only reports the distribution of language tags. In both sections, I miss quantitative results. In section 6.3, it is not clear why some of the problems (e.g., an underscore) should actually be quality problems. Furthermore, the "flow cytometer sorter" opens up the whole new field of typos and grammar mistakes, which I assume would be a paper of its own. However, it seems only to be addressed since correctly guessing a language tag from such grammatically incorrect strings (the term "syntax" should be avoided here for the sake of clarity) seems to be problematic for the tools used, not so much as a quality issue in itself.
The selection in 6.3 is a bit arbitrary. The authors state that they use a sample of documents with a score of less than 40% to identify quality issues that are not detectable by the tool at hand. Why? This somehow implies the assumption that the distribution of detectable and non-detectable errors are similar. But no evidence whatsoever is given.
Section 7.1 states that fixing DBpedia datatype IRIs would resolve the majority of the problem. As stated above, without a clear profile of the dataset at hand, it is impossible to decide between two hypotheses: (1) most other datasets do not suffer from that problem, and (2) the evaluation data collection has a heavy bias towards DBpedia.
In section 7.2, while it is intuitively clear that the language of a longer literal is easier to determine than that of a shorter one, it is not clear why the F1 decreases for very long literals. I would appreciate a discussion here. Along the motivation of a data-driven quality study, it would also be interesting to report what language tags are most often inferred for untyped literals, i.e., which language tags are the most omitted ones. Is there any notable deviation from the overall distribution?
In summary, the paper addresses an interesting field, but lacks clarity and coherence in too many spots that I could recommend acceptance.
Minor issues:
* p.1: "Abstract Quality" (in the abstract) is a strange term. Rather use "quality in general" or something like that
* p.1: "First, we create a toolchain" - I expect a "second" in the subsequent text, which never comes
* p.1: "a toolchain that allows billions of literals to be analyzed" - actually, any toolchain can do that, given enough computing power, memory, and time. You should specify some constraints here.
* p.2: missing comma after "Unique Names Assumption"
* p.2: missing comma after "domain violations"
* section 4.1: add example for language tagged string for the sake of completeness
* Fig. 1 can be improved. Inheritance (specialization) arrows usually run from the specific to the general class (not vice versa), and use a non-filled arrow head. The semantics of dashed and solid rectangles should be explained in the caption. Plus, from my understanding, "Well-formed" probably should be solid, not dashed.
* p.8: "For the first metric Luzzu taken..." - sentence is awkward
* p.9: don't use [0], [1] etc. to refer to lines in the listing, as it is easily confused with references.
* p.11: "This section present" -> "presents"
* p.15: the example does not make sense, since zh-cn and zn-tw differ in the first two characters anyways
|