Topic profiling benchmarks: issues and lessons learned

Tracking #: 1604-2816

Authors: 
Blerina Spahiu
Andrea Maurino
Robert Meusel

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
Abstract: 
Topical profiling of the datasets contained in the Linking Open Data cloud diagram (LOD cloud) has been of interest for a longer time. Different automatic classification approaches have been presented, in order to overcome the manual task of assigning topics for each and every individual new dataset. Although the quality of those automated approaches is comparable sufficient, it has been shown, that in most cases, a single topical label for one datasets does not reflect the variety of topics covered by the contained content. Therefore, within the following study, we present a machine-learning based approach in order to assign a single, as well as multiple topics for one LOD dataset and evaluate the results. As part of this work, we present the first multi-topic classification benchmark for the LOD cloud, which is freely accessible and discuss the challenges and obstacles which needs to be addressed when building such benchmark datasets.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Michael Röder submitted on 02/Aug/2017
Suggestion:
Major Revision
Review Comment:

The paper discusses the task of assigning topical labels to RDF datasets. The authors look on both types of this task - the single- and the multi-label classification task. In section 1, the authors motivate the task before they describe the LOD cloud dataset they are relying on, in Section 2. Section 3 describes the development of their multi-topic benchmark. Section 4 describes the different feature vectors, classification algorithms, sampling techniques and normalization techniques the benchmarked approaches are relying on. In Section 5, the approaches are benchmarked with both available benchmarks - the already existing single-topic benchmark and the newly developed multi-topic benchmark. In Section 6, the authors discuss the results of Section 5. Section 7 presents related work and Section 8 summarizes the paper.

=== Positive aspects

+ This task is very important when working with a growing LOD cloud.
+ The creation of a benchmark is a very important step for this research field since it a) eases the comparison of approaches and b) it makes it easier for other researchers to enter the area (since they don't have to invest a lot of time into creating their own ground truth).
+ The novelty of the work is good since there does not seem to be another benchmark like this.
+ The usage of the majority class classification as baseline is a good choice.

=== Major issues

- The paper seems to be an extension of [45] but the authors tried to change the focus of the paper to fit the special issue. While [45] describes the classification approaches and their evaluation, the authors try to focus this extended version on the benchmark dataset they created for the multi-topic classification task. However, the authors haven't applied this strategy to the complete paper and created a paper that "pretends" to describe a benchmark but still focusses a lot on the approaches and their evaluation. The following points describe why I came to this conclusion.
- Section 4 is called "Benchmark Settings" but describes "4.1. Feature Vectors", "4.2. Classification Approaches", "4.3. Sampling techniques" and "4.4. Normalization techniques" which are describing the benchmarked approaches but are not part of the benchmark. The main components of the benchmark should be a) the set of RDF datasets, b) the ground truth (i.e., the topical labels for the single datasets) and c) a metric to measure the success of a benchmarked system based on the system output. For me, only 4.1 is related to the benchmark itself, since it is interesting to see the statistics about datasets that have values for the single features. The other parts of Section 4 are clearly parts of a system that tries to tackle the task. For example, it is up to the system whether it uses sampling or normalization techniques while it has no influence on the benchmark itself.
- In the introduction, the authors state that they want to "[...] discuss the choke points which influence the performance of [multi-topic profiling] systems". Although they find some interesting choke points, they could present them in a better way. For example, the bbc.co.uk/music example discussed in Section 6.2 shows an important choke point: vocabularies that are only used in a single dataset. Why do the authors not combine that with the statistics that they have about the datasets? They could provide the number of vocabularies and how many datasets are using this type. Another choke point could be the size of the datasets for which the authors do provide a statistic for the complete crawl but not for the datasets that are part of the benchmark.
- Another step to transform this paper more into a benchmark paper (and move it away from simply benchmarking the approaches published in [45]), would be the benchmarking of other approaches. I am aware of the problem that some approaches are very special or might not be available as open source programs. However, [R1] presents a simple "topical aspect" for their search engine that could be used as a similarity measure for two datasets. Based on that, an easy baseline could be defined that was not part of [45].
- Another hint that the paper internally focusses a lot on the approaches instead of the benchmark can be found in the description of the related work. The authors compare nearly all other approaches with their approaches described in Section 4 regarding the features that they are using. If the paper would focus on the benchmark itself, the authors may have compared their benchmark dataset with the data used for the evaluation of the other approaches, e.g., the data from [8] or why the authors have chosen precision, recall and F-measure instead of the normalised discounted cumulative gain used in [8].

Together with the long list of minor issues, the mistakes in writing and the problems in the references, I think that this paper needs a major revision before it can be published in the semantic web journal.

=== Minor issues

- In the abstract, the authors state "it has been shown, that in most cases, a single topical label for one datasets does not reflect the variety of topics
covered by the contained content". However, the authors do not proof this statement nor do they cite a proof.
- The definitions on page 4 are confusing. Why is a topic T a set of labels {l_1, ..., l_k} when the single-topic classification chooses a single label l_j from the set of labels {l_1, ..., l_p}. Why is L defined in Definition 3 and not in Definition 1 so that it could have been reused in the other definitions? It also looks like l_k has two different roles in Definition 1 and 3 which should be avoided.
- The description of the gold standard presented from [35] is wrong (pages 5 and 17). The authors state that it would be a gold standard for multi-topic classification. This is wrong because the gold standard from [35] has been created for finding topically similar RDF datasets and does not contain any topical labels or classifications.
- On page 6, the authors write "rdfs:classes and owl:classes". Shouldn't this be rdfs:Class and owl:Class?
- On page 7, the authors write "... as described in the VOC feature vector there are 1 453 different vocabularies. From 1 438 vocabularies in LOD, ..." Why are these two numbers different?
- Footnote 13 on page 7 is not helpful at all since nobody knows when the authors have executed their experiments.
- The description of "overfitting" on page 7 is wrong.
- On page 9, the authors state that "Classification models based on the attributes of the LAB feature vector perform on average (without sampling) around 20% above the majority baseline, but predict still in half of all cases the wrong category". Taking into account Table 3, this sentence seems to be wrong. If LAB achieves 51.85% + 20% (or 51.85% * 1.2) it leads to an accuracy that is higher than predicting every second category wrong.
- On page 15, the authors cite [34] but I assume that they wanted to cite [35] because [34] does not fit in there.
- On page 17, the authors start a paragraph with "Some approaches propose to model the documents ...". As a reader who is not familiar with [35], it is hard to understand what a "document" is since it has not been defined before. While I understand that the authors cite [32] it is also not clear to me why the authors are citing [33] and [34] in their paper. Neither Pachinko Allocation [33] nor Probabilistic Latent Semantic Analysis [34] is related to their work or the related work they are describing in this paragraph.
- On page 17, the authors state that "approaches that use LDA are very challenging to adapt in cases when a dataset has many topics". This is neither proven by the authors nor do they cite a publication that contains a proof.
- On page 17, the authors write "These approaches are very hard to be applied in LOD datasets because of the lack of the description in natural language of the content of the dataset" when discussing the application of LDA. However, they contradict this statement by citing [35] which is not bound to natural language description of datasets (LAB or COM) but can also make use of LPN or CPN.
- At the end of the Section 7, they briefly repeat the description of [35] but with a faulty reference to [34].
- The authors state that the benchmark will be made publicly available. However, I couldn't find a link in the paper to the benchmark (there is only a link to the LOD cloud data crawled with LDSpider). Since this is no blind submission I do not see a reason why the authors do not allow the reviewers to have a look at the benchmark itself.

=== Writing Style

The paper has a high number of grammatical errors and typos making some parts of the paper hard to read. In the following, I will list some of the errors (I gave up to collect all of them on page 7). However, it is not sufficient to fix only the errors listed here. A check of the complete paper (maybe by somebody who is not one of the authors) is highly recommended.

- Page 1 "for one datasets" --> "for one dataset"
- Page 6 "We extracted ten feature vectors because want to" --> "... because we want ..."
- Page 6 "We lowercase all values and tokenize them at space characters and filtered out all values shorter than 3 characters and longer that 25 characters" --> "We lowercase all values, tokenize them at space characters and filtered out all values shorter than 3 characters or longer that 25 characters"
- Page 6 "This because"
- Page 6 "In the LOV website, there exist 581 different vocabularies." In this sentence, "in" seems to be the wrong preposition. There are a lot of discussions, whether "on" or "at" are correct when talking about things that can be found on (or at) a website (e.g., https://english.stackexchange.com/questions/8226/on-website-or-at-website).
- Page 7 "Among different metadata, it is also given the description in natural language for each vocabulary." --> "Among different metadata, the description in natural language for each vocabulary is given."
- Page 7 "581^13" --> a footnote shouldn't be added to a number. Otherwise the number of the footnote can be confusing.
- Page 7 "While in LOD as described in the VOC feature vector there are 1 453 different vocabularies" ?

- The paper shows some minor formatting problems that need to be fixed before publishing it.
- Several words are written into the margin (i.e., hyphenation rules should have been applied). This can be seen on pages 3, 4, 6, 9 and 16.
- Tables 3 and 8 are too wide.
- While two feature sets are called PURI and CURI they are called "PUri" and "CUri" in the tables.

=== Paper References

- The paper has 45 references. However, it seems like the authors don't have a good strategy to handle these references because several references are listed twice ([9] = [34], [21] = [26], [24] = [35], [25] = [44], [36] = [38]) and [16] simply seems to be a newer version of [17].
- [28] has a character encoding problem in the title.
- [30] has only authors and title. It is missing additional data, e.g., the conference, publisher or year. At least I couldn't find it with the given information.
- [45] looks like the title has not been defined correctly, since it is not formatted as a title.

=== Comments

- It might be better to use the F-measure for the inter-rater agreement, see [R2].
- On page 5, "but this work was done before" should be replaced with "but our work was done before" since "this" could refer to different papers

[R1] Kunze, S., Auer, S.: "Dataset retrieval". IEEE Seventh International Conference on Semantic Computing (ICSC), 2013.
[R2] George Hripcsak and Adam S Rothschild: "Agreement, the f-measure, and reliability in information retrieval". Journal of the American Medical Informatics Association, 12(3):296–298, 2005.

Review #2
By Antonis Koukourikos submitted on 02/Oct/2017
Suggestion:
Major Revision
Review Comment:

The paper discusses on the application of various sampling and classification techniques for carrying out multi-topic classification of datasets in the Linked Open Data Cloud.
A significant drawback of the paper is that the reported experiments operate over an obsolete version of the LOD cloud, observed in 2014, while the most recent, significantly extended version reaches until August 2017. The substantial growth of the linked data universe in this intermediate period constitute the presented results somewhat limited, thus I think that an update with more recent data is needed.
Another confusing aspect is the fact that, while the authors express the need for obtaining and circulating high quality benchmarks for the problem, their own contribution is not sufficiently emphasised and analysed. On the other hand, it is commendable that they make their benchmark dataset directly available to the community.
One of the strongest aspects of the paper is that the experimental methodology is sound and complete, with multiple sampling and classification approaches used. The presentation of the results for all discussed algorithms helps the reader to acquire a clear view on the impact of selecting different approaches for solving the problem.
Another drawback for the manuscript is the fact that a large part of the paper refers to already published work, namely experimentation over the single-topic classification problem that preceded the current multi-topic classification experiments. While the inclusion of previous experiments is useful and showcases the continuity of the authors' work, it constitutes an abnormally big part of the paper. It would be preferable to limit the relevant section and dedicate the additional space in a more thorough analysis of the results and processes for the multi-topic classification experiments.
Overall, the presented work is of high quality in terms of experiment design and presentation of the core technical and algorithmic selections. However, the paper is lacking on the following aspects:
- The used datasets are not up-to-date;
- While the dire need for benchmarks on the field is emphasised, the analysis of the constructed benchmark is brief and shallow;
- Much space is dedicated to the presentation of previous work;
- More discussion / qualitative analysis of the results should be included;
- There are multiple syntax and grammar errors
Based on the previous remarks, I would suggest that the paper is accepted after major revisions, as an iteration of the experiments over the current state of the LOD cloud is required, and an extensive rewriting will be needed in order to (a) further discuss on the properties of the provided benchmark, (b) analyse the results for the multi-topic classification experiments and (c) correct with the help of a native speaker the language used throughout the manuscript.

Review #3
By Nikolay Nikolov submitted on 08/Nov/2017
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The paper presents a Linked Data benchmark for topic profiling of generic datasets. The authors present two approaches for single- and multi-topic profiling. They further provide an analysis of the performance of the approaches and explain the quality of the results. Topic profiling in the web of data is an important challenge that needs to be addressed in order to increase the adoption of linked data. The paper presents a practical approach to addressing this issue (with certain limitations that are described in the paper) and provide novel insights about the challenges that need to be addressed in this area. Given the aforementioned aspects, I conclude that the paper's topic is highly relevant to the journal issue on Benchmarking Linked Data.

Formatting and structure:
The paper generally uses language and grammar at a fair level, its structure is logical and it is mostly well-formatted (with some notable exceptions that need to be addressed). I have also noticed a large number of minor grammatical/phrasing errors that I list below.
* Formatting issues:
- Fig. 1 is not possible to read (both legend and labels of the nodes of the LOD graph)
- Fig. 3 is very difficult to read
- text outside of the column border in several places - in section 2, page 3; section 2, page 4; section 4.1, page 6; section 7, page 16
- Table 3 goes outside the print area
- Table 7 is badly formatted
* (minor) duplication in introductions to sections/sub-sections - the last paragraph of the Introduction section states the topics of the individual sections, so it is not necessary to include introductory paragraphs at each section to repeat it. Same goes for some of the sub-sub-section 5.2.1 - in my opinion it is not necessary to repeat introductory sentences given the above paragraph (describing sub-section 5.2).
* Grammatical/phrasing errors:
- (Abstract) "quality of those automated approaches is comparable sufficient, it has been shown, that in most cases, a single topical label for one datasets does" - "comparably" instead of "comparable" and "one dataset" instead of "one datasets"
- (Section 1) "Up till now, topical categories were" - should be "Up until now, topical categories have been"
- (Section 1) "In database community the benchmark series" - should be "In the database community, the benchmark series"
- (Section 1) "Although the importance of such needs" should be "Despite the importance of such needs"
- (Section 2) in the descriptions of each of the topical categories, there should be a full article in front of the name of each category - e.g., "THE government category contains Linked Data published by [...]"
- (Section 3) "To assign more than one topical category to each dataset the researchers cloud access the descriptive metadata" - the word "could" has been misspelled
- (Section 3) "Also, the work presented in [35], build a gold standard" - "builds" instead of "build"
- (Sub-section 4.1) "if people in a dataset are annotated with foaf:knows statements or if her professional affiliation is provided" - should be "their" instead of "her"
- (Sub-section 4.1) "reduces the diversity of features, but on the other side might increase the number of attributes" would be better phrased as "reduces the diversity of features, but, on the other hand, might increase the number of attributes"
- (Sub-section 4.1) "Among different metadata, it is also given the description in natural language" would be better phrased as "Among other metadata, LOV also provides the description in natural language"
- (Sub-section 4.1) "In the LOV website, there exist 58113 different vocabularies. While in LOD as described in the VOC feature vector there are 1 453 different vocabularies." - the sentence starting with "While" is grammatically incorrect. I suggest to merge the sentences: "In the LOV website, there exist 581 different vocabularies, while in LOD, as described in the VOC feature vector, there are 1 453 different vocabularies."
- (Sub-section 4.2) "Classification problem has been widely studied" - "Classification" should have the full article - "The classification problem has been widely studied"
- (Sub-section 4.2) "While Jaccard distance is a good measure when the data in input are of different types" (if I understood correctly) should have a different connecting word - e.g., "On the other hand, Jaccard distance is a good measure when the data in input are of different types"
- (Sub-section 4.3) "In Table 2 it is given an example how" could rather be "Table 2 shows an example how"
- (Section 5) "in order to show the goodness of" should be rephrased - e.g., "in order to show the advantages of", or "in order to show the applicability of"
- (Sub-section 6.2) "At a second moment we" should be rephrased - e.g., "Later, we" or "In addition, we"
- (Sub-section 6.2) "In table 9 we summaries" the authors seem to have misspelled "summarise"
- (Section 7) "description of the topically" should be "description of the topical"
- (Section 8) "The multitopic benchmark is heavy imbalance" should be "The multitopic benchmark is heavily imbalanced"
- (Section 8) "not using such vocabulary" should be "not using such a vocabulary"

Content:

* In my opinion, since the major contribution of the paper is about topic profiling benchmarks specifically in LOD (other domains are discussed in related work), the authors should consider changing the title accordingly - e.g., "Topic profiling benchmarks in the Linked Oped Data Cloud: issues and lessons learned"
* (Sub-section 4.2) "In our experiments, based on some preliminary experiments on a comparable but disjunct set of data, we found that a k equal to 5 performs best." - the authors should include a statement on what exactly was the premise of their preliminary experiments, since the value of the 'k' coefficient is critical to the output of the algorithm
* (Sub-section 4.2) In their description of the Naive Bayes classification algorithm the atuhors should include some more details on why it is appropriate in the domain of LOD. The only argument given is that it is "easy to build" and therefore good for large datasets, which is not enough to justify its use (especially when they further state that it is built on "mostly a rather poor assumption"
* (Sub-section 5.2.1) The authors state that "While the second challenge is related to the independence of the labels and also some datasets might belong to an infinite number of labels." - I could not understand how some datasets might belong to an "infinite" number of labels - did the authors mean to say "very large" instead?

Overall, I think the paper provides novel insights, is highly relevant for the journal, and given that the minor issues described above should be accepted.