MTab4DBpedia: Semantic Annotation for Tabular Data with DBpedia

Tracking #: 2609-3823

Authors: 
Phuc Nguyen
Natthawut Kertkeidkachorn
Ryutaro Ichise
Hideaki Takeda

Responsible editor: 
Jens Lehmann

Submission type: 
Full Paper
Abstract: 
Semantic annotation for tabular data with knowledge graphs is a process of matching table elements to knowledge graph concepts, then annotated tables could be useful for other downstream tasks such as data analytic, management, and data science applications. Nevertheless, the semantic annotations are complicated due to the lack of table metadata or description, ambiguous or noisy table headers, and table content. In this paper, we present an automatic semantic annotation system designed for the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019), called MTab4DBpedia, to generate semantic annotations for table elements with DBpedia concepts. In particular, our system could generate Cell-Entity Annotation (CEA), Column-Type Annotation (CTA), Column Relation-Property Annotation (CPA). MTab4DBpedia combines joint probability signals from different table elements and majority voting to solve the matching challenges on data noisiness, schema heterogeneity, and ambiguity. Results on SemTab 2019 show that our system consistently obtains the best performance for the three matching tasks: the 1st rank all rounds (the four rounds), and all tasks (the three tasks) of SemTab 2019. Additionally, this paper also provides our reflections from a participant’s perspective and insightful analysis and discussion on the general benchmark for tabular data matching.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Vasilis Efthymiou submitted on 27/Nov/2020
Suggestion:
Major Revision
Review Comment:

Summary:
The paper presents the system MTab4DBpedia, a system that performs semantic annotation of tables, using DBpedia as the target knowledge graph. The tasks performed by MTab are i) annotating each cell of a table with an entity from DBpedia, ii) annotating each column of a table with a DBpedia class (aka type), and iii) annotating pairs of table columns with DBpedia properties. MTab consists of 4 steps, while the last 3 are repeated to provide better / more refined results. The first step is probably the most important one, since the authors perform pre-processing (text decoding, language detection, and data type detection), as well as a first assessment of the entity types, and a thorough lookup over several endpoints (DBpedia, Wikidata, Wikipedia). Using score and rank aggregation from the lookup results of Step 1, in Step 2 the system estimate the candidates for cell annotations. Subsequently, using majority voting from cell annotations and data type detection, Step 3 about column annotations, while numeric values are handled in a special way (using the EmbNum+ model). Then, in Step 4, relationships between columns are detected using DBpedia's SPARQL endpoint for an aggregation of the cell entity annotations, for every row of the table and for a given pair of columns. Using the results of Steps 2-5, the authors re-estimate their annotations for each task and return the final results. MTab participated in the SemTab 2019 challenge and won in every round and every dataset with very impressive results.

The contribution of MTab to the research community is considered very important. However, there are some serious problems with the submitted paper that make me ask for major revisions. I believe that the authors will put enough effort to significantly improve the presentation of the paper, which is currently of low quality. Please find my detailed comments regarding originality, significance of the results, and quality of writing, below.

(1) Originality: OK
This paper is an extended version of the MTab4DBpedia system paper that was published in the SemTab challenge proceedings, which is also available on arXiv. In the submitted paper, the authors have included an error analysis section, and a discussion about the the SemTab challenge.
In Section 1, you should be more detailed about the exact differences to your arXiv paper (reference [5]). You should also cite your paper in the SemTab 2019 proceedings and explicitly state the differences to this submitted paper. If the SemTab proceedings paper is the same as the one on arXiv, you can just cite both previous papers.

(2) Significance of the results: Great
This is the strongest part of the paper.

- The authors have done a remarkable job in constantly showing the best performance among all rounds and all tasks of the SemTab challenge. As one of the challenge co-organizers, I know that this was not at all easy and the competition was quite strong.

- The insights provided in Section 4.5 were a true delight to read. Very informative, very helpful and showing that the authors have a very good understanding in the domain. One thing that could be improved though, is that the authors focus too much (or entirely sometimes) or the errors of the benchmark data provided by the challenge, and say nothing about the errors of the MTab system. I would like it even more if the authors also described the errors of MTab, i.e., even if there were no errors in the benchmark data, the scores would still be perfect. What are the cases where the benchmark data and ground truth were correct but MTab failed?

(3) Quality of writing: Poor
This is the weakest point of the paper. At its present state, the paper seems quite sloppy, with too many typos and grammar mistakes that should have been observed even in a single proof-read cycle. I would advise the authors to seek the assistance of a native English speaker, if needed, or spend way more iterations on proof-reading. At this point, I don't think that it even makes sense for me to list all the errors. Perhaps a very small indicative list of errors is the following:
- Section 1: "dbr:Authur_Drews" and "dbo:deadYear" should be "dbr:Arthur_Drews" and "dbo:deathYear", respectively.
- "the 1st rank all rounds (the four rounds), and all tasks (the three tasks)" should be something like "the first rank in all (four) rounds, and all (three) tasks"
- Section 2.1 starts with "We denotes"
- In Section 2.1, the sentence starting "The intersection between a row (...)" is describing something rather straightforward, in a rather incomprehensible way.
- Section 2.1: "The tabular to KG matching problems could be formalized the three tasks as follows" --> "(...) could be formalized as the three following tasks:"
- Section 2.1, CEA definition (and throughout the paper): "relevance entity" --> "relevant entity". Also, "match (...) to" is preferred to "match (...) into".
- Section 2.1, CTA definition: what does "and its ancestors"? the goal is to match a column to a class hierarchy?
- Assumption 1: where does the name "error matching samples" come from?
- Assumption 4 could be more clearly written as "All the cell values of the same column have the same data type, and the entities related to those cell values are of the same type."
- Assumption 5: why is this an assumption? your solution also works without this assumption; it does not seem to be a requirement for you.
- Section 3.1: "(...) as 7 steps pipeline (...)" --> "(..) as a seven-step pipeline (...)"
- Section 3.3. The sentence starting with "q is a list of ranking of entities (...)" is incomprehensible. I don't understand what q is exactly. This makes me unable to understand Q, as well as E. q is a list of ranking...? I would expect q to be a query. Are the entities in E only the correct (relevant) entities, or all candidate entities? Then why do you use the term "relevant" both as the goal and the candidates? I am very confused with the notation used.
- before Equation 6: "In our setting, we select \alpha_e = 100" This has been already stated as footnote 5
- Equation 6: what is the intuition behind this formula? Can it ever take the value 1?
- Section 3.4.1: "Next, we use DBpedia endpoint to infer the classes (types) from those relations as Figure 4." Please also describe this process in words.
- Figure 5: The text does not mention a re-ranking. Please elaborate: what are the blue and yellow columns? Where did they come from? Why do you re-rank? Where is this ranking used?
- Section 3.5.1 "Figure 5 illustrate (...)" --> by the way "illustrates", but more importantly, how is this paragraph / Figure related to the rest of Section 3.5.1?
- Section 3.5.2: The first sentence is a secondary sentence; the main sentence is missing, i.e., given that (...), we have what?
- Section 3.5.2: "We select those pairs have ratio larger than \beta." --> by the way "(...) pairs that have a ratio (...)" To which ratio are you referring? Between what?
- Section 3.5.2 (for textual values): high Levenshtein distance -> high ratio? so you select pairs which are more than \beta distant (less than \beta similar)?
- Equation 12: is that ever possible for two integer numbers to satisfy both these conditions for the first "if", i.e., for the max of their absolute values to be 0, it means that both numbers are equal to 0, which means that their difference is also 0. What am I missing?
- Section 3.5.2: "We aggregate all relevance ratio with respect to relations." what does this mean?
- Equation 13 uses a rather unfortunate notation, looking like you mean "probability of r, given m_j1, m_j2, m_j2 (...)"
- Section 3.6: how do you learn all those 10 "learnable parameters"?
- Table 1: you should at least mention in text, if not in table headers, that the numbers shown are the number of target cells (in CEA), columns (in CTA), and column pairs (in CPA) that were annotated.
- Section 4.2: "The challenge attached many research teams' attentions" --> "The challenge attracted the attention of many research teams"
- Section 4.3: "The Precision scores was also used" --> "Precision was used"
- Section 4.5.1: "None decoded", "No decode" are not correct terms. Consider using "encoded" or maybe "un-decoded"
- Section 5.2: "MTab4Wikidata" --> "MTab4DBpedia"

There are many many more typos and grammar/syntax mistakes that I will leave to the authors to detect and fix.

Review #2
By Ivan Ermilov submitted on 06/Dec/2020
Suggestion:
Major Revision
Review Comment:

The paper describes the MTab4DBpedia system, which performed the best on SemTab 2019 challenge.

The paper is based on a workshop/system description paper written by the authors in 2019 - MTab: Matching Tabular Data to Knowledge Graph using Probability Models (short paper).

The described system is not available on the github at the moment (see an empty repository located at https://github.com/phucty/MTab). As originally, it was a system paper and there is a comparison of the MTab system with the others using an openly available benchmark, I will find it difficult to accept this paper in case if github does not contain the source code and instructions on how to repeat the experiments.

The work represented in the paper is definitely interesting for the Semantic Web community and is a considerable contribution to the tabular knowledge disambiguation (i.e. matching tables to graphs). Unfortunately, at the moment it is hard to read it. In particular, the part of the paper dealing with the formalisms requires additional work. See my comments in detail below:

Page 3, Assumption 4. Column cell values have the same entity types and data types.

There is no explanation for this assumption.
For example, a cell "A. Drews" is an dbr:Arthur_Drews, who is a dbo:Philosopher.
The entity type for the cell is dbo:Philosopher.
However, the data type in my opinion is a string.

Page 4, paragraph 3.3: "Step 2: Entity Candidate Estimation"

It is not clear what the authors tried to convey with the paragraph.

The set Q is a set of ranking result lists from lookup services.
So, given if there are four services, there will be four lists.
However, the limit alpha(e) for lookup results is set over the whole set of lookup results.
I don't understand why the formalizm of Q as a set of lists was introduced, because the separation between the lookup services (i.e. lists) is not used anywhere in the paragraph.
The confidence score is defined over each q, but then aggregated using max function over the whole Q set.
Also, formula 5 uses the global limit alpha(e) on the result set from all the lookup services with a local ranking index rank(e) for a particular lookup service.

On page 5 line 1, the set E is a set of relevant entities (not relevance entities).
On page 5 line 4, reference to "these specific relevance functions" is not clear.

On page 5 line 18, the conditional probability has to be explained.
I read it as "Given that set Q exists, there is a set of relevant entities E in it"

Formula 6 is missing distinction between the current entity (denoted simply e) and other entities inside the set E.
The way it is right now, it will always equal to 1.

Page 5, paragraph 3.4.1. Numerical Column.

The paragraph discusses identification of numerical columns.
On the line 18, relation r is introduced: "The confidence score of a relation r is calculates as the following equation."
It's not clear what is this relation (between what and what?).
All the further explanation is not understandable due to this fact.
It also includes phrases as "the probability of potential of relation", "those confidence scores are associated as the probabilities of type potential Pr(t|Mnum) given Mnum", which are hard to understand (what is a probability of potential?).

Other remarks:

- Page 2, line 32: In section 4 --> In Section 4 (inconsistent capitalization)
- Page 2, line 38: the related works --> the related work
- Page 5, line 34: the columns are an entity column --> the column is an entity column (mixing singular/plural inside one sentence)
- Page 5, line 37: for numerical columns_6 --> for numerical columns6 (remove space before footnote number)
- Page 5, line 46: labeled attributes_7 --> labeled attributes7 (remove space before footnote number)
- Page 5, line 47: ranking of relevance numberical attributes --> ranking of relevance of numerical attributes (has to be rephrased maybe in a different way, the original sentence is not proper English)
- Figure 4. DBpeida --> DBpedia
- Page 5, line 28: from those relations as Figure 4 --> from those relations as depicted in Figure 4 (for example)
- Page 5, line 44: the probabilities of type potential from numerical columns --> what did the authors mean by this? the same on the next line. What is the "probabilities of type potential"?
- Page 6: line 44: symbol r (used previously for relation) is used for a row

Suggestions:

- the authors might consider using microtype latex package to have aligned and nicer looking columns
- avoid using "those", "these" inside the sentences as in Page 5, line 4 "these specific relevance functions"

Review #3
Anonymous submitted on 11/Dec/2020
Suggestion:
Minor Revision
Review Comment:

This paper proposes MTab4DBpedia - an automatic semantic annotation system to generate semantic annotations for table elements with DBpedia concepts. The authors address the three matching tasks (Cell-Entity Annotation (CEA), Column-Type Annotation (CTA), Column Relation-Property Annotation (CPA)) that described in the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019).
MTab4DBpedia combines joint probability signals from different table elements and majority voting to solve the matching challenges on data noisiness schema heterogeneity and ambiguity.

The problem they tackle is important and the idea is attractive to pursue.
I have several issues with this paper and I'd like to see if the authors polish this study by considering these issues.

- The authors mentioned that MTab4DBpeida is an extension of MTab, but they didn’t differentiate between them in the paper. In section 4.4, they describe the results of MTab4DBpedia, but according to SemTab 2019, it is the results for MTab. So what is the difference between the original system and its extension? It is better to compare the results of MTab and MTab4DBpedia (the extension) and describe its improvements.
- Limitations and future work for MTab (ref. 5) and MTab4DBpedia are the same!
- “however, our system improves matching performance with the novel solutions as follows.” → You can rename it as the list of contributions.
- Page 2, right column, line 17, “In the end, results on SemTab 2019 show that MTab4DBpedia” → Till this paragraph, it is better to mention MTab (not MTab4DBpedia) which is the system that participated in SemTab 2019 and MTab4DBpedia is an extension for it.
- Page 4, left column, line 39, remove space between “duckling” and superscript (2)

Review #4
Anonymous submitted on 23/Dec/2020
Suggestion:
Major Revision
Review Comment:

This paper describes the "MTab4DBpedia" system, implemented to take part at the SemTab 2019 Challenge. The system has won the challenge and a detailed publication on it beyond what is available can certainly benefit the community. However, the paper in its current form has several shortcomings as a 'full paper' for this journal. My overall recommendation is either a major revision addressing the shortcomings as outlines below, or a less major revision but submission as "Reports on tools and systems".

Since this manuscript was submitted as 'full paper', I first start with a summary of the paper's contributions along the usual dimensions for research contributions:
(1) originality
The paper is a very minor extension of a currently available arxiv paper. Even if we treat this paper as an original manuscript and treat the arxiv paper as unpublished, the core technical contributions are adaptations of existing ideas and so this paper does not get a high score on originality. A potential original contribution could be a more formal description of the methods used, and detailed analysis of the effect of various design decisions made in building the system.
(2) significance of the results
Although the described system has won the challenge and so is a significant contribution to the research in this area, the results described in this paper do not add much to what is already available and known. In other words, the results published in this paper do not add much to what is already known and could be views as insignificant.
(3) quality of writing.
This is another major shortcoming of the paper. There are many English language and grammar issues, in addition to notation issues. The paper needs a serious proof-reading before publication.

More detailed comments:
1) The paper has numerous presentation issues, starting from the title. Here are some examples from the first few pages:
Page 1
- Title: Semantic Annotation for Tabular Data -> Semantic Annotation of Tabular Data
- line 17: a process -> the process
- line 18: data analytic -> data analytics
- line 19: semantic annotations are complicated -> it's the process that is complex, not the annotations!
- line 23: Relation-Property: Relation/Property?
challenges on ... -> do you mean challenges that arise from ...
- line 46: matching table elements into knowledge graphs -> matching table elements WITH knowledge graphs?
- line 49: ; therefore, it is easy to use in -> to be used ?
Page 2
- line 13: dbo:deadYear -> dbo:deathYear ?
- lines 42-50: you repeat this twice: "We introduce a scoring function to estimate the uncertainty from the relevance ranking list returned from a lookup service"
Page 3:
- denotes -> denote
- Your definition of a graph is not right. A graph is not just a "set" of entities and types and relations, but also association between those. A typical graph is defined as G=(V,E) where V is vertices and E is edges connecting the vertices (and not just edges!). You can expand it to include types/relations but don't forget to define what's in each component and how they relate.
- CEA is not mapping c_{i,j} to E but to an e \in E. Similarly for CTA and CPA.
- Assumption 2: Tables are horizontal not vertical, not sure why you say vertical. Read a classic web table paper to learn about different kinds of tables and how they are referred to.
Other places:
- You use "relevance" instead of "relevant" in a number of sentences.
- Table 10: Dbpeida Spotlight -> DBpedia Spotlight

2) I was expecting a more formal and detailed description of the system in Section 3, but you are instead writing libraries you used, citing URLs, etc. The use of fasttext models, duckling, SpaCy, etc. are all implementation details. You need to define the tasks and the solution needed and then in implementation details say what model or library you used.

3) Related to the above, some details are not clear. You have a link for DBpedia Lookup, but what is Wikidata lookup and Wikipedia lookup? What are Duckling types?

4) Perhaps most importantly, you need a "baseline" method and some form of ablation study to measure the effectiveness of various steps and various design decisions made in your solution. The fact that MTab has outperformed other participants is already known. What you could add could be making it clear how each component/step and design decision has contributed to this outstanding performance.

5) I really liked Table 10, but it is again purely based on implementation details. If Elastic is used or not means whether a text IR solution is used and the source and the particular index used is secondary. Can you refine the table to make it about the methods and not choice of source/index only?

Review #5
Anonymous submitted on 05/Feb/2021
Suggestion:
Major Revision
Review Comment:

1. Introduction
MTab4DBpedia is a Semantic Annotation approach for tabular data (Semantic Table Interpretation or STI). MTab4DBpedia covers all tasks of STI, i.e., Cell-Entity Annotation (CEA), Column-Type Annotation (CTA), Column Relation-Property Annotation (CPA).
The described system obtains the best performance for the three matching tasks of SemTab 2019 challenge. The technique combines joint probability signals from different table elements and majority voting to deal with data noisiness, schema heterogeneity, and ambiguity.
The system is inspired by the graphical probability model-based approach described by Limaye and the signal propagation in the T2K system of Ritze.

2. Definitions and assumptions
2.1 Problem definitions
The objectives of the STI and the related tasks are well defined.

2.2 Assumptions
The assumptions are well described and reasonable if we consider the objectives of the SemTab 2019.
However, these are too restrictive for the use of the technique outside the challenge. This issue will be clarified later.

3 MTab4DBpedia approach
3.1 Framework
The MTab4DBpedia approach is composed by 7-steps pipeline. Step 1 is to pre-process a table data S by decoding textual data, predicting languages, data type, entity type prediction, and entity lookup. Step 2 is to estimate entity candidates for each cell. Step 3 is to assess type candidates for columns. Step 4 is to evaluate the relationship between two columns. Step 5 is to re-estimate entity candidates with confidence aggregation from step 2, step 3, and step 4. Step 6 and Step 7 are to re-estimate type and relation candidates with results from Step 5.

3.2 Step 1: Pre-processing
– Text Decoding: considering the nature of the input data, we find the choice reasonable.
– Language Prediction: authors should clarify how language identification affects the lookup task to justify this step (like on page 11, right column, line 44).
– Data Type Prediction: what are the 13 data types used?
– Entity Type Prediction: also in this case, the authors should clarify which are the 18 types of entities. (typo page 4, left column, line 39: remove the space before the footnote number.)
– Entity Lookup: in this step, a threshold is defined to limit the lookup results. We ask the authors how to justify the threshold set at 100.

3.3. Step 2: Entity Candidate Estimation
The approach obtains a set of candidate entities from four different services (i.e., s DBpedia lookup, DBpedia Endpoint, Wikidata lookup, and Wikipedia lookup). It is a sensible choice, but this implies a strong correlation of the approach with these services. It would be desirable that the authors hypothesize the use of their own lookup service.

3.4. Step 3: Type Candidate Estimation
(typo page 5, left column, line 37: remove the space before the footnote number.)
3.4.1. Numerical Column
The EmbNum+ approach is interesting. However, it is necessary to expand the description of this approach using real examples, possibly extracted from the dataset of SemTab 2019. Also in this step, the authors should clarify how the threshold alpha have been defined. It is suggested to do this empirically. (typo page 5, left column, line 46: remove the space before the footnote number.)

3.4.2 Entity Column
(typo page 5, right column, from line 42: correct punctuation in the bulleted list.)
The Section is clear and well described. Again it is necessary to justify the threshold beta.

3.5. Step 4: Relation Candidate Estimation
3.5.1. Entity - Entity columns
The Section is clear and well described.

3.5.2. Entity - Non-Entity columns
The Section is clear and well described, but could you clarify the wording "w5, w6 are learnable parameters".

3.6. Step 5: Entity candidate Re-Estimation
It is necessary to clarify the meaning of the parameters w7,w8,w9,w10.

3.7. Step 6, 7: Re-Estimate Types and Relations
The description of the process is acceptable, but we suggest to include a concrete example to guide the reader in understanding.

4. Evaluation
4.1. Benchmark Datasets
(typo page 7, right column, from line 39: i.e.)

4.4. Experimental Results
A comparative analysis with the other approaches of the challenge would be useful.

4.5.1. CEA: Entity Matching
Authors should also provide their version of the CEA GT (EDCEA_GT).
Can be an excellent resource.

4.5.3. CPA: Relation Matching
Authors should also provide their version of the CPA GT (DECPA_GT). Can be an excellent resource.

5. Related Work
5.1. SemTab 2019 systems
We suggest better identifying the strengths of the other approaches presented during the challenge.

5.2. Other Tabular Data Annotation Tasks
Semantic table interpretation is a long-studied "problem". The first works date back to 2007. Therefore, it would be useful to extend this Section to contextualize the proposed approach concerning the other works in the state of the art. See the general comment.

6. Conclusion
6.1. Limitations
As indicated by the authors, since MTab4DBpedia is built on top of lookup services, the upper bound of accuracy strongly relies on the lookup results. We think this is a significant limitation of the proposed approach. Therefore, authors should specify how they can adapt their approach in case it is not possible to use external services (e.g. building a local index service).

GENERAL COMMENTS
The paper clearly describes the proposed approach. The approach is characterized by a fair degree of innovation and has been validated in an international challenge, which certifies its quality.
However, the discussion remains too tied to what it was necessary to achieve within the competition, thus losing some generality. Indeed the challenge addresses the main issues related to the Semantic Table Interpretation task, but it requires some assumptions that differ from real scenarios.
In particular, an approach should be able to identify the subject column and the header.
Another weak element of MTab4DBpedia is the close link with external services, as specified previously. The authors may also insert the changes made to MTab4 for participation in SemTab 2020.
There are several formulas, perhaps even too many; a small example would undoubtedly help, to give a logical sense to the formulas.
We suggest that the authors insert the repository link containing the implementation of the proposed approach.