Review Comment:
This article proposes a fully automated approach, called KINDEX, that uses different knowledge graph services for subject indexing. The motivation behind this new approach is two folds: difficulty to manually index large-scale collections of machine-readable information, and the burden of training new ML models for automatic indexing. The originality of the approach consists of using identity links of knowledge graphs, published in the web of data, together with lexical matching to improve the indexing process.
General review:
This work addressed an important research problems and proposes a "rather" ambitious approach. However, this latter suffers from two major weaknesses: (1) it depends on many parameters (especially sameAs links) and seems difficult to generalize, and (2) it lacks originality as it only consists of reusing existing techniques. The unpredictable results of the two use cases confirm the difficult application of the approach to other uses cases. Besides, only general indications are provided regarding the comparison with existing ML-approaches.
Detailed review:
The motivations and contributions of the approach are "generally" well presented. However there are some gray areas that should be clarified.
Section 1
It is indicated that one of the benefits of the proposed system is that "it does not need to be trained for a particular controlled vocabulary". However "the KOS system from which keywords are generate" should be published in the Linked Open Data. This is a strong condition, as if the KOS is not available in the LOD, its publication is a complex, very time-consuming and costly task.
Section 2
- "The automatic strategy has proven to be feasible with mostly high quality assignments of index terms [10, 11]" Regarding the chosen classification for related works (ML and associative indexing). To which category does this work belong?
- "Empirical investigations into the potential and effect ….. high quality data sources." This idea is repeated several times.
Section 4
-"the most suitable strategy is to first match the surface form as was determined by the spotlight index with the STW thesaurus’ preferred or alternative labels" Which technique is used to perform this matching ? Syntactic matching ? Semantic matching ?
- "FlexiFusion approach presented by Frey et al." reference is missing!
- "For instance, the mappings to the GND descriptors are accessible via the property , whereas mappings to VIAF can be determined from " How determining the property to use each time?
- "Thus the matching of descriptors is enabled through traversing identity paths" Traversing the web of data through sameAs links is a very complex problem? How to decide to stop the process!?
- "Hence, link-enabled ….Afterward, the cross-concordance …." These two paragraphs definitely show that the approach is based on a complex and context-dependent matching process!!
Section 5
- "For the LIMBO catalogue, it also seems to be the case that more input data (title+desc) leads to better F1 scores" In Figure 3, it seems that (desc) and not (title+desc) that leads to better F1 scores!?
- "Figures 5.2 and 5.2" maybe Figures 3 and 4 ?
-"Tabs. 5.2-2 show the final evaluation" problem with the reference!
- "Please note that these findings give only an indication, since the evaluations could not be run on the same sample." This is a real problem, as it is not possible to make a judgment based solely on indications!!
The language of the article is acceptable, but it contains some typos such as:
two real world usage scenario => two real world usage scenarios
has to be learnt => has to be learned
growing amount of publications => growing number of publications
and can not be easily => and cannot be easily
have been rarely taken => have rarely been taken
there are cross-domain indexing tools available that can be => to reformulate
There might be cases in which there neither => to reformulate
is even higher then directly => is even higher than directly
the cross-concordance that exist between => the cross-concordance that exists between
can be obained by querying => can be obtained by querying
are then send to => are then sent to
a few preliminary parameter => a few preliminary parameters
This might be explained with the fact that => This might be explained by the fact that
|