Review Comment:
This is a full research paper on ADEL an adaptable Entity Linking framework which is independent of the text to be processed and the used knowledge base.
Overall the paper is well-written, the followed approach seems to be interesting and the evaluation results are well-discussed in general. However, the paper has certain drawbacks in the way it describes the approach which make me think that it should not be accepted as it is. On high level:
- The proposed approach's introduction is very poor and disappointing instead of engaging for the readers to continue reading intrigued by the contributions! Instead of introducing all innovative aspects of the approach, the contributions list is limited (sec 1) and I think that the mentioned items do not reflect the contributions that one can read spread within the text. Similarly, when the approach is described (sec 3), its introduction is limited on contrasting existing approaches on two aspects, the contributions are not clearly mentioned here nor justified and it is only claimed that the architecture is designed in a way that allows to enable the changes. Therefore, I would strongly suggest that the introduction of the approach is improved, clearly mention the contributions and justifying why these are significant contributions.
- The focus of the approach is currently on the (new) architecture (of the implementation). If this is the case, I would expect that this is a system paper and not a full research paper. The approach normally goes beyond the proposed architecture. However what is currently considered as approach is nothing else but the current implementation's description whereas what is considered as implementation (current section 4) is just a description of the config file which accompanies the implementation. Therefore, I would strongly suggest to distinguish the contribution from the implementation.
- There are often statements which are not supported by references or proven (especially in sections 1 and 2). I mention a few below in details but there are more cases. I would suggest to add more references or try to show evidence in most cases.
- I would request to be clarified if the pipeline (or which of parts of the components/contributions) are open source and the experimental settings and results are made publicly available via permanent URLs (e.g. figshare) in order to enable reproducibility.
- (minor) Different acronyms are mentioned in different places within the text, it would be best if the first time they are mentioned the full name is provided. More, a proof-reading is required to correct all grammar and syntax errors.
In more details:
Introduction (section 1):
- "textual content represents the biggest part of content available on the Web" --> I would suggest that either a reference to this statement is provided or the argument is soften as it is not self-evident.
- in the task description (sec 1.1) there are different definitions mentioned. Are all these notions new? I would suggest to provide corresponding references wherever possible or clearly mention that this definition of the term for this paper.
- "the two main problems when processing natural language text are ambiguity and synonymy." --> I would suggest to provide a reference to support the argument. Is it only entity linking that solves the problems of ambiguity and synonymy? Not the entity recognition?
- sec 1.2: 1st challenge, why are newspapers, magazine and encyclopedia trusted sources? Are all newspapers trusted sources?
- sec 1.2: where are these challenges coming from? Are they defined by the authors? If so, based on which evidence? Is this list complete? Is it result of "surveying" the state of the art? Then I would suggest to provide references to different publications which refer/address to each one of the challenges. Or is there a publication that lists those challenges? Then I would suggest to refer to it. Of course, it is mentioned that these are the main challenges but again why are these and not others the main challenges? I think this can be addressed by showing that there were several papers in the past proposing alternative solutions to address this problem or arguing that these challenges are relevant to the problem being addressed in this paper or any other solution that might help supporting the argument.
- sec 1.2: "formal texts, usually well-written and coming from trusted sources such a newspaper, magazine, or encyclopedia;" How is the "well-written aspect determined? And (why) are all the newspapers and magazines trusted?"
-sec 1.2: I think the difference between the formal and informal texts lies (barely) on how they are written (genre as it is later on called) and not on their trustworthiness. Namely why a magazine is more trustworthy than a tweet? Couldn't the magazine have a twitter account? Why its tweet is less trustworthy than its articles?
- sec 1.2/1.3: could it be explicitly elaborated which challenge affects each contribution? e.g. which challenge affects the third contribution?
- all contributions apart from contribution 3 have a reference to a corresponding section, could contribution 3 also have such a reference?
- sec 1.3: Why is the 4th a contribution? It reads more like results of the evaluation rather than a certain contribution
-sec 1.4: "numerous" --> I would suggest to rephrase this!
section 2:
sec 2.1:
- "We identify two external entries for an entity linking system: the text to process and the knowledge base to use for disambiguating the extracted mentions. We extend the definition of what is an external entry for an entity linking system defined in [43] " --> This reads more like an assumption that was made within the frame of the proposed solution rather than related work. I would propose this part of the related work to be moved in another section or that this paragraph is rephrased. Moreover, I think that going from 3 (text, knowledge base and entity) to 2 (text and knowledge base) is not really an extension.
-"This definition is often extended by including other categories such as Event or Role" --> I would suggest a couple of examples to be provided with regard to where this happens.
- sec2.1.1: "We propose a different orthogonal categorization where textual content is divided between formal text and informal text." --> This is not related work but more part of the assumptions of the paper for the proposed approach. I would suggest to move the text to the corresponding section and limit on presenting existing works in the related work section so the readers can barely get aware of the state of the art in this section.
- sec 2.1.1.: Why are subtitles trusted? I would suggest to back-up the argument with a reference. The same stands for ASR, I would suggest to provide a reference to an example/publication that does state that subtitles are generated from such a system.
- sec 2.1.2: there is an outline of certain knowledge bases but none of the sub-challenges (coverage, data model and freshness) are of the knowledge base challenge are covered in details. Even more the section does not present evidence from the state of the art that shows that the aforementioned challenges indeed exist. I would suggest both remarks to be clarified in the text.
sec 2.2:
- Where does this classification come from? I would suggest either to provide a reference to a source or examples per case showing that indeed such cases exist.
- Table 1 on top it says that it is about mention extraction but the column entity recognition refers to whether the entity is recognized during the mention extraction or linking process (similarly for the entity candidate generation) but Table 2 is dedicated on Entity Linking. Despite this issue, what does it mean Yes and No? I would suggest that this is better represented. I would recommend even a table dedicated on these two aspects where, for each case, there is a tick on one of the two alternatives. Then Table 2 would be directly comparable to Table 3 and Table 1 almost comparable.
- a minor comment for these tables: there are different upper/lower cases of writing the same, e.g. "lexical similairity" and "Lexical Similarity".
- another minor/optional comment with regard to Table 1: I would suggest firstly to put the columns Main Features and Method and then the external tools and language resources, so all tables have the same structure (at least in the beginning)
- while sec 2.2 provides a clear comparison among the different technologies, this is not the case for sec 2.1 which provides a plain outline of different alternatives.
- I would suggest to provide a reference for the definition of "overlap resolution"
- "In Table 2, we observe three approaches" --> 3 approaches for doing what? I would suggest to explicitly say what these approaches do.
-section 3:
- Figure 2 should be closer to its reference (the same occurs with other images and tables too so please adjust overall)
- "ADEL comes with a new architecture" --> new compared to an older one or new compared to the others? If the former, please add a reference to the older and explain the difference but I guess the latter, so I would suggest to choose a more adequate term, perhaps innovative or alternative? But getting back to the contributions, it is mentioned that a modular architecture is proposed but could the readers assume that the approaches were not modular/adaptable so far? So, is this the contribution or does the innovation stand on other aspects, such as static Vs dynamic or flexibility? The contribution text should be updated accordingly to show the actual contribution.
- "little flexibility" --> This reads too vague. I would suggest that this is further clarified (perhaps it's best to happen within the related work section)
- "cannot be extended without ... spending a lot of time in terms of integration" --> Why would an extension require integration? I understand based on the example that replacing a module or complementing it with another module is considered an extension and I assume that integration is required for the new module to be added in the pipeline for that reason but this is best to be explained.
- "the knowledge base being used is often fixed as well" --> Was there knowledge bases that required fixing? This sentence needs to be rephrased.
sec 3.1 (all minor comments)
- it is mentioned where Gazetteer Tagger relies on but it is not mentioned what it does. I would suggest to mention what it does too
- "to handle tweets, we use the model proposed in [10]." --> How is that relevant to the POS tagger and it needs to be mentioned there? If tweets are handled according to a methodology which is proposed by [10], I would expect that it is a different extractor.
- "While using a dictionary as extractor, it gives the possibility to be very flexible in terms of entities to extract and their corresponding type" --> it refers to GATE or ADEL in this case? Please mention explicitly.
- "If we only apply the 4 classes model" --> this refers to one of the Stannford NLP models but I would suggest that it is mentioned because as it is now, it remains vague.
- was the mapping among the different sources manually defined? Was a methodology followed?
sec 3.2
- How could we know in advance which columns to search? How is that determined in advance?
- "This optimization reduces the time of the query to generate the entity candidates from around 4 seconds to less than one second" --> This reads more like results which are produced after a certain evaluation. There is no context to turn them relevant in sec 3.2. Readers may assume that 4 seconds is the time of the system before optimizing but that was never mentioned. I would suggest that the optimization is part of the contributions (I assume that it is significant improvement) and that the concrete results are mentioned in the evaluations section together with the comparisons to other systems.
section 4:
- The configuration file is mentioned to be written in YAML but it is not clearly mentioned that it consists of 3 parts which are further clarified. I would suggest to do so. Moreover I would suggest
-"In case of an Elasticsearch index, the properties query and name are mandatory. In case of Lucene, these properties are replaced by two other mandatory properties that are fields and size" --> This does not read as a very generic, modular and configurable solution. I would suggest this aspect to be clarified.
section 5:
- "the best configuration for the NEEL2015 dataset is not the same than for the NEEL2016 dataset despite the fact that both datasets are made of tweets." --> Could you explain why this happens? And do you have any idea of what needs to be done not to have this?
The evaluation is well-discussed but I miss comparison with the best approaches of each case, e.g. OKE2015, OKE2016 etc. Namely besides which configuration is the best, it would be good to know how the tool is compared to other tools which used the same evaluation datasets. of course, within the text it is mentioned that ADEL outperforms the state of the art but I would suggest that this becomes clearer in the corresponding tables. Now this is not obvious.
- For the index optimization it would be best if the results both with and without the optimization are presented.
sec 5.1:
- "We evaluate our approach at different level: extraction (Tables 6, 5, 7 and 8)," --> I would suggest to be a bit more detailed within the text with regard to what each table presents. Note also the order to be correct, now it is firstly Table 6 mentioned and then Table 5.
- Could you provide a table with all conf information together to be comparable?
- "We tackle this problem by developing a novel hashtag segmentation method inspired by [51,24]." --> I think this segmentation method should be mentioned when the solution is presented.
- The experimental settings should be made publicly available via permanent URLs (e.g. figshare) in order to enable reproducibility.
Minors:
" there is currently no agreed upon definition of what is an entity. " --> "of what an entity is."
-sec 2.1:
"We extend the definition of what is an external entry" --> "what an external entry is" (more syntax errors like this one)
"The current entity linking systems tends to adopt" --> "tend"
"this generally consists in mention detection and entity typing" --> consists of
-sec 2.2:
"since these methods aims primarily to" --> "aim"
- sec 3.1:
"it is then possible to jump from one source to another" --> I would suggest to replace jump with another verb, such as alter.
|