State of the art in Turkish Named Entity Recognition

Tracking #: 1334-2546

Authors: 
Gökhan Şeker
Gülşen Eryiğit

Responsible editor: 
Guest Editors Social Semantics 2016

Submission type: 
Full Paper
Abstract: 
Named entity recognition (NER), which provides useful information for many high level NLP applications and semantic web technologies, is a well-studied topic for most of the languages and especially for English. However the studies for Turkish, which is a morphologically richer and lesser-studied language, have fallen behind these for a long while. In recent years, Turkish NER intrigued researchers due to its scarce data resources and the unavailability of high-performing systems. Especially, the need to discover named entities occurring in Web datasets initiated many studies in this field. This article presents the state of the art in Turkish named entity recognition both on well formed texts and user generated content, and introduces the details of the best-performing system so far. The introduced approach uses conditional random fields and obtains the highest results in the literature for Turkish NER with 92% CoNLL score on a dataset collected from Turkish news articles and 65% on different datasets collected from Web 2.0. The article additionally introduces the re-annotation of the available datasets to extend the covered named entity types, and a brand new dataset from Web 2.0.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Genevieve Gorrell submitted on 16/Mar/2016
Suggestion:
Minor Revision
Review Comment:

It's a nice paper and makes a valuable contribution by reviewing Turkish NER and gathering together the work that has been done, comparing the results of different systems on different corpora.

It's reassuring that some attention is given to the role of the gazetteers in the final result obtained. Nonetheless I'd be glad of more discussion of the steps taken to ensure that the gazetteers were compiled without knowledge of the test corpora, as this is one way that results can be artificially boosted.

At a glance, it seems odd to include positional features with a sequence learner (previous token, previous but one etc.) Perhaps more explanation might be given of the rationale for this?

It might be nicer to let the reader decide for themselves that the system presented is superior to others in the literature, rather than telling them so several times and emphasizing it in the title. It can be more helpful to focus on reasons why some systems perform better than others rather than just numerical performance differences, which don't necessarily prove a great deal, nor contribute to the dialogue. For example, the title might be reworded to focus on the key way(s) in which this system differs from previous ones?

The article is written in fair English, with no difficulty discerning the authors' meaning. However I think for final publication it would be better to have the language checked.

As the authors note, a full comparison between systems is not possible if others don't share their systems. Unless these authors share their system, we're only in the same position going forward! Also, the authors introduce three new annotated corpora, but do they share them?

Review #2
Anonymous submitted on 29/Apr/2016
Suggestion:
Major Revision
Review Comment:

This paper presents a named-entity recognition (NER) approach based on conditional random fields (CRF) for Turkish. In their experiments, they re-annotate some of the existing datasets for additional entity types, and compare their performance to existing systems in the literature based on the published results.

The positive aspects of the paper can be summarized as follows:
- The paper is related to the SWJ special issue.

- As stated by the authors, this paper is an extension of an earlier version presented at COLING 2012. While the main contribution is still the CRF based NER method, I think the extension has fair amount of additional content, which is essentially in the experiments part. in particular, the authors extended some of the existing datasets for TIMEX and NUMEX datasets to repeat the experiments in their previous work. Furthermore, they evaluate their approach using User Generated Content (UGC), which is typically considered as a different and more challenging setup for evaluating NER.

- The authors put solid effort in implementation and experimentation, and show that their system is performing good.

On the other hand, there are still some major issues that need to be resolved:

1) First, I believe that starting from the title, the general attitude of the paper is kind of misleading: This is not a paper aiming to survey existing NER work in Turkish, but indeed, it proposes a new method (at least in the context of Turkish) and evaluates its performance against the earlier approaches (also see below for some concerns at this point), as a typical research paper would do. So, I really don’t agree with the sentence in the abstract: “This article presents the state of the art in Turkish named entity recognition both on well formed texts and user generated content, and introduces the details of the best-performing system so far.” What this paper really does is proposing a method, and evaluating it (using well formed texts and UGC), which is actually a reasonably adequate contribution. I strongly recommend the authors to frame and present their work in the latter line.

2) Secondly, I would like to see at least a brief discussion on the use of CRF in NER task in other languages and its success as reported in the literature, and whether there is anything specific for Turkish while applying CRF. It would also be nice to discuss the previous results for other languages that are similar to Turkish, i.e., in the family of agglutinative languages.

3) Third and most crucially, I think the comparison to the earlier studies is rather superficial: apparently, the authors have not implemented any of the earlier approaches and simply make comparisons based on the published results. While I agree that it is not reasonable to expect all previous approaches to be implemented, at least a couple methods should be implemented as the baselines. Furthermore, in the current state of the paper, it looks like that in most of the comparisons to published results, there are differences in the setup, dataset, annotations etc.; and it is hard to draw a strong conclusion and suggest the proposed method as the “best performing one”, as this paper does.

Here are some specific examples of such comparisons that seem unconvincing or at least incomplete to me:
- Comparison to [18] (page 12): The paper states that “They work on ENAMEX, TIMEX and NUMEX entity types but they do not provide the scores for each of these. In order to be able to make a fair comparison between the two studies, we measure the performance of their system on our test data and calculate the overall ENAMEX performance (F-Measure) as 69.78% in CoNLL metrics and 74.59% in MUC TYPE metrics.” Why do you only measure the performance for ENAMEX type, you do also have the annotated data for the other NE types?

- Comparison to [40] “We use the same training and test data, so our results given in CoNLL metrics are fully comparable with this work.” But you should compare to the results on the test set WFS3, right (as [40] did not use your newly introduced dataset WFS7?)? Given the next sentence “One should note that our performance before adding the gazetteers (89.55%) is still higher than her best result (88.94%)”, I suppose this finding is on WFS3 (based on the previous COLING paper), but this should be clarified.

- Comparisons for UGC datasets: First and foremost, there is a big confusion in the description of the training and test sets: In section 6, the paper states that “Following the previous work [40,4], in all of the provided experiments, we used 440K tokens of the news articles [39] (Table 2) as the training set and the remaining 47K tokens as the test set (WFS) for well formed text domain.” And then the test sets are listed, which include both the WFS3 and WFS7; as well as all the other UGC sets! So, is the training for the UGC case also done using the WFS set; and if yes, is it WFS3 or WSF7 (assuming that you extract additional NE types, should be the latter?)? Otherwise, on your own UGC data (Tables 8 and 9) and on Tweets-2 and -3 (Table 11); which training datasets are you using?

Nevertheless, if you have used the newly annotated WFS7 dataset for training, or some newly annotated UGC dataset, then your training set is *always* different than other works, say [16] and [9], so how reliable are the comparisons in Section 6.2.2?

4) Last but not the least, several experimental details are not clear, which may also lead some of the confusions discussed above. In particular:

a) For Table 4: To what extent the gazetteer lists are different than those in your previous work [4], please state clearly.

b) Sec 5.1.6: “We provided our atomic features within a window of {-3,+3} and some selected combinations of these as feature templates to CRF++.” Please specify what exactly these combinations are.

c) In Tables 5, 6, what is the evaluation metric, is it CoNLL? Please specify. Also briefly discuss the relationship of F-measure and CoNLL, as they are compared to each other in certain cases. For instance, page 10 states “We also executed the same experiments with 10 fold cross validation and obtained an average F-measure of 91.53 with a standard error of 0.50.”, which is confusing, as a totally different setup is described along with the results using F-measure metric. Please elaborate.

d) Why are the comparisons in Tables 8 and 9 not uniform? Tables 8 has the base model and feature analysis; but table 9 does not; at least base model should be there.

e) Section 5.1.4: the paper states that “In this format, we use the labels such as “PERSON”, “ORGANIZATION”, “LOCATION” and “O” (other - for the words which do not belong to a NE) without any position information.” Is this sentence up-to-date, as you are also annotating now TIMEX and NUMEX types? In general, I do very strongly recommend to specify the training and test sets explicitly for each experiment reported in Section 6.

f) For the UGC dataset used in testing, is it guaranteed that re-tweets or other sorts of duplicates are removed?

Overall I believe that this work has some merit. However, I recommend a major revision of the paper to a) frame the paper’s contribution better as proposing and evaluating a new approach rather than a survey, b) clarify of the experimental setup and results, and c) provide a more detailed comparison to the literature (i.e., either implementing by some key methods, or providing a more careful comparison to published results by taking into account the differences in their setup and avoiding excessive claims, etc.).

Review #3
By Giuseppe Rizzo submitted on 16/May/2016
Suggestion:
Major Revision
Review Comment:

This paper presents a three-fold contribution: state-of-the-art of the challenges in the field of named entity recognition (NER) applied to Turkish text, newly sets of labeled data, and an approach to perform NER based on a conventional CRF with a deep and remarkable engineering effort for feature optimization. Such a summary does not match the title of the paper, that is misleading and it should be updated.

The paper is well-written and it reads easily. Other positive aspects of the paper are: extensive knowledge of the challenges being addressed while processing Turkish content. From the paper it emerges that the main challenge is the labeled data at disposal that has limited the performance of previous approaches to middle level results. Section 2, 3, and 4 set the ground for the paper and the experimental setup. In Section 5, authors present the CRF-based approach. Overall looks well-structured and motivated. The engineering efforts of fine tuning features look terrific, and the achieved results show the efficiency of the approach and well motivate the claim on the two types of textual data: news articles and tweets.

On the negative side: it is hard to grasp the scientific added value of the NER approach. The take home message looks to be too skewed toward engineering rather than science. Neither new algorithms nor new use of semantics are discussed. Nowadays, the use of a CRF is a baseline de-facto. I tried to question whether this approach can actually be generalized, but it looks too specific to these settings. This might be better emphasized in the follow up of the paper to ease the reviewing task.
The classification introduced by the authors might be revised: well-formed text -> news articles and tweets -> user generated content. Isn't a news article a user generated content? Although producing gold standards is gold mine and it well fuels numerous future research endeavors, I must admit without any guidelines this effort looks to be unbalanced and, potentially and arguably done to optimize the performance of the proposed approach. I encourage the authors to make use of the annex to add the guidelines.
Lastly, but most importantly, the presence of Social Semantics/Social Web is rather small and this paper would need a strong revision and refocus to better fit the scope of the special issue.

Finally, a few typos/things to better phrase:
- MUC/CoNLL score. What is this? Do you mean that using the X scorer you got Y. Please be more rigorous in defining the metrics instead.
- 55K dataset -> 55K sentences (?)
- the annotations ... was -> were
- of re-annotated .. dataset -> of a re-annotated ...
- to make a fair comparison -> to make a comparison
- Muc-6 -> MUC-6
- couldn't -> could not
- CoNLL and MUC are not metrics (page 9)
- wasn't -> was not
- don't -> do not
- in all tables, please explicitly define the metrics used to rank the settings/approaches (that I reckon is F1)
- the column "Best Result" (page 13) reports F1?