Subject Classification of Academic Journals via Knowledge Graph Embedding

Tracking #: 1937-3150

Soumya George
M. Sudheep Elayidom
T. Santhanakrishnan

Responsible editor: 
Guest Editors Knowledge Graphs 2018

Submission type: 
Full Paper
Subject Classification of scholarly articles is a pertinent area in the field of research. Proper classification of journal articles is an essential criterion for academic search engines to facilitate easier search and retrieval of journal papers based on user preferred research areas. The widely used approach is to use metadata of journal papers like title, abstract, paper keywords etc. to classify articles. This paper proposes an efficient graph based subject classification of journal articles using a pre-indexed classifier model by means of full text indexing approach. Journal contents are indexed using sequence word graph model to classify any journal article into its relevant research areas and sub areas based on actual keyword or key phrase embedding in the journal contents. This automatic classification approach enables efficient search of scholarly articles by means of subject categories or by sub areas and also relieves the journals to ask users to classify their papers to find proper reviewers. An attempt to find authors main streams or research areas based on the subject classification of their papers is also done. The subject classification accuracy is tested using arXiv subject classified papers set of total 1307 papers and accuracy yields 91%. Classification accuracy of author’s research areas of interest also tested manually for 52 authors having 2 or more papers indexed in the database by comparing with that of Google Scholar profile and got 100% accuracy for 28 authors. And out of the remaining 24 authors, 14 authors have only one field missing and rest of the authors have only 2 or 3 fields missing. A comparison of full text based subject classification with metadata based classification is also done and the results proved that full text indexing based subject classification yields high accuracy than metadata based classification.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 03/Aug/2018
Review Comment:

(1) originality
-- The authors do not make a good case that we need a new subject-based classification scheme for academic journals. The entire work is essentially motivated by the problems with the current state of the art, as they are mentioned in three bullets in an unacceptably small related work section. These bullets provide the motivation for the study but are not explained or justified in any way. Use of only basic journal information to predict the subject area will not be fully accurate.... why? Use of interrelationship analysis to classify journals is not reliable and accurate... this is an extremely broad statement. What do you mean by reliability and accuracy, and why are they not reliable or accurate, despite past subject classification research that suggest the methods actually are reliable and accurate?

++ The methodology does seem novel and interesting. The proposed work does a merging of subject information from a number of online indexing sites to produce a universal hierarchical model of paper subjects to be referred to later. It seems quite comprehensive, with 81k+ entries. Publishing this hierarchy would be a great contribution.

-- I really cannot understand the methodology or experimental results presented in non-concise psudeocode. The written explanation for the algorithms is far shorter than the pseudocode itself -- which is usually just one long run-on sentence. The code and explanation for the classification algorithm is something I could not understand well.

(2) Significance
The significance of the paper is hindered by poor methodological discussion. The paper mentions "key" and "file" nodes at the start of the methods section without explaining what such nodes are or mean. I'm also a bit lost when the article says: "the resulting graph will be a graph of keys for each sentence in each document". So there is a graph for every sentence in the document? Im not sure what the result graph is.

I am also confused about the Fig 3 GSC classifier screenshot. Is this from a tool that the authors' made to implement their methodology? Is this tool available and open source for us to take a look at?

The paper may also be improved by comparing their results against methods from the literature, a typical practice for IR research. The paper compares their results against a baseline involving metadata (but again, I cannot understand this algorithm because of its presentation) but state of the art approaches, particularly those that incorporate knowledge graphs and semantic web technologies, would greatly improve the significance of this method and its suitability for SWJ.

(3) Quality of writing.
There are many sentences that need revisions in the manuscript. A common problem are compound sentences where many topics are squished into 1 sentence. I mention some examples:

-- Lots and lots of works are going on in the field of graph based representation of scholarly articles in an efficient way by creating a network of academic papers using many dependency relationships like citations, references etc.
==> What are lots and lots and what do you mean by going on? etc. implies networks using relations besides the 3 mentioned,
what are they? The sentence needs citations as well.

-- The central assumption behind all classification algorithms is that the objects in the same class or field have similar properties and subject classification algorithms is based on the central fact that all documents or articles under the same subject category have related or same set of keywords or key phrases associated with it.

-- There are many types of classification schemes available to classify journals based on type like review paper, letter or original research paper etc. or based on research level as basic or applied research or based on published journal name etc.
==> Please do not use etc, it is very informal.

-- The main advantage of the proposed system relies on the subject classification of each journal article in addition to index free adjacency feature of graph database facilitates faster search on keywords or key phrase.

-- But this often fails because of the existence of interdisciplinary journals and several other reasons [6]
==> ... such as?

-- This is done by a performing a recursive traversal for each key of each categorization type by traversing through its child key nodes present in the article, their child nodes in the article and so on thereby traversing through each level of classification in its lower level till level 6, i.e. keywords category type, which is the last level by aggregating count calculated by primary key labeling for each key traversed and select the key with highest aggregate count for each category type.
==> this explains the classification algorithm, but I don't really understand this long single sentence. This is so critical!

Review #2
Anonymous submitted on 06/Aug/2018
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The paper presents an approach for classification of academic journals using a knowledge graph. The approach is based on a previous work done by the authors, named word sequence graph model, which is based on graph-of-word model.
The paper addresses an interesting problem, however there are several major drawbacks:
* The paper cannot be considered as a full paper, because it doesn't describe a novel approach, rather presents the use of an existing approach for the application of classifying academic journals.
* While the authors claim that the WSG approach is novel, the approach is based on an existing graph-of-word model [1], which hasn't been even referenced in this paper. With that being said, it is unclear what is the novelty and the contributions of this paper, thus the paper cannot be considered as a full research paper.
* The language and the structure of the paper is not on a decent scientific level. The paper should be proof-read by a native/good English speaker as it contains a lot of grammatical mistakes, e.g., often missing definite/indefinite article, grammatical mismatches, mixed tenses and various typos. This makes it very difficult to read and follow the paper.

* Even if the paper is considered as an application paper, there are several major flaws:
- A lot of details are left out. The authors just provide the pseudo algorithms without explanation. For example, it is not clear how the classification of the articles is done. In Section 3.3 the authors mention that the two keys with highest aggregate count will be selected, however it is not clear what is this count representing and how it is calculated. The authors need to describe the algorithms, as the provided pseudo code is not sufficient to understand how the classification is done.
- The authors do not provide a comparison to any related approaches nor baselines. It is important to see what is the advantage of the Knowledge graph approach compared to baseline methods, e.g., simple bag of words, TF-IDF, BM25 and BM25+. Also comparison to recent related approaches needs to be performed to show the value of the system, e.g., the approaches already listed in the related work section, and some more recent [2], also approaches that address the classification of interdisciplinary articles [3,4] as the authors claim that their approach can address this issue as well.

[1] Rousseau, François, and Michalis Vazirgiannis. "Graph-of-word and TW-IDF: new approach to ad hoc IR." Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACM, 2013.
[2] Li, Keqian, et al. "Unsupervised Neural Categorization for Scientific Publications." Proceedings of the 2018 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2018.
[3] Katz, J. Sylvan, and Diana Hicks. "The classification of interdisciplinary journals: a new approach." Proceedings of the fifth international conference of the international society for scientometrics and informetrics. Learned Information, Melford. 1995.
[4] Nanni, Federico, et al. "Capturing interdisciplinarity in academic abstracts." D-lib magazine 22.9/10 (2016).

Review #3
By Paul Groth submitted on 29/Aug/2018
Review Comment:

This paper describes the construction of a classifier for adding subject classifications to academic articles based on a knowledge graph semi-automatically constructed from a set of existing computer science related subject classifications.

Overall, there is significant work needed for this paper to be in a place to be ready for publication. Specifically, I would begin with three areas where the paper needs work:

1) Details on the construction of the pre-index
2) The evaluation setup and analysis
3) Related work

My detailed comments on these areas are below.

I would also look at the overall presentation of the paper as in its current form it is difficult to understand the overall approach.

Lastly, given that this is a systems paper I would expect the code and data to be made available unless there is a reason given where this is not possible.

# Detailed Comments

## Construction of the pre-index

I had difficulty understanding what was meant by a "pre-indexing graph". Is this the construction of a graph prior to the use of the graph for classification? If this is the case, it's difficult to understand how that construction occurred. It seems to be a manual merger of various existing subject classifications from different sources. But the manual approach isn't defined. Stating that the end result of the merger was "properly filtered" by comparing to a manual list is not enough detail. Importantly, the subject classification itself is not provided.

I would have also expected a justification as to why this approach was used rather than a more bottom-up approach as has been done in works like:

Osborne, F. and Motta, E. (2015) Klink-2: Integrating Multiple Web Sources to Generate Semantic Topic Networks, International Semantic Web Conference 2015, Bethlehem, Pennsylvania, USA

## Evaluation
In terms of an evaluation, I think I understand that what was used was the existing arXiv classifications as a gold standard? Is that correct? It should be made a bit more apparent. If my understanding is correct, then I was surprised to see just accuracy reported. One could report, precision and recall measures as well as F1 or some other combination metric. It would be interesting to know if your classifier is potentially adding missing classifications or not getting certain classifications.

Also, I would expect at least a justification as to why only one dataset was used as an evaluation corpus.

In terms of the research area of interest classification, I find the comparison to be underpowered using on the order of hundreds of authors. Additionally, reporting the large table of authors and their classification would be better left as a dataset and only a couple of examples given.

There is also no baseline comparison to another state of the art or even a strong baseline (e.g. a dictionary based annotator or just a search system) classification system.

Note, I did like the comparison between metadata and full text performance.

## Related Work
The related work is not properly up to date and is not contextualized:

* "Lots and lots of works are going on in the field of graph based representation of scholarly articles in an efficient way by creating a network of academic papers using many dependency relationships like citations, references etc." - Please provide references and examples. I agree with the statement but please give an up-to-date view to help guide the reader in the current landscape of work.

* You need a citation for the claim "The early used approach for subject classification of journals is based on journal name". The history of subject classification. Wikipedia is a good jumping off point in this instance. See

* Evidence supporting the claims that existing solutions are not reliable or accurate are not given. Why do the authors think that using dictionaries or interrelationship analysis are not accurate or reliable enough? Provide justifications.

# Minor comments

* "Graph based search systems are gaining more importance nowadays because of its reliability and efficiency to index and search data and wide variety of graph databases" - I don't think the citation to the website Predictive Analytics Today top databases supports this claim.

* The pseudocode provided could use the introduction of notation and then subsequent explanation in the text itself to allow the reader to understand the goal of the algorithms and generally how they work and then provide a more convenient and closer to implementable set of code with the algorithms.

* Have a careful check of use of prepositions (i.e. the use of "the" and "a") throughout the article. For example, "Sequence word graph maintains the sequence" should be "A sequence word graph..."

* It would be good to provide an organizing paragraph in the introduction to help the reader understand what you will subsequently present.

* The authors mention embeddings in the title but it's not addressed specifically anywhere else in the article as far as I can tell.