LOPDF: A Framework for Extracting and Producing Linked Open Data of Scientific Documents

Tracking #: 1669-2881

Ahtisham Aslam
Naif Radi Aljohani
Rabeeh Ayaz Abbasi
Ali Daud
Saeed-Ul Hassan

Responsible editor: 
Guest Editors LD4IE 2017

Submission type: 
Full Paper
The results of scientific experiments and research conducted by both individuals and organizations are published and shared with the scientific community in various types of scientific documents such as books, journals and reference works. The metadata of these documents describe important properties, such as has_Author, has_Affiliation, has_Keyword and has_Reference. These can be used to find potential collaborators, discover people with common research interests and research work, and explore scientific documents in matching domains. The major issue in obtaining these benefits from the metadata of scientific documents is the lack of availability of this data in a well-structured and semantically enriched format. This limits the ability to pose smart queries that can help to perform various types of analysis on scientific publication data. To address this problem, we have developed a generic framework named Linked Open Publications Data Framework (LOPDF). The LOPDF framework can be used to crawl, process, extract and produce machine understandable data (in RDF format) about scientific publications from various publisher-specific sources such as web portals, XML exports and digital libraries. In this paper, we present the architecture, process and algorithm used to develop this LOPDF framework. The RDF datasets produced can be used to answer semantically enriched queries by employing the SPARQL protocol. We present quantitative as well as qualitative analyses of the LOPDF framework. Finally, we present the potential usage of semantically enriched RDF data and SPARQL queries for various types of analyses of scientific documents.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Anca Dumitrache submitted on 13/Jul/2017
Major Revision
Review Comment:

This paper describes the LOPDF framework for extracting and producing linked open data for scientific publications. The authors present a case study of mining the Springer website for metadata on publications and publishing it as linked data. The paper is well written and the methodology of the framework is clearly explained. The dataset created with this method is referred to as SPedia, and its evaluation is also presented in the paper.

However, the motivation for this work needs to be better put into context. The authors claim that “only limited work has been done in publishing the LOD of scientific documents”. In fact, many scientific publishers now publish LOD metadata, including Springer (http://lod.springer.com/data/search). The DOI service also provides linked data for publications (https://www.doi.org/doi_handbook/5_Applications.html#5.4). I would like to see a comparison between the SPedia data and these datasets in terms of completeness.

SPedia could also benefit from some ontology alignment. Instead of using new predicates, SPedia should reuse vocabularies that are already established and that deal with publications. I would like to see at least an alignment with Dublin Core (http://dublincore.org/) - as this is one of the most cited ontologies, it should improve the reusability of the data.

A discussion of how to generalize the LOPDF extraction algorithms to multiple data sources is also missing. In the current setup, it is not clear how this could be done concretely. The authors mention that it can be accomplished by “changing the endpoint triggers”. How are these endpoint triggers collected for each publisher? And also, how is data alignment performed? For instance, the authors mention the inconsistency of publication categories across publishers - how does the LOPDF framework handle this?

The evaluation also needs more detail. How many SPARQL queries were used in the evaluation of the quality? How many articles were randomly chosen from the results? Is this a representative sample for the data?

Finally, is the SPedia data available somewhere online? I would like to see at least a link to a data dump, if not a SPARQL endpoint for it.

Review #2
Anonymous submitted on 13/Oct/2017
Major Revision
Review Comment:

The authors describe work to extract publication metadata to a structured, uniform, machine-readable format, as linked data, to aid querying across datasets and discovery of information such as expertise and relationships due to collaboration or common affiliation.

The paper addresses a current challenge. However, I am not convinced about novelty - what does this approach provide that isn't already available? Among others, dblp provides a very similar service (admittedly xml, not RDF - but still machine-readable). Further, how is the framework extended to other publication portals? There is not enough information about the actual implementation to guess this.
S2 concludes by saying "there are no frameworks for producing LOD about scientific documents." - assuming this is true, why not reuse existing work to do the same? What makes scientific documents so different that a new framework needs to be built specifically for this?

The paper gives a good bit of detail/long explanations about contributions with relatively lower importance, but does not detail the model on which the data extraction is based. There are hints about this - in Figs 3 & 4 and the algorithms, but it is never explicitly presented. It would be useful to provide at least a high-level description or snapshot.

How is "qualitative" defined in this paper? From the text S6.2 appears to be discussing QUALITY - it is not a qualitative analysis as would be normally carried out.
Overall, both sections 6.2 and 6.3 do not provide particularly remarkable information.
What was the "client semantic web browser" used in S6.3 - was it Gruff or something else? Importantly, WHICH browser(s) and on what basis selected need to be made clear.

100% accuracy is unusual, except maybe for manual extraction of a relatively small sample, and especially so for "interlinking different datasets". What are examples of the "obstacles and inconsistencies" encountered, and how were they resolved? How easy would it be to replicate this set-up for new portals?
What "external" datasets were used and what and how many were the links? In the follow-on with other publication portals were the results replicated?

The literature review is a little difficult to follow - it focuses predominantly on government data conversion to LOD, but then includes other work that seems in part to be randomly placed. E.g., is the discussion about SILK specific to the Singapore data?
I'd suggest restructuring to group similar topics and showing clearly, where relevant, where one feeds into another.

That recursion/iteration is used in the data extraction/triplification process is repeated several times. Is there a reason for stressing this? - in which case it should be expanded on.

Algorithm2 is a bit tedious to read - each iteration repeats nearly the same thing with minor differences for type and document part. For readability a simple switch between types could be used.

****** Other detail

p.4 - what do you mean by "the implantation of unformatted and unstructured unique multilingual datasets"? - emphasis on "implantation". Further, what is it in the study that needs to be improved?

p.5 - first paragraph - don't understand what this is saying - what are these vis methods and in what way are they better than the existing - where the existing is what? How does generalisation allow use of SPARQL - or is this meant to say whatever method this is allows the use of SPARQL?

The last sentence on p.5 says the same thing as the last sentence of the following paragraph (on p.7).

P.7 - why would you link an organisation to a dataset using the property "has_coordinates"? Yes, this might be appropriate for geonames specifically, but the text gives geonames as an example that stands apart from the property in question. Should the description not provide a more abstract connection property, from which a refinement per specific data type (such as this) could be used? Considering the article is arguing for machine-readable linking of data this IS important - as a human I can do this manually for a relatively small number of instances before the tedium becomes an issue. A machine can deal with the tedium but cannot carry out this level of specification without a properly defined method for categorisation.

It isn't completely obvious which parts of Fig.3 map to the four parts as described in the text. I'd suggest annotating the figure to distinguish between the first two.

Fig.4 - for the three "crawl books | journal | reference works|" - the (process) arrows pointing downwards to increasing level of refinement make sense, but not the reverse arrows. How do you move from, e.g., "extract book chapters details" to "crawl book chapters" - why would you need to repeat the latter, and what in the former is needed to repeat the latter? In the same vein, why does this feed back into the data source box? Are you pushing information back to your source? What does this arrow represent?
Essentially, this section provides a closed loop - why is not obvious - I'd expect the data source to be the input point only, not part of a closed loop.
Further, S5 says the crawling and extraction processes "Both work in sequence and are interdependent." - this is actually not possible - sequence means you start from one and go to the next. Going back to the arrows the flow is in both directions. Logically and practically, also, you cannot extract the detail before you've parsed the data.

S6.4 - why would you encode a date as a string? Especially after noting such instances in the source data contributed to challenges in parsing?

S7 - "Performing queries such as finding multi-author trends in writing scientific documents in various disciplines may help management to find out which disciplines need to establish policies for more collaborative work." - why? is collaboration always a requirement for research? Further, to what aim?


Please help your reader correctly interpret your diagrams. Especially as they are necessary to get a good grasp of the work done.

Figure 1 is uncomfortable to read - there is a lot of variation in resolution between the different snapshots - which requires the eye to attempt to refocus moving through the diagram. The overlays only make this worse - as the picture gets busier with yet another level of resolution and lines crossing over each other. This renders what should be an informative figure not so very useful.
Am I correct in assuming that each column as you go down is for one of the three journals? If so place a bit of space between them, some of the snapshots lie on the column line. Also, move the descriptions a-d to the top of each row - it took me rereading to figure out that d was not missing, but that you had to read from bottom to top.
I'd suggest ALL the snapshots use the same resolution level and font size - manually adjust if need be and note this in the caption. Fine to retain different family as this is a good way of distinguishing between the three types, but if so use the same for each type - this varies within each column.

Fig 2 places very light text on a background that is almost the same shade - the text is barely discernible, and only after you read the text indicating it exists. I'd suggest using a white or transparent background. FYI, I'm using a monochrome printout - double-checking on-screen I can see two different colours, but only slightly better, text still illegible because the contrast is too low. Same suggestion stands for the background.

Fig 5b - I would suggest ordering the labels on the horizontal axis by some meaningful property - it appears to be random, so reading and especially comparison across variables, is unnecessarily difficult. How can the reader tell which bar maps to which discipline? Or is the caption incorrect - what is the grouping - discipline or property?

Figs 5b and c aren't very useful as presented - they compare apples and pears. If certain attribute types are only found in certain document types then comparing them as whole numbers to others found across all doesn't provide a representative picture of content. In this case proportions might be more useful.

I would suggest splitting 7a into two parts or at least updating the caption to say it's both query and results list. Further, because the caption is incomplete it is not immediately obvious that 7b is the detail for one result. Especially as it talks about authors only when the detail is much more than authors. For readability I would suggest merging the first column - all four rows are the same - or show one and use the "as above" symbol (") to highlight this. Ditto for column 4 - context. Then use the abbreviated form for the predicate and object so the property/value of interest stands out - there is so much (redundant) text in this snapshot that the interesting detail is lost in the noise.


Reference format and presentation is inconsistent.

Additional errors include:

repeated/redundant information, e.g., in [9, 33, 37] - "Springer Berlin Heidelberg, Berlin, Heidelberg"

misspelled names, e.g., both authors, at least one editor in [12], one author in [24]

errors in 'chapter in book' references, including 10, 26.

there are several instances of incorrect capitalisation of acronyms and proper nouns, e.g., "emblematica" in [8], "singapore" in [32], "uk" in [34]. If you're using BibTeX surround any such with curly braces to preserve capitalisation.

[31, 32] should include an "available/published at"


doi is an acronym - please be consistent and use only correct capitalisation.

A handful of errors that a proofread or auto grammar check will pick up. Some of the more difficult to pick up below.

Missing apostrophes, e.g., p.3 -
"The British government, for instance, publishes government LOD [34] to facilitate citizens['] easy access …"
"The Albanian government has … so that they can participate in governance and decision making as a part of the country[']s modern democracy. "

A few examples of dropped definite/indefinite articles (a/an/the), e.g., p.3
"The Albanian government …as part of [AN or THE] Open Government Partnership, so that they …"
"Here, [THE] authors present their data retrieval algorithm. "
p.7 - "… while the organization can be linked to [AN] external dataset (e.g. geonames) …"

Review #3
By Andrea Giovanni Nuzzolese submitted on 16/Oct/2017
Review Comment:

The paper presents LOPDF, which is a framework for extracting and producing Linked Open Data of Scientific Documents. LOPDF implements a pipeline-like architecture that consists of four sequential modules. Namely, those modules are (i) the Information Crawler, which takes the URL of a source portal as input and starts crawling it, from the first to the last discipline; (ii) the Data Parser and Extractor, which enables metadata extraction from scientific documents; (iii) the Triplifier, which converts extracted metadata into RDF triples; and (iv) the RDF Datasets Generator, which finally generates the resultant RDF datasets serialised by using the N-Triple format. The authors applied LOPDF to SpringerLink as data source in order to generate a dataset, i.e. SPedia, which consists of around 300 million RDF triples describing information on about 9 million scientific documents in a machine processable format.

==== Overall comments ====
The paper is well written and structured in all its parts.
The problem of generating LOD from scholarly objects is a relevant topic to the SWJ.
As a matter of fact, in recent years the number of solutions and initiatives (e.g. the Semantic Web Dog Food, Scholarly Data, Open Citations, etc.) in the domain is steadily rising.

=== Strengths ===
The description of the general architecture of LOPDF is fair.
Accordingly, Section 4 provides the description of the four modules part of the architecture, i.e. the crawler, the parser/extractor, the triplifier, and the dataset generator.
Similarly, the paper provides a good high-level description of the algorithms for generating LOD from a given data source about publications.
The related work section provides a general overview, thought not focused on the scholarly domain (see weaknesses) over state of the art solutions for LOD generation.

=== Weaknesses ===
Nevertheless, the paper shows significant weaknesses that, in my opinion, prevent it from publication as it is in its current form.
Those weaknesses are:

+++ Not focused SOA +++
Although a general overview (as it is provided in Section 2) over existing solutions is useful for drawing the boundaries around the problem of generating LOD from non-RDF sources in broader sense, the authors never narrow the narrative to specific solutions for modelling ontologies and generating LOD in the scholarly domain. In fact, the authors should provide a more comprehensive state of the art by including recent solutions, such as [1], [2], [3], and [4]. These works are relevant with respect to LOPDF and cannot be omitted.
Additionally, the paper fails in systematically comparing the solutions at state of the art with LOPDF. This does not allow the reader to fully understand to what extent the solution proposed by the authors is original.

+++ Lack of details +++
Section 4 and 5 provide the description of the architecture and the algorithm for generating LOD. However, the description is kept very general causing lots of details being omitted. Instead, those details are relevant to provide formal support to the solution proposed. In fact, the following points remain unclear:
- How does the Information Crawler communicate with the original sources?
- What are the protocol used if any (e.g. OAI-PMH)?
- How is it possible and how much effort is required to adapt the Crawler to custom sources for crawling (e.g. specific API based on HTTP REST services) and parsing (e.g. the extraction of specific pieces of information available in heterogeneous format, such as HTML, JSON, CSV, etc.)? I think that the customisation of the LOPDF is a non-trivial task. Hence, this part has to be significantly extended.
- The metadata and the data schemata might vary a lot from one source to another. What is the solution adopted for homogenising those metadata and schemata during the triplification? Is there any target ontology/vocabulary to use for the triplification?
- Is LOPDF available as software platform?

+++ Evaluation +++
The authors perform a quantitative and qualitative evaluation by applying the LOPDF framework to SpringerLink and generating SPedia.
However, both are trivial and imprecise for a research article. Firstly, how may journal, papers, etc. from SpringerLink have been processed by LOPD exactly? Without this information the numbers reported in Fig. 5 are useless.
Similarly, the authors argue that they used "around a thousand SPARQL queries (on average) to test and analyse results from various aspects". How are many they? What are the va aspects used as reference exactly?
Nevertheless, the qualitative evaluation does not show any effectiveness of LOPDF for generating LOD. What is the objective of the user-based study? The user-based study involving ten researchers with expertise in the fields of LOD does not provide any result about the quality of generated LOD. Additionally, the authors do not report any data about the inter-rater agreement and the reliability of subjects, which are typically recorded during user-based studies.
The query-based evaluation provides only a naïve assessment or initial hints on the effectiveness of LOPDF. In principle, the query-based evaluation can be used for measuring accuracy, precision, recall and F-measure. However, the authors should carry on the experiment by using a more rigorous approach. In fact, the authors have to clarify how they select the SPARQL and provide them to the readers along with the subset of SPedia used.

1. K. Möller, T. Heath, S. Handschuh, and J. Domingue. Recipes for semantic web dog food: The eswc and iswc metadata projects. In Proc. of ISWC’07/ASWC’07, pages 802–815, Berlin, Heidelberg, 2007. Springer.
2. A. G. Nuzzolese, A. L. Gentile, V. Presutti, and A. Gangemi. Conference Linked Data: the ScholarlyData project. In Proceedings of ISWC 2016 - Resource Track, Lecture Notes in Computer Science 9982:150-158. Springer, 2016. DOI: 10.1007/978-3-319-46547-0_16
3. Silvio Peroni, Alexander Dutton, Tanya Gray, David Shotton (2015). Setting our bibliographic references free: towards open citation data. Journal of Documentation, 71 (2): 253-277. https://doi.org/10.1108/JD-12-2013-0166, OA at http://speroni.web.cs.unibo.it/publications/peroni-2015-setting-bibliogr...
4. D. Shotton. Semantic publishing: the coming revolution in scientific journal publishing. Learned Publishing, 22(2):85–94, 2009.