Linked Data Representation of the Nomenclature of Territorial Units for Statistics

Paper Title: 
Linked Data Representation of the Nomenclature of Territorial Units for Statistics
Authors: 
Gianluca Correndo, Nigel Shadbolt
Abstract: 
The publication of public sector information (PSI) data sets has brought to the attention of the scientific community the redundant presence of location based context. At the same time it stresses the inadequacy of current Linked Data services for exploiting the semantics of such contextual dimensions for easing entity retrieval and browsing. In this paper we describe our Linked Data representation of the NUTS European statistical subdivision, created to support the e-government and public sector in publishing their data sets. The topological knowledge published in the Linked NUTS can be reused in order to enrich the geographical context of other data sets, in particular in a scenario where statistical data sets describe information that have strong ties with the territory, and therefore with its geography.
Full PDF Version: 
Submission type: 
Dataset Description
Responsible editor: 
Pascal Hitzler
Decision/Status: 
Accept
Reviews: 

Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...

Resubmission after "accept with major revisions", now accepted for publication. The first round reviews are beneath the second round reviews, below.

Solicited review by Jesse Weaver:

In short, my initial concerns with this paper have been sufficiently addressed, but there are some minor problems that need to be addressed prior to final publication.

In section 3, discussion is included about the expected dereferencing behavior of URIs in NUTS. Specifically, the URIs are slash URIs that 303 redirect to documents in accordance with httpRange-14. This discussion addresses my first, initial concern. However, having actually tried to access the documents, I had no success (unlike when I initially reviewed the paper). Following the example, "curl -I -H 'Accept: application/x-turtle' http://nuts.psi.enakting.org/id/UKG32" simply returned with 500 INTERNAL SERVER ERROR. I may have an outdated version of curl, so I attempted it in my browser. When I entered http://nuts.psi.enakting.org/id/UKG32 into Firefox, viewing the web console, I was 303 redirected to http://nuts.psi.enakting.org/id/UKG32/doc , which is correct, but then the document could not be retrieved (browser seems stuck). So, although the description in the paper is complete, it no longer appears to agree with the actual behavior. However, I have opted not to let this affect my scoring of the paper since it worked before, and technical problems can be fixed.

Also, there is a URI in section 3 http:\nuts.psi.enakting.org/id/UKG32/ttl ; the slashes following the scheme are in the wrong direction. There is also a sentence that says: "When asking for a document describing an NUTS region therefore, clients are redirected via HTTP 303 to the correct URI; ...." This seems to imply that the first URI was incorrect, which it is not. Consider rephrasing to something like "redirected via HTTP 303 to a document describing the resource identified by the initial URI" (or possibly a similar, more concise phrasing).

An example is given in Figure 1 in the Turtle syntax. This addresses my second, initial concern. There are some errors in the example, though. The last two prefix declarations do not end with periods, but should. "nutsd:UKG32" should presumably be "nuts:UKG32" since there is no nutsd prefix declared. (Just a note, fb:guid.9202... is syntactically invalid under the Team Submission http://www.w3.org/TeamSubmission/2011/SUBM-turtle-20110328/ because the local name contains a period, but it is syntactically valid according to the RDF 1.1 Working Draft http://www.w3.org/TR/2012/WD-turtle-20120710/ . This is not so terrible, but it is something of which the authors should probably be aware.)

The paper also needs to be revised for grammar errors.

These issues are really very minor. The content of the paper otherwise seems appropriate for acceptance.

Solicited review by Oscar Corcho:

This version of the dataset description has considerably improved the previous one, addressing the main comments that were made on the initial review. The descriptions are now more clear and some small issues about how time is represented have been well explained.

It still remains unclear ho many sites, datasets or apps are using this dataset. Also the usefulness of the dataset may be improved if available as well through a SPARQL endpoint, in my opinion.

Only one typo in the turtle code. It uses prefix nutsd instead of nuts for UKG32.

Solicited review by Michael Hausenblas:

The issues I've raised in the previous round have been addressed, happy to see this article being published now. Good work!

First round reviews:

Solicited review by Michael Hausenblas:

The article is overall in a good shape, however there are some issues that need to be addressed in order to accepted it as I will detail out in the following.

In the paper, the author describes a Linked Data version of the EU territorial units nomenclature (NUTS), available at http://nuts.psi.enakting.org/, covering the modelling, interlinking and publishing process in varying detail.
The quality of the dataset and its usefulness is convincing, concerning the clarity and completeness of the descriptions certain improvements should be made (see below).

## Core DSD
I was unable to find a description of the original license or the one used to re-publish the LD NUTS dataset in the paper and a discussion of its implication. Further, with regards to access methods from the description in the paper it is unclear if only the RDF documents are published or alternative means (SPARQL endpoint, dumps) have been made available.

## Publishing and metrics
The coverage and original dataset is clearly described, however I wonder if the author is aware of similar efforts such as NUTS-RDF (http://nuts.geovocab.org/) - as the related work section is missing, I assume not. This also raises the question for the motivation to provide this dataset in the first place. As it is unclear to me when the work has been performed it could be the case that at this time simply nothing else existed, but if not, I'd be interested to learn why the effort was made (other NUTS datasets don't provide enough detail or other reasons?) Concerning the interlinking, this section (3.) can be improved, for example, why only link with Freebase and not other datasets (GeoNames, LinkedGeoData, NUTS-RDF, etc.)?

A number of metrics have been provided (Table 1 and Fig1 and Fig2) but the basic metrics (number of triples, interlinks, etc.) are not mentioned in the paper AFAICT. The author describes the use of established vocabularies (re-use of Ordnance Survey ontology, OWL time) but doesn't provide the design decisions explicitly (why those? alternatives?)

## Examples, modeling patterns and shortcomings
An example (UKG32) is mentioned in the paper but it would perhaps be good to also provide the RDF representation directly (preferably in Turtle) in the paper, or alternatively a figure that shows the graph of a typical example. The weakest part of the paper is in my understanding the lack of a critical discussion of the modeling patterns. I was unable to find a discussion of the shortcomings of the dataset.

## What is missing
I would expect a related work section that compares this dataset with other NUTS datasets (in Linked Data and in other formats, where it makes sense).

As the author rightly points out in Section 2: "Since the NUTS nomenclature encodes a subdivision of a territory that is subject to frequent changes, it is expected to change accordingly." and in Section 4. "The version of NUTS currently covered by the data set does not include the new version, released the 1st of January 2012.", I strongly suggest to add a paragraph about the update policy (will new data be added, and if so, how?), maybe in Section 3.

## Editorial comments

* Section 1: contains quite some general discussion which can be cut down - IMO, no need to go into the history of PSI or motivate why government Open Data is relevant. The sentence "With such a directive, the EU has laid down the legislative foundations that member countries should follow in order to ensure a healthy secondary market built upon such data sources (i.e. texts, databases, au- dio, and video) with transparency and fair competition." doesn't parse in my brain - rephrase?

* Section 2: the first two paragraphs of this section could serve as the introduction section as this is not really about the Linked Data NUTS version but a general description of the NUTS datasets. In addition I don't understand the sentence "Each region at the same level is either the expression of a political will or meant to provide comparable features at statistical level (e.g. similar geo- graphical or socio-economic requirements) in order to make comparison and analysis." It is also unclear to me, from the description, if the shape files are available for the entire EU or only for the UK - I checked a few and apparently it covers the entire EU, please clarify this.

* Section 3: the interlinking process with Google Refine could be fleshed out a bit more (how did you do it, some more quantitative measures re success rate, etc.)

* Section 4: the last paragraph is a collection of generic yada yada, for example "Such reuse of knowledge is potentially innovative but poses many questions about the management of the quality of the knowledge and the entity alignments used" - here would be a good place to come up with concrete examples - where is the innovation? what are the quality issues?

Solicited review by Oscar Corcho:

This paper describes the dataset that results from the transformation of the NUTS dataset into Linked Data. Although statistics about the dataset itself are not given, it is clearly a medium-sized one, and it is actually linked mainly to Freebase, even though there could be many other links that could be considered in this context (to geonames, linkedgeodata, dbpedia, etc.).

Considering the main criteria that are the focus of this special issue, we will first talk about data quality. In fact, the quality of the dataset depends heavily on the quality of the NUTS one, which is a well-curated data source, and hence the data in this Linked dataset is good quality as well. The dataset has been transformed according to some well-known and well-worked vocabularies, although the description of how time intervals are handled is not quite clear, and for instance it is not clear why some specific consecutive intervals are considered for the availalbe data instead of considering a unique interval (e.g., in http://nuts.psi.enakting.org/id/ES12/doc it is not clear why the validity corresponds to four intervals and not unified in one, since if not the discussion on time representation is not well addressed). The metadata about the dataset is poor, as acknowledged in the home website of the whole dataset, and should be improved as well (adding some metadata would be enough to have a complete description).

As for the usefulness of this dataset, it is clear that geographical information is one of those data sources with a high potential of being useful. However, the author does not express how many sites or applications are currently using it, or in cases that other sources of geographical linformation are being used, why this is happening. This should be further explained and discussed.

Finally, the dataset is quite complete and the descriptions are clear. I have some concerns about the fact that section 3 talks about linking but actually this description is probably not so important in the context of this special issue, and would be better placed in a summarised manner in the discussion section. How many links have you actually generated? this is also related to the last paragraph of section 4, which is not clear enough.

I would suggest revising the paper with the comments that have been made, and focusing on providing the corresponding metadata of the dataset, as pointed out, and fixing some of the aspects that have been described above.

Solicited review by Jesse Weaver:

This article describes a Linked Data representation of the NUTS European statistical subdivision, assigning a resolvable URI for each NUTS region. The Linked Data representation includes a containment hierarchy of regions, in which there are five levels, and also a version represented as a time interval. Such data seems to be a useful addition to Linked Data on the Web.

Having perused the dataset, the quality of the dataset seems sufficient, and the content of the dataset seems implicitly useful. However, the description is brief, although nearly sufficient since pointers to the dataset are given to the reader. The only apparent deficiency of the article is that it does not discuss the (intended) dereferencing behavior of URIs in the dataset. Admittedly, given the current state of Linked Data standards, it should be expected that such behavior should comply with the current resolution of httpRange-14, but it is surprising how frequently that is not the case the practice. The URIs mentioned in this article are slash URIs, and using a browser, they seem to 303 redirect in compliance with httpRange-14. A brief mention of this intended behavior would make the overall Linked Dataset description complete.

Additionally, the article includes no examples of the data which would help the reader get a sense of the dataset beyond the mere text description. Since the article is four pages (two pages short of the maximum allowable length), there is room to add such examples.

Resolving these minor issues would improve the article as a Linked Dataset description.

Tags: 

Comments

I have changed the paper according to the comments provided by the reviewers. The following is a point-by-point response:

* license used
The NUTS data is published from Eurostat as open data with attribution. We adapted the same licensing.

* methods of publishing
The data is available only as linked data (i.e. no dumps, no sparql). Further access to data is done by the geoservice.

* data sets relevant to NUTS
Included all relevant data sets retrieved by CKAN in the relevant work section.

* motivation to publish the data set
Included in the introduction the motivation and reference to a previous work that provides an example of the value of published authoritative geographies.

* why linking only to Freebase
Described how Freebase is only the entry point to the LOD cloud. via sameAs service one can access the equivalence set for that entity. The stats described in Section 3.2 are based on sameAs results.

* basic metrics for data set
Added metrics in Section 3.

* design decisions behing ontology
Paper reorganised. Now Section 3.2 describes the vocabularies used and the rational behind the choice operated.

* example in Turtle
Added.

* modelling patterns
Section 3.1 should cover this. There is unfortunately not enough space to a more detailed description of the patterns used.

* update policy
Since the changes to the NUTS are infrequent any change in the linked data version is decided on a per request basis. Added to Section 3.

* reduce introduction
Introduction reduced and paper reorganised in order to make it more readable.

* specify the coverage of the shape files
Specified. Shapes are provided only for the 2009 version of the nomenclature.

* linking process with google refine
Specified in more detail the amount of work done in aligning the entities. Stats provided.

* time intervals management
Specified in Section 3.2 about the vocabularies. Turtle example should make it more clear how NUTS versions are Time Intervals (OWL Time) and how regions refer to versions.

* meatadata
Added both to CKAN and to the paper. Every data set mentioned has a link to the CKAN package now (when available).

* http-range-14
303 redirection of entity URI described and http-range-14 mentioned