Linked SDMX Data

Tracking #: 454-1631

Authors: 
Sarven Capadisli
Sören Auer
Axel-Cyrille Ngonga Ngomo

Responsible editor: 
Oscar Corcho

Submission type: 
Dataset Description
Abstract: 
As statistical data is inherently highly structured and comes with rich metadata (in form of code lists, data cubes etc.), it would be a missed opportunity to not tap into it from the Linked Data angle. At the time of this writing, there exists no simple way to transform statistical data into Linked Data since the raw data comes in different shapes and forms. Given that SDMX (Statistical Data and Metadata eXchange) is arguably the most widely used standard for statistical data exchange, a great amount of statistical data about our societies is yet to be discoverable and identifiable in a uniform way. In this article, we present the design and implementation of SDMX-ML to RDF/XML XSL transformations, as well as the publication of OECD, BFS, FAO, ECB, and IMF datasets with that tooling.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Francois Scharffe submitted on 03/May/2013
Suggestion:
Accept
Review Comment:

This paper presents transformation from the ISO standard statistical data format SDMX-ML to RDF. URI patterns, interlinking, and publication is proposed for several statistical datasets. The paper is clearly written and the tools can be used for publishing other datasets, although some extensions or specific configuration might be necessary.
The only comments are that the paper is 2 pages too long according to the call. Also it would be better if datasets were published by the providers.

Review #2
Anonymous submitted on 08/May/2013
Suggestion:
Minor Revision
Review Comment:

This paper describes a practical workflow for transforming SDMX collections into Linked Data. The authors focus on four relevant statistical datasets:

- OECD: whose mission is to promote policies that will improve the economic and social well-being of people around the world.
- BFS Swiss Statistics, due to the Federal Statistical Office's web portal offering a wide range of statistical information including population, health, economy, employment and education.
- FAO, which works on achieving food security for all to make sure people have regular access to
enough high-quality food.
- ECB, whose main task is to maintain the euro's purchasing power and thus price stability in the
euro area.

Nevertheless, the tool proposed in the paper can be easily used for transforming any other SDMX dataset into Linked Data.

On the one hand, statistical data are rich sources of knowledge currently underexploited. Any new approach is welcome and it describes an effective workflow for transforming SDMX collections into Linked Data. On the other hand, the approach technically sounds. It describes a simple but effective solution based on well-known toolss, guaranteing robustnees and making easy its final integration into existing environments. Thus, the workflow is a contribution by itself and each stage describes how it impacts in the final dataset configuration.

With respect to the obtained datasets, these are clearly described in Sections 7 and 8. These reuse well-known vocabularies and provide interesting interlinkage between them but also with DBpedia, World-Bank, Transparency International and EUNIS. Apache Jena TDB is used to load RDF and Apache Jena Fuseki is used to run the SPARQL endpoint. Datasets are also released as RDF dumps (referenced from Data Hub).

Finally, its relevant for me how scalability problems are addressed, because I think that 12 GB is an excessive amount of memory for processing these datasets (the largest one outputs less 250 million triples). Do you have an alternative for processing larger datasets? Maybe you could partition the original dataset into some fragments: is the tool enough flexible to support it? Please, explain how scalability issues will be addressed to guarantee big datasets to be effectively transformed.

Review #3
Anonymous submitted on 16/Aug/2013
Suggestion:
Accept
Review Comment:

This paper describes the Linked Data datasets, and the process used to generate them, for a set of SDMX-enabled datasets coming from relevant international and country-focused statistical organisations (even some work has been done during the review process on another IMF dataset).

The paper is clearly fitting the special issue call, and the datasets are of good quality, and are made available under open licenses in such linked data format. I particularly like the extensive use of DataCube and Prov-o, and the approach taken for versioning of codelists and concept schemes.

The comments provided in this review are mostly curiosities about some design decisions or requests to make things more clear:
- First of all, a large part of the paper is about the process used to generate the Linked data datasets, but not so much about the datasets themselves or the problems associated to problems in the original SDMX data, which certainly exist still in current sdmx implementations.
- One example is related to the management of some NL descriptions of codes in codelists. These are simply ignored. Could they be handled differently? How manty of the datasets present this situation?
- URI patterns. I find them nicely created and sensible. However, I would have some questions: how do you order dimensions in observations? are you using order in the dimensions in the data cube structures? Why in the owl:Class part do you only use codelistID instead of considering also conceptIDs?
- The interlinkage of sdmx annotations part should be better explained, probably with an example that illustrates how it works.
- The interlinking of datasets should be better explained as well. are all found links correct? I have found many cases of compatible codelists that are produced as separate codelists by stat offices and hence in principle are not linked, but could be. Have you found those cases?


Comments

We would like to thank all reviewers for their time and valuable feedback. It is much appreciated.

I've tried to address some of your questions and comments below, and have made changes for the camera-ready version.

  • Reviewer 1:
    Problems with the original SDMX data is explained in terms of what the Linked SDMX templates do to handle or work-around the shortcomings, as opposed to correcting them - which would be more appropriate to take care of at source. For instance, the configuration option "omitComponents" allows the administrator to skip over any cube components that's in the datasets e.g., malformed codes or erroneous data that they are able to tell, or simply to leave out a component for their own reasons.

    Some normalization is done e.g., removal of whitespace from code values when they are used in URIs as safe URIs, reference period values are converted to British reference periods (URIs).

    Missing values e.g., human readable dataset title, are left as such, and fallbacks to dataset code. If license information is not provided, it can be explicitly set in the configuration file.

    In order to use agency identifiers consistently, the configuration provides a way to add aliases for the primary agency identifier.

    In SDMX 2.0, extensive descriptions for codelists are not provided. All of the datasets which were transformed used SDMX 2.0. AFAIK, SDMX 2.1 resolves this short-coming and that some of the SDMX publishers are adapting to 2.1.

    The order of dimension values (e.g., as a path) in the observation URI is based on the order in the dataset i.e., it does not reflect the DSDs order. While the DSD makes the call on the order of the dimensions (in which qb:order is specified in the DSD in any case), the order of the terms in the URI is not too important - I do however think that it may be "okay" to reorder based on what DSD says (overriding the order in the dataset). I'll revisit this for the templates.

    For code list Classes only used codelistID simply to follow the same pattern as in sdmx-code.ttl i.e., a code list has a see also to a class (which is of same nature i.e., referring to the code list as opposed to the code). But yes, the codelist could be a super-class of the class that's used for the code.

    Omitted the annotations example because it is slightly extensive for the paper (too bad that there are artificial limits set for "papers" to share knowledge). If you are interested, there is an example at https://github.com/csarven/linked-sdmx/wiki#interlinking-sdmx-annotations.

    The interlinks that are retained at the end have passed through a human-review process. They are correct in a sense that a concept like a reference area (e.g., country) from two different datasets use the same code notation and label. In other words, temporal bits are ignored - not to mention they are not available in SDMX sources (AFICT). This of course means that they are subject to being wrong since the country concept that's used in source A may be "former" country that's mentioned in source B. While such interlinking is critical, requires better interlinking methods and tools.

  • Reviwer 2:
    The current state of the paper is 8 pages (which is within the calls requirement: 5-8 pages).
  • Reviewer 3:
    The amount of memory (12 GB) that was used to transform was merely what was dedicated for the process itself. Based on tests, the minimum amount of memory that was actually required was 4 GB.

    The triple counts of the datasets reflect the total amount of triples after all the transformations for each dataset. Basically, multiple SDMX-ML files are transformed from each source. The largest dataset (SDMX-ML) was in fact 6.6 GB, plenty for the Linked SDMX XSLT - http://csarven.ca/linked-sdmx-data provides a bit more information on the data retrieval process.

  • Thank you all again.