Towards high quality data catalogues: addressing exploitability.

Tracking #: 1258-2470

Authors: 
Enrico Daga
Alessandro Adamou
Mathieu d’Aquin
Enrico Motta

Responsible editor: 
Guest Editors Quality Management of Semantic Web Assets

Submission type: 
Full Paper
Abstract: 
Data quality is broadly defined as "fitness for use", and the increasing number of systems generally categorised as “Data Catalogues” are expected to support such a notion through enabling data discoverability. As such, one aspect that strongly relates to the quality and completeness of a Data Catalogue is the one we refer to as exploitability: The compatibility between the policies of the provided datasets and the task at hand. However, the current practice in Data Hubs and Data Stores is for their Data Catalogues to simply provide a link to the text of the licence associated to the original data, in its original data source. This is insufficient to effectively support exploitability, since it requires the data consumer to trace back the processing that might have been applied to the data, to manually assess how much it might have affected the policies described in the licences, and to finally check that these policies match the intended use. In this article we argue that a high quality data catalogue can better address exploitability by also considering the way policies propagate across the data flows applied in the system. We propose a methodology to deploy an end-to-end solution centred on a Data Catalogue to support the machine-processable representation of data policies and of the data flows in the system, to enable the propagation and validation of these data policies so to deliver them as exploitability information alongside the data itself.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 30/Dec/2015
Suggestion:
Reject
Review Comment:

The paper argues for the use of machine-processable data policy information as a way to make it easier for data consumers to assess the fitness of use of the underlying data. The work has been carried out in the context of the MK Data Hub, a smart cities data sharing platform. Parts of the paper are very good and could be the starting point for a new submission. However, as it stands, the paper has several severe limitations which make it yet unsuitable for publication.

First, there is the general narrative: the authors start from data quality to move to data repositories and their features to go back to a particular quality dimension around licenses and policies, which can be facilitated by a particular, machine-driven technique to process policy information, which is part of a broader methodology for data governance, whose design and application in MK:Smart occupy the largest part of the manuscript. The authors need to simplify the narrative, decide what they are 'selling' and adjust the paper accordingly. If their main contribution is a methodology, then the next question they need to answer is who is their paper helping: other researchers working on data governance; MK:Smart data providers; MK:Smart data consumers; others? The application of the methodology in the MK:Smart context allows the authors to reflect upon the assumptions made at design time. While the summary in Section 5 is the most interesting part of the paper, it is difficult to see the implications of the work and to assess its value.
Second, I have concerns about the actual contribution. The methodology is rather prescriptive in terms of technologies used. While this is not a fault in itself, it is not entirely clear how the choice of technologies would affect the application of the general framework in MK:Smart and elsewhere. The related work has the same technical bias, focusing on policy languages, provenance ontologies, and reasoning over policies, while only briefly touching on data governance as an area, which the authors refer to as Metadata Supply Chain Management, or by surveying industrial proposals in this space (available from any big IT vendor, specialized vendors, or sector-specific). The methodology has not been evaluated. Its application in the smart cities project gives the authors the opportunity to assess the validity of their design assumptions, though Section 5 often seems to imply that the extent to which a design feature applies is more a matter of the current status of the platform rather than a principled matter. The usability of the different components of the data governance framework should be evaluated and a discussion is needed in the broader context of data catalogues (open, shared, closed etc.). One aspect which I believe would be interesting to be discussed further is the quality of the metadata, including licensing information. The article seems to suggest that by following the methodology, this problem will disappear. Yet, experience from almost any data portal, portal aggregators (e.g., http://www.europeandataportal.eu/ or the Open Data Monitor project http://opendatamonitor.eu/), and surveys (e.g., Open Data Index, Open Data Barometer) tells a different story. License information is often missing, among other metadata attributes. It is not at all clear that people would follow the methodology proposed.
Third, I am not convinced the special issue (or the journal) are the best venue for the work, especially with the current narrative.
Finally, the content needs to be proof-read and polished. The paper reads as if different sections would have been written by different people, possibly for different audiences. This is most obvious in Sections 1 and 2, after that the text has a better flow.

Review #2
By Rob Brennan submitted on 26/Jan/2016
Suggestion:
Reject
Review Comment:

Overall assessment
This paper has a number of interesting features such as a real deployment, the application of the datanode ontology and license policy reasoning. Unfortunately most of these aspects are discussed more fully in referenced papers instead of here. This paper focuses on an outline methodology to support quality improvement of datasets by ensuring the appropriate provenance meta-data is collected. It is not clear to me how this improves on the simple method of collecting provenance on all data processing activities. There is also a stated goal of enabling automated exploitability assessment for users of the datasets but as the conclusion itself describes the application of this to actual legal decision making is unlikely - "However, while this [exploitability] assessment is part of an early analysis, when the user wants to assess whether a given dataset is eligible to be adopted, we expect this assessment to be performed manually, on a case by case basis.".

Hence despite some good work, the twin problems of this paper are that it exists at the edge of the planned scope for the special issue (meta-data quality enhancement) and the contribution of the work described here (as opposed to the wider scope of the project it describes which is very interesting). This is compounded by the patchy presentation of the paper with many typos and a lack of a clear focus (data catalogue vs data hub) or logical flow in some sections.

(1) originality

The specific deployment scenario described in this paper is original. However the methodology contains many elements common to data quality/data lifecycle systems, policy-based management systems and automated license processing. Only the last of these is adequately covered in the papers related work section and I provide some links below on the other topics. Nonetheless these are all open areas of research and so more work, like that described here, is welcome. The advances in this paper seem incremental compared with the other papers being published by the team of authors.

(2) significance of the results

The research questions tackled by the paper are problematic, or at least their true value has not been made clear to me. Automatically making exploitability decisions seems to be a focus but as the conclusion makes clear if this is a legal decision made by a consumer it is more realistic that the techniques will support human decision-making rather than supplant them. The earlier parts of the paper would be stronger if they emphasised this rather than the idealised case of automated decisions. Especially given the lack of a trust infrastructure between the consumer and data hub, which would seem to be a basic requirement for any distributed decision-making.

Deciding what provenance meta-data to capture or present to the consumer in order to support exploitability decisions by them is also another question - it is not demonstrated how this is
superior to simply capturing all data processing steps and making this available to the end user to query.

Finally the meta-data value chain architecture/methodology has a lot in common with many data quality lifecycle models, which exhibit huge diversity (see for example Data Life Cycle Models and Concepts CEOS.WGISS.DSIG.TN01, Issue 1.0, September 2011) but is not directly motivated and when the MK Data Hub is used as an example of the methodology it is not compelling as a validation since it seems that the two things were developed by the same team for the same use case.

It is hard to evaluate the significance of the results when no real evaluation of the methodology is provided. In sec 4 the MK Data Hub is explained as a use case for the methodology but very little analysis of or explanation of the technical underpinnings are provided instead the section reads at a use case level. In section 5 good work is done to identify a number of assumptions built
into the methodology (although at times it again drifts the analysis to the MK Data Hub rather than of the methodology) but in many cases I have issues with the conclusions of even this lightweight evaluation. For example:

=Assumption 1.1==
"While we do not support complex policies at the moment, we could deal with it by user profiling
(with a commercial or non commercial account), or by including a taxonomy of usage contexts to consider separately, thus obtaining multiple policy sets depending on the usage context."

Would this not increase the complexity of the system and has the trade-off been analysed? An implication of this is that all policy information is then not included in the policy model (instead it is split between the policy model and the usage contexts), would the PPP Reasoner have access to this? I think that this makes it likely that implications of rules that cross contexts would be missed by the Reasoner.

Assumptions 1.2 and 3.3 make a similar assumption about the ease of splitting the problem this way (and implicitly depriving the reasoner of knowledge). Of course the exact implications are hard to evaluate since the specification of the reasoner is not in the scope of the paper. In the end this makes it hard for me to be convinced that assumption 4.1 is satisfied without some evidence.

==Assumption 1.4 ==
Your discussion here is confusing to me because on one hand you say that ODRL can support non-binary relations (violating your assumption) and on the other you say that as far as you have seen the binary assumptions are sufficient. Since your PPR are based on ODRL, why is the assumption the correct one to make?

(3) quality of writing.

The overall structure and presentation of the paper is good.

There is a recurring confusion in the text as to whether the paper is about data catalogs or data
sharing platforms (ie the MK Data Hub) that have a broader scope. This leads to statements like (sec 1):
" It is clear however that, as the number and diversity of the datasets they need to handle is growing, there is a need for these systems to play a further role in fully supporting the
delivery and reuse of datasets."
Is this a bit strong for a simple data catalogue? Doesn't this depend on the resources available and the role of the catalogue provider? What about the end to end principle of the internet that favours placing service intelligence at the edge rather than in the middle of the network?

Unfortunately there are a large number of typographic errors, see below for details.

There are a large number of RDF fragments included. I am not sure how much most of these add to the paper as they use space that could otherwise be used for discussion of the details of the system. For example, in section 4.1, Listing 1: This would be much easier to read if you used prefixes rather than full URIs for all the terms.

The paper is not specific at a number of places where further elaboration would be useful, for example in sec 4.1 it states "Moreover, the policies include a peculiar attribution requirement (Listing 4).". It would be useful to explain how it is peculiar rather than just stating this as a fact and expecting the user to interpret the listing. eg How have you quantified how unusual it is? Do you just mean that it doesn't map well to ODRL? A list of areas where further clarification would be desirable are listed below under "minor comments".

Section 4.4 needs more detail on how policy propagation is actually performed, limitations, advantages etc. If this was done in the context of the specific use case it would be a strong addition to the paper.

(4) Relevance to Call

The paper argues (sec 1) that "maximising the exploitability of data is an issue of quality of the
catalogue itself". In my opinion, this makes the paper fairly peripheral to the SI call in that it does not deal with data quality itself, but rather the quality of the meta-data or even more removed the
service that provides the data.

=Minor Comments=

=Section 2=

Policy Reasoning -
"specific forms of policy compatibility assessment are also
found in fields whose primary focus is tasks rather than
data, as in workflow modelling for task delegation"

What about network management where data access is a prime concern?
See for example: S. Davy Harnessing Information Models and
Ontologies for Policy Conict Analysis
http://repository.wit.ie/1059/1/2008_SDavy_Thesis_final_v2.pdf

= Section 3 =
"Our methodology follows the Data life-cycle, which comprises four phases"
Surely it is only one of many possible life-cycles? Is this life-cycle central to your methodology?
Does this limit the applicability of your results? You should discuss these points.
It also implies, to me, a confusion between a catalogue (which does not necessarily care about data lifecycles and a data hub which does.

"Processing: data are processed, manipulated and analysed in order to generate a new dataset, targeted to support some data-relying task"

Q: Why does the data-hub do this processing, couldn't it be at the client or provider? How do they know what the client wants? It seems like a very centralized approach. This should be documented as a limitation.

=Section 3.2=

"This activity can be rather complex, including automatic and supervised methods, and going into
the details of it is out of scope for this article. What is important for us is that this phase should provide a sufficient amount of metadata in order to support data processing."

Q: This seems like a hard requirement to meet, since it is very lightly specified? I think more detail is needed to scope things here.

= Section 3.4 =

"The exploitability task is indeed reduced to the assessment
of the compatibility between the actions performed by
the user’s application and the policies attached to the
datasets, with an approach similar to the one presented
in [16], for example using the SPIN-DLE reasoner described in [21]"

Q: But without trust between the consumer and provider how can this be done?

= Section 4 =

"Our hypothesis is that an end-to-end solution for exploitablity assessment can be developed by using stateof-the-art Semantic Web technologies."

Typically the possibility of developing a system is not a strong hypothesis since given sufficient time and resources the flexibility of IT systems means "something" can be developed. Hence it would be better to re-formulate your hypothesis in terms of limits, extents or desirable properties of such a system.

= Section 5 =
==Ass 2.1==
Q: Should this assumption be changed to state that "Content metadata appropriate for ETL generated from the data source is available"?
This seems to be what you actually need, rather than access to the data itself?

Assumption 3.3: Not sure how license changes are handled? ie what happens when a dataset license change occurs but the dataset itself does not change - does the ETL need to be run again? Need to make clearer.

= Ass 4.2 ==
"The user’s task need to be expressible in terms of ODRL policies, thus enabling reasoning on policies compatibility"
Q: Should this not be documented as a new assumption?

=Typos and English Improvements=

=Sec 1 =
Typo: " applyed" -> " applied"

= Sec 3 =

typo: "to what extend" -> "to what extent"
typo: "In this Section" -> "in this section"
typos: "including Air quality and Soil moisture" -> "including _a_ir quality and _s_oil moisture"
typo: "given geospatials coordinates" -> "given geospatial_ coordinates"

Example of poor readability, re-phrase: "The aforementioned ward (see Figure 3 for some
example data) and museum in Milton Keynes are examples of named entities the ECAPI may be queried for; but also, an arbitrary geographical area within a fixed radius of given geospatials coordinates (e.g. 51.998,-0.7436 in decimal degrees) could be an entity for an application to try to get information about (see Figure 4 for example data)."

= Sec 4.1 =
typos: "Air Quality and Moisture Sensors" -> "_a_ir _q_uality and _m_oisture _s_ensors"

= Sec 4.2 =
typo: "to supporting the data processing" -> "to support_ the data processing"
Figs 3 + 4: Not readable, due to both the size of fonts and colour schemes used

= Sec 4.3 =
typo:@ "b) a description on the process capable of" -> "b) a description o_f_ the process capable of"
typo: "that, as SPARQL query remodels" -> "that, a_ SPARQL query remodels"

= Sec 4.1 =
Typo: "UK Food Estanblishments" -> "UK Food Esta_blishments"

= Sec 5 =
==Ass 2.2 ==
typo: "multiple datasets, these cases" -> "multiple datasets, th_i_s_ case_"
typo: "The need of setting up Dataflow" -> "The need _t_o_ set____ up Dataflow"

==Ass 3.3 ==
typo: "Process executions do not influence
policies propagation." -> "Process executions do not influence
polic_y_ propagation."
typo: "hipothetical" -> "h_y_pothetical"

= Conclusions =
typo: "b) a description on the process" -> "b) a description o_f_ the process"
typo: "metadata-relying tasks" -> "metadata-rel_iant_ tasks"

End

Review #3
By Ciro Baron submitted on 26/Jan/2016
Suggestion:
Reject
Review Comment:

The paper describes a framework to support the exploitability of data respecting rights and policies. The paper is well structured and the formalization of the problem is clear.

However, I believe the authors submitted the paper to the wrong track. It's not possible to evaluate the approach as a full paper, since there isn't a "results" section. I would suggest to submit for the application or for the system track. Therefore the paper has a good potential and explore an issue which has credibility. Any kind of Data Hub should provide License metadata in a uniform.

Nevertheless, I have some minor questions about the approach.
DCAT provides description for datasets with multiple distributions. 'Listening 1' shows a single distributions with a single Licence. Assumptions (1.1) are made that a single License is provided by the Licensor, however this might not be always true. Different distributions don't necessarily contain the same data, so the Licence might not be the same.
Would be of great value add more details of the schema or the ontology/vocabulary used to created the metadata.
I suggest to the authors to look at DBpedia DataID Unit vocabulary, which reuses most of the standards presented in the paper.
About the paper structure: Before submit to the right track, I suggest to remove Fig 3 and Fig 4. They are not relevant.
Maybe an online demo would be relevant, although I understand that it's a specialized framework (mainly used by Data Catalogues Administrators)

In conclusion, the paper has potential, but I definitely believe that was submitted to the wrong track.

Review #4
Anonymous submitted on 28/Jan/2016
Suggestion:
Reject
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The paper describes a methodology towards achieving high exploitability of data catalogues. They define exploitability as “the ability to determine which policies a unit of data is subject to, and their compatibility with the intended usage by a consumer”, and regard this as a key feature of data catalogue quality. The idea is quite interesting and has the potential of adding values to existing data catalogues, allowing data consumers to be more aware of data license issues when making use of public datasets. However, the paper has been very confusingly presented, with very limited visible research contributions. With regret, the reviewer thinks the submission is premature as a journal publication for the following reasons:
- The research contribution is very weak. A methodology was proposed, but the methodology was poorly presented and hardly positioned in appropriate range of related work, aka methodologies.
- There was no formal evaluation of work, apart from a case study.
- The paper is badly organised and presented.

Major issues

1. The definition of exploitability

It is a term that can be fuzzy depending on who it is presented to. It reflects a very contextual and subjective feature of a dataset. There are several intertwined conceptualisations right at the beginning of the manuscript, which firstly, put the relevance of this submission to this special issue under debate, as well as its core objectives.

1. If the authors are trying to study the “quality of the terms/licenses” that can be provided in their data catalogue, then this is well within the scope of the special issue; and it can be something to be realistically evaluated, building upon the existing practical work.
2. Or do the authors indeed try to maximise “the ability to determine which policies a unit of data is subject to, and their compatibility with the intended usage by a consumer”?

Two different goals emerge from the introduction sections, and neither of them was either supported by sufficient evidence to show how well the proposed approach can achieve either objective; or clearly reflected in the actual described methodology.

For a future resubmission, the authors should consider 1) provide an explicit definition of exploitability much earlier in the manuscript, and accompany it with some simple examples; 2) state more clearly the objectives and research contributions of the work, which could help people understand what is exactly achieved by this work; and 3) finally provide some concrete quantitative or qualitative evaluations.

2. The methodology

It is very hard to figure out the unique features of the methodology for enabling the so-called exploitability.

The first page of the methodology was quite confusing to read. How is it related to exploitability? It is supposed to give an overview of what this methodology is about or able to achieve, and it failed to do that.

Reading through each sub-section did not tell more, apart from the fact that some licence metadata can be attached now. These sections also lack clear and strong scientific justification regarding why things were proposed in such a way. How do we expect the exploitability metadata to be acquired (section 3.1)? What are PPR (Policies Propagation Rules) and their role in all of this?

The datasets and diversity of licences/policies shown in the case study is quite limited, which may explain why no evaluations were provided.

Generally speaking, there has clearly been some interesting practical work, as shown in section 4. However, the methodology has not been sufficiently justification either by scientific literature or a known design methodology. And it has been supported with a very limited case study.

The work would require more than a major revision. The authors need to think more carefully about the objective of the work, and the actual contributions that they could draw from the practical work. The upcoming special issue on “The Semantic Web and Linked Data: Security, Privacy and Policy” might be a more appropriate venue if I understood the work properly.