Findable and Reusable Workflow Data Products: A genomic Workflow Case Study

Tracking #: 2135-3348

Alban Gaignard
Hala Skaf-Molli
Khalid Belhajjame

Responsible editor: 
Guest Editors Semantic E-Science 2018

Submission type: 
Full Paper
While workflow systems have improved the repeatability of scientific experiments, the value of the processed (intermediate) data have been overlooked so far. In this paper, we argue that the intermediate data products of workflow executions should be seen as first-class objects that need to be curated and published. Not only will this be exploited to save time and resources needed when re-executing workflows, but more importantly, it will improve the reuse of data products by the same or peer scientists in the context of new hypotheses and experiments. To assist curator in annotating (intermediate) workflow data, we exploit in this work multiple sources of information, namely: i) the provenance information captured by the workflow system, and ii) domain annotations that are provided by tools registries, such as Bio.Tools. Furthermore, we show, on a concrete bioinformatics scenario, how summarisation techniques can be used to reduce the machine-generated provenance information of such data products into concise human- and machine-readable annotations.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Tomi Kauppinen submitted on 15/Apr/2019
Review Comment:

This is a very good piece of work, and in my opinion publishable as it is. In other words, and by using the dimensions for research contributions:

(1) originality

Authors explain in detail the difference between their work, and that of the state of the art. From that discussion, from the paper itself and what what I have seen in other projects this work provides an original and unique contribution.

(2) significance of the results

The significance of this work is essentially the rather straight-forward, yet powerful method to feed other systems (e.g. the mentioned [10]) with provenance summaries.

(3) quality of writing.

The paper is very well written, and easy to follow. The paper explains their approach in detail and provides good examples.

Review #2
By Monika Akbar submitted on 14/May/2019
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The paper presents a new approach to make intermediate workflow data findable and reusable. They augment PROV-O with EDAM information to identify what tools were used to generate the data. The information about the tool used also adds another layer of metadata, which is what the tool does. This information is used to automatically generate short summaries of the generated data. In terms of originality, the proposed framework is an extension of existing standard and tools. However, the potential impact of this work can be significant given that the scientific community is looking for ways to reuse the vast amount of scientific data already available. Autogenerated sentence-based summaries and diagrams are going to be helpful in finding and reusing such scientific data. This is a very well-written paper.

Some comments to improve the paper:
Even though the literature review is nicely done, it is not clear why current approaches for summarizing scientific workflows (33,34) will not be sufficient for summarizing intermediate workflow data. It would be particularly interesting to see how the proposed approach performs compared to the existing summarizing approaches in Table 2.

Table 2 adds valuable information in terms of the performance of the proposed approach. If a few additional datasets can be used to run a similar experiment, it would be helpful to determine if the system consistently performs well.

Some discussion and example of how the approach supports findability would be helpful.

In page 6, section 4.1, the sentence 'Figure 1 shows a summary diagram automatically …' should be 'Figure 4'.

Page 6, Section 4, the two tables with PROV class and properties needs captions.

Page 8, 'The results shown in section 4 where obtained' should be '…were obtained'

Page 8, 'This is is demonstrated' should be 'This is'.

Review #3
Anonymous submitted on 04/Jun/2019
Minor Revision
Review Comment:

In this paper, the authors present an approach that focuses on annotating intermediate workflow data, tackling findability and reusability of data of genomic workflow data. The rationale behind this work is that intermediate data produced within a workflow can be shared and reused by other researchers to test further hypotheses, instead of wasting computational time in reproducing such data.

The paper is very well structured, as it starts from the problem, describes the approach and discusses its results. However, to me it seems that the paper currently lacks a list of requirements for running the FRESH approach:
• Availability of a workflows
• Semantic tool catalogues
and so on. This list would help the reader to understand what is needed to run your approach. I like the fact that you considered how we can export this approach and use it in other domains.

Regarding the results of you approach, FRESH returns both human and machine-readable summaries, respectively with text and diagrams, and nano-publications. Have you thought of evaluating your approach? Perhaps running a user study to assess the comprehensiveness of the text and diagrams? This evaluation could definitely help you in understanding how easy is for researchers to find and reuse such intermediate data.
How feasible is for you to create a gold standard against which compare the nanopublications returned by the algorithm? This would help estimate the goodness of your approach.

I believe that the related work should be section 3, after the motivation and the problem statement section, and before the description of the approach. The related work section gives more context and explains why we need such approach.

Overall, looking at the review criteria, I can that it has:
• High originality
• Fair significance of the results
• High quality of writing

• Page 6, second column, row 24: I believe it should be figure 4 (instead of 1)
• Page 6, second column, row 34: “which highlights”
• Page 6, second column, row 35: “The diagrams show”
• There are duplicate footnotes, e.g. 2 and 6, 5 and 13. Footnote 14 should contain a link that does not work.