Data journeys: knowledge representation and extraction

Tracking #: 2988-4202

Authors: 
Enrico Daga
Paul Groth

Responsible editor: 
Guest Editors Ontologies in XAI

Submission type: 
Full Paper
Abstract: 
Artificial intelligence applications are not built on single simple datasets or trained models. Instead, they are complex data science workflows involving multiple datasets, models, preparation scripts and algorithms. As these workflows increasingly underpin applications, it has become apparent that we need to be able to understand workflows comprehensively and provide explanations at higher levels of abstraction. To tackle this problem, we focus on the extraction and representation of data journeys specifically from data science code. A data journey is a multi-layered semantic representation of data processing activity linked to data science code and assets. We propose an ontology to capture the essential elements of a data journey and an approach to extract such data journeys. Using a corpus of python notebooks from Kaggle, we show that we are able to capture high-level semantic data flow that is more compact than using the code structure itself. Furthermore, we show that introducing an intermediate knowledge graph representation outperforms models that rely only on the code itself. Finally, we reflect on the challenges and opportunities presented by computational data journeys.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 03/Apr/2022
Suggestion:
Major Revision
Review Comment:

The paper discusses an approach concerning the automatic extraction and building of data journeys.
A data journey is defined as a multi-layered semantic representation of data processing activity linked to data science code and assets.

Overall the paper is well written and easy to follow.
However, I have several concerns about this manuscript.

First, the focus with respect to the special issue.
The aim of the special issue is to gather contributions related to the Explainable Artificial Intelligence (XAI) area.
It was quite hard to contextualize the approach presented in the paper with respect to the XAI domain.
It was also hard to keep track of this throughout the manuscript.
For example, the word "explanation" (or similar) is mentioned only twice (one at the beginning of the Introduction and one at the end of the Conclusions).
This could be a symptom that also the authors tried to force their idea within the XAI context.
How the abstract representation can actually support XAI?
What is the kind of support towards transparency and interpretability this approach is able to provide?
Did the authors plan to test it?
Which are the challenges they wanted to address?

Second, the evaluation.
The evaluation part relates to a knowledge extraction and a classification exercise rather than an evaluation of a XAI approach.
This is quite clear from the introduction of Section 6.
Again, how did the evaluation procedure show in the paper relates to the XAI research area?
Which are the insights for the reader? (i.e., obtained results are commented, but no discussions are provided about what a reader could get from them).
I warmly invite the authors to clarify these points.

Finally, the link with the state of the art.
Among the content of Section 2, only the works mentioned within the "Workflow abstractions" paragraph are very relevant with respect to the aim of the manuscript.
Indeed, the approach proposed, as it is presented right now, falls definitely within the knowledge extraction and representation area.
I am aware that this might be considered a hard statement, but, by considering that no links are provided with the XAI research are, I think that a further limit of this contribution is this one.
Then, there is a strong focus on the extraction of these abstract representation from code structure.
Is there any reason for that?
Which are the challenges that this scenario provides with respect to others?

Since this is the first round of review, I would recommend a Major Revision.
However, I would invite the authors to deeply think about the suitability of this work for this special issue.
There are many open questions that should be properly solved before to consider this contribution for publication.

Review #2
Anonymous submitted on 09/May/2022
Suggestion:
Major Revision
Review Comment:

In this paper, the authors extract data journeys and represent them as an ontology. The authors test their claim with the help of a corpus provided by Kaggle where the proposed approach outperforms the existing ones.

There are a few terms that are used and not concretely explained such as "data science code" or "high-level activities".

Section 3 gives the definitions, however, it would be great to have concrete examples with those definitions.

Section 4 talks about the ontology but there is no diagrammatic representation of this ontology.

The authors talk about the succession of properties, are these properties sequences? Were these properties treated differently?

Section 5:

- Again each of the subsections in section 5 could have an example.
- Listing 1 is not in the appendix, it seems to be in the main paper.
- The pseudo-code in Listing 1 and 2 should be explained line by line.
- Page 9, line 45: can the graph be shown with the help of an image?

Figure 1 is not clear.

In figure 2, could the authors provide a legend in the figures?

Overall, the paper could use proofreading since there are many typos. It feels like it was written in a hurry. Moreover, some of the information is repeated twice such as the fact that the authors are extracting and representing data journeys, it is mentioned twice only in the abstract. Also, the impact of this research work, i.e., where could it be used is a bit hard to grasp.

Review #3
By Agnieszka Lawrynowicz submitted on 13/May/2022
Suggestion:
Major Revision
Review Comment:

The paper proposes the term "data journeys" for the sake of open science by providing the possibility of explaining data science workflows at different levels of abstraction.
Developing more sophisticated abstraction models to represent the code with graph representations is a potentially impactful and promising area of research, especially for the transparency and interoperability of machine learning.

There are several topics in the paper, such as ontology, extraction, and machine learning to classify activity types.
However, the overall impression is that this work does not go deep into these topics.
Notably, the paper neglects some previous related works.
This mostly concerns:
a) data science ontologies
b) previous efforts for extracting knowledge graphs from source code

The current paper should clearly state how it differs from related efforts and be more precise in terms of envisaged usage scenarios and the particular focus of the paper regarding such scenarios.

In particular,
1) regarding the formulation of the task:
What is a data journey, and what is not a data journey? How does the data journey differ from efforts to model data science workflows such as Research Objects [21]? Mainly by the used schemas/ ontologies to represent a data flow?
What problems can data journey solve that are not solved by related past efforts? Conversely, what problems can data journey solve better? The paper should focus more on explainability issues than generic data science workflow representation, which was also covered in previous works.
There is a need for better motivation on how the proposed method helps to explain the data flow or activity workflow.

2) Regarding ontologies:
Similar abstract representations of a data flow have been used for various tasks [6-10].
Cannot ontologies associated with those efforts be used to represent data journeys?

Activities include: Analysis, Cleaning, Movement, Preparation, Retrieval, Reuse, and Visualization.
Is the above list exhaustive? Can there be more activities on the list beyond those from Workflow Motifs Ontology (reference needed)?
Some seem to overlap (like Cleaning can be assumed a type of data preparation).

There is a body of works on ontologies for representing data science experiments, such as:
Ilin Tolovski, Saso Dzeroski, Pance Panov: Semantic Annotation of Predictive Modelling Experiments. DS 2020: 124-139

Gustavo Correa Publio, Diego Esteves, Agnieszka Lawrynowicz, Pance Panov, Larisa N. Soldatova, Tommaso Soru, Joaquin Vanschoren, Hamid Zafar: ML-Schema: Exposing the Semantics of Machine Learning with Schemas and Ontologies. CoRR abs/1807.05351 (2018)

Pance Panov, Larisa N. Soldatova, Saso Dzeroski: Ontology of core data mining entities. Data Min. Knowl. Discov. 28(5-6): 1222-1265 (2014)

C. Maria Keet, Agnieszka Lawrynowicz, Claudia d'Amato, Alexandros Kalousis, Phong Nguyen, Raúl Palma, Robert Stevens, Melanie Hilario: The Data Mining OPtimization Ontology. J. Web Semant. 32: 43-53 (2015)

Here is even some overview:
Larisa N. Soldatova, Pance Panov, Saso Dzeroski: Ontology Engineering: From an Art to a Craft - The Case of the Data Mining Ontologies. OWLED 2015: 174-181

There exist a code ontology:
Mattia Atzeni and Maurizio Atzori. 2017. CodeOntology: RDF-ization of source code. In International Semantic Web Conference. Springer, 20–28.

3) Regarding graph extraction:
"the feasibility of automatically generating a graph representation, anchored to the source code" evaluation objective seems to address a task, where there have been some works in the past that have already proved the feasibility of this task, e.g.:
Kun Cao and James Fairbanks. 2019. Unsupervised Construction of Knowledge Graphs From Text and Code. arXiv preprint arXiv:1908.09354 (2019).

Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente. 2019. Extracting ontological knowledge from Java source code using Hidden Markov Models. Open Computer Science 9, 1 (2019), 181–199.

and similar recent work:
Ibrahim Abdelaziz, Julian Dolby, Jamie P. McCusker, Kavitha Srinivas: A Toolkit for Generating Code Knowledge Graphs. K-CAP 2021: 137-144

There are machine learning approaches to summarize code:
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A Transformer-based Approach for Source Code Summarization. arXiv preprint arXiv:2005.00653 (2020).

Therefore, the research question or hypothesis might be how the proposed approach for code extraction to a graph differs from the previous efforts and how it is better? Or maybe also how the purpose differs (explainability, transparency)?

4) Also, from a methodological point of view, the paper re-uses generic workflow motifs ontology, while the authors specifically apply it to data science workflows.
Data science workflows have some common structures, like evaluation protocols of cross-validation or leave-one-out etc. and represent ML phases.
The chosen activity model may seem too abstract. For instance, considering Fig.3: how the proposed graph can be used? What are the scenarios for using such as graph where it adds value? It looks like a generic data science workflow starting from data loading, then pre-processing, analysis, and finally visualization. That is correct, but what could be the added value of this particular example?
Additionally, the paper (and its title) says about the data journey (which I imagine as a semantically annotated data flow), but in the end, what we achieve is an activity flow as in Fig. 3?

5) Other remarks:

a) Missing references when names of artefacts or methods are first mentioned:
Workflow Motifs Ontology
CodeBERTa
BERTcode

b) Question regarding activity categorization:
Why computing tanh is in :Analysis, while computing mean is in :Preparation? What was the exact criterion here?

c) Fig. 1 is unreadable.

d)
"Parameter: any data node which is not supposed to be modified by the program but is needed to tune the 22 behaviour of the process. For example, the process splits the data source into two parts, 20% for the test set 23 and 80% for the training set. 2, 20%, and 80% are all parameters."
This naming convention may be misleading in the cited domain of machine learning. In particular, this is a definition of a hyper-parameter in machine learning. In contrast, Parameters are the ones whose value is actually changed (optimized while performing training on the training set).

e) Typo:
modesl

f) Overclaim: the proposed ontology is rich (in comparison to pre-existing data science ontologies, it is relatively not)