Semantic Modeling for Engineering Data Analytic Solutions

Tracking #: 1823-3036

Madhushi Bandara
Fethi A. Rabhi

Responsible editor: 
Oscar Corcho

Submission type: 
Survey Article
Data analytic solutions often are a composition of multiple tasks from data exploration to result presentation that are applied in various contexts and on different data sets. Semantic modeling based on open world assumption support flexible modeling of linked knowledge and in turn may help to tackle heterogeneity and continuous changing requirements in data analytic solutions. Hence the objective of this paper is to review existing techniques that leverage semantic web technologies to facilitate data analytic solution engineering. We explore the application scope of those techniques, the different classes of semantic concepts they use and the role these concepts play during the analytic solution development process. To gather evidence for the study we performed a systematic mapping study by identifying and reviewing 49 papers that incorporate semantic models in engineering data analytic solutions. One of the paper's findings is that existing models represent four classes of knowledge: domain knowledge, analytics knowledge, services and user intentions. Another finding is how this knowledge is used to enhance different tasks within the analytics process. We conclude our study by discussing limitations of the existing body of research, showcasing the potential of semantic modeling to enhance data analytic systems and discussing the possibility of leveraging ontologies for effective end-to-end data analytic solution engineering.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Luca Costabello submitted on 27/Apr/2018
Major Revision
Review Comment:

The authors present a survey of a large body of literature on semantic-based data engineering. The scope of the survey is ambitious, as the goal is to cover works that leverage semantic web technologies to model data analytics pipelines. The authors queried a number of academic publications repositories, and retrieved a set of candidates that they manually filtered for relevance. The surveyed papers are classified along two axis, the family of concepts included in the ontologies, and the data analytics task for which each paper has been designed.
The authors describe each work according to the aforementioned axis, and derive some final remarks. The survey covers 49 studies, chosen with a structured 3rd-party protocol. Adaptation of such protocol to the purpose of the paper are well described and detailed.
The paper is clear enough, and sufficiently well structured.

Nevertheless, there exist some shortcomings:

- Unlike other surveys listed in section 2, the authors do not narrow down the scope of the survey to a specific sub-topic. Instead, they opt for a broad topic. It is a remarkable ambition, but it becomes hard to condense 15+ years of semantic web research on improving data analytics pipelines in a journal paper.
- The survey does not focus enough on describing the research problems, and it does not clearly analyze whether the research community has delivered (e.g. which problems do these papers claim to solve? Are there any recurrent patterns? Have semantic web technologies delivered in specific niches? How?).
- I was expecting a dedicated section in the survey to support the statement "enhancements [of semantic technologies] can reduce the cost of data analytics solutions". The introduction of the paper points to that direction, which is an interesting topic to address in surveys in this domain (especially for readers that want to get started on the topic), but unfortunately the paper fails to give an overall picture of how the selected studies impact legacy pipelines, established practices, friction of adoption, cost reduction, etc.
- The authors claim as main finding the categorization based on analytics concepts and tasks. Nevertheless, such categories are chosen by the authors beforehand, hence this does not seems like an actual discovery.
- A considerable shortcoming of the survey is the lack of temporal dimension: prior art is categorized and described according to different dimensions, but the reader is not given a chronological evolution of this body of work (e.g. it would be interesting analyzing research trends over time, to see whether certain topics faded away or instead are receiving recent attention).
- In section 3.1.1 a number of other sub-questions would have added depth to the survey, e.g.: what are the added benefits of the studies over traditional techniques? What are real-world examples and proof of adoption in the surveyed papers? Are any of these tools/pipelines/frameworks publicly available on the web?
- Data sources: the authors acknowledge potential limitations of their approach, yet the bibliography does not contain articles from ISWC or ESWC main tracks (international/extended semantic web conference), which seems odd. Were ISWC|ESWC proceedings included in the data sources? The authors may also want to consider the DBLP search API [1], which offer comprehensive coverage with a single interface.

- Adding examples would help clarify: e.g. section 4.3.7: "validation of the analytics process involves storing, verifying and managing related artifacts"; sec 5.1.1: "hence to a large extent [...] who performs these tasks.", "less effective data analytics [...] degrade with time.".
- 5.1.1:, description of table 2: the list of tasks where intent concepts are not used also includes "model selection" and "code generation".
- 5.1.2: why not mentioning also "Business understanding" and "Data extraction and transformation"? They are not associated to "analytic concept" either.
- 5.1.3: the list of ontologies available on the web should have been put into a table, with prefixes and URIs of the vocabularies.
- The paper contains typos. It requires proofreading.


Review #2
Anonymous submitted on 15/May/2018
Major Revision
Review Comment:

The submitted manuscript analyses a number of works in the area of semantic modelling for facilitating data analysis. It employs a holistic approach in considering the whole pipeline of data analysis. This approach enables authors to identify existing gaps in semantic modelling and to propose possible ways to fill in those gaps.

Overall, I liked this paper but it has some limitations. The strength of the reported work is in considering the data analysis process as a whole, starting from understanding user tasks and data extraction and finishing by presentation and interpretation of the results. Taking user tasks into consideration is of particular interest and importance. It is good to see the breadth of analysed works, but it still seems to be insufficient to draw more useful and accurate conclusions. The methodology looks reasonable, but unfortunately it did not lead to credible conclusions. Therefore, I suggest to reconsider the choice of papers to analyse.

The major limitation is missing key works in the representations of data analysis processes. I assume it is due to the selection of the three online databases and focusing on computer science papers. Unfortunately, generally computer science journals have lower impact factors than, for example, biomedical journals. As the result, data scientists tend to publish semantic modelling and data analytic works in application journals (relevant to the areas where such techniques were applied) and not in computer science ones. Unfortunately, some of such published in application journals works are not freely available, and therefore excluded from the analysis. But there are principally relevant for the reported in this paper works published as open access publications. I strongly recommend considering the following highly relevant works:

- EXPOSE ontology for recording ML studies and comparing them within Open ML;
- DMOP, this ontology is focusing on meta-learning and has an excellent classification of data mining learners;
- OntoDM is a very generic ontology of data mining;
- OntoKDD models the whole process of knowledge discovery from databases;
- OntoDT models data types and recommends what data mining algorithms are suitable for analysis of given datasets,
- And some others.

I have to declare that I am a co-author of some of these works. Normally I would not refer in reviews to my own works. But in this case, it is unavoidable.

Another problem for me is inclusion of outdated works. For example, GALEN is superseded by OGMS, OpenCube is superseded by OntoDT. Given these gap in the reported in the submitted manuscript analysis, I presume that there are other gaps, in the analysis of areas where I am not expert, like business analytics. The authors need to carefully re-think the process of selecting papers for their analysis. I believe that the wrong choice of papers led to dubious conclusions and recommendations:

- “it is essential to promote the reuse of ontologies”. I completely agree with the authors. But the paper has to reflect the ongoing efforts in this. For example, OBO (open biomedical ontologies) Foundry ( recommends re-using classes already deined in other ontologies classes.
‘No clear separation between concepts classes’

This subsection shows luck of understanding of best practices in the design of ontologies and how they can be used. “One ontology with a unique URI may contain concepts related to one or more concept classes”. No true ontologist would use such expressions as ‘concept classes’ and ‘concept categories’. There are classes (= categories, types, etc.) and instances. Each of them should have a unique URI.
There is no need to have separate ontologies, as authors suggest. Many top-quality ontologies employ modular approach, and many - branches. This fully solves the problem.

“4 out of 29 made their semantic models available”.

This is outdated. Nowadays, the majority of ontologies are available as it is no longer acceptable to publish ontology works without making it publicly available.

Minor issues:

p. 2. “Ontologies provide a representation of knowledge and the relationship between concepts” In ontology engineering ‘relationship between concepts’ is considered to be knowledge, (along with rules and axioms) and therefore the usage of ‘and’ is wrong.

p. 2. “solution composed of multiple tasks”. Normally solutions are not composed of tasks. Unless a solution required is to identify tasks.

p. 5. “sorted papers into classification schema..” It makes no sense. “sorted papers in accordance with a classification schema...”?

p. 5. “Analytics concepts can be used as a vocabulary for analytic operations and attributes” – Does not make sense to me.

p. 10. “Validation of the analytic process involves storing, verifying and managing related artifacts”

It is a very strange definition. Where did it come from?

Review #3
By Ilaria Tiddi submitted on 04/Jun/2018
Minor Revision
Review Comment:

The paper presents a systematic survey on semantic-based approaches for engineering data analytics solutions (intended as combinations of tasks ranging from data exploration to visualisation and interpretation), with the idea that the flexibility of semantic technologies can be of support for the dynamic requirements of the various tasks in the data analytics process. After selection of approx 50 papers, the authors presents their main findings, including (1) four main types of knowledge represented in the work (domain knowledge, analytics knowledge, services and user intentions); and (2) the role this types play in the different data analytics tasks. As a conclusion, the author give some view on possible research areas to underpin, due to the current lack of work.

In general, this is an interesting paper and it is clear that the authors put a lot of effort into it. The survey methodology is thoroughly described and motivated, which is certainly a +1. Difficult to say if the authors miss something in terms of comprehensiveness; however, selection and search criteria were well defined and seem to be respected. Also, many ideas provided to the SW community as future directions to take. In terms of clarity, there are some few typos and misformulations (see bottom), but in general the work is clear and easy to follow.

My main concern is that I am missing a bit of the main motivation. What makes "(engineering) data analytics" so particular for it to be surveyed? In other words, is "(engineering) data analytics" any different to the other data processing workflows (e.g.: Knowledge Discovery or, a bit more recent, Data Science)? Is it the type of datasets, is it the type of tasks? Or is data analytics a subprocess of these? This should be clarified in the beginning of the paper.
Related to this:
- (page 2, col 1) "studies that are focusing on semantic technology applications related to data analytics" : a bit vague, maybe? (as many works not included in the survey might also be considered "data analytics solutions based on semantic technologies")
- (page 2, col 2) what are examples of application "supporting the (data analytics) process itself"? This could be rephrased
- (page 5, section 4.2.2) The Analytics Concepts include "data preprocessing" and "data mining" processes, which are 2 of the 3 core processes in knowledge discovery. Why is the third one (data post-processing/interpretation) replaced by control&data flow? (Again, explaining how data analytics is different from KD)
- (page 14) w.r.t. the identified limitations, are the authors taking any specific direction?
Is "Semantic model" equivalent to "ontology", or is it a more specific term? If so, this should be clarified too.

Second thing, re the findings. The work mentions that the four classes (domain knowledge, analytics knowledge, services and user intentions) are the main findings, but these are introduced in Section 3 as classification scheme for the survey (also, the way these classes are derived is unclear, how do they map with Nigro’s classification exactly? And why that one did not fit?). I would not claim them as main findings anyway, but rather mention that these are used in a top-down fashion for your analysis, and they helped in assessing the limitations of the domain analysed in your survey. An example or a clarifying picture of the six different tasks you identified might be also helpful. I would also suggest that the authors summarise the answers to their questions/sub-questions (4.1/4.2/4.3) at the beginning of the respective subsection, mostly for readability.

- (page 1, abstract) " Semantic modeling based on open world assumption support " >> "Semantic modeling based on open world assumption supports"
- (page 2, col 1) "we identify unresolved challenges exist in ..." >> ""we identify unresolved challenges exist in …" (I would actually rephrase the whole sentence, which is odd)
- (page 3, col 1) Citation 27 is "Ristoski", not "Ristiski"
- (page 4, col 1) the synthesis of evidence >> the synthesis of evidence.
- (page 4, col 2) "Ontologies for Data Mining Process >> "Ontologies for Data Mining Process"
- (page 6, table 1) add "." at the end of the caption