Review Comment:
The submitted manuscript analyses a number of works in the area of semantic modelling for facilitating data analysis. It employs a holistic approach in considering the whole pipeline of data analysis. This approach enables authors to identify existing gaps in semantic modelling and to propose possible ways to fill in those gaps.
Overall, I liked this paper but it has some limitations. The strength of the reported work is in considering the data analysis process as a whole, starting from understanding user tasks and data extraction and finishing by presentation and interpretation of the results. Taking user tasks into consideration is of particular interest and importance. It is good to see the breadth of analysed works, but it still seems to be insufficient to draw more useful and accurate conclusions. The methodology looks reasonable, but unfortunately it did not lead to credible conclusions. Therefore, I suggest to reconsider the choice of papers to analyse.
The major limitation is missing key works in the representations of data analysis processes. I assume it is due to the selection of the three online databases and focusing on computer science papers. Unfortunately, generally computer science journals have lower impact factors than, for example, biomedical journals. As the result, data scientists tend to publish semantic modelling and data analytic works in application journals (relevant to the areas where such techniques were applied) and not in computer science ones. Unfortunately, some of such published in application journals works are not freely available, and therefore excluded from the analysis. But there are principally relevant for the reported in this paper works published as open access publications. I strongly recommend considering the following highly relevant works:
- EXPOSE ontology for recording ML studies and comparing them within Open ML;
- DMOP, this ontology is focusing on meta-learning and has an excellent classification of data mining learners;
- OntoDM is a very generic ontology of data mining;
- OntoKDD models the whole process of knowledge discovery from databases;
- OntoDT models data types and recommends what data mining algorithms are suitable for analysis of given datasets,
- And some others.
I have to declare that I am a co-author of some of these works. Normally I would not refer in reviews to my own works. But in this case, it is unavoidable.
Another problem for me is inclusion of outdated works. For example, GALEN is superseded by OGMS, OpenCube is superseded by OntoDT. Given these gap in the reported in the submitted manuscript analysis, I presume that there are other gaps, in the analysis of areas where I am not expert, like business analytics. The authors need to carefully re-think the process of selecting papers for their analysis. I believe that the wrong choice of papers led to dubious conclusions and recommendations:
- “it is essential to promote the reuse of ontologies”. I completely agree with the authors. But the paper has to reflect the ongoing efforts in this. For example, OBO (open biomedical ontologies) Foundry (www.obofoundry.org/) recommends re-using classes already deined in other ontologies classes.
‘No clear separation between concepts classes’
This subsection shows luck of understanding of best practices in the design of ontologies and how they can be used. “One ontology with a unique URI may contain concepts related to one or more concept classes”. No true ontologist would use such expressions as ‘concept classes’ and ‘concept categories’. There are classes (= categories, types, etc.) and instances. Each of them should have a unique URI.
There is no need to have separate ontologies, as authors suggest. Many top-quality ontologies employ modular approach, and many - branches. This fully solves the problem.
“4 out of 29 made their semantic models available”.
This is outdated. Nowadays, the majority of ontologies are available as it is no longer acceptable to publish ontology works without making it publicly available.
Minor issues:
p. 2. “Ontologies provide a representation of knowledge and the relationship between concepts” In ontology engineering ‘relationship between concepts’ is considered to be knowledge, (along with rules and axioms) and therefore the usage of ‘and’ is wrong.
p. 2. “solution composed of multiple tasks”. Normally solutions are not composed of tasks. Unless a solution required is to identify tasks.
p. 5. “sorted papers into classification schema..” It makes no sense. “sorted papers in accordance with a classification schema...”?
p. 5. “Analytics concepts can be used as a vocabulary for analytic operations and attributes” – Does not make sense to me.
p. 10. “Validation of the analytic process involves storing, verifying and managing related artifacts”
It is a very strange definition. Where did it come from?
|