Review Comment:
## Page 1
[number] (column)
37 (left): First sentence is confusing: how is introducing changes necessary for reproduction? I also recommend to clarify the usage of reproducibility early on, since "consistent with the original one" does not mean "equal to the original result", does it? The term is inconsistently used across disciplines, and for readers a clear statement on how the authors understand the term is valuable, cf. https://arxiv.org/abs/1802.03311.
I applaud the stressing of re-use and extension as the ultimate goal of reproducibility.
48 (left): The connection between workflows and large scale computations is unclear. Why is reproducibility not relevant for small workflows, running on a scientists laptop?
39 (right): Do you include hardware in "computational resources"? It becomes clear later, I suggest to explain "computational resources" early own or use simpler words.
47 (right): Use of term "replication" unclear (see above) - it is commonly understood as comming to the same conclusions without having any information from the author.
47 (right): "additional information" - additional compared to what? Probably explained later.
## Page 2
9 (left): Does this refer to "the" Research Objects?
21 (left): How do _virtual_ machines help with _physical_ conservation? Storage demands of virtual machines are not a strong argument when datasets grow larger and larger every year. The used arguments in favour of containerisation (lines 36 ff.) are not specific to containers but also hold for VMs. I suggest to clarify the advantages for containers over VMs (challenges of UI-based workflows? Dockerfile as recipe?).
1 (right): Others have proposed Docker images for reproducibility (cf. https://doi.org/10.1145/2723872.2723882), as also detailed in section 2. Please clarify the new contributions of the presented work, which is afaict at this point the annotation of images.
6 (right): "Containers are lightweight" is an often used argument, but needs clarification. Light on what scale? Is a few GB saved storage and quicker boot duration really relevant in scientific reproduction of workflows done only by a handful of readers?
## Page 3
19 (left): Data storage costs are high - please provide support for this argument. The used Docker Hub for storing images, for example, is completely for free.
31ff. (left): While I support the conclusion of VMs not being suitable, I think the discussion takes some shortcuts here. Why can I not know what is inside a VM? Why do I not know what is in a container, when it was created from a Dockerfile (line 46)? Please take a look at ReproZip (http://reprozip.org/) as a packaging format that abstracts from VMs and containers.
49 (left): Please re-check your statement "these works ... only express desiderata". At least Marwick provides a case study (= with a solution).
10 (right): I cannot follow the argument leading toe scalability and security issues.
12 (right): It is unclear what role software vulnerabilities play for the presented work. The authorship of containers in a scientific setting should be clear, thus trust is usually not an issue. Please clarify the relevance of security for reproductions in scholarly settings.
30 (right): Please clarify the differences between the set of ontologies presented by the authors and the existing ones. What are their shortcomings?
46ff. (right): (and also other places) Containers can also be black boxes, right? Why is a VM "fixed" - a use can start it and make changes to it. Why do manual annotations reduce quality and trust?
Also, I suggest not to hide the two main problems tackled at the end of the related work section (see comment above about main contribution).
## Page 4
44 (left): Introduce abbreviation "WMS"
50 (left): "they archival" - "the" or "their" ?
3 (right): I still have problems seeing containers as a solution for _physical_ preservation. Physical preservation of software or data in my understanding would be a (brick and mortar) library that has several hard disks of data stored in different locations.
6 (right): "audit features" mentioned here, and only here. Please expand on that use case or stick to "annotation".
13 (right): "images, containing the Operating System and dependencies" - it is my understanding that containers do not include the "operating system", namely not the kernel. That is the whole point. Please recheck. Also, if size should remain an important argument (questionable), what about Docker's image layers role for storage size?
41 (right): How does the system identify layers (to be reused) - by the layer ID I assume? Could be clarified for non-Docker experts.
This section might also be a good place to introduce the files the store the image/layer metadata, or mention the Docker API ?
## Page 5
12 (left): Extend an existing Dockerfile - this can be confusing, I think it is worth introducing the concept of parent images here (FROM ...). Also, the Dockerfile does not "finally uploads it to Docker Hub", please clarify (Docker CLI vs. Dockerfile)
13 (left): "the software packages needed to build the Docker image" - the only software needed to build the image Docker, you probably mean all software packages installed into the container?
27 (left): "which components are installed..." - please consider that the statement about intuitiveness of Dockerfiles strongly depends on a persons background. You might make the argument here that that is the case for scientists, but then please connect to other parts of your manuscript how you solve that problem.
30 (left): "Also, some components might exist in the container that are not specified by the Dockerfile itself" - which ones? Do you mean dependencies of the installed software?
52 (left): What does "light-weighted" mean in this context? Suggest to rephrase.
1 (right): Please elaborate on the expected process of reproduction - why does it affect "production infrastructures"? Because you only cover HPC workflows?
5ff (right): versions and rollback: This is quite short, and I can imagine what you mean, but it would be better to strike this or explain thoroughly, or find a reference: Do you mean Dockerfiles under version control, or tagged images (with time or releases) ?
47 (right): Are DeploymentPlan, DeploymentStep etc. not also classes and should be typeset in the same font as SoftwarePackage on the following page?
## Page 6
4 (left): Suggest to mention Singularity earlier, when you introduce containerisation.
9 (left): First mentioning of "dockerpedia" - please explain! Also, what is the relation to "docker:" namespace in Figure 1 ? Shouldn't it be the same.
14 (left): "In summary, we annotate every installed software package on the container file system." Please clarify if you annotate the packages that are mentioned in the Dockerfile, or also that package's dependencies. I assume not the latter. Have you considered running a command like `dpkg --list` to get a list of all installed software?
27 (left): "annotation service implements a REST interface" - please provide a link to the specification of said interface.
10 (right): Please clarify how you handle base images (FROM ubuntu) and image stacks - or is the metadata for all base images given?
39 (right): How can you model source-based installation, e.g. wget-ting an archive and installing with make? I think that is a common approach, especially since Docker is often used to provide a software stack that is not easy to install on all platforms (i.e. software that is not available via package managers).
36 (left): Right - you use Clair to capture all dependencies. Very good, please consider my above comments as requests to mention that earlier. Can you provide an example link to how the result of a Clair analysis looks like? I think it could illustrate your integration of the different usd tools well.
41 (left): Can you provide a citation for Clair, or just the GitHub link? In general, I would kindly ask you to double check if you cite each software the way they want to be cited, to give proper credit (which does not work with just a URL).
22 (right): "we extend Clair in our system" - can you provide a link to a pull request or commit with your extensions? Are your changes in https://github.com/dockerpedia/clair/commits/conda ?
## Page 7
Listing 1: I don't see a tensor-flow package installed - maybe these are just the depenencies needed by tensorflow?
33 (right): "To do that, we create the Docker image again using the previous annotations." Can you clarify or give an example for a whole "round trip", Dockerfile > Docker image > Annotations parsed from Dockerimage > Dockerimage (or do you generate a Dockerfile from the annotations?)
37 (right): "just repeating the package manager install command": I think the "just" in that sentence does not pay justice to the complex system you built. Can you provide an illustrative example, e.g. what information does Clair extract, and what install command do you create for a specific package and version? Or does it not differ at all? Do you use the specific version in the apt commands?
Listing 2: It is unclear why you would query for Pegasus software packages. Please better introduce the example, considering that Pegasus is properly introduced only on page 10.
## Page 8
Fig. 2: Please provide links to the source code of the Annotator (d3.js-based, Go). Also, consider adding numbers to structure the interaction - the order in which requests are made is not obvious.
49 (left): You say "five different experiments" but only have 4 names in the brackets after that. Please clarify.
24 (right): "guarantee that our approach is platform independent" - Please clarify if you actually run 45 workflows (five experiments * three workflow systems * three execution environments). Later it does not become clear which platform was used to generate which output (e.g. extractBudged-reproduced.csv - which platform does it come from?)
48 (right): I don't think "imports" is a proper term to use with one image being based on another one. As said above, properly explaining the FROM command is probably worth it for readers without Docker expertise.
## Page 9
17 (left): "We rely on .. to" - this seems odd, suggest to rephrase. Maybe "We rely on Docker Images stored on DockerHub for the physical conservation." ? Still suggest to rethink this, as DockerHub might disappear any day, while a proper data repository (Zenodo, OSF, figshare, b2share) is more likely to actually "conserve" data.
24 (left): "so that any user inspect and improve them" - add "can" ?
20 (right): "similar enough": Please clarify your criteria! What is "enough" - should the workflow be executed, or be executed and have the same results? You do say that later, so I suggest to consider striking this unfortunate phrasing.
25ff (left): Please reconsider "storable" as an evaluation criterion, same as lightweight. If you compare the disk usage: where is that data?
29 (right): "With the SPARQL query in Listing 4 is easy to spot the differences between both execution environments." - I disagree, with the _result_ of that query a user could spot the differences. I suggest to also link to or include the result.
## Page 10
21ff (left): The relevance of the software requirements for Pegasus are unclear. You do not mention that for the other workflow software. Also, I suggest to make clear that you created the Docker Images for Pegasus while you could use existing ones for example for dispel4py.
31ff (left): "The workflow ..." - wow, that's a tough sentence to digest for a non-genomics expert. Please consider either explaining/adding references (Wikipedia?) to what SNP, GATK, haplotype are, or rephrasing in more general words. When a reader wants the details, there is [22, 23]. Please clarify why you use that worklow (was is readily available? published under an open license? typical?)
22 (right): "in yellow" "in green" - sentence is probably missing a reference to Fig. 3 ?
24 (right): It is unclear what "baw, gatk and picard" are. They are not in Fig 3.
28 (right): "ome of its steps being probabilistic" - can you clarify why you are not able to set a seed and thus come to the same results? This is common for reproduction of randomised workflows.
34 (right): I think it is a bit risky to go from "similar outputs" to "successful reproduction". Please transparently define your criteria (which might not require bitwise equality), don't just use "similar".
## Page 11
34 (left): Cool that you use perceptual hashes for the comparison! The Zenodo links in the footnote do not seem to contain any images though.
39 (left): Footnote 25 is for "DockerHub" but the links in the footnote actually point to Zenodo. Please fix/clarify. It is very good that you put snapshots of your code on Zenodo! Please consider doing that for the other software you developed, too (like the annotator).
## Page 12
Footnotes 30 and 31 are suppossed to got to DockerHub and GitHub judging by the text, but actually are Zenodo DOIs. Please clarify/fix.
25 (right): "obtained the same results" - please clarify how you checked that. It would be great if you, for example in the README of the results repository, could provide the commands for a reader to re-run the experiments, i.e. how you generated the files in results repository.
45 (right): "include the complete list of installed packages on the Dockerpedia GitHub" - which repository precisely?
31 (right): WINGS section does not report on results of a reproduction. Is WINGS used for MODFLOW-NWT (which is a subsection)?
## Page 13
17 (left): Why do you not run a line-by-line comparison of the CSV files for modflow (extractBudged.csv)? The images are just a visualisation, you can check the actual data. As such Fig. 9 does not add real value to the manuscript.
28 (left): "different Docker version" - conflicts with page 9 "The Docker version (37) tested for this experimentation", doesn't it? If Docker versions differ, you should extend Table 1 to include that, and to include the precise architecture ("64" probably means bits, right? Could still be ARMv8-A or RISC-V, but probably isn't either.)
32 (left): "predefined VM image": It is unclear how you executed the workflow in the VM (I can guess that you started it, logged in, then executed it), please describe that in the previous sections though for transparency.
48 (left): Please re-consider using other means for comparing probabilistic outputs. Could you not introduce error margins? I guess the results might be a little bit different but should not contradict each other. Also this does not fit the later sentence "equivalent in terms of size and content". If the content is equivalent, then why not compare it directly?
34 (right): "produces a histogram by zone": As stated above, I suggest to compare the data underlying the histogram, it should be more precise (and possible with a diff tool)
## Page 14
38 (left): "the graph does not have a conflict" - please explain: to me Java == 1.7 and Java >= 1.8 are a conflict, you can only have one of those.
35 (right): "noiseless." - suggest to re-word. What does noise have to do with containers and annotations? Why does a correct execution of an experiment lead to security?
39 (right): "We can detect the similarities and differences between two versions of a image" - just to be clear: for your reproductions, did you perfectly recreate the environments, i.e. did you not have any differences, in versions of underlying libraries etc. ? I strongly suggest to provide the output of the comparison queries in the results repositories on GitHub.
42 (right): "The Docker images takes less disk space compare to Virtual Machine images" - while this is true in absolute numbers, for the argument to hold I would like to see the size of the data of the workflows. Or is the full data included in each image?
Also, the Container for dispel4py is larger than the Virtual machine for Pegasus - if you can store the dispel4py container, why can you not store the Pegasus VM ?
Table 2: VM and image sizes for MODFLOW-NWT, WINGS are missing.
## Page 15
Fig 10.: Not all orange nodes are actually different, at least not in the figure. Only the text of the "Pegasus 4.8/4.9" node changes. If SoyKB version changed to, I suggest to add the version in the node. Also the "Java == 1.7" did not change. I might be missing the point here, so a more extensive figure caption might help.
22 (right): Please check with ReproZip and other tracing-based tools about perf events, such as Parrot (http://ccl.cse.nd.edu/research/papers/techniques-ipres-2015.pdf), if there might be overlap to this idea.
## Page 16
Reference 25: Please use one reference for each of the Zenodo repositories. It is really great (!) that you put your results on Zenodo, but the metadata there is really insufficient. "Commit cited in the master's thesis" is not helpful.
## git repositories
- I suggest to add a short introductory paragraph to each README so that readers understand _what_ the analysis that is include is about, ideally with a reference to the source/original paper.
- Please make sure all your repositories have a useful README, and include a LICENSE (https://github.com/dockerpedia/modflow_results/tree/thesis for example does not)
- https://github.com/dockerpedia/soykb/issues/1
- Is it possible that I run the annotator myself? Could you add instructions to the README of https://github.com/dockerpedia/annotator ?
- Consider turning your GitHub projects into binders. It will allows readers (and reviewers..) to easily follow your steps! (Applies also to montage_results and internal_extinction_results)
- https://github.com/dockerpedia/soykb_results/issues/1
- https://github.com/dockerpedia/modflow_results/issues/1
## Final comments
Overall I found the manuscript well written and understandable, and referencing relevant literature and related work. The results are original, but the could be reported more thoroughly and clearly.
The manuscripts needs some edits, not the least because I apply high standards for reproducibility, which I assume are in line with the authors intentions, as their topic is strongly connected with computational reproducibility. I do thing that all information that might be missing exists, and there is no need to re-run any experiments to fulfill my suggestions.
Specifically:
- Some arguments need fleshing out and critical review, especially in the introduction and related work. I understand some things seem obvious to developers, but I think (as a researcher also relying heavily on containers for reproducibility!) the paper should be clearer/more realistic on the problems solved and unsolved, and also about the relevance of some challenges.
- I am not an expert in semantic modelling, and since this topic and the presented solutions are surely relevant for other communities, I suggest to accommodate readers from other domains where possible.
- The possibility to detect changes between two environments seems like a "hidden gem" in the article and could be expanded upon.
- The results section reports on the results of the reproductions, but not on the results of the recreation of the environment. It would be helpful to see a Dockerfile example generated from the annotations (if that is how it works), and if the READMEs for the results repositories (or any repository where you see fit) contained the `docker run` commands to execute the workflow.
- The diverse workflows you use support the stability of your approach, but it is really hard for readers to understand what happens in the workflow specifically, because you need a lot of domain expertise. I suggest to rephrase workflow descriptions in more generic terms provide a glimpse, and then reference the literature for details.
- "dockerpedia:" and "docker:" as namespaces are mixed, but probably only one is actually used?
- The extension of the Clair tool: Do you consider contributing them back to the original codebase?
- The conclusions could be balanced a bit with identified shortcomings of the approach.
- WINGS vs. MODFLOW-NWT relation is unclear (the latter is a subsection of the former, whose section does not report on any reproduction)
- Provide more examples: For me browsing on https://dockerpedia.inf.utfsm.cl/examples helped a lot.
- Source code of Annotator is missing in the text AFAICT, but it is a core part of the work. Should be more prominent, and ideally published in a repository with a DOI.
- One further comment on formatting: I am unfamiliar with the used template, but it would be helpful if all references contain a DOI, currently not all do. Suggest to use the prefix https://doi.org (not http://dx.doi...).
|