Facilitating Scientometrics in Learning Analytics and Educational Data Mining - the LAK Dataset

Tracking #: 955-2166

Authors: 
Stefan Dietze
Davide Taibi
Mathieu d’Aquin

Responsible editor: 
Claudia d'Amato

Submission type: 
Dataset Description
Abstract: 
The Learning Analytics and Knowledge (LAK) Dataset represents an unprecedented corpus which exposes a near complete collection of bibliographic resources for a specific research discipline, namely the connected areas of Learning Analytics and Educational Data Mining. Covering over five years of scientific literature from the most relevant conferences and journals, the dataset provides Linked Data about bibliographic metadata as well as full text of the paper body. The latter was enabled through special licensing agreements with ACM for publications not yet available through open access. The dataset has been designed following established Linked Data pattern, reusing established vocabularies and providing links to established schemas and entity coreferences in related datasets. Given the temporal and topic coverage of the dataset, being a near complete corpus of research publications of a particular discipline, it facilitates scientometric investigations, for instance, about the evolution of a scientific field over time, or correlations with other disciplines, what is documented through its usage in a wide range of scientific studies and applications.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Vojtěch Svátek submitted on 19/Jan/2015
Suggestion:
Minor Revision
Review Comment:

The authors' reaction to my comments was mostly adequate. I appreciate the cleaner logical structure and addition of licencing info, as well as correction of a number of typos.

I would still like to see more properties in Table 5. The list has been pruned but not extended. Even less frequent properties may sometimes be of interest.

Fig. 4 should be commented a bit more: what can the user actually see at the screenshot.

Minor issues:
- In the list of licence model for graphs you should make clear that the first two only apply to *metadata*. The statement "contains... publications" might mislead.
- In Fig. 1, "swc:Proceedings" should probably be "swrc:Proceedings"

I also still noticed some typos, e.g., "the authors already are aware of 16 scientific publication" (word order; missing plural).

Review #2
By Maria Keet submitted on 27/Jan/2015
Suggestion:
Minor Revision
Review Comment:

While there is at least some related works now, It doesn’t mention any aspect of best practices reused, or methodological approaches to develop a LD data set (there is none?). The discussion section states as second contribution “a set of practices and vocabulary choices for reuse in similar efforts ”, and technically one can extract a set from the various paragraphs, and they are practices and choices, but they’re not best practices, and this is not justified neither w.r.t. related literature, nor is there a clear sequence of steps presented in the paper. The onus ought not to be on the reader to extract the practices and choices from the different paragraphs when it is claimed that the paper offers this, but instead presented on a plate’/spoonfed instead (which it is not).

I also see some improvements on the schema, but the main problem of mishmash [that word in the previous review was a polite rendering of ‘unmotivated messy cherrypicked cocktail of vocabulary terms with a range of issues’] of classes and properties from a plethora or sources remains. While I disapprove of such an approach, and I don’t think it was ever the intention to do a cherry-picking of URIs to copy-and-paste one’s schema together, perhaps this could work if the sources were aligned properly—which is what the paper claims to have done—and motivated properly, which is asserted in section 7 (“vocabulary choices”). On the latter, it mentions only a ‘we started with swrc and added stuff to that along the way’ in section 4.1, but that doesn’t count as a description on trade-offs among vocabularies. More problematic is that despite some corrections made following my pointers in the previous review, there are still problems. To illustrate, selecting a few from Tables 4 and 5, the following:
- Paper swrc:inProceedings is really on papers in proceedings and essentially have an associated (scientific) event of which it is a proceedings, whereas schema:Article is only the informal notion of an article, and bibo:Article’s description lies somewhere in-between, so no equivalences.
- Journal Issue in the concept column with ofType bibo:Journal: no, in bibo, Journal hasPart only Issue, so a Journal Issue cannot be ofType Journal.
- Some missing mappings, npg:Citation and schema:Citation seem to be the same, and npg:hasCitation could relate with isReferencedBy in bibo (though that is actually purl.rog/dc/terms, not really bibo).
No doubt there are more such shortcomings regarding the schema, but I consider it the responsibility of the authors to do all this, not me, in particular because the integrated graph is promoted in the paper as a useful contribution. At present, it still doesn’t instil confidence, but instead gives the impression of a not well-executed copy-and-paste job of vocabulary items, which, while in the strict sense of providing a “set of practices”, is not one that is advisable. Given that the schema is still unstable, I wonder about its knock-on effects on the data and on doing analyses with that data, especially regarding querying the data with the kind of limitations the schema currently still has, and it’s proneness to change until a stable version has been developed.

While it is good to see the LD data set is being used and endorsed by various institutions, neither best practices nor an immediately reusable good schema is available. This being the case, then either the authors should tone down their claims of contribution, or improve the work to meet those claims asserted in the paper.

section 6
16 scientific publication -> 16 scientific publications

section 7
short-comings -> shortcomings
draw-backs -> drawbacks

Review #3
By Agnieszka Lawrynowicz submitted on 02/Feb/2015
Suggestion:
Minor Revision
Review Comment:

The authors made efforts to address the comments raised in all the reviews.
This resulted in a better dataset and a better paper (with regard to quality and with regard to the clarity, what in turn increases the potential reusability of the results).
With regard to the usefulness of the dataset, there are two important points in favour of this work:
1) The authors now expanded the list of publications at the LAK dataset website that includes several works using the dataset within the LAK challenge (co-authored by the people beyond the authors of the paper).
2) There is a community around this work: for instance SoLAR (potential usefulness).
Hovewer, I would appreciate if the authors elaborate more on the aspect of the dataset’s up-to-now (not only potential) usefulness. Especially, it would be important that the authors discuss more the third-party uses of the dataset to provide more evidence on its usefulness (I am referring here to the SWJ category description of “Linked Dataset Descriptions“ and the requirements listed there).

I have further, small remarks.
1) Schema:
ConferenceEvent may not always be an AcademicEvent (there might be business conference for instance)?
Though both are in the ontology describing research communities.
Article is not equivalent InProceedings? There are articles in journals.
2) The labels on Fig. 2 could be of better quality.