OpeNER Accommodations as Linked Data

Tracking #: 453-1630

Authors: 
Clara Bacciu
Angelica Lo Duca
Andrea Marchetti
Maurizio Tesconi

Responsible editor: 
Oscar Corcho

Submission type: 
Dataset Description
Abstract: 
The OpeNER Linked Dataset contains information about accommodation for three locations: Amsterdam, Tuscany (Italy) and Spain. For each accommodation, it provides the type (e.g. hotel, bed and breakfast, hostel etc.), and other useful information, such as a short description, the location, the number of rooms and the features it provides. The dataset has been built starting from two Web sites, which give information about accommodation: Booking.com and Google+ local. Furthermore, it exploits three common ontologies for the accommodation domain: Acco, Hontology and GoodRelations. Finally, the dataset contains 19.973 entries: 1.043 entries for Amsterdam, 15.371 for Tuscany and 3.559 for some localities of Spain.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Aidan Hogan submitted on 11/Apr/2013
Suggestion:
Reject
Review Comment:

This paper discusses Linked Data representing various forms of accommodation in Amsterdam, Tuscany and Spain, containing 19,973 instances. The dataset has been sourced from booking.com and Google+ Local using scrapers. The paper gives an overview of related (Linked) datasets, the existing vocabularies used, the mechanisms by which the raw data are extracted and integrated, and what the final RDF looks like. Links are provided as entry points to the dataset. Licensing and maintainance/updates to the dataset are briefly discussed. Finally, the authors outline potential use-cases for the dataset; primarily, as its name suggests, the OpeNER dataset was originally intended to provide Named Entity Recognition for entities referring to accommodation in the given locales.

Based on the review criteria for Linked Dataset Descriptions [1], I think that the description of the dataset is mostly adequate and the quality of the dataset is adequate (aside from a lack of links). My main concern is about the legality of publishing screen-scraped data as Linked Data: this completely undermines the usefulness of the dataset.

(3) Clarity and completeness of the descriptions.

The paper does a reasonable job of describing the dataset. There are a few minor issues here and there with formatting and with English (an incomplete list of minor comments highlighed below), but otherwise the description is quite concise and clear.

(1) Quality of the dataset.

The paper describes the dataset and provides links to the dataset. The example URI for a hotel is incorrect, and should presumably be:

http://wafi.iit.cnr.it/opener/resource/acco-1

This successfully returns a D2R-style HTML rendering of the RDF data, or Turtle if requested. From a quick look, the data seem fine. I dislike that the referenced HOntology is returned in OWL XML presentation syntax since most applications will not have a parser for this (since we are talking about consumption by RDF tools, an RDF representation of the ontology would make more sense than an OWL syntax supported by a handful of OWL tools). And lower-case properties are more conventional in Linked Data. I also think it should be clarified in the paper that Hontology was created by the authors.

http://wafi.iit.cnr.it/angelica/Hontology.owl#

I am also quite concerned that the level of interlinkage is quite low: about 543 links to DBpedia. (Also, the SPARQL endpoint and the dataset itself seem a bit unstable where I encountered various temporary errors while browing through.)

Personally, I would also recommend against VCard whereever possible. The vocabulary is not tailored for RDF; it over-uses literals and is not particularly intuitive. Use alternatives whereever possible. In particular, for latitude and longitude, wgs84 is much more commonly used:

http://www.w3.org/2003/01/geo/wgs84_pos#

Also, don't use gr:name. There are about twenty name/title/label properties in Linked Data and only three of them are needed. Please just use rdfs:label (and skos:prefLabel or skos:altLabel if there are aliases).

The lack of links is perhaps the biggest issue in terms of evaluating OpeNER as a Linked Dataset. The other issues are admittedly minor and the data seem to be formatted quite well as RDF.

(2) Usefulness (or potential usefulness) of the dataset.

My main concerns lie in this point. Aside from the fact that the coverage of the data is somewhat localised (and thus applications are limited to those locales), and that the highlighted NER application is very specific, licensing is a major issue since the dataset is screen-scraped from two commercial sites that expressly forbid such extraction. The authors acknowledge such issues in Section 4.5 (discussion should probably be earlier as it's an obvious concern), but they leave the situation ambiguous whereas the situation seems rather clear-cut. For example, for booking.com, aside from the copyright notice on all pages, here's the relevant quote from the T&C's [2]:

"""
Our services are made available for personal and non-commercial use only. Therefore, you are not allowed to re-sell, deep-link, use, copy, monitor (e.g. spider, scrape), display, download or reproduce any content or information, software, products or services available on our website for any commercial or competitive activity or purpose.
"""

And these are the relevant Google T&C's [3]

"""
2. Restrictions on use. Unless you have received prior written authorisation from Google (or, as applicable, from the provider of particular Content), you must not: (a) copy, translate, modify or make derivative works of the Content or any part thereof; (b) redistribute, sub-license, rent, publish, sell, assign, lease, market, transfer or otherwise make the Products or Content available to third parties; (c) reverse engineer, decompile or otherwise attempt to extract the source code of the Service or any part thereof, unless this is expressly permitted or required by applicable law; (d) use the Products in a manner that gives you or any other person access to mass downloads or bulk feeds of any Content, including but not limited to numerical latitude or longitude coordinates, imagery and visible map data; (e) delete, obscure or in any manner alter any warning or link that appears in the Products or the Content; (f) use the Service or Content with any products, systems or applications for or in connection with (i) real-time navigation or route guidance, including but not limited to turn-by-turn route guidance that is synchronised to the position of a user's sensor-enabled device; (ii) any systems or functions for automatic or autonomous control of vehicle behaviour; (g) use the Products to create a database of places or other local listings information.
"""

When the authors used the service to screen-scrape the site, they obviously broke the T&C's. They simply should not have the data. I thus don't see how the authors can "get a specific licence in order to expose all the information, including accomodations description and other sensitive data."" It would probably be okay to use the data for personal use or for offline research purposes, but replicating the data online is obviously a different matter. Hence, since the legality and/or availability of the dataset are fundamentally compromised, I think the usefulness of the dataset is completely undermined.

MINOR COMMENTS:
* Throughout: use "," or a thind space as a thousand separator, not "."
* "Booking.com is [an] online booking"
* Fix line spacing in right hand column of first page.
* "100 textsearch" -> "100 text searches"
* "Note that not all the categories are defined in both the ontologies" Rephrase
* "using both [of] the"
* "Turtle code[ ]is"
* "rectangles [are] literal values"
* Fix formatting of links in Section 4.4.
* "to search for <> accommodation providing"
* Fix formatting of references.

[1] http://www.semantic-web-journal.net/reviewers
[2] http://www.booking.com/general.en-gb.html?dcid=1&sid=6ad949028f7510383b8...
[3] http://www.google.com/intl/en_uk/help/terms_maps.html

Review #2
By Boris Villazon-Terrazas submitted on 15/May/2013
Suggestion:
Minor Revision
Review Comment:

This paper presents a linked open dataset about accommodations for Amsterdam, Tuscany, and Spain. It also offers a brief description of related datasets, reused vocabularies, and internal structure.

However, the added value of Linked Data in this paper is limited. The authors should try to show the benefits of having that dataset exposed as Linked Data. The dataset is currently linked to only two other datasets. From the paper it is not clear which concrete opportunities would be provided by linking the OpeNER dataset with other datasets available on the web.

Moreover, the paper is missing the usage statistics, i.e., is anyone using the data from the dataset and how?

It would be good to create links to the related datasets, for example within Spain to the Santillana Guide from the Webenemasuno project.

I'm missing the VoID description of the dataset, and datahub.io entry for the dataset.

Minor comments
- Google+ Local[6] -> Google+ Local [6]
- how often the scrappers check whether there are new entries on Booking.com and Google+ Local?
- please add the following reference for El Viajero tourism dataset
Daniel Garijo, Boris Villazón-Terrazas, Óscar Corcho: A provenance-aware Linked Data application for trip management and organization. I-SEMANTICS 2011: 224-226
- I think section 3 should be renamed to Vocabulary description.
- Also within section 3, I would like to know if there is sth within schema.org for representing accommodations.
- Finally in section 3, the authors should provide a link to check the description of the vocabulary they created, by reusing the existing ones in this domain.
- it would be good to have access to the ad-hoc extraction/scrapers scripts. how easy would be to adapt those for other resources, e.g., tripadvisor?
- Figure 1. can be reduced, currently is a bit big.
- Can we have access to the dataset outgoing links? For example AEMET datasets has those available here
http://aemet.linkeddata.es/links/
- Section 4: "The following Turtle codeis its representation ..." there is a typo and I think some grammar issue as well.

Review #3
By Emanuele Della Valle submitted on 06/Jun/2013
Suggestion:
Major Revision
Review Comment:

The paper is well written and well organised. The dataset is small, but of great practical value. The idea of off-loading to Google Places APIs the data linking task is convincing. A part from minor changes (see below), I would recommend to accept the paper with minor revisions.

However, I'm very concerned with the ownership of the data. I'm worried that the authors have to right to publish this dataset and, thus, this paper is describing an "illegal" dataset.

The authors in Section 4.5 says that "[they] have taken data from two different contributions, which provide free access to their databases", but I checked Google's [1] and Booking.com [2] terms and conditions (T&C), and I understand that the authors had no right to dump the content of Google + Places and Booking.com in their dataset. Of course I can be wrong.

Google's T&C says: Using our Services does not give you ownership of any intellectual property rights in our Services *or the content you access*. *You may not use content from our Services unless you obtain permission from its owner* or are otherwise permitted by law.

So, unless using the content is "permitted by law", the authors may not publish Google+ Location data in their dataset.

Booking.com's T&C says: Unless stated otherwise, the software required for our services or available at or used by our website and the intellectual property rights *(including the copyrights) of the contents and information of and material on our website are owned by Booking.com B.V., its suppliers or providers*.

Here, I don't even space for interpretation. It appears that booking.com forbids to republish its content.

So, my request for a major revision is not related to the paper, but to the legal issues I rose w.r.t. the dataset. I hope I'm very wrong.

Minor comments:
- page 2
- ontologies: the GoodRelations [5], Acco [2] and Hontology [7] [10] ontologies. -> ontologies: GoodRelations [5], Acco [2] and Hontology [7] [10].
- page 3
- I would move Table 1 to page 2
- page 4
- the Acco example appears broken; does it miss the subject?
- page 5
- the Hontology example appears broken; does it miss the subject?
- how were links to DBpedia established?
- at the bottom of the left column an URL runs into the margin
- page 6
- the references are badly formatted

[1] http://www.google.com/intl/en/policies/terms/
[2] http://www.booking.com/general.html?sid=ad73863306327688a6ac74fea6043f81...

Review #4
By Emanuele Della Valle submitted on 06/Jun/2013
Suggestion:
Major Revision
Review Comment:

The paper is well written and well organised. The dataset is small, but of great practical value. The idea of off-loading to Google Places APIs the data linking task is convincing. A part from minor changes (see below), I would recommend to accept the paper with minor revisions.

However, I'm very concerned with the ownership of the data. I'm worried that the authors have to right to publish this dataset and, thus, this paper is describing an "illegal" dataset.

The authors in Section 4.5 says that "[they] have taken data from two different contributions, which provide free access to their databases", but I checked Google's [1] and Booking.com [2] terms and conditions (T&C), and I understand that the authors had no right to dump the content of Google + Places and Booking.com in their dataset. Of course I can be wrong.

Google's T&C says: Using our Services does not give you ownership of any intellectual property rights in our Services *or the content you access*. *You may not use content from our Services unless you obtain permission from its owner* or are otherwise permitted by law.

So, unless using the content is "permitted by law", the authors may not publish Google+ Location data in their dataset.

Booking.com's T&C says: Unless stated otherwise, the software required for our services or available at or used by our website and the intellectual property rights *(including the copyrights) of the contents and information of and material on our website are owned by Booking.com B.V., its suppliers or providers*.

Here, I don't even space for interpretation. It appears that booking.com forbids to republish its content.

So, my request for a major revision is not related to the paper, but to the legal issues I rose w.r.t. the dataset. I hope I'm very wrong.

Minor comments:
- page 2
- ontologies: the GoodRelations [5], Acco [2] and Hontology [7] [10] ontologies. -> ontologies: GoodRelations [5], Acco [2] and Hontology [7] [10].
- page 3
- I would move Table 1 to page 2
- page 4
- the Acco example appears broken; does it miss the subject?
- page 5
- the Hontology example appears broken; does it miss the subject?
- how were links to DBpedia established?
- at the bottom of the left column an URL runs into the margin
- page 6
- the references are badly formatted

[1] http://www.google.com/intl/en/policies/terms/
[2] http://www.booking.com/general.html?sid=ad73863306327688a6ac74fea6043f81...

Review #5
By Emanuele Della Valle submitted on 06/Jun/2013
Suggestion:
Major Revision
Review Comment:

The paper is well written and well organised. The dataset is small, but of great practical value. The idea of off-loading to Google Places APIs the data linking task is convincing. A part from minor changes (see below), I would recommend to accept the paper with minor revisions.

However, I'm very concerned with the ownership of the data. I'm worried that the authors have to right to publish this dataset and, thus, this paper is describing an "illegal" dataset.

The authors in Section 4.5 says that "[they] have taken data from two different contributions, which provide free access to their databases", but I checked Google's [1] and Booking.com [2] terms and conditions (T&C), and I understand that the authors had no right to dump the content of Google + Places and Booking.com in their dataset. Of course I can be wrong.

Google's T&C says: Using our Services does not give you ownership of any intellectual property rights in our Services *or the content you access*. *You may not use content from our Services unless you obtain permission from its owner* or are otherwise permitted by law.

So, unless using the content is "permitted by law", the authors may not publish Google+ Location data in their dataset.

Booking.com's T&C says: Unless stated otherwise, the software required for our services or available at or used by our website and the intellectual property rights *(including the copyrights) of the contents and information of and material on our website are owned by Booking.com B.V., its suppliers or providers*.

Here, I don't even space for interpretation. It appears that booking.com forbids to republish its content.

So, my request for a major revision is not related to the paper, but to the legal issues I rose w.r.t. the dataset. I hope I'm very wrong.

Minor comments:
- page 2
- ontologies: the GoodRelations [5], Acco [2] and Hontology [7] [10] ontologies. -> ontologies: GoodRelations [5], Acco [2] and Hontology [7] [10].
- page 3
- I would move Table 1 to page 2
- page 4
- the Acco example appears broken; does it miss the subject?
- page 5
- the Hontology example appears broken; does it miss the subject?
- how were links to DBpedia established?
- at the bottom of the left column an URL runs into the margin
- page 6
- the references are badly formatted

[1] http://www.google.com/intl/en/policies/terms/
[2] http://www.booking.com/general.html?sid=ad73863306327688a6ac74fea6043f81...

Review #6
By Emanuele Della Valle submitted on 06/Jun/2013
Suggestion:
Major Revision
Review Comment:

The paper is well written and well organised. The dataset is small, but of great practical value. The idea of off-loading to Google Places APIs the data linking task is convincing. A part from minor changes (see below), I would recommend to accept the paper with minor revisions.

However, I'm very concerned with the ownership of the data. I'm worried that the authors have to right to publish this dataset and, thus, this paper is describing an "illegal" dataset.

The authors in Section 4.5 says that "[they] have taken data from two different contributions, which provide free access to their databases", but I checked Google's [1] and Booking.com [2] terms and conditions (T&C), and I understand that the authors had no right to dump the content of Google + Places and Booking.com in their dataset. Of course I can be wrong.

Google's T&C says: Using our Services does not give you ownership of any intellectual property rights in our Services *or the content you access*. *You may not use content from our Services unless you obtain permission from its owner* or are otherwise permitted by law.

So, unless using the content is "permitted by law", the authors may not publish Google+ Location data in their dataset.

Booking.com's T&C says: Unless stated otherwise, the software required for our services or available at or used by our website and the intellectual property rights *(including the copyrights) of the contents and information of and material on our website are owned by Booking.com B.V., its suppliers or providers*.

Here, I don't even space for interpretation. It appears that booking.com forbids to republish its content.

So, my request for a major revision is not related to the paper, but to the legal issues I rose w.r.t. the dataset. I hope I'm very wrong.

Minor comments:
- page 2
- ontologies: the GoodRelations [5], Acco [2] and Hontology [7] [10] ontologies. -> ontologies: GoodRelations [5], Acco [2] and Hontology [7] [10].
- page 3
- I would move Table 1 to page 2
- page 4
- the Acco example appears broken; does it miss the subject?
- page 5
- the Hontology example appears broken; does it miss the subject?
- how were links to DBpedia established?
- at the bottom of the left column an URL runs into the margin
- page 6
- the references are badly formatted

[1] http://www.google.com/intl/en/policies/terms/
[2] http://www.booking.com/general.html?sid=ad73863306327688a6ac74fea6043f81...


Comments