Review Comment:
In the full paper with the title “FarsBase: The Persian Knowledge Graph” the authors describe a knowledge base system extracting RDF data from Wikipedia infoboxes, tables and raw text from web in order to create a Farsi knowledge graph tailored to support Persian search engines.
The work fits the special issue of Knowledge Graphs and covers multiple topics in the field of Knowledge Graph Construction.
The paper is well structured. After evaluating available Farsi knowledge sources with respect to information richness (extracted triples per 1000 words) and costs for extraction and supervision of correct facts, the architecture of FarseBase is presented. The system is based on three types of extractors (Wikipedia, tables, text) and uses a 2-phase storage layer. Mapped knowledge from the extractors is first loaded into the Candidate Fact and Metadata Store and then transferred to the Belief Store based on filtering and human supervision techniques. In section 5 the extraction methods of the different type of parsers are shown and Section 6 presents the knowledge and ontology mapping approaches. An evaluation of FarseBase follows in Section 7. Finally, related work is presented.
However, the evaluation of FarseBase is missing important details and the description as well as decisions in the conceptualization of the evaluation are hard to comprehend (see comments). Moreover, the paper contains a few scientifically vague claims including the positioning and novelty of the own work with respect to previous work, which are discussed below. Unfortunately, the language has many errors (too many errors to enumerate within this review) like missing noun markers or pronouns, wrong usage of singular and plural also in combination with simple present, missing verbs and spelling errors and therefore a proofreading from an editorial office is required.
Although there seem to exist open Github projects for the codebase of Farsebase, there is no reference provided in the paper. It is not stated whether and where the KG or (data) parts of it are (freely) available. The evaluation is not replicable due to missing depth and details in the description and especially by not giving any pointer to an experiment website providing the necessary data to reproduce the results.
Although the novelty and innovation of the FarseBase system seems limited, there is inherent value in verbosity and showing the entire construction process of a KG in the paper. Considering the big challenges in processing Persian language due to badly optimized or underperforming algorithms and toolchains but also the limited effort in the research community to optimize NLP techniques for languages other than the major ones, I encourage the authors to improve the current version of the paper and suggest a major revision.
====
--- “FarsBase is the only multi-source knowledge base that supports timeliness[4] by handling different versions of data from multiple sources” ===.> Wikidata supports multiple key-value pairs (qualifiers) per statement as metadata and a dedicated set of key-value pairs for references / provenance of a statement. These qualifiers are prominently used to specify (validity) time dimensions of triples. References can be used to trace down the information to the sources. Contradicting information can be handled by providing references, a time dimension or other context information e.g. the determination method. Moreover, Wikidata keeps track of every change to a resource with modification date and author. Although this information is not queryable via the SPARQL endpoint, it is accessible via Wikidata API calls and in monthly dumps. Given the referenced definition of timeliness, Wikidata supports all criterions of timeliness and therefore is a knowledge base that supports timeliness and handling of different versions of data from multiple sources.
--- “DBpedia uses rules. Rules are hard to maintain and write. We used tables. This is much more straightforward.” ===> While it is true that DBpedia uses rules, the argument that tables are more straightforward is in question. DBpedia also uses tables to display the rules in the MappingsWiki and the tables in FarseBase are rules as well even denoted as such in the paper. The difference is the way of presenting these rules.
--- “Above method [tabular mapping rules] is friendly and flexible enough to handle all complicated cases” ===> From the presentation of the mapping language/table it is not clear how the simple case of aggregating two values from different infobox keys into one literal value (e.g. long & lat coordinates as one string, year, month and day as one W3C datetimestamp, etc.) and the slightly more complex case of combining these values in a dedicated resource object having these infobox values as property values can be handled.
--- “To support above cases, a new approach has been implemented in FarsBase” ===> The description in its current form seems not to present any new technique. DBpedia supports (data)type mappings, unit normalization as well as transformation functions. In DBpedia the mapping community usually does not need to take care of unit conversion and defining the expected literal datatype. This is handled in a consistent manner in the ontology via specifying the range of the properties. By specifying the target property for an infobox key, the datatype and base unit are implicitly defined for its values. Moreover, the extraction framework of DBpedia is aware of several measurement dimensions and units and automatically converts them to base units where applicable. This allows to define that the property dbo:temperature expects a temperature and the framework automatically transforms °F and °C from the infoboxes into Kelvin in the extracted triples. For complicated cases where it is not easy to guess correct units (e.g. 6ft 3”), it is possible to use transformation functions in a mapping rule. The FarseBase approach to specify the datatype for every template individually is redundant effort and is a potential cause for inconsistent mappings. It would be possible to specify the length of movies in hours but for albums in minutes (both float) and to also use the property length to map spatial dimensions. This would return invalid results during querying.
--- “we can’t disambiguate any entities from raw text” ===> Wikipedia provides disambiguation pages. In combination with the context of the named entity in question from raw text and the content of the Wikipedia pages of the list of ambiguous entities, text based similarity or distance functions could be defined in order to determine a ranking for ambiguous entities. It is not clear for the reader why it is not possible to perform some entity disambiguation strategy in FarseBase at this stage.
-------------------------
Further Comments:
Section 2.2. can be shortened or removed since RDF can be assumed to be well known by readers of SWJ.
Section 7.2. The evaluation with respect to precision of the KG is imprecise. There is no explanation how the precision is calculated and how the correctness of an answer to a query is evaluated. The nature of the queries (especially w.r.t. to the description of class-centric queries) itself is not comprehensible and replicable and the average precision and weighted-average precision values as well as the odd query instance numbers are adding up to the confusion.
Section 7.4. The “Wikipedia coverage analysis” is trivial and should achieve a coverage of 100% by design of the system (at least every page has a label and a DBpedia IRI which is already 2 triples per page). It should be removed from the evaluation section or extended to show valuable information e.g. type statement completeness w.r.t. the ontology (an artist who is songwriter, singer and actor needs to be member of all classes but typically has only one infobox or no infobox at all) or property completeness (check if all infobox properties are mapped and the transformers succeeded to parse and extract information).
In my opinion it could make more sense to compare the extracted data from DBpedia (for random Farsi instances having a template which is mapped in DBpedia) with the results from Farsebase to get an impression of the improvements. In order to achieve a fair comparison, the coverage of the mappings should be as identical as possible.
There is no link provided to the used gazetteers and queries.
7.4.1. Is this section evaluating the coverage of a query in combination with the correctness??
7.6. This raises several questions. Is the interlinking of the Farsi DBpedia and Wikidata shown? According to Chapter 6.7 no own linking approaches are utilized. What is the purpose of Table 13? Furthermore, Table 13 lists a Dublin Core Dataset. I’m not aware of such a dataset only the vocabulary they provide.
p 7 l 42: CFM is not yet introduced at this point
p 11 l33 why not reuse existing ontologies and vocabularies (e.g. birthplace)?
p 15 l 12 better use indentation and turtle syntax
p 16 l 42 reference missing
listing 6.3. is really hard to read => compact turtle syntax would be better
table 6,7,8,10 use digit grouping and potentially right aligned digits
|