FarsBase: The Persian Knowledge Graph

Tracking #: 1958-3171

Authors: 
Majid Asgari Bidhendi
Ali Hadian
Behrouz Minaei-Bidgoli

Responsible editor: 
Guest Editors Knowledge Graphs 2018

Submission type: 
Full Paper
Abstract: 
Over the last decade, extensive research has been done on automatically constructing knowledge graphs from web resources, resulting in a number of large-scale knowledge graphs such as YAGO, DBpedia, BabelNet, and Wikidata. Despite that some of these knowledge graphs are multilingual, they contain few or no linked data in Persian, and do not support tools for extracting Persian information sources. FarsBase is a multi-source knowledge graph especially designed for semantic search engines in Persian language. FarsBase uses some hybrid and flexible techniques to extract and integrate knowledge from various sources, such as Wikipedia, web tables and unstructured texts. It also supports an entity linking that allow integrating with other knowledge bases. In order to maintain a high accuracy for triples, we adopt a low-cost mechanism for verifying candidate knowledge by human experts, which are assisted by automated heuristics. FarsBase is being used as the semantic-search system of a Persian search engine and efficiently answers hundreds of semantic queries per day.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Mozhdeh Gheini submitted on 19/Aug/2018
Suggestion:
Minor Revision
Review Comment:

In this paper, the authors have introduced FarsBase, a knowledge graph constructed in Persian. Given that Persian is one of the low-resource languages, this is a valuable contribution. They explain their challenges and approaches and provide their evaluation results in details. That said, there are a number of things I wish were revised:
-First and foremost, the quality of the writing needs to be improved. A large number of grammatical and writing mistakes exist throughout the paper.
- It's said that 'FarsBase is being used as the semantic-search system of a Persian search engine.' However, the name of the search engine is not provided.
- There is no discussion on whether there have been any efforts to construct a knowledge graph for another low-resource language. If there has been, it'd be good to include some comparisons in the 'Related Work' section.

Review #2
By Johannes Frey submitted on 22/Oct/2018
Suggestion:
Major Revision
Review Comment:

In the full paper with the title “FarsBase: The Persian Knowledge Graph” the authors describe a knowledge base system extracting RDF data from Wikipedia infoboxes, tables and raw text from web in order to create a Farsi knowledge graph tailored to support Persian search engines.
The work fits the special issue of Knowledge Graphs and covers multiple topics in the field of Knowledge Graph Construction.
The paper is well structured. After evaluating available Farsi knowledge sources with respect to information richness (extracted triples per 1000 words) and costs for extraction and supervision of correct facts, the architecture of FarseBase is presented. The system is based on three types of extractors (Wikipedia, tables, text) and uses a 2-phase storage layer. Mapped knowledge from the extractors is first loaded into the Candidate Fact and Metadata Store and then transferred to the Belief Store based on filtering and human supervision techniques. In section 5 the extraction methods of the different type of parsers are shown and Section 6 presents the knowledge and ontology mapping approaches. An evaluation of FarseBase follows in Section 7. Finally, related work is presented.
However, the evaluation of FarseBase is missing important details and the description as well as decisions in the conceptualization of the evaluation are hard to comprehend (see comments). Moreover, the paper contains a few scientifically vague claims including the positioning and novelty of the own work with respect to previous work, which are discussed below. Unfortunately, the language has many errors (too many errors to enumerate within this review) like missing noun markers or pronouns, wrong usage of singular and plural also in combination with simple present, missing verbs and spelling errors and therefore a proofreading from an editorial office is required.
Although there seem to exist open Github projects for the codebase of Farsebase, there is no reference provided in the paper. It is not stated whether and where the KG or (data) parts of it are (freely) available. The evaluation is not replicable due to missing depth and details in the description and especially by not giving any pointer to an experiment website providing the necessary data to reproduce the results.
Although the novelty and innovation of the FarseBase system seems limited, there is inherent value in verbosity and showing the entire construction process of a KG in the paper. Considering the big challenges in processing Persian language due to badly optimized or underperforming algorithms and toolchains but also the limited effort in the research community to optimize NLP techniques for languages other than the major ones, I encourage the authors to improve the current version of the paper and suggest a major revision.

====

--- “FarsBase is the only multi-source knowledge base that supports timeliness[4] by handling different versions of data from multiple sources” ===.> Wikidata supports multiple key-value pairs (qualifiers) per statement as metadata and a dedicated set of key-value pairs for references / provenance of a statement. These qualifiers are prominently used to specify (validity) time dimensions of triples. References can be used to trace down the information to the sources. Contradicting information can be handled by providing references, a time dimension or other context information e.g. the determination method. Moreover, Wikidata keeps track of every change to a resource with modification date and author. Although this information is not queryable via the SPARQL endpoint, it is accessible via Wikidata API calls and in monthly dumps. Given the referenced definition of timeliness, Wikidata supports all criterions of timeliness and therefore is a knowledge base that supports timeliness and handling of different versions of data from multiple sources.

--- “DBpedia uses rules. Rules are hard to maintain and write. We used tables. This is much more straightforward.” ===> While it is true that DBpedia uses rules, the argument that tables are more straightforward is in question. DBpedia also uses tables to display the rules in the MappingsWiki and the tables in FarseBase are rules as well even denoted as such in the paper. The difference is the way of presenting these rules.

--- “Above method [tabular mapping rules] is friendly and flexible enough to handle all complicated cases” ===> From the presentation of the mapping language/table it is not clear how the simple case of aggregating two values from different infobox keys into one literal value (e.g. long & lat coordinates as one string, year, month and day as one W3C datetimestamp, etc.) and the slightly more complex case of combining these values in a dedicated resource object having these infobox values as property values can be handled.

--- “To support above cases, a new approach has been implemented in FarsBase” ===> The description in its current form seems not to present any new technique. DBpedia supports (data)type mappings, unit normalization as well as transformation functions. In DBpedia the mapping community usually does not need to take care of unit conversion and defining the expected literal datatype. This is handled in a consistent manner in the ontology via specifying the range of the properties. By specifying the target property for an infobox key, the datatype and base unit are implicitly defined for its values. Moreover, the extraction framework of DBpedia is aware of several measurement dimensions and units and automatically converts them to base units where applicable. This allows to define that the property dbo:temperature expects a temperature and the framework automatically transforms °F and °C from the infoboxes into Kelvin in the extracted triples. For complicated cases where it is not easy to guess correct units (e.g. 6ft 3”), it is possible to use transformation functions in a mapping rule. The FarseBase approach to specify the datatype for every template individually is redundant effort and is a potential cause for inconsistent mappings. It would be possible to specify the length of movies in hours but for albums in minutes (both float) and to also use the property length to map spatial dimensions. This would return invalid results during querying.

--- “we can’t disambiguate any entities from raw text” ===> Wikipedia provides disambiguation pages. In combination with the context of the named entity in question from raw text and the content of the Wikipedia pages of the list of ambiguous entities, text based similarity or distance functions could be defined in order to determine a ranking for ambiguous entities. It is not clear for the reader why it is not possible to perform some entity disambiguation strategy in FarseBase at this stage.

-------------------------

Further Comments:

Section 2.2. can be shortened or removed since RDF can be assumed to be well known by readers of SWJ.

Section 7.2. The evaluation with respect to precision of the KG is imprecise. There is no explanation how the precision is calculated and how the correctness of an answer to a query is evaluated. The nature of the queries (especially w.r.t. to the description of class-centric queries) itself is not comprehensible and replicable and the average precision and weighted-average precision values as well as the odd query instance numbers are adding up to the confusion.

Section 7.4. The “Wikipedia coverage analysis” is trivial and should achieve a coverage of 100% by design of the system (at least every page has a label and a DBpedia IRI which is already 2 triples per page). It should be removed from the evaluation section or extended to show valuable information e.g. type statement completeness w.r.t. the ontology (an artist who is songwriter, singer and actor needs to be member of all classes but typically has only one infobox or no infobox at all) or property completeness (check if all infobox properties are mapped and the transformers succeeded to parse and extract information).
In my opinion it could make more sense to compare the extracted data from DBpedia (for random Farsi instances having a template which is mapped in DBpedia) with the results from Farsebase to get an impression of the improvements. In order to achieve a fair comparison, the coverage of the mappings should be as identical as possible.
There is no link provided to the used gazetteers and queries.

7.4.1. Is this section evaluating the coverage of a query in combination with the correctness??

7.6. This raises several questions. Is the interlinking of the Farsi DBpedia and Wikidata shown? According to Chapter 6.7 no own linking approaches are utilized. What is the purpose of Table 13? Furthermore, Table 13 lists a Dublin Core Dataset. I’m not aware of such a dataset only the vocabulary they provide.

p 7 l 42: CFM is not yet introduced at this point
p 11 l33 why not reuse existing ontologies and vocabularies (e.g. birthplace)?
p 15 l 12 better use indentation and turtle syntax
p 16 l 42 reference missing
listing 6.3. is really hard to read => compact turtle syntax would be better
table 6,7,8,10 use digit grouping and potentially right aligned digits

Review #3
Anonymous submitted on 14/Nov/2018
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along with the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The topic of the paper is quite interesting, Creating a Persian Knowledge Graph
With respect to the fact that Persian is a language with poor resources (i.e., NLP tools and corpora are not well developed). Thus it is a great activity to build up a knowledge graph in Persian.

But the paper has a very very poor language and presentation. Both the presentation and language require heavy and serious revision, I can say I would prefer such a paper immediately gets rejected but since it is one of the first attempts to make a Persian KG, I rate for major revision, however, I do not guarantee that I accept later on since the revision has to reach a satisfactory point.

I list several of my objects although it is not the full list

Where is this Persian Knowledge Graph? Why there is no URI to access FarseBase, if it is a close or local KG, there is no benefit for the community.
The language used in the paper is quite imprecise, for example, the authors use the term knowledge graph and knowledge base interchangeably, even in Section 2.1. They try to describe and differentiate them, but it is a very terrible presentation: e.g., a knowledge base contains a set of facts and rules that allows storing in a computer system??????? From where did you get this definition.
Another example: RDF is Resource Description Format, not a Knowledge Description Format (SECTION 2.2.)
3.The paper tries to do many things in a generic and immature manner, for example, it wants to extract triple from a text, extract triple from infoboxes and ….
Please be noted that extracting knowledge from unstructured data (text) requires substantial research, contribution, and evaluation. You can make a separate publication for that. I would recommend to follow the initial version of DBpedia and just create a Persian KG from infoboxes along with a proper evaluation on that and discussion over challenges and quality issues

4. The choice of various approaches are arbitrary and imprecise, (page 11, line 49), what is the rationale behind this formula? If it is from literature where is the citation.
Another bold example is ontology creations, it is ok to borrow DBpedia ontology, what exactly how did you specify that? Did you add labels in Persian? How did you come up to add new concepts or relations and removing some? You have to be very precise.

5. The related work is quite long, please prune that, for example, you did not have any contribution with respect to data quality, why it is included? Or NELL is quite different it learns knowledge from unstructured data,

Suggestion: shrink the scope of your contribution and for a narrow contribution provide extensive and accurate.

At last, Language!!!

Serious revision

I just mention a few:

Page 1, line 28, structured data such as Wikipedia: Wikipedia is semi-structured
Line 42, are to be done when the …. ---> are necessary tasks for KG construction
Line 40: tailor-made ??? what that mean?
Where is Persian search engine?
Why persia is in []

Page 3 line 41, Yago and … ???
Thanks to tons very awkward term
Page 3 line 3: knowledge form ---> knowledge from
Page 6 line 10: Wikipedia is a reach ---> rich

web ---> Web