HDT crypt : Efficient Compression and Encryption of RDF Datasets

Tracking #: 1733-2945

Authors: 
Javier D. Fernandez
Sabrina Kirrane
Axel Polleres
Simon Steyskal

Responsible editor: 
Ruben Verborgh

Submission type: 
Full Paper
Abstract: 
The publication and interchange of RDF datasets online has experienced significant growth in recent years, promoted by different but complementary efforts, such as Linked Open Data, the Web of Things and RDF stream processing systems. However, the current Linked Data infrastructure does not cater for the storage and exchange of sensitive or private data. On the one hand, data publishers need means to limit access to confidential data (e.g. health, financial, personal, or other sensitive data). On the other hand, the infrastructure needs to compress RDF graphs in a manner that minimises the amount of data that is both stored and transferred over the wire. In this paper, we demonstrate how HDT – a compressed serialization format for RDF – can be extended to cater for supporting encryption. We propose a number of different graph partitioning strategies and discuss the benefits and tradeoffs of each approach.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Miel Vander Sande submitted on 10/Oct/2017
Suggestion:
Minor Revision
Review Comment:

The authors propose an extension to the compressed, but queryable RDF format HDT. Three different methods to compress the Dictionary and the Triples are presented and evaluated. The paper is very well written, the research conducted and an interesting contribution to a continuing trend of publishing Linked Data effectively.

I do suggest some minor changes before it can be accepted.
First, Section 4.2 is a bit hard to grasp. I guess this is mainly because of the numerous inline definitions that result in a lack of visual cues.
Second, it surprises me no more advanced approach in service of querying was considered. This should be at least explained more elaborately in the conclusion.
Third, the evaluation could be stronger. I wonder, for example, why DBpedia 3.8 was chosen (it's small and old). It seems to obscure size scalability issues. Also, I would have liked to see a comparison with an unencrypted HDT file to see what the introduced overhead is, especially when integrating dictionary and triples in a new HDT for querying. Or at least an explanation why it was not included.

Minor remarks

- S1;L5: exchange is a bit more appropriate than interchange
- S2;L16: could use a line break
- S2;L17-21: the limitations or relevancy of previous work by authors is not well explained
- S4.1;L9: watch horizontal text overflow
- S4.1;L14: the footnote is a strange choice, I suggest brackets or hyphenation.
- S4.1;Fig. 3: mentioning the access rights would clearify the figure
- S4.1;L52: a one-to-one
- S5;L3: this sentence needs rephrasing

Review #2
Anonymous submitted on 24/Dec/2017
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

HDT (Header, Dictionary, Triples) is a well known compact data format for RDF data. The main idea of HDT is to represent the labels of an RDF graph by unique integers, and to create a dictionary that maps these integers to their corresponding labels. Since labels can be long and can occur many times in one RDF graph, this simple dictionary mapping can achieve reasonable compression performance. In the present paper, the authors extend HDT by encryption. The idea is that different user groups have access to only certain subgraphs of an RDF graph. These subgraphs are encrypted (using standard public/private-key encryption) so that only those users which have the right to access a certain subgraph are able to decrypt that subgraph.

I find this research important and practically relevant. The paper is well written. Especially the first 10 pages that explain the framework are exception- ally well written and are a true joy to read. My concern with the present version of this paper is the experimental section. First, it is unclear how realistic the two datasets and their subgraphs are; I find it awkward to randomly select sub- graphs. The paper would greatly benefit from a real dataset with real access restrictions as they would actually be used in an industrial application. Second, the experimental section is way too detailed and overly verbose; an essential task has not been carried out yet, namely, to select and properly summarize the important data only and to draw (compact) relevant conclusions. The cur- rent version feels like a “full dump” of all the produced experimental data. My recommendation is to accept the paper for publication in The Semantic Web Journal, subject to a major revision of the experimental section.

Main Comments
----------------

(1) Is the “Efficient Compression” in the title of the paper justified? As far as I understand, there is nothing new in this paper concerning efficient compression. Also, I am bit worried by the actual compression and decompression times. For certain graphs Table 4 contains compression times of over 3 hours. That seems quite long, if the only task of the compressor is building a dictionary. One would expect running times that are similar to those of general purpose dictionary compressors (e.g., gzip).

(2) About the Experimental section: Why is it realistic to experiment with 6, 9, and 12 subgraphs only, and to have only 4 different LUB sizes that increase linearly. Why not scale up to 100 or 1000 subgraphs, and have LUB sizes at 1K, 2K, 4K, 8K, and 16K to obtain a more complete picture?
As mentioned before, I do not find it realistic to choose subgraphs randomly. I find it important to experiment with realistic subgraph selections as they would appear in an actual application.

I think many of the tables and graphs should be removed, and merely the summary of the (essential) observations should be given. There is simply too much data here. The graphs about querying (Figures 10–13) do not look very interesting, please remove them.

Table 2: can you explain why the sizes can increase, when going from 9 to 12 subgraphs (e.g., for G4)?

Table 3: I think this table can be removed, the data looks rather similar and predictable. It can be summarized in a few (clear) sentences.

Table 4: please remove.

Table 5: please explain these times in the text and remove this table.

Typos
------

page 1, right column, line 2: change “are allowed access to” into “are allowed to access”.

page 1, right column, line -9: spell out what “HDT” stand for.

page 2, left column, line 6: wrong opening quote mark.

page 2, left column, line 9: the reader can at this moment not understand where the “redundancy” comes from. Please improve.

page 2, left column, line -3: “minimise” is a strong word. I would be careful to use it.

page 2, right column, line 4–9: does this represent a good state-of-the art approach?

page 2, right column, line -12: one wonders how the final binary encoding works, e.g., whether a Huffman Code or similar is applied?

page 2, right column, bottom: it might look nicer to have k2 (i.e., the k in italic). How much space saving do the approaches [40], [34], and [28] give in comparison? As far as I understand, [34] gives the strongest compression (stronger than k2), is that correct? On the other hand, on page 3, left column, 10 lines above Section 3, it is written that “k2 triples ...is the most effective compressor”. Does “most effective” refer to “space saving”?

page 3, right column, line 4 of Section 3.1: check the typesetting of SO. (should be SO).

page 4, left column, line 5: it might be more intuitive to simply “apply” the dictionary as a function, and to write (D(s), D(p), D(o)). In Figure 2, can you please explain how the non-binary numbers, such as 3, 4, 6, 2, 7 etc in So are represented in binary? Do you use a fixed-length encoding for these? How does the “alignment” (line 3 above Section 3.3) work? Also the reader wonders at this moment, how blank nodes are represented (later you say something about it). page 4, left column, line -5: change “triples” into “triple”. Two lines below, change “dictionary dictionary” into “dictionary”.

page 5, right column: “It is worth noting” used twice in this paragraph. Maybe rephrase at least one of them.

page 6, line 2 of Section 4.2: please spell out AES.

page 7, line 4 of Section 5: remove “will”. Line -5: what does “no standard- izing apart” mean? By the way, this is US American spelling, while for other words (“minimising”) you use British English spelling. Please decide which one you want, and the make a consistent version.

page 8, left column, line 3 above Section 5.3: this superscript in HDT′ is not well readable.

page 9, left column: I am not sure this observation is important, but, count- ing the number of ones in a binary string is known as “rank”; there are many results and data structures for efficiently determining the rank of a position in a bit string. How is it implemented in your system?

page 9, right column: n:m-relation, better write n:m.

page 12, right column: please draw a clear conclusion (instead of reading out the numbers).

pages 13 and 14: if encryption and decryption times are so negligible, then why even mention and show them?

page 14, line 3 of Section 6.3: it is unclear to me what “fair” means here. page 15, left column, line 2 of Section 6.4.1. Insert “the” before “i.e.”.

page 15, right column: the first paragraph is as expected, and also the
bottom paragraph can be removed, I think.

page 16, it is not clear to me whether random queries give realistic picture
as to how the data structures would behave in a real setting. I think the paper tried to do too much here, and looses focus. I think the paper would benefit if the querying part was reduced to a few short paragraphs that summarize the outcomes. The presentation of the various figures is not justified.

pages 20,21: some of these papers have been published in the meantime, please check.

- End of comments.

Review #3
By Wouter Beek submitted on 12/Feb/2018
Suggestion:
Minor Revision
Review Comment:

This paper extends the existing HDT compression format for RDF with
several partitioning strategies that allow a given graph to be
automatically subdivided into subgraphs that reduce terminological
(dictionary) and/or assertional (triples) duplication. Since
subgraphs can themselves be annotated with different metadata, one use
case is to couple named graph access rights to encryption keys.

# Originality

The big contribution of this paper is not that it performs automatic
partition, nor that it provides encryption of sensitive data, but that
it does these things in combination with the ability to query the
data. This combination is very novel and has many concrete
applications, both in Semantic Web research as well as in practice.

# Significance of the results

The significance of the results is not immediately clear, because
there are subtle benefits and downsides to the various evaluated
approaches. Table 6 and 7 perform a fair attempt at giving an overview
of the various pro's and con's. It is unfortunate that approach
crypt-D cannot really be compared to the other three approaches, since
it uses a very different encoding for assertions/triples to begin
with. It is not entirely clear to me why crypt-D is implemented in
such a different way, and/or whether the performance hit that is due
to the different representation can be calculated independently
(thereby making the here presented performance results more
comparable).

# Approach

Overall the approach is easy to follow. Specifically, the use of
examples help in keeping track of the specifics of the described
approach. There are some things that are a bit unclear to me, mainly
WRT the notion of dictionary identifiers:

- p3: If S and O both start at index |SO|+1, the latter cannot end at
index [SO|+|S|+|O|. Maybe O starts at index |SO|+|S|+1?
- p3. Role is mentioned but not explained. The roles seem to be node
and edge (which implies that an IRI that appears as a node and as an
edge receives two IDs).
- p3. If $G = \{\langle x,y,y \rangle\}, id(y,D) does not seem to be
functional (receiving ID 1 and 2). Maybe functions id and ids
should return pairs that contain a role (node or edge) and an ID?
- p5. It is unclear why {G_1,\ldots,G_n} must be a cover of G.
- p5. I do not understand the definition of the canonical partition.
S is drawn from a set of Boolean _functions_. Is S' drawn from the
same set? But then i and j cannot be members of S and S', because
they are not functions.
- p6. The two admissibility conditions on R also require i and j to be
pairs i.o. integers (otherwise an edge i in ids(T) could be related,
through R, to a node i in ids(D)).

# Evaluation

- Why was `lzo' chosen to compress N-Triples and `gz' to compress HDT?
Is this a fair comparison?
- Is it correct/expected that crypt-C compresses _faster_ than regular
HDT?
- To what extent does it make sense to compare performance results of
crypt-D to results of crypt-A, crypt-B, and crypt-C, given that the
triple representation is entirely different?
- While it makes sense that decryption time increases when there are
more subgraphs (due to more duplication), I was expecting a tradeoff
for certain types of queries that only require a subset of D's and
T's to be decrypted in the calculation of the query result set.

# Discussion

The paper has focused on uninformed partitioning of a given dataset
into graphs. What about informed approaches that alter the D/T
partitioning based on use / execution times measured while querying?

# Typo's

Overall, the paper is well written; here is only a couple of typo's
that I came across:

- LaTeX quotes are sometimes the wrong way around, e.g., on p2 ‘access
restricted’
- Text sometimes appears outside columns, e.g., on p5
‘access-restricted’ and the ‘=’-sign; on p8 \union; on p9
HDT_{crypt-B}.
- p2: “investigate [on]”
- p2: “each of our partitioning strategies [is] able”
- p5: the double quotes in the example triple appear in code, so they
should be the ASCII double quote character.
- p7: “the notation of T(G,D) from above”, but T(G,D) has not been
used or defined before.
- p11: ‘N-Triples’ i.o. ‘NTriples’
- p13: “for the only matter of” → “for the sole purpose of”
- p14: “but not the total graph”; should this not be ‘dataset’?
- p15: “all approaches show[s]”
- p19: “and i[s] not competitive”
- p19: “such [as] in DBpedia”