Continuous Top-k Approximated Join of Streaming and Evolving Distributed Data

Tracking #: 2055-3268

Shima Zahmatkesh
Emanuele Della Valle

Responsible editor: 
Oscar Corcho

Submission type: 
Full Paper
Continuously finding the most relevant (shortly, top-k) answer of a query that joins streaming and distributed data is getting a growing attention. In recent years, this is, in particular, happening in Social Media and IoT. It is well known that, in those settings, remaining reactive can be challenging, because accessing the distributed data can be highly time-consuming as well as rate-limited. In this paper, we consider even a more extreme situation: the distributed data slowly evolves. The state of the art proposes two families of partial solutions to this problem: i) the database community studied continuous top-k queries on a single data stream ignoring the join with distributed datasets, ii) the Semantic Web community studied approximate continuous joining of an RDF stream and a dynamic linked dataset that guarantees reactiveness but ignores the specificity of top-k queries. In this paper, extending the state-of-the-art approaches, we investigate continuous top-k query evaluation over a data stream joined with a slowly evolving distributed dataset. We extend the state-of-the-art data structure proposed for continuous top-k query evaluation and introduce Super-MTK+N list. Such a list handles changes in the distributed dataset while minimizing the memory usage. To address the query evaluation problem, first, we propose Topk+N algorithm. This is an extension of the state-of-the-art algorithm that handles the changed objects in the distributed dataset and manages them as new arrivals. Then, adopting the architectural approach presented by the Semantic Web community, we propose AcquaTop framework. This keeps a local replica of the distributed dataset and guarantees reactiveness by construction, but it may need to approximate the result. Therefore, we propose two maintenance policies to update the replica. We contribute a theoretical proof of the correctness of the proposed approach. We perform a complexity study. And we provide empirical evidence that the proposed policies provide more relevant and accurate results than the state of the art.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 10/Dec/2018
Minor Revision
Review Comment:

I thank the authors for the clarification regarding my previous review. While some improvements have been made, I believe a minor revision is necessary before the paper can be accepted (of course, subject to other reviews).

Please find below my comments for the revised version:

- It's worth mentioning (briefly) what are the two maintenance policies and their main difference.
- "We perform a complexity study. And we provide empirical evidence that the proposed policies provide more relevant and accurate results than the state of the art."

The grammar here can be further improved.

- I appreciate the use of examples in the introduction. However, the first example about Twitter is somewhat misleading. The presented problem is more about resource allocation policy (by the Twitter company). I am not really sure how the proposed work may be useful in this scenario as long as the Twitter policy does not change.
- The introduction can be organized better by using subsectioning (or paragraph titling).
- Pg. 4: For the sake of display consistency, the maximum value of y-axis of Fig. 1(a) could be set to 10. This ensures easier comparison to the subfigures on the right.
- Pg. 4: Briefly mention how Super-MTK+N list is used in Topk+N algorithm.
- Typo: "but it may not
on the Web, which is decentralize" -> decentralized

- "An RSP-QL query is defined by a quadruple
" -> The spacing of the symbols (e.g., SDS and SE) can be further optimized, by perhaps, using \textit{} (e.g., \textit{SDS} and \textit{SE}). This also applies for all symbols used throughout the paper.
- Definition 2.5: I think the brackets/parentheses are used incorrectly. To say whether the border value is inclusive, one usually makes use of "[" or "]", and to say exclusive, one uses "(" or ")".
- "this thesis" -> this paper. This mistake occurs quite often in the paper.
- Pg. 6: RStream, IStream, and DStream are not described sufficiently. If the details do not matter, perhaps they could just be treated as a single Stream.
- Def. 2.7: There should be a specific notation for a top-k query instead of only narrating it.
- Pg. 6: "The high relevant items are more
valuable comparing to others." -> highly
- Pg. 6: "The lower the ranked
position of a relevant items" -> a relevant item
- To be more precise, the definition of precision@k is ill-defined, since the notations of tp and fp do not incorporate k.
- "which shows that
the result are more relevant and less accurate" -> is
Do please proofread the paper for other typos (I actually found some more -- more than 10, but the idea is to check more thoroughly for typos I did not mention).
- Sec. 2.2: The 1:1 join can be motivated more.
- Up to this point, there is still no clear attempt to at least explain what it means by "slow" in the term "slowly evolving dataset".
- "and any linear combination of these two variable can
guarantee to have a monotonic function" -> It is not always the case depending on the coefficients (positive vs. negative).
- Pg. 8: The error equation is ill-defined, particularly since 1 is a number while M is a set.
- Pg. 8: Moreover, if the goal of the paper is only to minimize the error, why don't just use the slow, Oracle approach? Please further detail the goal (e.g., minimizing the error under what constraints and which settings?). I believe the goal is more than just minimizing the error, but also to take into account reactiveness in the setting of streaming data combined with evolving distributed data.

- Pg. 9: "When an object arrives in
a the current window, it will also participate in all w/\beta
future windows" -> I believe that flooring (i.e., floor(w/\beta)) is necessary here.
- Fig. 3(a): As per the definition of LBP (lower bound pointer), shouldn't the LBP of W3 be pointing to A?
- Fig. 3(b): This subfigure is never mentioned in the text. Which part of the text refers to this subfigure? (Ah, I realized that the subfigure is mentioned much later in the text; this situation could be improved).
- Sec. 3.2: Again, the 1:1 join is mentioned, but is not motivated well.
- Here, I believe that the notion of slowly distributed dataset is tightly coupled with the refresh budget (as per "how slow is slow"). Can this be further elaborated in Sec. 3.2?
- "The normalized
renewed best before time (Vi(t)) of mappings
A^R; B^R; E^R at time 6 are respectively 3, 2 and 3." -> How are these values computed as per Eq. (2)?

- Pg. 12: In the intro paragraph, the sections are mentioned not in the proper order.
- Sec. 4.1: I believe that adding a similar illustration as in Fig. 3 would be suitable here, and may greatly enhance the readability.
- Sec. 4.1: "As the scoring variable
of the changed object cannot be greater than
min.score_S, " -> cannot be lower?
- Sec. 4.1: Perhaps I am missing something here, but if the score computation is based on a lower bound of streaming score, would it risk the non-inclusion of some mapping in top-k results (since its score will become lower than what it should be), which again would impact the precision@k score?
- Sec. 4.4: A small example in the style of Fig. 3 for each of the functionalities described in Sec. 4.4.1, Sec. 4.4.2, and Sec. 4.4.3 would greatly enhance readability here. I acknowledge that on Pg. 17, the authors refer to Fig. 3(b) for an example (which is not ideal due to the far positioning), yet this is lacking for the following two reasons: this does not rely on the Super-MTK+N list structure; and this does not show each of the functionalities in Sec. 4.4.1, Sec. 4.4.2, and Sec. 4.4.3.

- Again, though this part is not as dense as the Topk+N solution part, a little example in the style of Fig. 3 demonstrating the main difference between AT-TSM and AT-BSM would improve readability. This is particularly important since the explanation of AT-BSM at the current version is still rather broad.
- My feedback above should also help in distinguishing between the AT-TSM policy and AT-WBM policy since I believe they both look quite similar.
- Sec. 5.3: "Moreover, the
state-of-the-art shows that using materialization-thensort
approach (like ACQUA) has higher computational
overhead comparing to the incremental approaches
(MinTopk and AcquaTop)." -> Missing a citation here, which state-of-the-art?

- Sec. 6.2: This part contains a valuable insight as to a strategy to set how big is the refresh budget and K+N. Perhaps, this can be mentioned again in the post-experiment discussion part, if the strategy works out.
- The cumulative formulas in Pg. 23 are not displayed optimally, since the equation uses a line break.
- Fig. 9(b) still uses the terms in the original paper (MTKN-A, etc).
- Fig. 9 needs a bit more explanation: What main insight can be gained from Fig. 9 wrt. the different policies?
- Sec. 6.3: WST lacks a sufficient explanation.
- The experiment discussion does not point out the drawbacks of using AT-ASM, particularly since AT-ASM is included anyways in the experiment, meaning it can be done in practice. Please provide sufficient information why AT-ASM should not be used, based on your experiment results (and perhaps, other observations missing in the paper, such as reactiveness, etc).
- Fig. 12(b) and 12(d): Perhaps the authors can explain as stated in the Response to Reviews letter as to why the performance degrades in the middle area, and then gets better to the end; that it relates more to the dataset characteristics.
- Fig. 13(a): I wonder why small N gives the best performance for AT-BSM?

Review #2
Anonymous submitted on 21/Jan/2019
Minor Revision
Review Comment:

The paper addresses the problem of efficiently resolving top-k join queries in the context of streaming data joining evolving distributed data. To that end, authors first propose an extension of the existing Mintopk algorithm and Super-MTK structure in order to support new arrivals of data coming from the changes in the evolving distributed data. The new structure, Super-MTK+N adds N positions to the original structure, acting as a buffer for current and future evaluations. Given that the streaming data out of the top list are discarded, authors discuss the cases in which the new data can also serve to resolve future evaluations in an exact or approximate way. In this first step, authors assume that data changes are notified by the distributed source. Then, the paper proposes AcquaTop, an architecture that integrates Super-MTK+N in the existing Acqua architecture, which tackles the challenge of querying evolving distributed data providing a local replica that is updated based on some refresh policies. The main idea of the integration is that the candidates for the refreshment are taken from the Super-MTK+N list, which will be re-fetched from the distributed sources and could lead to the new changes that need to be considered in the topk computation. Authors propose several policies to select the elements to be refetched and perform an evaluation over user mentions and followers in twitter with varying number of changes, refresh budget, number of top-k results and number of N extra elements. Results show that the two defined policies generally outperform standard policies.

Overall, the paper is interesting and the approach is technically sound. The paper has significantly improved in the last revision, in particular it is better structured and more clear.

I still have some comments, but they can be solved in a minor review:

The new two examples added in the introduction seem useful but (i) the explanations are too long, postpone the important text in the introduction and make it too large and (ii) the twitter example is a bit unclear: is it presented to justify the top-k join instead of a full join, or to present a case where the data is slowly evolving “artifically” given the restrictions of the API? I would suggest to clarify the motivation of twitter example and short the excessive details of both examples (in particular, limit the number details of the twitter APIs) or even consider to include only one of them. Another alternative is to move them to a motivation example after the introduction.

In the revised version, authors changed the title to limit the expectations, but still the limitation to 1:1 joins should be prominent in the introduction. In addition, the focus and limit on joins is not clearly stated in the introduction e.g., “In this paper, we extend the state-of-the-art approach for top-k query evaluation”.

Regarding the blurry definition of “slowly changes”, in spite of the explanations of authors in the rebuttal letter, readers would still have similar concerns while reading the paper (a proof of that is that reviewers coincide in pointing out the problem). I would strongly suggest to solve this problem in the paper (a) avoiding the term and just referring to evolving data and, at some point mention that in general one can assume that the change ratio of the distributed data is slower than the joined streams as otherwise it could be seen as another data stream, (b) use the term but explicitly state your understanding of “slowly change”.

Regarding Table 1, I cannot find in the paper the comment “We assume in the current window we have the exact result up to now, and if we have approximated result from previous evaluations, in all the condition we have approximated result.”.

The new result table 6 is a bit difficult to grasp. Including bold font for the winning figures and vertically splitting the 4 columns of the experiments would help.

- The abstract is a bit verbose and contains some typos: “We perform a complexity study. And we provide”
- The explanation of subsections in the beginning of Section 4 is out of order.
- The caption of Algorithm 1 can be more descriptive.
- A pass of grammar mistakes and typos would be still required. For example, in page 11, “the elected” –> “the selected”.