Review Comment:
I thank the authors for the clarification regarding my previous review. While some improvements have been made, I believe a minor revision is necessary before the paper can be accepted (of course, subject to other reviews).
Please find below my comments for the revised version:
ABSTRACT
- It's worth mentioning (briefly) what are the two maintenance policies and their main difference.
- "We perform a complexity study. And we provide empirical evidence that the proposed policies provide more relevant and accurate results than the state of the art."
The grammar here can be further improved.
INTRODUCTION
- I appreciate the use of examples in the introduction. However, the first example about Twitter is somewhat misleading. The presented problem is more about resource allocation policy (by the Twitter company). I am not really sure how the proposed work may be useful in this scenario as long as the Twitter policy does not change.
- The introduction can be organized better by using subsectioning (or paragraph titling).
- Pg. 4: For the sake of display consistency, the maximum value of y-axis of Fig. 1(a) could be set to 10. This ensures easier comparison to the subfigures on the right.
- Pg. 4: Briefly mention how Super-MTK+N list is used in Topk+N algorithm.
- Typo: "but it may not
on the Web, which is decentralize" -> decentralized
PROBLEM DEFINITION
- "An RSP-QL query is defined by a quadruple
" -> The spacing of the symbols (e.g., SDS and SE) can be further optimized, by perhaps, using \textit{} (e.g., \textit{SDS} and \textit{SE}). This also applies for all symbols used throughout the paper.
- Definition 2.5: I think the brackets/parentheses are used incorrectly. To say whether the border value is inclusive, one usually makes use of "[" or "]", and to say exclusive, one uses "(" or ")".
- "this thesis" -> this paper. This mistake occurs quite often in the paper.
- Pg. 6: RStream, IStream, and DStream are not described sufficiently. If the details do not matter, perhaps they could just be treated as a single Stream.
- Def. 2.7: There should be a specific notation for a top-k query instead of only narrating it.
- Pg. 6: "The high relevant items are more
valuable comparing to others." -> highly
- Pg. 6: "The lower the ranked
position of a relevant items" -> a relevant item
- To be more precise, the definition of precision@k is ill-defined, since the notations of tp and fp do not incorporate k.
- "which shows that
the result are more relevant and less accurate" -> is
Do please proofread the paper for other typos (I actually found some more -- more than 10, but the idea is to check more thoroughly for typos I did not mention).
- Sec. 2.2: The 1:1 join can be motivated more.
- Up to this point, there is still no clear attempt to at least explain what it means by "slow" in the term "slowly evolving dataset".
- "and any linear combination of these two variable can
guarantee to have a monotonic function" -> It is not always the case depending on the coefficients (positive vs. negative).
- Pg. 8: The error equation is ill-defined, particularly since 1 is a number while M is a set.
- Pg. 8: Moreover, if the goal of the paper is only to minimize the error, why don't just use the slow, Oracle approach? Please further detail the goal (e.g., minimizing the error under what constraints and which settings?). I believe the goal is more than just minimizing the error, but also to take into account reactiveness in the setting of streaming data combined with evolving distributed data.
BACKGROUND
- Pg. 9: "When an object arrives in
a the current window, it will also participate in all w/\beta
future windows" -> I believe that flooring (i.e., floor(w/\beta)) is necessary here.
- Fig. 3(a): As per the definition of LBP (lower bound pointer), shouldn't the LBP of W3 be pointing to A?
- Fig. 3(b): This subfigure is never mentioned in the text. Which part of the text refers to this subfigure? (Ah, I realized that the subfigure is mentioned much later in the text; this situation could be improved).
- Sec. 3.2: Again, the 1:1 join is mentioned, but is not motivated well.
- Here, I believe that the notion of slowly distributed dataset is tightly coupled with the refresh budget (as per "how slow is slow"). Can this be further elaborated in Sec. 3.2?
- "The normalized
renewed best before time (Vi(t)) of mappings
A^R; B^R; E^R at time 6 are respectively 3, 2 and 3." -> How are these values computed as per Eq. (2)?
TOPK+N SOLUTION
- Pg. 12: In the intro paragraph, the sections are mentioned not in the proper order.
- Sec. 4.1: I believe that adding a similar illustration as in Fig. 3 would be suitable here, and may greatly enhance the readability.
- Sec. 4.1: "As the scoring variable
of the changed object cannot be greater than
min.score_S, " -> cannot be lower?
- Sec. 4.1: Perhaps I am missing something here, but if the score computation is based on a lower bound of streaming score, would it risk the non-inclusion of some mapping in top-k results (since its score will become lower than what it should be), which again would impact the precision@k score?
- Sec. 4.4: A small example in the style of Fig. 3 for each of the functionalities described in Sec. 4.4.1, Sec. 4.4.2, and Sec. 4.4.3 would greatly enhance readability here. I acknowledge that on Pg. 17, the authors refer to Fig. 3(b) for an example (which is not ideal due to the far positioning), yet this is lacking for the following two reasons: this does not rely on the Super-MTK+N list structure; and this does not show each of the functionalities in Sec. 4.4.1, Sec. 4.4.2, and Sec. 4.4.3.
ACQUATOP SOLUTION
- Again, though this part is not as dense as the Topk+N solution part, a little example in the style of Fig. 3 demonstrating the main difference between AT-TSM and AT-BSM would improve readability. This is particularly important since the explanation of AT-BSM at the current version is still rather broad.
- My feedback above should also help in distinguishing between the AT-TSM policy and AT-WBM policy since I believe they both look quite similar.
- Sec. 5.3: "Moreover, the
state-of-the-art shows that using materialization-thensort
approach (like ACQUA) has higher computational
overhead comparing to the incremental approaches
(MinTopk and AcquaTop)." -> Missing a citation here, which state-of-the-art?
EVALUATION
- Sec. 6.2: This part contains a valuable insight as to a strategy to set how big is the refresh budget and K+N. Perhaps, this can be mentioned again in the post-experiment discussion part, if the strategy works out.
- The cumulative formulas in Pg. 23 are not displayed optimally, since the equation uses a line break.
- Fig. 9(b) still uses the terms in the original paper (MTKN-A, etc).
- Fig. 9 needs a bit more explanation: What main insight can be gained from Fig. 9 wrt. the different policies?
- Sec. 6.3: WST lacks a sufficient explanation.
- The experiment discussion does not point out the drawbacks of using AT-ASM, particularly since AT-ASM is included anyways in the experiment, meaning it can be done in practice. Please provide sufficient information why AT-ASM should not be used, based on your experiment results (and perhaps, other observations missing in the paper, such as reactiveness, etc).
- Fig. 12(b) and 12(d): Perhaps the authors can explain as stated in the Response to Reviews letter as to why the performance degrades in the middle area, and then gets better to the end; that it relates more to the dataset characteristics.
- Fig. 13(a): I wonder why small N gives the best performance for AT-BSM?
|