How to get a fixed order of same-score query results?

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

How to get a fixed order of same-score query results?

Dennis Dam
Hi,

At our client we are running several site instances with embedded repositories.  On one particular page, we perform a query where we simply get all items of a certain type, sorted by date (descending). What we are observing now is that the order of the returned hits is random, if these hits have the same date (and the same JCR score)! For example, on server 1 the order of the results could be:

document A
document B
..

while on server 2 the order is :

document B
document A
..

To solve it, we could add a second ordering to the query on some random field, for example the UUID of the document. This would force the same ordering on every server. BUT,  this involves quite some work, because we'd have to change a lot of queries to add this 'fallback' / secondary ordering. Can we solve this in an integral way , e.g. in the repository ?

tia
Dennis

_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Ard
Reply | Threaded
Open this post in threaded view
|

Re: How to get a fixed order of same-score query results?

Ard
Hello Dennis,

On Fri, Jul 16, 2010 at 9:09 AM, Dennis Dam <[hidden email]> wrote:

> Hi,
> At our client we are running several site instances with embedded
> repositories.  On one particular page, we perform a query where we simply
> get all items of a certain type, sorted by date (descending). What we are
> observing now is that the order of the returned hits is random, if these
> hits have the same date (and the same JCR score)! For example, on server 1
> the order of the results could be:
> document A
> document B
> ..
> while on server 2 the order is :
> document B
> document A
> ..
> To solve it, we could add a second ordering to the query on some random
> field, for example the UUID of the document. This would force the same

you cannot sort on UUID from the top of my head. Sorting on nodename
is possible iirc.

> ordering on every server. BUT,  this involves quite some work, because we'd
> have to change a lot of queries to add this 'fallback' / secondary ordering.
> Can we solve this in an integral way , e.g. in the repository ?

Interesting issue. This seems to me to be a Jackrabbit/Lucene issue.
It should be solvable, but we really have be in the Jackrabbit core
for this.

Regards Ard

> tia
> Dennis
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html
>
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Reply | Threaded
Open this post in threaded view
|

Re: How to get a fixed order of same-score query results?

Dennis Dam


On Fri, Jul 16, 2010 at 9:19 AM, Ard Schrijvers <[hidden email]> wrote:
Hello Dennis,

On Fri, Jul 16, 2010 at 9:09 AM, Dennis Dam <[hidden email]> wrote:
> Hi,
> At our client we are running several site instances with embedded
> repositories.  On one particular page, we perform a query where we simply
> get all items of a certain type, sorted by date (descending). What we are
> observing now is that the order of the returned hits is random, if these
> hits have the same date (and the same JCR score)! For example, on server 1
> the order of the results could be:
> document A
> document B
> ..
> while on server 2 the order is :
> document B
> document A
> ..
> To solve it, we could add a second ordering to the query on some random
> field, for example the UUID of the document. This would force the same

you cannot sort on UUID from the top of my head. Sorting on nodename
is possible iirc.


the problem with solving on nodename is that the nodename is often equal to the documents title. The website visitor would then see alphabetic ordering for same-score hits, which would look kind of strange. I need a field with random contents to order on ;)

> ordering on every server. BUT,  this involves quite some work, because we'd
> have to change a lot of queries to add this 'fallback' / secondary ordering.
> Can we solve this in an integral way , e.g. in the repository ?

Interesting issue. This seems to me to be a Jackrabbit/Lucene issue.
It should be solvable, but we really have be in the Jackrabbit core
for this.


Shall I create a Jira issue for this ?



_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Reply | Threaded
Open this post in threaded view
|

Re: How to get a fixed order of same-score query results?

b.vanderschans@onehippo.com
In reply to this post by Ard
On Fri, Jul 16, 2010 at 9:19 AM, Ard Schrijvers
<[hidden email]> wrote:

> Hello Dennis,
>
> On Fri, Jul 16, 2010 at 9:09 AM, Dennis Dam <[hidden email]> wrote:
>> Hi,
>> At our client we are running several site instances with embedded
>> repositories.  On one particular page, we perform a query where we simply
>> get all items of a certain type, sorted by date (descending). What we are
>> observing now is that the order of the returned hits is random, if these
>> hits have the same date (and the same JCR score)! For example, on server 1
>> the order of the results could be:
>> document A
>> document B
>> ..
>> while on server 2 the order is :
>> document B
>> document A
>> ..
>> To solve it, we could add a second ordering to the query on some random
>> field, for example the UUID of the document. This would force the same
>
> you cannot sort on UUID from the top of my head. Sorting on nodename
> is possible iirc.
>
>> ordering on every server. BUT,  this involves quite some work, because we'd
>> have to change a lot of queries to add this 'fallback' / secondary ordering.
>> Can we solve this in an integral way , e.g. in the repository ?
>
> Interesting issue. This seems to me to be a Jackrabbit/Lucene issue.
> It should be solvable, but we really have be in the Jackrabbit core
> for this.
Can't you use the respectDocumentOrder option? It could have a bad
performance impact though aiui.

Bart
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Ard
Reply | Threaded
Open this post in threaded view
|

Re: How to get a fixed order of same-score query results?

Ard
In reply to this post by Dennis Dam
On Fri, Jul 16, 2010 at 9:29 AM, Dennis Dam <[hidden email]> wrote:
>>
>> you cannot sort on UUID from the top of my head. Sorting on nodename
>> is possible iirc.
>>
> the problem with solving on nodename is that the nodename is often equal to
> the documents title. The website visitor would then see alphabetic ordering
> for same-score hits, which would look kind of strange. I need a field with
> random contents to order on ;)

Ok. But as explained, I don't think you can sort on UUID

>>
>> > ordering on every server. BUT,  this involves quite some work, because
>> > we'd
>> > have to change a lot of queries to add this 'fallback' / secondary
>> > ordering.
>> > Can we solve this in an integral way , e.g. in the repository ?
>>
>> Interesting issue. This seems to me to be a Jackrabbit/Lucene issue.
>> It should be solvable, but we really have be in the Jackrabbit core
>> for this.

> Shall I create a Jira issue for this ?

Of course you can. However, realize, that fixing it will have to be
done somewhere in Jackrabbit, and I hope it is not to hard, because it
actually might be a really complex issue to really fix: this is
because every cluster node has its own lucene index, which return the
same results, but internally might be quite different from each other
wrt the internal lucene doc id's

Regards Ard

>
>
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html
>
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Ard
Reply | Threaded
Open this post in threaded view
|

Re: How to get a fixed order of same-score query results?

Ard
In reply to this post by b.vanderschans@onehippo.com
On Fri, Jul 16, 2010 at 9:34 AM, Bart van der Schans
<[hidden email]> wrote:
>> Interesting issue. This seems to me to be a Jackrabbit/Lucene issue.
>> It should be solvable, but we really have be in the Jackrabbit core
>> for this.
> Can't you use the respectDocumentOrder option? It could have a bad
> performance impact though aiui.

No, don't use respectDocumentOrder. First of all, it is only taken
into account when you are not sorting (they already are sorting, it is
just not deterministic enough). Secondly, it 'could' have a bad
performance is an understatement. It is not an option to use for
repositories containing serious amounts of data

Regards Ard

>
> Bart
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html
>
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Reply | Threaded
Open this post in threaded view
|

Re: How to get a fixed order of same-score query results?

Dennis Dam
Alright, clear.. thanks ! I will make an issue for it anyway ;)

On Fri, Jul 16, 2010 at 9:42 AM, Ard Schrijvers <[hidden email]> wrote:
On Fri, Jul 16, 2010 at 9:34 AM, Bart van der Schans
<[hidden email]> wrote:
>> Interesting issue. This seems to me to be a Jackrabbit/Lucene issue.
>> It should be solvable, but we really have be in the Jackrabbit core
>> for this.
> Can't you use the respectDocumentOrder option? It could have a bad
> performance impact though aiui.

No, don't use respectDocumentOrder. First of all, it is only taken
into account when you are not sorting (they already are sorting, it is
just not deterministic enough). Secondly, it 'could' have a bad
performance is an understatement. It is not an option to use for
repositories containing serious amounts of data

Regards Ard

>
> Bart
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html
>
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html



--
Hippo B.V.  -  Amsterdam
Oosteinde 11, 1017 WT, Amsterdam, +31(0)20-5224466

Hippo USA Inc.  -  San Francisco
101 H Street, Suite Q, Petaluma CA, 94952-3329, +1 (707) 773-4646
-----------------------------------------------------------------
http://www.onehippo.com   -  [hidden email]
-----------------------------------------------------------------


_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Reply | Threaded
Open this post in threaded view
|

Re: How to get a fixed order of same-score query results?

Gerrit Berkouwer
In reply to this post by Dennis Dam
Why do we not use minutes and seconds in the 'date' to order the results? This would certainly give much more accurate results, would it not? There would only be a problem with documents with the exact same timestamp, but roughly it would be much better than just using the date...
--
Greetz, Gerrit
Ard
Reply | Threaded
Open this post in threaded view
|

Re: How to get a fixed order of same-score query results?

Ard
Hello Gerrit,

On Fri, Jul 16, 2010 at 5:43 PM, Gerrit Berkouwer
<[hidden email]> wrote:
>
> Why do we not use minutes and seconds in the 'date' to order the results?

I think by default, the cms interface for the date plugin stores with
a granularity of day. Most likely, this is also the granularity that
makes sense for, let's say, a news article. The hour and minute aren't
important.

> This would certainly give much more accurate results, would it not? There

I don't think it is really about the accuracy here: It's about a
different ordering on different cluster nodes where multiple results
are equal with respect to ordering. Then, the order is not
deterministic enough, and results in different ordering on different
nodes. So, your suggestion will certainly take away the problem for
99.99%. However, it is not solving the real issue. It is a workaround.
And why would you need to store some random hour, random minute and
random second if it is not adding any real information...but only
solving a corner case ordering issue on clustered environment.

That said, I am also afraid about the suggested workaround. This is
something you cannot know, and it quite low-level Lucene, but I'll try
to explain it anyhow:

When doing range queries, Lucene does Query expansion: So, give me all
articles of 2009, is internally a translation to an OR query with all
present unique dates in 2009. Now, when we have granularity of day, we
at most have an or query with 365 terms. When having granularity on
hour, it can be 24 * 365, on second, to it times 60...and on ms.
Obviously, I am stating the max: the number of OR terms will never be
larger than the number of articles. But, I think you already see my
point: your suggestion deteriorates performance of range queries
enormously (don't be surprised of range queries dropping a factor 1000
in speed). Lucene has since recently heavily improved on this, by
TrieRange queries (based upon a large similarity between dates, you
can expand much more efficiently). Unfortunately, Jackrabbit still has
to take quite some hurdles to be able to benefit from this. Hopefully
I'll be able to work on this in the future as it is one of my favorite
areas...

> would only be a problem with documents with the exact same timestamp, but
> roughly it would be much better than just using the date...

timestamps can be on ms: having this as a cornercase is quite
acceptable as it is really unlikely to happen, and the worst thing
possible is a different order between two articles on a clustered
environment...once every 6 billion years...

Also one other disadvantage about have dates stored with much larger
granularity is memory consumption becoming much higher. This is also a
very common Lucene issue...perhaps even quite a common inverted
indexes problem.

So, yes, your solution does solve this corner case, but I am afraid it
opens up a whole world of far more serious problems.

Regards Ard

>
> -----
> --
> Greetz, Gerrit
> --
> View this message in context: http://hippo.2275632.n2.nabble.com/How-to-get-a-fixed-order-of-same-score-query-results-tp5300731p5302539.html
> Sent from the Hippo CMS 7 mailing list archive at Nabble.com.
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html
>
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Reply | Threaded
Open this post in threaded view
|

Re: How to get a fixed order of same-score query results?

Gerrit Berkouwer
Ok, I totally get what you are saying. As you maybe know I am a fan of speed, just as you :-).

Only, I think using minutes and seconds in a row of documents is not strange at all and not a corner-case. Think about using a weblog functionality where the blogger wants to blog several blogs a day, and wants these blogs to be in the exact order, newest on top... And also news-documents: if there are several a day, you really want those to be ordered by minute, newest on top.

How would we order those then in a clustered environment?
--
Greetz, Gerrit
Ard
Reply | Threaded
Open this post in threaded view
|

Re: How to get a fixed order of same-score query results?

Ard
Hello Gerrit,

On Sat, Jul 17, 2010 at 5:58 PM, Gerrit Berkouwer
<[hidden email]> wrote:
>
> Ok, I totally get what you are saying. As you maybe know I am a fan of speed,
> just as you :-).

yes, need for speed!

>
> Only, I think using minutes and seconds in a row of documents is not strange
> at all and not a corner-case. Think about using a weblog functionality where
> the blogger wants to blog several blogs a day, and wants these blogs to be
> in the exact order, newest on top... And also news-documents: if there are
> several a day, you really want those to be ordered by minute, newest on top.

In this case, I'd store them on granularity of minutes...of course
:-). And, this won't be a problem. I was merely trying to explain,
that putting random hours/minute/seconds on news articles, where they
really are about which day, does not add value: In that case, you are
proposing a workaround for the clustered environment instead of
tackling the issue. But, if you like articles to be on minute/second
dates, just do so! Yes, it will consume more memory and will be
slower...when you grow well beyond, let's say 100.000 articles. And
then, specifically range queries. Just sorting will be good enough for
couple of millions.

Also to be sure, I'd really like to address the Jackrabbit issue,
where we need to improve the lucene impl to be able to have the latest
and greatest lucene features, like fast range queries on unique dates.
It is however I think a 3 months job, as it means a large
restructuring of the jackrabbit indexing, which contains hundreds of
classes

>
> How would we order those then in a clustered environment?

As explained, if the second in the date is important, then store the
second...but rather not the milli second if that one does not matter.
I think you agree on this one, right?

Regards Ard

>
> -----
> --
> Greetz, Gerrit
> --
> View this message in context: http://hippo.2275632.n2.nabble.com/How-to-get-a-fixed-order-of-same-score-query-results-tp5300731p5306294.html
> Sent from the Hippo CMS 7 mailing list archive at Nabble.com.
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html
>
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Reply | Threaded
Open this post in threaded view
|

Re: How to get a fixed order of same-score query results?

Arje Cahn
Administrator
In reply to this post by Gerrit Berkouwer
Which date field are you sorting on?

2010/7/17, Gerrit Berkouwer <[hidden email]>:

>
> Ok, I totally get what you are saying. As you maybe know I am a fan of
> speed,
> just as you :-).
>
> Only, I think using minutes and seconds in a row of documents is not strange
> at all and not a corner-case. Think about using a weblog functionality where
> the blogger wants to blog several blogs a day, and wants these blogs to be
> in the exact order, newest on top... And also news-documents: if there are
> several a day, you really want those to be ordered by minute, newest on top.
>
> How would we order those then in a clustered environment?
>
> -----
> --
> Greetz, Gerrit
> --
> View this message in context:
> http://hippo.2275632.n2.nabble.com/How-to-get-a-fixed-order-of-same-score-query-results-tp5300731p5306294.html
> Sent from the Hippo CMS 7 mailing list archive at Nabble.com.
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html
>


--
Regards,

Arjé Cahn

Hippo
[hidden email] / [hidden email]
Amsterdam - Hippo B.V. Oosteinde 11 1017 WT Amsterdam +31(0)20-5224466
San Francisco - Hippo USA Inc. 101 H Street, suite Q Petaluma CA 94952-5100
+1 (707) 773-4646
www.onehippo.com [hidden email]
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Reply | Threaded
Open this post in threaded view
|

Re: How to get a fixed order of same-score query results?

Gerrit Berkouwer
I really do not know, I'm not a techie, Dennis will know for sure! :-)
--
Greetz, Gerrit
Ard
Reply | Threaded
Open this post in threaded view
|

Re: How to get a fixed order of same-score query results?

Ard
In reply to this post by Dennis Dam
Hello Dennis,

On Fri, Jul 16, 2010 at 9:51 AM, Dennis Dam <[hidden email]> wrote:
> Alright, clear.. thanks ! I will make an issue for it anyway ;)

To get back to you on a really easy and 100% proof, and even quite
sensible solution add to your order by clause to also order on
@hippostdpubwf:lastModificationDate descending. So where you had:

order by  @myproject:date descending

you change it in:

order by  @myproject:date descending,
@hippostdpubwf:lastModificationDate descending

That will make sorting in clusters correct, as I assume the documents
have a unique lastModificationDate (you can also use creationDate of
course)

Regards Ard

>
> On Fri, Jul 16, 2010 at 9:42 AM, Ard Schrijvers <[hidden email]>
> wrote:
>>
>> On Fri, Jul 16, 2010 at 9:34 AM, Bart van der Schans
>> <[hidden email]> wrote:
>> >> Interesting issue. This seems to me to be a Jackrabbit/Lucene issue.
>> >> It should be solvable, but we really have be in the Jackrabbit core
>> >> for this.
>> > Can't you use the respectDocumentOrder option? It could have a bad
>> > performance impact though aiui.
>>
>> No, don't use respectDocumentOrder. First of all, it is only taken
>> into account when you are not sorting (they already are sorting, it is
>> just not deterministic enough). Secondly, it 'could' have a bad
>> performance is an understatement. It is not an option to use for
>> repositories containing serious amounts of data
>>
>> Regards Ard
>>
>> >
>> > Bart
>> > _______________________________________________
>> > Hippo-cms7-user mailing list and forums
>> > http://www.onehippo.org/cms7/support/forums.html
>> >
>> _______________________________________________
>> Hippo-cms7-user mailing list and forums
>> http://www.onehippo.org/cms7/support/forums.html
>
>
>
> --
> Hippo B.V.  -  Amsterdam
> Oosteinde 11, 1017 WT, Amsterdam, +31(0)20-5224466
>
> Hippo USA Inc.  -  San Francisco
> 101 H Street, Suite Q, Petaluma CA, 94952-3329, +1 (707) 773-4646
> -----------------------------------------------------------------
> http://www.onehippo.com   -  [hidden email]
> -----------------------------------------------------------------
>
>
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html
>
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Reply | Threaded
Open this post in threaded view
|

Re: How to get a fixed order of same-score query results?

Dennis Dam
@ard sounds like a sensible solution. As a temporary solution, I sorted on title, but could change that to the last modified date.

@arje/gerrit we are sorting on a field called "publication date" , which is a standard date field

On Mon, Jul 19, 2010 at 9:23 AM, Ard Schrijvers <[hidden email]> wrote:
Hello Dennis,

On Fri, Jul 16, 2010 at 9:51 AM, Dennis Dam <[hidden email]> wrote:
> Alright, clear.. thanks ! I will make an issue for it anyway ;)

To get back to you on a really easy and 100% proof, and even quite
sensible solution add to your order by clause to also order on
@hippostdpubwf:lastModificationDate descending. So where you had:

order by  @myproject:date descending

you change it in:

order by  @myproject:date descending,
@hippostdpubwf:lastModificationDate descending

That will make sorting in clusters correct, as I assume the documents
have a unique lastModificationDate (you can also use creationDate of
course)

Regards Ard

>
> On Fri, Jul 16, 2010 at 9:42 AM, Ard Schrijvers <[hidden email]>
> wrote:
>>
>> On Fri, Jul 16, 2010 at 9:34 AM, Bart van der Schans
>> <[hidden email]> wrote:
>> >> Interesting issue. This seems to me to be a Jackrabbit/Lucene issue.
>> >> It should be solvable, but we really have be in the Jackrabbit core
>> >> for this.
>> > Can't you use the respectDocumentOrder option? It could have a bad
>> > performance impact though aiui.
>>
>> No, don't use respectDocumentOrder. First of all, it is only taken
>> into account when you are not sorting (they already are sorting, it is
>> just not deterministic enough). Secondly, it 'could' have a bad
>> performance is an understatement. It is not an option to use for
>> repositories containing serious amounts of data
>>
>> Regards Ard
>>
>> >
>> > Bart
>> > _______________________________________________
>> > Hippo-cms7-user mailing list and forums
>> > http://www.onehippo.org/cms7/support/forums.html
>> >
>> _______________________________________________
>> Hippo-cms7-user mailing list and forums
>> http://www.onehippo.org/cms7/support/forums.html
>
>
>
> --
> Hippo B.V.  -  Amsterdam
> Oosteinde 11, 1017 WT, Amsterdam, +31(0)20-5224466
>
> Hippo USA Inc.  -  San Francisco
> 101 H Street, Suite Q, Petaluma CA, 94952-3329, +1 (707) 773-4646
> -----------------------------------------------------------------
> http://www.onehippo.com   -  [hidden email]
> -----------------------------------------------------------------
>
>
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html
>
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html



--
Hippo B.V.  -  Amsterdam
Oosteinde 11, 1017 WT, Amsterdam, +31(0)20-5224466

Hippo USA Inc.  -  San Francisco
101 H Street, Suite Q, Petaluma CA, 94952-3329, +1 (707) 773-4646
-----------------------------------------------------------------
http://www.onehippo.com   -  [hidden email]
-----------------------------------------------------------------


_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Reply | Threaded
Open this post in threaded view
|

Re: How to get a fixed order of same-score query results?

Arje Cahn
Administrator
In reply to this post by Ard
> you change it in:
>
> order by  @myproject:date descending,
> @hippostdpubwf:lastModificationDate descending

yup. I like! :)
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html