In Apache Lucene, queries are responsible for creating sorted streams of matching doc IDs. Implementing a disjunctive query boils down to taking N input queries that produce sorted streams of doc IDs and combining them into a merged sorted stream of doc IDs. The textbook approach to this problem consists of putting input streams into a min-heap data structure ordered by their current doc ID. This approach has been referred to as BooleanScorer2 (BS2) in Lucene.
While BS2 works nicely, it gets a bit of overhead from having to rebalance the heap every time that it needs to move to the next match. BS1 tries to reduce this overhead by splitting the doc ID space into windows of 2,048 documents. In every window, BS1 iterates through all matching doc IDs, one clause at a time. On every doc ID, it computes the index of this doc ID in the window, sets the corresponding bit in a bitset, and adds the current score to the corresponding index in a double[2048]. Iterating matches within the window, then consists of iterating bits of the bitset and looking up the score at the corresponding index in the double[2048]. This approach often runs faster with queries that have many clauses or high-frequency clauses.
These two approaches have been described in a 1997 paper called “Space Optimizations for Total Ranking” by Doug Cutting, the creator of Lucene. BS2 is called “Parallel Merge” in this paper and described in section 4.1, while BS1 is called “Block Merge” and described in section 4.2. These are arguably more descriptive names than BS1 and BS2. Note that the description of “Block Merge” in the paper is quite different from what it looks like in Lucene today, but the underlying idea is the same.
Leave a Reply