Last week, a colleague asked me a really simple question about segments merging in Solr. After discussing the answer for some minutes while playing around with Solr, I realized that there are a lot of subtleties in this subject. So I started to look at the code and discovered some interesting things, which I’m going to summarize in this post.
What is a merge policy?
The process of choosing which segments are going to be merged, and in which way, is carried out by an abstract class named
MergePolicy. This class builds a
MergeSpecification, which is an object composed of a list of
OneMerge objects. Each one of them represents a single merge operation, defined by the segments that will be merged into a new one.
After a change in the index, the
IndexWriter invokes the
MergePolicy to obtain a
MergeSpecification. Next, it invokes the
MergeScheduler, who is in charge of determining when the merges will be executed. There are mainly two implementations of
ConcurrentMergeScheduler, which runs each merge in a separate thread, and the
SerialMergeScheduler, which runs all the merges sequentially in the current thread. Finally, when the time to do a merge comes, the
IndexWriter does part of the job, and delegates the other part to a
So, if we want to know when a set of segments is going to be merged, why a segment is being merged and another one isn’t, and other things like these, we should take a look at the
There are many implementations of
MergePolicy, but I’m going to focus in one of them (
LogByteSizeMergePolicy), because it’s Solr’s default, and I believe that it’s the one used by most people.
MergePolicy defines three abstract methods to construct a
findMergesis invoked whenever there is a change in the index. This is the method that I’ll study in this post.
findMergesForOptimizeis invoked whenever an optimize operation is called.
findMergesToExpungeDeletesis invoked whenever an expunge deletes operation is called.
Step by step
I’ll start by giving a brief and conceptual description of how the merge policy works, that you can follow by looking at the figure below:
- Sort the segments by name.
- Group the existing segments into levels. Each level is a contiguous set of segments.
- For each level, determine the interval of segments that are going to be merged.
Lets define a couple of parameters involved in the process of merging the segments that compose an index:
mergeFactor: this parameter determines many things, like how many segments are going to be merged into a new one, the maximum number of segments that can be in a level and the span of each level. Can be set in
minMergeSize: all segments whose size is less than this parameter’s value will belong to the same level. This value is fixed.
maxMergeSize: all segments whose size is greater than this parameter’s value won’t be ever merged. This value is fixed.
maxMergeDocs: all segments containing more documents than this parameter’s value won’t be merged. Can be set in
Constructing the levels
Let’s see how levels are constructed. To define the first level, the algorithm searches the maximum segment’ size. Let’s call this value
levelMaxSize. If this value is less than
minMergeSize, then all the segments are considered to be at the same level. Otherwise, let’s define
That’s quite an ugly arithmetic formula, but it means something like this:
levelMinSize will be almost
mergeFactor times less than
levelMaxSize (it would have been
mergeFactor times if 1 had been used as exponent instead of 0.75), but if it goes below
minMergeSize, make it equal to
After this calculation, the algorithm will choose which segments belong to the current level. To do this, it will find the newest segment whose size is greater or equal than
levelMinSize, and will consider this segment, and all the segments older than it, to be part of this level.
The next levels will be constructed using the same algorithm, but constraining the set of available segments to the ones newer than the newest of the previous level.
In the next figure you can see a simple example with
minMergeSize=1.6MiB. The intention behind using a
mergeFactor of 10 is to obtain levels with span of decreasing order of magnitude.
But in some cases, this algorithm can result in a structure of levels that you won’t expect if you don’t know the internals. Take, for example, the segments of the following table, assuming
How many levels are in this case? Let’s see: the maximum segment size is 200 MiB, thus,
levelMinSize is 35 MiB. The newest segment with size greater than
levelMinSize is x, so the first level will include x and all the segments older than it. This means that, in this case, there’s only one level!
Choosing which segments to merge
After defining the levels, the
MergePolicy will choose which segments to merge. To do this, it’ll analyze each level separately. If a level has less than
mergeFactor segments, then it’ll be skipped. Otherwise, each group of
mergeFactor contiguous segments will be merged into a new one. If at least one of the segments in a group is bigger than
maxMergeFactor, or has more than
maxMergeDocs documents, then the group is skipped.
Resuming the second example of the previous section, where only one level is present, the result of the merge will be:
Through the post, I’ve talked about these two parameters, which are involved in the process of merging. It’s worth noting that the value of them is hard coded in the source code of Lucene, and their values are the following:
minMergeSizehas a fixed value of 1.6 MiB. This means that any segment whose size is less than 1.6 MiB will be included in the last level.
maxMergeSizehas a fixed value of 2 GiB. This means that any segment whose size is greater than 2 GiB will never be merged.
While the algorithm itself isn’t extremely complex, you need to understand the internals if you want to predict what will happen to your index when more documents are added. Also, knowing how the merge policy works will help you in determining the value of some parameters like