Last week, a colleague asked me a really simple question about segments merging in Solr. After discussing the answer for some minutes while playing around with Solr, I realized that there are a lot of subtleties in this subject. So I started to look at the code and discovered some interesting things, which I’m going to summarize in this post.

# What is a merge policy?

The process of choosing which segments are going to be merged, and in which way, is carried out by an abstract class named `MergePolicy`

. This class builds a `MergeSpecification`

, which is an object composed of a list of `OneMerge`

objects. Each one of them represents a single merge operation, defined by the segments that will be merged into a new one.

After a change in the index, the `IndexWriter`

invokes the `MergePolicy`

to obtain a `MergeSpecification`

. Next, it invokes the `MergeScheduler`

, who is in charge of determining when the merges will be executed. There are mainly two implementations of `MergeScheduler`

: the `ConcurrentMergeScheduler`

, which runs each merge in a separate thread, and the `SerialMergeScheduler`

, which runs all the merges sequentially in the current thread. Finally, when the time to do a merge comes, the `IndexWriter`

does part of the job, and delegates the other part to a `SegmentMerger`

.

So, if we want to know when a set of segments is going to be merged, why a segment is being merged and another one isn’t, and other things like these, we should take a look at the `MergePolicy`

.

There are many implementations of `MergePolicy`

, but I’m going to focus in one of them (`LogByteSizeMergePolicy`

), because it’s Solr’s default, and I believe that it’s the one used by most people. `MergePolicy`

defines three abstract methods to construct a `MergeSpecification`

:

`findMerges`

is invoked whenever there is a change in the index. This is the method that I’ll study in this post.`findMergesForOptimize`

is invoked whenever an optimize operation is called.`findMergesToExpungeDeletes`

is invoked whenever an expunge deletes operation is called.

# Step by step

I’ll start by giving a brief and conceptual description of how the merge policy works, that you can follow by looking at the figure below:

- Sort the segments by name.
- Group the existing segments into levels. Each level is a contiguous set of segments.
- For each level, determine the interval of segments that are going to be merged.

# Parameters

Lets define a couple of parameters involved in the process of merging the segments that compose an index:

`mergeFactor`

: this parameter determines many things, like how many segments are going to be merged into a new one, the maximum number of segments that can be in a level and the span of each level. Can be set in`solrconfig.xml`

.`minMergeSize`

: all segments whose size is less than this parameter’s value will belong to the same level. This value is fixed.`maxMergeSize`

: all segments whose size is greater than this parameter’s value won’t be ever merged. This value is fixed.`maxMergeDocs`

: all segments containing more documents than this parameter’s value won’t be merged. Can be set in`solrconfig.xml`

.

# Constructing the levels

Let’s see how levels are constructed. To define the first level, the algorithm searches the maximum segment’ size. Let’s call this value `levelMaxSize`

. If this value is less than `minMergeSize`

, then all the segments are considered to be at the same level. Otherwise, let’s define `levelMinSize`

as:

That’s quite an ugly arithmetic formula, but it means something like this: `levelMinSize`

will be almost `mergeFactor`

times less than `levelMaxSize`

(it would have been `mergeFactor`

times if 1 had been used as exponent instead of 0.75), but if it goes below `minMergeSize`

, make it equal to `minMergeSize`

.

After this calculation, the algorithm will choose which segments belong to the current level. To do this, it will find the newest segment whose size is greater or equal than `levelMinSize`

, and will consider this segment, and all the segments older than it, to be part of this level.

The next levels will be constructed using the same algorithm, but constraining the set of available segments to the ones newer than the newest of the previous level.

In the next figure you can see a simple example with `mergeFactor=10`

and `minMergeSize=1.6MiB`

. The intention behind using a `mergeFactor`

of 10 is to obtain levels with span of decreasing order of magnitude.

But in some cases, this algorithm can result in a structure of levels that you won’t expect if you don’t know the internals. Take, for example, the segments of the following table, assuming `mergeFactor=10`

and `minMergeSize=1.6MiB`

:

Segment |
Size |

a | 200 MiB |

l | 88 MiB |

m | 8.9 MiB |

n | 6.5 MiB |

o | 1.4 MiB |

p | 842 KiB |

q | 842 KiB |

r | 842 KiB |

s | 842 KiB |

t | 842 KiB |

u | 842 KiB |

v | 842 KiB |

w | 842 KiB |

x | 160 MiB |

How many levels are in this case? Let’s see: the maximum segment size is 200 MiB, thus, `levelMinSize`

is 35 MiB. The newest segment with size greater than `levelMinSize`

is *x*, so the first level will include *x* and all the segments older than it. This means that, in this case, there’s only one level!

# Choosing which segments to merge

After defining the levels, the `MergePolicy`

will choose which segments to merge. To do this, it’ll analyze each level separately. If a level has less than `mergeFactor`

segments, then it’ll be skipped. Otherwise, each group of `mergeFactor`

contiguous segments will be merged into a new one. If at least one of the segments in a group is bigger than `maxMergeFactor`

, or has more than `maxMergeDocs`

documents, then the group is skipped.

Resuming the second example of the previous section, where only one level is present, the result of the merge will be:

Segment |
Size |

u | 842 KiB |

v | 842 KiB |

w | 842 KiB |

x | 160 MiB |

y | 311 MiB |

`minMergeSize`

and `maxMergeSize`

Through the post, I’ve talked about these two parameters, which are involved in the process of merging. It’s worth noting that the value of them is hard coded in the source code of Lucene, and their values are the following:

`minMergeSize`

has a fixed value of 1.6 MiB. This means that any segment whose size is less than 1.6 MiB will be included in the last level.`maxMergeSize`

has a fixed value of 2 GiB. This means that any segment whose size is greater than 2 GiB will never be merged.

# Conclusion

While the algorithm itself isn’t extremely complex, you need to understand the internals if you want to predict what will happen to your index when more documents are added. Also, knowing how the merge policy works will help you in determining the value of some parameters like `mergeFactor`

and `maxMergeDocs`

.