## Merge Policy Internals

Last week, a colleague asked me a really simple question about segments merging in Solr. After discussing the answer for some minutes while playing around with Solr, I realized that there are a lot of subtleties in this subject. So I started to look at the code and discovered some interesting things, which I’m going to summarize in this post.

# What is a merge policy?

The process of choosing which segments are going to be merged, and in which way, is carried out by an abstract class named MergePolicy. This class builds a MergeSpecification, which is an object composed of a list of OneMerge objects. Each one of them represents a single merge operation, defined by the segments that will be merged into a new one.

After a change in the index, the IndexWriter invokes the MergePolicy to obtain a MergeSpecification. Next, it invokes the MergeScheduler, who is in charge of determining when the merges will be executed. There are mainly two implementations of MergeScheduler: the ConcurrentMergeScheduler, which runs each merge in a separate thread, and the SerialMergeScheduler, which runs all the merges sequentially in the current thread. Finally, when the time to do a merge comes, the IndexWriter does part of the job, and delegates the other part to a SegmentMerger.

So, if we want to know when a set of segments is going to be merged, why a segment is being merged and another one isn’t, and other things like these, we should take a look at the MergePolicy.

There are many implementations of MergePolicy, but I’m going to focus in one of them (LogByteSizeMergePolicy), because it’s Solr’s default, and I believe that it’s the one used by most people. MergePolicy defines three abstract methods to construct a MergeSpecification:

• findMerges is invoked whenever there is a change in the index. This is the method that I’ll study in this post.
• findMergesForOptimize is invoked whenever an optimize operation is called.
• findMergesToExpungeDeletes is invoked whenever an expunge deletes operation is called.

# Step by step

I’ll start by giving a brief and conceptual description of how the merge policy works, that you can follow by looking at the figure below:

1. Sort the segments by name.
2. Group the existing segments into levels. Each level is a contiguous set of segments.
3. For each level, determine the interval of segments that are going to be merged.

# Parameters

Lets define a couple of parameters involved in the process of merging the segments that compose an index:

• mergeFactor: this parameter determines many things, like how many segments are going to be merged into a new one, the maximum number of segments that can be in a level and the span of each level. Can be set in solrconfig.xml.
• minMergeSize: all segments whose size is less than this parameter’s value will belong to the same level. This value is fixed.
• maxMergeSize: all segments whose size is greater than this parameter’s value won’t be ever merged. This value is fixed.
• maxMergeDocs: all segments containing more documents than this parameter’s value won’t be merged. Can be set in solrconfig.xml.

# Constructing the levels

Let’s see how levels are constructed. To define the first level, the algorithm searches the maximum segment’ size. Let’s call this value levelMaxSize. If this value is less than minMergeSize, then all the segments are considered to be at the same level. Otherwise, let’s define levelMinSize as:

$max\left(\frac{levelMaxSize}{mergeFactor^{0.75}}, minMergeSize\right)$

That’s quite an ugly arithmetic formula, but it means something like this: levelMinSize will be almost mergeFactor times less than levelMaxSize (it would have been mergeFactor times if 1 had been used as exponent instead of 0.75), but if it goes below minMergeSize, make it equal to minMergeSize.

After this calculation, the algorithm will choose which segments belong to the current level. To do this, it will find the newest segment whose size is greater or equal than levelMinSize, and will consider this segment, and all the segments older than it, to be part of this level.
The next levels will be constructed using the same algorithm, but constraining the set of available segments to the ones newer than the newest of the previous level.
In the next figure you can see a simple example with mergeFactor=10 and minMergeSize=1.6MiB. The intention behind using a mergeFactor of 10 is to obtain levels with span of decreasing order of magnitude.

But in some cases, this algorithm can result in a structure of levels that you won’t expect if you don’t know the internals. Take, for example, the segments of the following table, assuming mergeFactor=10 and minMergeSize=1.6MiB:

 Segment Size a 200 MiB l 88 MiB m 8.9 MiB n 6.5 MiB o 1.4 MiB p 842 KiB q 842 KiB r 842 KiB s 842 KiB t 842 KiB u 842 KiB v 842 KiB w 842 KiB x 160 MiB

How many levels are in this case? Let’s see: the maximum segment size is 200 MiB, thus, levelMinSize is 35 MiB. The newest segment with size greater than levelMinSize is x, so the first level will include x and all the segments older than it. This means that, in this case, there’s only one level!

# Choosing which segments to merge

After defining the levels, the MergePolicy will choose which segments to merge. To do this, it’ll analyze each level separately. If a level has less than mergeFactor segments, then it’ll be skipped. Otherwise, each group of mergeFactor contiguous segments will be merged into a new one. If at least one of the segments in a group is bigger than maxMergeFactor, or has more than maxMergeDocs documents, then the group is skipped.

Resuming the second example of the previous section, where only one level is present, the result of the merge will be:

 Segment Size u 842 KiB v 842 KiB w 842 KiB x 160 MiB y 311 MiB

# minMergeSize and maxMergeSize

Through the post, I’ve talked about these two parameters, which are involved in the process of merging. It’s worth noting that the value of them is hard coded in the source code of Lucene, and their values are the following:

• minMergeSize has a fixed value of 1.6 MiB. This means that any segment whose size is less than 1.6 MiB will be included in the last level.
• maxMergeSize has a fixed value of 2 GiB. This means that any segment whose size is greater than 2 GiB will never be merged.

# Conclusion

While the algorithm itself isn’t extremely complex, you need to understand the internals if you want to predict what will happen to your index when more documents are added. Also, knowing how the merge policy works will help you in determining the value of some parameters like mergeFactor and maxMergeDocs.

## OS’s cache does matter

These days I’ve been working on the performance benchmarks that I mentioned in my previous post. They consist in running a big number of queries against a Solr server, and verifying how each feature (sharding, highlighting, faceting, etc) affects the average query time.

To generate this great number of queries, I’ve made a tool that takes the terms from the TermsComponent, combines them using the operators AND, OR and NOT in a random manner, and then writes a file with an arbitrary number of queries. To run the tests I use Solrmeter (http://code.google.com/p/solrmeter). This program allows me to run all the queries in the file easily, and also displays some beautiful plots, which help me to check if everything is working OK. The Solr server is running on Ubuntu 10.04 Server Edition, in a machine with 8 GB of RAM and a quad core processor. The client is running on a laptop with Windows XP.

By default, Solrmeter runs the queries from the file in random order. I was not looking for that kind of feature, because I wanted to repeat exactly the same test several times. That’s why I wrote a little class that loads all the queries from a file and runs them in sequential order.

After running some tests, I found that the results weren’t always the same. In fact, I found something totally different from what I was expecting. As shown in the next figure, when running the first 1000 queries, the average query time was 104 ms. The first queries were slower, but I assumed that it was because the Solr caches were filling.

After restarting Solr and running the first 2000 queries, the result obtained is the one shown in the second figure: the first 1000 queries run really fast (near 20 ms), whereas the time for the next 1000 queries increases abruptly to about 60 ms.

So my question was: why is this happening? I can’t understand it! How is Solr remembering something about those first 1000 queries, if I’m completely restarting the server? Well, the answer was easier than I thought.  It’s because in the first case, the operating system needs to read all the blocks from disk, which are then stored in the cache. In the second case, there is virtually no reading time for the blocks needed to answer the first 1000 queries, because they were previously cached, but the next 1000 queries are new, so the operating system has to load a lot of new blocks. This justifies this abrupt increase.

Why the new queries need some previously unused blocks is something that I don’t know for sure, but I think that maybe it has to do with the documents that are returned only by the new queries. As a side note, the OS’s cache can be really big. In this example, it reached almost 2 GB.

After discovering this, I started to run the benchmarks, bearing in mind to clean the OS’s cache and restarting Solr before every run. But this time another unexpected thing happened! The first feature that I tried to test was the FastVectorHighlighter. I wanted to measure how much faster than the default highlighter (from now on, simply called Highlighter) it was. But the first one was running slower! After a series of tests, I figured out what was going on: the time lost by the OS loading from disk the term vectors was greater than the time saved by using the FastVectorHighlighter.

So, to confirm my suspicions, I repeated all the tests again, but this time I executed each one of them twice. The first time, in order to get the OS’s cache filled. The second time to measure the real query time. Now everything is working as expected! The FastVectorHighlighter is 30% faster than the Highlighter!

Therefore, we may conclude that:

1. If you are doing benchmarks, look at the initial conditions of the OS’s cache, and of other caches that you may have.
2. The OS’s cache is really useful, it decreases significantly the time required to answer a query (even after completely restarting the server!), so always remember to keep some memory free for the OS.

I hope to show the results of the benchmarks in the next post, so stay tuned!

## Solr Index Size Analysis

In this post I’m going to talk about a set of benchmarks that I’ve done with Solr. The goal behind it is to see how each parameter defined in the schema affects the size of the index and the performance of the system.

The first step was to fetch the set of documents that I was going to use in the tests. I wanted the documents to be composed of real text, so I started to look for sources in Internet. The first one that I really liked was Twitter. They provide a REST API that allows you to read a continuous stream of tweets, composed of approximately 1% of all the public tweets. Each tweet is expressed as a JSON Object, and carries meta-data about the message and the author. While this source allowed me to get a good number of documents in a short time (about 1.7 million tweets in 2 days), they were really small, so I started to look for a source of bigger documents, finally choosing Wikipedia. I downloaded the documents through HTTP using the “Random Article” feature in their site, obtaining about 160,000 articles in a couple of days. At the time of writting, the site download.wikipedia.org, which provides an easy way of downloading a bunch of articles, was out of service.

The next step was to design the schema. Because one of the objectives is to see how each change in the schema affects the size of the index, I used many different combination of parameters, as to measure the influence of each one of them. On each case, the database of stop-words was populated using the top 100 terms of each set of documents, obtained from the administration panel of Solr. For both datasets, the “omitNorms”, “termVectors” and “stopWords” parameters are referred to the “text” field. In all cases, the value of the parameters “termOffsets” and “termPositions” is the same as “termVectors”.

In the first figure you can see the size of the index for each schema for the Twitter data-set, and which proportion of the index corresponds to each parameter. Remember that this data-set has lots of documents (about 1.7 million) but each one is small (240 bytes on average). There are many remarkable things here. The first one is that the space occupied by the term vectors (~280 MiB when not using stop words) is almost equal to the space occupied by the inverted index itself (~240 MiB).  In second place, the space saved by omitting norms is almost negligible (~2 MiB). Third, the space saved by using stop word is doubled when storing term vectors, going from about 4% of the index to about 10%. Finally, the space occupied by the stored fields (~340 MiB) is considerably bigger than the space occupied by the inverted index itself.

In the second figure you can see the same information for the Wikipedia data-set. The size occupied by the norms is still negligible (< 1MiB), however, the size occupied by the stop words has increased to about 22% of the index size when not storing term vectors, and about 25% when storing them. This time, the size occupied by the term vectors (~1067 MiB) is almost three times the space occupied by the inverted index itself (~380 MiB). Finally, the size of the stored documents (~6330 MiB) is more than four times the size of the index with term vectors stored.

At this point, we can state some conclusions concerning the size of the index:

• When the number of fields is small, the size of the norms is negligible, independently of the size and number of documents.
• When the documents are large, the stop words help reducing the size of the index significantly. Maybe here is important to note two things. In first place, the documents fetched from Wikipedia are writen using traditional language, and are all writen in English, while the documents fetched from Twitter are writen using modern language, and in many different languages. In second place, I didn’t measure the precision and recall of the system when using stop words, so it is possible that the findability in a real scenario won’t be good.
• If you’re storing the documents, and they are big enough, it’s not so important if you store the term vectors or not, so if you’re using a feature such as highlighting and you are looking for good performance, you should store them. If you’re not storing documents, or your documents are small, you should think twice before storing the term vectors, because they’re going to increase significantly your index’s size.

I hope you find this post useful. Currently I’m working on a set of benchmarks to measure the influence of each one of these parameters in the performance of the system, so if you liked this post, stay tuned!