It is very fitting that my first blog post in a long while is about Elasticsearch and some highlights of what I learned at ElasticOn 2015. I have been using Elasticsearch professionally to varying degrees for almost two years now (note, I’ve also used another Lucene backed product called SOLR). My latest technical adventure has been implementing a customer facing Elasticsearch backed feature with the hopes of providing search as a service to other products in the future. This has been a pretty demanding project, so free time has largely been dedicated to getting the project off the ground. So hence my five loyal readers have been waiting for months for new material....well, wait no more.

This has been a very fun and challenging project from a technical perspective. Though I am quite knowledgeable about search as a problem space, and about Elasticsearch as a developer, I have never been responsible for capacity and infrastructure planning for an Elasticsearch cluster or any other search engine I’ve worked with. I have also never had to provide operations support for an ES cluster, though have provided production support for my applications in the past. As the technical lead, among my many responsibilities I am actively working with our platform team to do capacity planning and testing for our production ES cluster. Needless to say I have learned a lot in the past few months.

Okay so that’s nice but what about the conference? So my above intro was a very long winded way of saying why I even wanted to attend ElasticOn 2015. I thought it would be invaluable to get knowledge from the front lines, from engineers that have built and managed ES clusters at scale and to learn directly from those that created Elasticsearch (as well folks who worked on Lucene, Logstash etc). Luckily my awesome boss agreed it was a good idea to attend.

Major Theme of The Conference: How To Plan, Build and Scale ES For Megashark Sized Data

I have said many times in the past that capacity planning for ES clusters is a dark art. There is no one size fits all solution for any search engine, no matter what the underlying technology is. Your data, your use case (how you plan on using the data, how often you index, how fresh does the data need to be etc) all matter when you start thinking about capacity planning, scaling and even support. This stuff isn’t easy and you will likely get it wrong at first.

With that said it is no wonder that one of the main themes of Elasticon 2015 was the need for tips and tricks regarding capacity planning for Elasticsearch clusters, particularly for large data sets.The short answer I received on “What’s the best approach to capacity planning and scaling an ES cluster for lots of data?”...It depends? On what? Everything. This echoed months of research on the topic and confirms our team’s experience as we go through this exercise.

Below are some nuggets of info that was common throughout many talks (as well as my current experience) one scaling and planing as well as from our current experience.

Use Dedicated Master Nodes

By default when you setup an elasticsearch cluster the configuration is such that all nodes operate in mix mode. Mix mode means the node can act as a master and data node for example. This can lead to performance issues as more resources are required for the node to pull double duty if it is elected as the master. Though in our initial performance testing, we haven’t seen the need to use these but I assume this may change in the future.

Use Aliases To Minimize Down Time

You will likely want to reindex data stored in an index from time to time. For example, if you want to add a new field, update an existing field mapping. This can be done by creating a new index under the covers and using an alias to point from the old index to the new one...think of it as a way to hot swap an index or a pointer.

Know Your Data

Which means knowing things like…

  • The approximate size of your documents.
  • The approximate size of your bulk indexing payloads...elasticsearch recommends it be no larger than 5 to 15MB.
  • What fields you are planning to index.
  • The types of search matching you want to support as this will affect your choices in how your fields are mapped.

Know Your Use Cases

This will greatly affect capacity planning as your use cases will dictate what your cluster should be optimized for. Some clusters for example are more optimized for search, but very little indexing and some are optimized for mostly indexing and not much searching.

How do you plan on searching for this data (keywords, aggregations, substring matching)? Why is this important?

Things like supporting substring matching and highlighting can greatly affect your index size, which of course affects the disk space required to index current and future data. Things like aggregation and sorting have an affect on how much memory you will need on your nodes. By default Elasticsearch (as of version 1.5) uses the field data cache to store field data used for sorting and aggregations.

The field data cache relies on the JVM HEAP and once it’s under pressure without proper configuration settings to do perform cache eviction you are looking at a world of circuit breaker exceptions or out of memory exceptions (if you are using older versions of ES) and a dead cluster.

How “fresh” does your data need to be?

When you update a document because the underlying Lucene data structures (segments) are immutable you are really deleting and create new documents during an update. This can have some impact on query performance. So determine how often you need to reindex your data.

Turn Off Refresh Interval During Large Bulk Indexing Jobs

Turn the refresh interval off (set it to -1) for the index.

Do Load Testing!

Testing, fix up, tune up..rinse and repeat. This sounds obvious but many teams skip this step. If you have existing search request logs this is a great time to leverage them to drive load to your cluster. If you don’t have this you’ll need to generate your own data and have some idea about your expected user load.

When load testing, start your testing with one node, one index and one shard. Index your data and search as you would in a production setting. Under this load you’ll know the max capacity for one node, and one shard. This is valuable when you are trying to decide how many nodes, shards etc are needed.

Doing Lots Of Sorting & Aggregations? Consider Using Doc Values

If you plan on doing lots of aggregations or sorting consider turning on doc values for the fields you know you need sort or aggregate on. Instead of leveraging the field data cache, and watching your heap usage like a hawk you can turn on doc values these fields which stores the field data on disk instead of in memory, it also makes use of the file system cache which is often larger than the allotted memory for the field data cache.

Doc Values, Strings & You
Doc values are not currently support on analyzed strings (as of Elasticsearch version 1.5). Also, for string fields using doc values greatly minimizes the use of the field data cache but doesn’t completely eliminate using it. Also note, that doc values aren’t a magic bullet though you won’t have memory pressures depending on your disk spaces needs you will be paying for this in some way, Like many things in tech there are trade offs.

Use Filters To Get The Best Query Performance

Make use of filters to minimize your search space, which minimizes the number of documents ES has to calculate a relevance score for. Also, use filters (instead of queries) unless you need to calculate the relevance score for results.

Monitoring & Alerts

Have good monitors and alerts in place to monitor critical cluster metrics like memory and CPU usage. There are some awesome ES plugins for monitoring and tending to your cluster. They also recently announced their own alerting and notification products. We are currently using Marvel, Kopf and Head as well as Data Dogs to collect system, cluster and application metrics.

Automation Is Your Friend

One of the most noteworthy bits I took away from the conference, is something our DevOps friends will testify on is that automation is your friend. There are very small teams implementing and managing very large number of clusters and the one thing they all seemed to have in common (besides being very talented ,smart people) was that they heavily leveraged automation for building, managing, and monitoring their clusters. The Netflix team seem to be masters of this. They managed thousands of nodes and man clusters with just two people (at the time of the conference in March). If you are planning on running a cluster in AWS I’d suggest checking tools for monitoring and cluster management called “Raigad.”

I <3 Elasticsearch & Search

I think it’s well executed product and given the level of dedication by the team I am sure it will get better and even more fun to use over time. Though it requires some level of effort and understanding of the technical weeds under the covers for both Elasticsearch and Lucene knowing a bit more and doing some due diligence will hopefully minimizes the pains you have turning on your ES cluster in the wild.

So no we haven’t implemented every single thing we’ve learned but we are working on it, and we have done a lot of work to get the important bits right well okay, right-ish. Hopefully this will make us have less sleepless nights and give our users a happy search experience.