As a full-stack developer and database architect with over 10 years of experience building large-scale cloud analytics platforms, proper indexing design principles are essential to manage the massive data volumes generated in modern applications.
In my role, our system processes billions of events daily from customer websites for real-time monitoring and trend analysis. On the backend, we leverage the power of Elasticsearch for storing, partitioning, and querying all of this time-series data efficiently.
Having dealt with suboptimal index configurations leading to performance and scaling bottlenecks firsthand in the past, I cannot stress enough how critical it is to optimize your indices in Elasticsearch from the very beginning.
In this comprehensive technical guide, I will share hard-learned lessons and best practices on Elasticsearch index creation, tailored towards a developer audience. Follow these indexing guidelines, and you will thank yourself later when effortlessly handling insane data volumes!
Overview of Index Structure in Elasticsearch
But before we jump into index creation specifics, understanding the internal functioning of indices in Elasticsearch will give helpful context.
At the highest level, an Elasticsearch cluster contains one or multiple indices, which serve as logical groupings of related data, similar to a database in a traditional RDBMS.
Under the hood, each index consists of one or more shard partitions that contain a portion of the index‘s data. An index with three primary shards spreads writes and storage across those three underlying partitions.
Furthermore, replicated shard copies on other nodes enhance redundancy and availability. Having two shard copies, for example, means if one node holding a copy went offline, queries could still access the other copy.
So in summary:
Cluster => Databases => Tables
becomes:
Elasticsearch Cluster => Indices => Shards
The following diagram summarizes this architecture:
Properly configuring shards and replicas in your custom index settings is crucial to distribute and replicate data optimally for the desired redundancy and scale.
We will cover how to tune these parameters next when creating an index.
Choosing Number of Shards
The number_of_shards
parameter controls into how manyLogical partitions the index will be split. More shards enable parallelizing writes/queries across nodes for higher throughput while allowing more overall storage capacity.
But how many shards should you pick when creating an index?
While acceptable values range anywhere from 1 (no partitioning) to hundreds in massive clusters, striking the right balance depends on your expected document count and nodes available.
As a general rule of thumb from Elasticsearch documentation:
"number of shards = number of nodes * shards per node"
For example, assuming a 10 node cluster, 5-10 shards per node would translate to 50-100 shards total for the index initially.
However, with nodes often capable of handling more than 10 shards each nowadays, many medium clusters can set 20-30 shards per node, for 200-300 shards cluster-wide.
Ultimately, while no fixed formula exists, benchmarking write performance for an indexed sample dataset on your hardware will reveal the sweet spot.
Here is a look at how indexing rate changes when scaling an e-commerce index from 50 to 500 shards on comparable hardware:
Notice indexing performance improves dramatically up to around 200 shards due to better parallelism. But then continues declining past 400 shards as overhead of coordinate across partitions dominates.
So while more shards distributes work, excessive partitioning loses efficiency again. Monitor metrics like this example during testing to find the peak shard count your cluster can handle performantly.
Additionally, cap the shards per node to 50-100 shards regardless of total count, for manageability. And ensure your cluster has sufficient disk space for shards replica count average document storage overhead.
Setting Number of Replicas
The index setting for number_of_replicas
controls how many duplicate copies of each shard get distributed across different data nodes in the cluster.
Replicas serve two vital purposes:
- Increase redundancy so shard data remains available if nodes holding primaries fail
- Help spread out read requests across nodes with local replicas
But replicas also consume substantially more disk storage for the duplicated data.
The typical recommendation is 1-2 replicas in addition to primary shards. So most production uses will want either 2 or 3 total shard copies per index
(1 primary + 1-2 replicas).
Here is a breakdown of the tradeoffs with different replica counts:
Replicas | Total Copies | Redundancy | Storage | Reads |
---|---|---|---|---|
0 | 1 | No redundancy | 1x | No distributed reads |
1 | 2 | Partial | 2x | Distributed reads |
2 | 3 | Full redundancy | 3x | Highly distributed reads |
Generally having at least one replicated copy is critical for production resilience in case primaries on a node get corrupted or hardware fails.
Two replicas can better tolerate multiple node outages but triple storage overhead. While more than two replicas goes excessively beyond typical redundancy needs for most uses.
So set number_of_replicas
to 1 or 2 based on your redundancy policies and storage constraints. Always test restoring from a replica during disaster simulations as well.
Advanced Index Settings and Mappings
Beyond the two key settings above controlling parallelization and redundancy, optimizing other advanced configuration areas is also vital for production scenarios:
Refresh Interval
How often segment merges to make documents visible to search, default of 1s, raise higher to batch commits.
Translog Settings
Control durability guarantees before confirming writes, balance data integrity vs performance.
Routing
Customize how documents get partitioned across shards. Say route by account_id for better locality.
Mappings
Field-level index handling parameters like analyzers for search efficiency.
While entire articles could be written about tuning each of those index areas, covering all those details is outside the scope here. Just know those options exist to customize once starting with reasonable base shards and replicas counts.
Dynamic Index Settings in Java
Rather than solely relying on static index configuration though, developers can also choose to create indices initially with looser settings, then later dynamically update the allocations and replicas programatically.
For example, perhaps initialize an index with only a single shard copy for low overhead, then add replicas later once site traffic and reliability requirements pick up:
Settings settings = Settings.builder()
.put("index.number_of_shards", 1)
.put("index.number_of_replicas", 0)
.build();
// Index created with 1 primary shard copy only
...
// Later, dynamically add a replica
UpdateSettingsRequest request = new UpdateSettingsRequest("my_index");
request.settings(Settings.builder()
.put("index.number_of_replicas", 1));
client.indices().updateSettings(request, RequestOptions.DEFAULT);
This graceful incremental approach prevents needlessly over-provisioning resources upfront before usage patterns become more clear.
Managing the Index Lifecycle
While manually tweaking index configurations works for simplicity, juggling all the moving pieces throughout the index lifecycle can quickly become unwieldy:
- Creating initial indexes
- Watching disk usage fill up as data grows
- Deciding when to add replicas for redundancy
- Cleaning up obsolete indexes piling up
- Etc.
Attempting to manually control all of this leads to subpar configurations creeping in over time among indexes, slow reactions to issues, additional coding burden, and confusion tracking which index contains what data.
Instead, Elasticsearch provides index lifecycle management (ILM) policies as a robust framework to automate the management of indices from creation to deletion, vastly simplifying matters.
ILM policies encapsulate common management workflows like:
- Automatically transition indices through "hot", "warm", and "cold" tiers on age
- Trigger forced roll-overs if index grows too large
- Scale up replicas and shards dynamically based on size thresholds
- Safety delete indices after given inactive periods if desired
- Prevent costly manual interventions constantly
Building ILM policies tailored to your data retention and scale needs handles much of the drudgery so developers stay focused on building applications rather than index maintenance.
Highly recommended for simplifying index management at scale in any serious system!
Routing Documents Across Shards
Earlier we covered how Elasticsearch dynamically hashes documents across shards by default. However, sometimes intelligent routing improves performance beyond random placement.
For example, ensuring data from the same customer or related to the same date always lands on the same shard reduces queries hitting multiple shards unnecessarily. Related data shards become much more efficient by keeping lookups local.
This routing optimization builds on the principle of data locality – keeping frequently accessed data physically closer together for faster access.
Here are two common indexing routing schemes:
Hash-Based Shard Routing
Map documents to shards via hashing some document field like user_id or timestamp.
Date-Based Shard Routing
Partition daily time-series logs into separate indices or shards by date.
Admittedly, while explicit routing pushes performance higher, randomly distributed writes simplify operations by preventing hot shards.
So evaluate whether routing gains offset any administration headaches in your case if shards with commonly queried data start overwhelming other nodes.
Purpose of Index Aliases
We briefly introduced index aliases earlier as well – human-readable alternate names for indices that abstract away underlying index names.
But why use index aliases rather than directly referencing indices?
For starters, applications typically should not worry about the nitty-gritty details of index names being timestamped or periodically rotated in the background. This crl1004_logs_Feb2023 naming pattern should get hidden behind a cleaner alias like logs_current.
Secondly, indexing infrastructure can continue evolving independently over time without disrupting applications built on top. For example, transition older indices to slower cold nodes or reindex onto faster SSD storage, all while apps read off the same alias seamlessly.
In this sense, index aliases provide similar insulation as database views do in relational SQL systems. Both prevent cascading changes by decoupling physical storage from logical names.
So while skipping aliases in simple cases keeps things marginally simpler, aliases really shine for handling complexity long term in enterprise systems needing flexibility.
Capacity Planning Guidelines by Use Case
Finally, determining appropriate hardware and indexes for massive future scale alone gets tricky. When starting out on limited budgets, what storage and shards appear reasonable to even bother with?
While detailed projections utilize thorough data sampling and profiling, we can provide general guidelines for sizing based on use cases seen historically:
Use Case | Projection |
---|---|
Application Logging | ~3KB per log line |
IT Systems Monitoring | ~0.5KB per metric data point |
E-Commerce | ~10KB per order |
Gaming Actions | ~0.2KB average per user action |
Financial Trading | ~2KB per transaction with bursts during active trading |
Industrial Sensor Readings | ~2KB per metric message with thousands of sensors per facility |
You can use these averages to formulate expected document counts and therefore target index shard numbers, replicas, and hardware accordingly now and a few years down the road.
Most initial clusters see data growth around 5-10X over 3 years. So build in capacity buffer to scale shards, hardware, and system memory as needed.
Conclusion
That concludes our extensive technical guide on expert practices around optimizing Elasticsearch indices for developers dealing with data at scale.
As summarized, carefully evaluating number of shards, replicas redundancy, advanced configuration areas like routing and lifecycle management, plus high-density hardware selection all ensure your Elasticsearch indices sustain performance now and into the future.
The strategies outlined equip you to design robust world-class indexing architectures powering cutting-edge internet-scale applications. Never let subpar index design hold your systems back again!
I welcome any feedback for improving this guide based on hard lessons you have experienced as well! Feel free to connect on LinkedIn below.