We are implementing Elasticsearch as a search solution in our organization. For the POC, we set up a 3-node cluster (each node with 16 VCores and 60 GB of RAM and 6 * 375 GB of SSD), all nodes acting as master node, data and coordination. Since this was a POC indexing speed, we were not looking for whether it would work or not.
Note: we tried to index 20 million documents on our POC cluster and it took us about 23 to 24 hours, which makes us take time and design the production cluster with sizing and settings appropriate.
We are now trying to implement a production cluster (in Google Cloud Platform) with an emphasis on both indexing speed and search speed.
Our use case is as follows:
- We will index in bulk 7 million to 20 million documents by index (we have 1 index for each client and there will be only one cluster). This bulk index is a weekly process, that is, we index all the data once and query them for a whole week before updating them.
We aim for an indexing rate of 0.5 million documents per second.
We are also looking for a strategy to expand horizontally as we add more customers. I mentioned the strategy in the following sections.
- Our data model has a nested document structure and a lot of queries on nested documents that I think require a lot of CPU, memory and IO resources.
We are aiming for sub-second times for the 95th percentile of queries.
I've read quite a few things on this forum and on other blogs in which companies have high performance Elasticsearch clusters that are running successfully.
Here are my learnings:
Have dedicated master nodes (always an odd number to avoid brain splitting). These machines can be of average size (16 vCores and 60 GB of RAM).
Give 50% RAM to ES Heap, except the size does not exceed 31 GB to avoid 32-bit pointers. We plan to set it to 28 GB on each node.
Data nodes are the working horses of the cluster and must therefore occupy an important place in CPU, RAM and IO. We plan to have (64 VCores, 240 GB of RAM and 6 * 375 GB SSD).
Also have coordination nodes to process index and bulk search requests.
We now plan to start with the following configuration:
3 Masters - 16Vcores, 60 GB of RAM and 1 SSD of 375 GB
3 Cordinators - 64Vcores, 60GB RAM and 1 X 375GB SSD (Supercomputing)
6 data nodes - 4 VCores, 240 GB of RAM and 6 * 375 GB of SSDs
We plan to add 1 data node for each incoming customer.
Now that the hardware is out of Windows, let's focus on the indexing strategy.
The best practices I have gathered are:
- A lower number of fragments per node is good in most scenarios, but provides good data distribution across all nodes for a load balancing situation. Since we plan to have 6 data nodes to begin with, I tend to have 6 nodes for the first client to fully use the cluster.
- Have 1 replication to survive the loss of nodes.
Next is the process of bulk indexing. We have a full spark installation and are going to use
ElasticSearch-hadoop connector to transmit Spark data to our cluster.
When indexing, we define the
refresh interval at
1m to make sure that updates are less frequent.
We use 100 Spark parallel tasks that each task sends
2MB data for the bulk application. So both there is
2 * 100 = 200 MB mass requests that I believe are good in what ES can handle. We can certainly change these settings based on comments or tests and errors.
I've read more information about setting cache percentage, thread pool size, and queue size, but we plan to keep them as defaults smart at first.
We are open to using both
Simultaneous CMS or
G1GC algorithms for GC but would need advice on this. I've read the pros and cons of using both and the dilemma.
Let's move on to my current questions:
Is sending mass indexing requests to the coordinator node a good design choice or should we send it directly to the data nodes?
We will send queries via coordination nodes. Now my question is, say, since my data node has 64 cores, each node has a thread pool size of 64 and 200 queue sizes. Suppose that during the thread pool of the search data node and the queue size is completely exhausted, will the coordinator nodes still accept and buffer search queries? until their queue is filled as well? Or will 1 lead on the coordinator also be blocked for each request?
Suppose a search request reaches the coordinator node, where it blocks 1 thread and sends a request to the data nodes, which in turn blocks the threads on the data nodes, according to the data in the request. Is this hypothesis correct?
- While mass indexing is in progress (assuming we do not run parallel indexing for all clients and we do not schedule them sequentially), how can we best design to ensure that query times are not affected during this mass index