Elasticsearch – when node left and cluster crashed

January 16, 2024
Tomasz Dzierżanowski

1. Introduction

When you are building cluster of Elasticsearch nodes you have to make it failure resistant so in case of one node goes down whole cluster is intact. There are advanced options to help shards being allocated based on geo-location or IP address and one very obvious that keep primary and replica shards separated. This is by default so in case of node failure with primary shard, it’s replica exists on another node and instantly becomes primary. Later it’s replicated as replica.

Unfortunately there is scenario when one node fails and “node left cluster” event triggers the master node to allocate new replicas to other nodes in the cluster. This can kill your cluster if sum of disk space available on other nodes is not enough to hold replica data.

Let me show you simulation of that incident.

2. Start Elasticsearch

Below commands will start three node Elasticsearch cluster with tiny disk space for sake of testing.

First set of commands creating volumes:

				
					docker volume create --opt type=tmpfs --opt device=tmpfs --opt o=size=2m europe01data
docker volume create --opt type=tmpfs --opt device=tmpfs --opt o=size=2m africa01data
docker volume create --opt type=tmpfs --opt device=tmpfs --opt o=size=2m arctica01data

Then start cluster:

				
					docker run --rm \
--name europe01 \
--net elknodes \
-d \
-e ES_JAVA_OPTS="-Xms2g -Xmx2g" \
-e node.name="europe01" \
-p 9200:9200 \
-v europe01data:/usr/share/elasticsearch/data \
docker.elastic.co/elasticsearch/elasticsearch:8.11.1

# wait until first node is started

# reset password
docker exec -it europe01 bash -c "(mkfifo pipe1); ( (elasticsearch-reset-password -u elastic -i < pipe1) & ( echo $'y\n123456\n123456' > pipe1) );sleep 5;rm pipe1"

# get enrollment token
token=`docker exec -it europe01 elasticsearch-create-enrollment-token -s node | tr -d '\r\n'`

docker run --rm \
-e ENROLLMENT_TOKEN="$token" \
-e node.name="africa01" \
-v africa01data:/usr/share/elasticsearch/data \
-p 9201:9200 \
--name africa01 \
--net elknodes \
-d \
-m 2GB docker.elastic.co/elasticsearch/elasticsearch:8.11.1


docker run --rm \
-e ENROLLMENT_TOKEN="$token" \
-e node.name="arctica01" \
-v arctica01data:/usr/share/elasticsearch/data \
--name arctica01 \
--net elknodes \
-d \
-m 2GB docker.elastic.co/elasticsearch/elasticsearch:8.11.1

3. Load test data

Because Elastic compressing data when I put text there it keep on playing with me and reducing disk usage so I decided to store kind of blob there in form of base64 encoded strings. This make it easy to control disk usage.

3.1. Mapping definition

Create mapping definition

				
					curl -k -u elastic:123456 -XPUT "https://localhost:9200/customerdata" \
-H 'content-type: application/json' -d'
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}'

curl -k -u elastic:123456 -XPUT "https://localhost:9200/customerdata2" \
-H 'content-type: application/json' -d'
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}'

curl -k -u elastic:123456 -XPUT "https://localhost:9200/customerdata3" \
-H 'content-type: application/json' -d'
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}'

curl -k -u elastic:123456 -XPUT "https://localhost:9200/customerdata4" \
-H 'content-type: application/json' -d'
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}'

curl -k -u elastic:123456 -XPUT "https://localhost:9200/customerdata5" \
-H 'content-type: application/json' -d'
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}'

curl -k -u elastic:123456 -XPUT "https://localhost:9200/customerdata6" \
-H 'content-type: application/json' -d'
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}'

3.2. Blob from IPFS

I prepared for you test data in IPFS storage. If you want to store your own you can register your account HERE.

Download data with curl:

				
					docker run --rm -it \
-v "$PWD:/tmp" \
-e IPFS_GATEWAY="https://ipfs.filebase.io/" \
curlimages/curl:8.5.0 --parallel --output "/tmp/#1.png" "ipfs://{QmeFtCsJSNQjWxFnhM548m5NQuBfhPthiJJRzm1JJ2u4XA,QmPC2FFxBp9Lh97rdSo3wj4pqmpnZ7LsZCCud8QpU8ukwK,QmaKvBe8jFeWw7Sa6ULPbJ3J4FwUCdJYuotfG6JXS4A9sP}"

3.3. Load to Elasticsearch

Once you got images on disk you can convert them to JSON and bulk load to Elastic. Run commands one by one slowly, otherwise you’ll kill cluster to early.

				
					someData=`cat QmPC2FFxBp9Lh97rdSo3wj4pqmpnZ7LsZCCud8QpU8ukwK.png |base64`

echo -n '' > QmPC2FFxBp9Lh97rdSo3wj4pqmpnZ7LsZCCud8QpU8ukwK.json
echo '{"index": {"_id": 1}}' >> QmPC2FFxBp9Lh97rdSo3wj4pqmpnZ7LsZCCud8QpU8ukwK.json
echo  '{"customer_name": "'$someData'"}' >> QmPC2FFxBp9Lh97rdSo3wj4pqmpnZ7LsZCCud8QpU8ukwK.json

curl -k -u elastic:123456 -XPOST "https://localhost:9200/customerdata/_bulk" -H 'Content-Type: application/x-ndjson' --data-binary @QmPC2FFxBp9Lh97rdSo3wj4pqmpnZ7LsZCCud8QpU8ukwK.json
curl -k -u elastic:123456 -XPOST "https://localhost:9200/customerdata2/_bulk" -H 'Content-Type: application/x-ndjson' --data-binary @QmPC2FFxBp9Lh97rdSo3wj4pqmpnZ7LsZCCud8QpU8ukwK.json
curl -k -u elastic:123456 -XPOST "https://localhost:9200/customerdata3/_bulk" -H 'Content-Type: application/x-ndjson' --data-binary @QmPC2FFxBp9Lh97rdSo3wj4pqmpnZ7LsZCCud8QpU8ukwK.json

someData2=`cat QmaKvBe8jFeWw7Sa6ULPbJ3J4FwUCdJYuotfG6JXS4A9sP.png |base64`
echo -n '' > QmaKvBe8jFeWw7Sa6ULPbJ3J4FwUCdJYuotfG6JXS4A9sP.json
echo '{"index": {"_id": 1}}' >> QmaKvBe8jFeWw7Sa6ULPbJ3J4FwUCdJYuotfG6JXS4A9sP.json
echo  '{"customer_name": "'$someData2'"}' >> QmaKvBe8jFeWw7Sa6ULPbJ3J4FwUCdJYuotfG6JXS4A9sP.json

curl -k -u elastic:123456 -XPOST "https://localhost:9200/customerdata4/_bulk" -H 'Content-Type: application/x-ndjson' --data-binary @QmaKvBe8jFeWw7Sa6ULPbJ3J4FwUCdJYuotfG6JXS4A9sP.json


someData3=`cat QmeFtCsJSNQjWxFnhM548m5NQuBfhPthiJJRzm1JJ2u4XA.png |base64`
echo -n '' > QmeFtCsJSNQjWxFnhM548m5NQuBfhPthiJJRzm1JJ2u4XA.json
echo '{"index": {"_id": 1}}' >> QmeFtCsJSNQjWxFnhM548m5NQuBfhPthiJJRzm1JJ2u4XA.json
echo  '{"customer_name": "'$someData3'"}' >> QmeFtCsJSNQjWxFnhM548m5NQuBfhPthiJJRzm1JJ2u4XA.json


curl -k -u elastic:123456 -XPOST "https://localhost:9200/customerdata5/_bulk" -H 'Content-Type: application/x-ndjson' --data-binary @QmeFtCsJSNQjWxFnhM548m5NQuBfhPthiJJRzm1JJ2u4XA.json


curl -k -u elastic:123456 -XPOST "https://localhost:9200/customerdata6/_bulk" -H 'Content-Type: application/x-ndjson' --data-binary @QmeFtCsJSNQjWxFnhM548m5NQuBfhPthiJJRzm1JJ2u4XA.json

4. Check data allocation

4.1. Cluster health

				
					curl -k -u elastic:123456 -XGET "https://localhost:9200/_cluster/health?pretty"

				
					{
  "cluster_name" : "docker-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 7,
  "active_shards" : 14,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

4.2. Data distribution

Run command to see how data is allocated and confirm cluster is green now.

				
					curl -k -u elastic:123456 -XGET "https://localhost:9200/_cat/shards?v&s=state:asc&index=customerdata*"

example response:

				
					index         shard prirep state   docs   store dataset ip         node
customerdata3 0     p      STARTED    0 134.3kb 134.3kb 172.26.0.3 africa01
customerdata3 0     r      STARTED    0 134.3kb 134.3kb 172.26.0.4 arctica01
customerdata5 0     p      STARTED    1 114.9kb 114.9kb 172.26.0.3 africa01
customerdata5 0     r      STARTED    1 114.9kb 114.9kb 172.26.0.4 arctica01
customerdata6 0     p      STARTED    1 114.9kb 114.9kb 172.26.0.3 africa01
customerdata6 0     r      STARTED    1 114.9kb 114.9kb 172.26.0.2 europe01
customerdata4 0     p      STARTED    0 118.3kb 118.3kb 172.26.0.3 africa01
customerdata4 0     r      STARTED    0 118.3kb 118.3kb 172.26.0.2 europe01
customerdata  0     p      STARTED    1 296.1kb 296.1kb 172.26.0.4 arctica01
customerdata  0     r      STARTED    1 296.1kb 296.1kb 172.26.0.2 europe01
customerdata2 0     p      STARTED    1 296.1kb 296.1kb 172.26.0.4 arctica01
customerdata2 0     r      STARTED    1 296.1kb 296.1kb 172.26.0.2 europe01

4.3. Disk usage

To see disk usage run

				
					curl -k -u elastic:123456 -XGET "https://localhost:9200/_cat/nodes?v=true&h=name,master,disk.used_percent,disk.avail,disk.used,disk.total&s=name"

example response:

				
					name      master disk.used_percent disk.avail disk.used disk.total
africa01  -                  53.52      952kb       1mb        2mb
arctica01 -                  75.78      496kb     1.5mb        2mb
europe01  *                  75.98      492kb     1.5mb        2mb

From above output is visible that maximum capacity of 2 nodes in the cluster is 952kb+496kb=1448kb. Minimum disk usage of 1 node is 1mb. However data is distributed in this way that failure of 1 node will cause 1.5mb left to be assigned in cluster of capacity equal 1448kb OR 1mb left to be assigned in cluster of capacity equal 992kb. In both cases it will be not enough disk space to keep cluster healthy.

5. One node killed

5.1. Postpone relocation

Kill one container to look how it will kill whole cluster. Only option that cluster awake for some time is setting

"index.unassigned.node_left.delayed_timeout"

Which can be modified to give you more or less time. When allocation instantly

				
					curl -k -u elastic:123456 -XPUT "https://localhost:9200/_all/_settings?pretty" \
-H 'Content-Type: application/json' -d'
{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "0"
  }
}
'

and when in 5 minutes (default is 1 minute)

				
					curl -k -u elastic:123456 -XPUT "https://localhost:9200/_all/_settings?pretty" \
-H 'Content-Type: application/json' -d'
{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "5m"
  }
}
'

5.2. Kill container

Run command to kill container and observe how within 1 minute cluster become unresponsive

				
					docker kill africa01

5.3. Check shards situation – Last minute

Last chance to call API before timeout and you can see unassigned shards.

				
					curl -k -u elastic:123456 -XGET "https://localhost:9200/_cat/shards?v&s=state:asc&index=customerdata*"

response:

				
					index         shard prirep state      docs   store dataset ip         node
customerdata4 0     r      UNASSIGNED                                 
customerdata6 0     r      UNASSIGNED                                 
customerdata3 0     r      UNASSIGNED                                 
customerdata5 0     r      UNASSIGNED                                 
customerdata4 0     p      STARTED       1 268.2kb 268.2kb 172.26.0.2 europe01
customerdata6 0     p      STARTED       1 114.9kb 114.9kb 172.26.0.2 europe01
customerdata  0     p      STARTED       1 296.2kb 296.2kb 172.26.0.4 arctica01
customerdata  0     r      STARTED       1 296.2kb 296.2kb 172.26.0.2 europe01
customerdata2 0     p      STARTED       1 296.2kb 296.2kb 172.26.0.4 arctica01
customerdata2 0     r      STARTED       1 296.2kb 296.2kb 172.26.0.2 europe01
customerdata3 0     p      STARTED       1 296.2kb 296.2kb 172.26.0.4 arctica01
customerdata5 0     p      STARTED       1 114.9kb 114.9kb 172.26.0.4 arctica01

				
					curl -k -u elastic:123456 -XGET "https://localhost:9200/_cat/nodes?v=true&h=name,master,disk.used_percent,disk.avail,disk.used,disk.total&s=name"

response:

				
					name      master disk.used_percent disk.avail disk.used disk.total
arctica01 -                  62.30      772kb     1.2mb        2mb
europe01  *                  62.50      768kb     1.2mb        2mb

5.4 Cluster is down

Health check should not work

				
					curl -k -u elastic:123456 -XGET "https://localhost:9200/_cluster/health?pretty"

and return bellow error

				
					{
  "error" : {
    "root_cause" : [
      {
        "type" : "master_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "master_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}

in some rare cases it might be still green but … most of the time it will not and therefore simulation is done successfully.

6. Conclusion

Verdict is simple – when node leaving cluster it will trigger replica shard allocation and you cannot stop it. To avoid cluster failure keep free space enough to handle that situation.

Help toughcoding

If this article help you somehow please share with your friends and consider saying me "Thank you". Small gesture for You but big for Us.

Say Thank You!

Enjoy Free Useful Amazing Content

How to use node enroll API in Elasticsearch

http://TODO Table of Contents 1. Intro If you went through Elasticsearch documentation and find out enroll node API without idea

Optimizing Apache Spark Joins with Bloom Filters: Math, Code, and Benchmark

Table of Contents 1. Introduction 1.1. What Is a Bloom Filter? You may already heard about this term. It is