Load 1M sample records to Elasticsearch with python and curl

October 22, 2023
Tomasz Dzierżanowski

1. Introduction

Sometimes you want to test queries on Elasticsearch but you are missing data. Instead of looking for samples over web you can make your own subset. This time I will show you how to quickly populate cluster with 1M records using python and curl.

2. Start Elasticsearch

In order to practice steps written here you will start first Elasticsearch one node cluster. Note down http.max_content_length property set as 1000mb, this is because default value is 100mb and if you plan to load data to Elasticsearch using _bulk API with file larger than this limit you will encounter error. It will be handy for you to create volume in docker and save configuration data over there. Once docker will start container, empty volume will be populated with container data at location ‘/usr/share/elasticsearch/config/’ ,next all containers that will mount this volume can access generated configs and certificates under that location.

				
					docker network create logstash

docker volume create elkconfig

docker run --rm \
--name elk \
--net logstash \
-v elkconfig:/usr/share/elasticsearch/config/ \
-e "http.max_content_length=1000mb" \
-e "node.name=elk" \
-d \
-p 9200:9200 \
docker.elastic.co/elasticsearch/elasticsearch:8.10.4


# reset password
docker exec -it elk /usr/share/elasticsearch/bin/elasticsearch-reset-password -i -u elastic

3. Generate test data

Once Elasticsearch is up and running it has no data. To populate that you can use Python script to generate JSON input records.

				
					# prepare data
cat <<EOF >generate_json.py
import json
from datetime import datetime, timedelta

# Set the file path where you want to save the JSON file
file_path = "filePY2.json"

# Initialize start date
start_date = datetime.strptime("2001-01-01 00:00:00", "%Y-%m-%d %H:%M:%S")

# Generate data for 1,000,000 pairs of rows
data = []
for i in range(1, 1000001):
    # Index row
    index_row = {"index": {"_id": i}}

    # Data row
    data_row = {
        "connection_name": f"RandomText{i}",
        "start_connection": start_date.strftime("%Y-%m-%d %H:%M:%S"),
    }

    data.extend([index_row, data_row])

    # Increment start date by 15 minutes
    start_date += timedelta(minutes=15)

# Write the data to a JSON file
with open(file_path, "w") as json_file:
    for entry in data:
        json.dump(entry, json_file)
        json_file.write("\n")

print("JSON file generated successfully!")
EOF

python3 generate_json.py

After execution output file content will be as below

				
					{"index": {"_id": 1}}
{"connection_name": "RandomText1", "start_connection": "2001-01-01 00:00:00"}
{"index": {"_id": 2}}
{"connection_name": "RandomText2", "start_connection": "2001-01-01 00:15:00"}
{"index": {"_id": 3}}
...
{"index": {"_id": 999998}}
{"connection_name": "RandomText999998", "start_connection": "2029-07-09 15:15:00"}
{"index": {"_id": 999999}}
{"connection_name": "RandomText999999", "start_connection": "2029-07-09 15:30:00"}
{"index": {"_id": 1000000}}
{"connection_name": "RandomText1000000", "start_connection": "2029-07-09 15:45:00"}

If you don’t have python setup on your laptop you can use container.

				
					touch filePY2.json
 
docker run -it \
--rm --name jsonelkinput \
-v ./generate_json.py:/generate_json.py \
-v ./filePY2.json:/filePY2.json \
-w / python:3 python generate_json.py

# Unable to find image 'python:3' locally
# 3: Pulling from library/python
# e720f94321d6: Pull complete 
# 5b7541d83e7b: Pull complete 
# b1a653d69b7b: Pull complete 
# c02d5d50082e: Pull complete 
# 161c9929f4a4: Pull complete 
# 9e06f1dd1ec0: Pull complete 
# 58a24f1a0320: Pull complete 
# 094d40e3f29f: Pull complete 
# Digest: sha256:2586dd7abe015eeb6673bc66d18f0a628a997c293b41268bc981e826bc0b5a92
# Status: Downloaded newer image for python:3
# JSON file generated successfully!

head filePY2.json 
{"index": {"_id": 1}}
{"connection_name": "RandomText1", "start_connection": "2001-01-01 00:00:00"}
{"index": {"_id": 2}}
{"connection_name": "RandomText2", "start_connection": "2001-01-01 00:15:00"}
{"index": {"_id": 3}}
{"connection_name": "RandomText3", "start_connection": "2001-01-01 00:30:00"}
{"index": {"_id": 4}}
{"connection_name": "RandomText4", "start_connection": "2001-01-01 00:45:00"}
{"index": {"_id": 5}}
{"connection_name": "RandomText5", "start_connection": "2001-01-01 01:00:00"}

4. Load test data

File filePY2.json prepared in previous step by you has 2M rows

				
					wc -l filePY2.json 
 2000000 filePY2.json

And describe 1M documents that can be loaded to Elasticsearch using _bulk API. In order to load them prepare mapping definition first.

				
					curl -k -u elastic -XPUT "https://localhost:9200/connections" -H 'content-type: application/json' -d'
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
      "properties": {
        "connection_name": {
          "type": "keyword"
        },
        "start_connection": {
          "type": "date",
          "format": "yyyy-MM-dd HH:mm:ss"
        }
      }
  }
}'

Then run curl command to load records from file into Elasticsearch.

				
					curl -k -u elastic -XPOST "https://localhost:9200/connections/_bulk" -H 'Content-Type: application/x-ndjson' --data-binary @filePY2.json

confirm data was loaded properly by checking shards status. Here index is one and has one shard on one node – simple.

				
					curl -k -u elastic -XGET "https://localhost:9200/_cat/shards?v&index=connections"

5. Final thoughts

In this knowledge article you have learned how to load sample data into Elasticsearch. If you like to know how to dump data from Elasticsearch then check another article Export Data from Elasticsearch – Logstash and more from that series.