Don’t you know Hadoop Integration for Docker Elasticsearch?

January 31, 2025
Tomasz Dzierżanowski

1. Introduction

There are people who do backup and people who will do backup.

Therefore in Elasticsearch you have to backup your data to stay on safe side. There are multiple backup options like through S3 interface or shared file system. Although in this knowledge article I want to present you famous Hadoop integration.

Hadoop like Elasticsearch is distributed system that have replicas for blocks (shards equivalent). Default value makes 3 same blocks of data exist on 3 different nodes which is proven by mathematical probability to be very secure. Therefore using Hadoop for backup is reasonable.

If you have Linux based installation you can simply call plugin tool to install additional plugin but with docker it is different cause docker images are immutable. In fact they are based on multiple read only layers with last one that is also writable but it is not persisted. So to have hadoop plugin on board you have to prepare your docker image.

Fortunately I did it already for you. Let me then show you how to configure it.

2. Hadoop plugin

All nodes from cluster should have installed plugin.

				
					GET https://localhost:9200/_cat/plugins?v=true&s=component&h=name,component,version,description

Should return

				
					name        component       version description  
dc896d273048 repository-hdfs 8.4.0      The HDFS repository plugin adds support for Hadoop Distributed File-System (HDFS) repositories.

This does not come naturally with official elasticsearch docker image but you can get it from docker-hub

				
					toughcoding/elasticsearch-hadoop-plugin     8.4.0-arm64

Container starts as usual, plugin is configurable via elastic API.

3. Hadoop Repository Configuration

For secured Hadoop cluster which is standard case, unique keytab is needed per each node containing unique service principal.

This will be referred during HDFS repository creation.

Here I am presenting you example call which you can use.

				
					curl -u admin:secretPassword --insecure -X PUT  "https://elasticnode2.somedomain.domain:9200/_snapshot/my_hdfs_repository_ha?verify=true" -H 'Content-Type: application/json' -d'
{
  "type": "hdfs",
  "settings": {
    "uri": "hdfs://nameservice1/",
    "path": "/tmp/elastic_test_20220413_backup/",
    "security.principal": "elastic/_HOST@somedomain.domain",
    "conf.hadoop.rpc.protection": "privacy",
    "conf.dfs.nameservices": "nameservice1",
    "conf.dfs.ha.namenodes.nameservice1": "namenode1,namenode2",
    "conf.dfs.namenode.rpc-address.nameservice1.namenode1": "hadoop-node1.local.domain:8020",
    "conf.dfs.namenode.rpc-address.nameservice1.namenode2": "hadoop-node2.local.domain:8020",
    "conf.dfs.client.failover.proxy.provider.nameservice1": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
  }
}
'

3.1. Custom Service User

If you look at security principal name you can see that it has _HOST placeholder that is variable and rest parts are hard coded. If you want user to be different for each of Elasticsearch nodes then you have to prepare your custom plugin. I did it in past although haven’t find time to test it properly. You can try yourself before I will write article about how to use it.

				
					docker pull toughcoding/elasticsearch-hadoop-plugin:7.16.2-arm64

4. Hadoop without plugin?

There is one exceptional Hadoop distribution which does not require plugin on Elasticsearch side. This is MapR.

In such case mapr-posix-client-basic service is running using maprticket to authenticate to Hadoop cluster. It’s using local filesystem mountpoint as way to read and write files. Therefore from client perspective like Elastic node there is no difference than Shared File System and that integration is included in official docker images, no additional action needed.

				
					PUT _snapshot/my_fs_backup  
{  
  "type": "fs",  
  "settings": {  
    "location": "my_fs_backup_location"         
  }  
}

In above example my_fs_backup_location will be bind-mounted in docker container.

From my experience MapR did it much simpler as you do not need to maintain mentioned configuration. If docker host is not part of your job then even ticket rotation is not headache.

5. Summary

In this knowledge article you have learned how to use Hadoop file system as your backup destination for Elasticsearch. You got example repository configuration that now you can adjust for your needs. Finally you found difference between MapR and other distributions. This can be your starting point to explore more. I am planning to publish sample project with secure Hadoop Distribution in form of labs so you can get hands dirty, therefore stay tuned for upcoming articles.

Have a nice coding!