Etcd is one of the most critical pieces of a Kubernetes cluster since the cluster state is stored in etcd. Consequently proper planning for etcd is required when setting up a production Kubernetes cluster.
Here is a list of important considerations for etcd based on my experience with architecting Kubernetes clusters. Please share your experiences in the comment section.
ETCD Data Storage
The etcd data directory consists of two sub-directories:
- wal: The write ahead log (wal) files are stored here.
- snap: The snapshots are stored here.
Location of wal directory can be specified separately, otherwise both wal and snap are stored in the location specified by the data directory.
Ensure etcd data dir (default: /var/lib/etcd) is hosted on a reliable and fast storage. Preferably a separate disk or dedicated partition. External SAN storage can also be used.
If you don’t want to place the entire data directory on a separate disk or partition then atleast the wal directory should be on a dedicated disk/partition. This is must for production deployment as it directly affects the cluster throughput and stability
ETCD Data Backup
It’s important to backup the data directory (and wal directory, if stored separately) at regular intervals.
The etcd CLI tool – etcdctl, provides a backup functionality that can be used:
# etcdctl backup --data-dir
[--wal-dir <wal_dir>] --backup-dir <backup_data_dir> [--backup-wal-dir <backup_wal_dir>]
Here is an example from a setup where there is no separate wal directory. The data directory is stored on a separate disk.
# etcdctl backup --data-dir=/etcd-data --backup-dir=/etcd-backup
This copies the etcd data to a backup dir. No compression or other post-processing is done by etcdctl.
Before using etcdctl for backing up the data, ensure the cluster state is ‘healthy’ and take the backup from a cluster-member with ‘healthy’ status.
# etcdctl -C "https://pkb-rhel71-1.kube.com:2379,https://pkb-rhel71-2.kube.com:2379,https://pkb-rhel71-3.kube.com:2379" --cert-file /etc/ssl/etcd/pkb-rhel71-1.kube.com-worker.pem --key-file /etc/ssl/etcd/pkb-rhel71-1.kube.com-worker-key.pem --ca-file /etc/ssl/etcd/ca.pem cluster-health member 8211f1d0f64f3269 is healthy: got healthy result from https://pkb-rhel71-1.kube.com:2379 member 91bc3c398fb3c146 is healthy: got healthy result from https://pkb-rhel71-2.kube.com:2379 member fd422379fda50e48 is unhealthy: got unhealthy result from https://pkb-rhel71-3.kube.com:2379 cluster is healthy
Take backup from any of the first two members whose status is healthy.
Another option for backup is to use an external tool like etcdtool – https://github.com/mickep76/etcdtool
etcdtool is a version independent backup and restore utility for etcd. Following is an example invocation to export (backup) data using etcdtool:
# etcdtool -p https://pkb-rhel71-1.kube.com:2379 --cert /etc/ssl/etcd/pkb-rhel71-1.kube.com-worker.pem --key /etc/ssl/etcd/pkb-rhel71-1.kube.com-worker-key.pem --ca /etc/ssl/etcd/ca.pem export / -o /etcd-backup/k8s.json
Following is an example invocation to import (restore) data:
# etcdtool -p https://pkb-rhel71-1.kube.com:2379 --cert /etc/ssl/etcd/pkb-rhel71-1.kube.com-worker.pem --key /etc/ssl/etcd/pkb-rhel71-1.kube.com-worker-key.pem --ca /etc/ssl/etcd/ca.pem import / /etcd-backup/k8s.json
Ensure etcd is clustered. Detailed config guide to setup etcd clustering is available here – https://coreos.com/etcd/docs/latest/clustering.html
For Kubernetes you can use ‘static’ bootstrap method based on your failure tolerance requirement.
Have a look at the following fault tolerance table for guidance. The failure tolerance number indicates the number of permanent server (cluster member) failures that the cluster can be tolerate. A cluster of N members can tolerate up to (N-1)/2 permanent failures. Beyond that the cluster needs to be restored from backup.
Also keep in mind that the data (and wal) directory for all the cluster members shouldn’t be shared. In other words the data (and wal) directory are specific to each member of the cluster and should not be shared among cluster members. This is a common misconception among people coming from traditional database backgrounds.
ETCD TLS Setup
It’s a good practice to secure all the etcd communication using SSL/TLS. Ensure certificate based authentication is turned on as well.
Refer to the following link which describes the configuration settings: https://coreos.com/etcd/docs/latest/security.html
While the default settings should be good to go for majority of the cases you might still want to look at the following tuning guide for applicability in your environment – https://coreos.com/etcd/docs/latest/tuning.html
If you would like to see a working configuration of a 3-node etcd cluster in order to replicate in your setup, please take a look at the following article.