Ceph is a distributed open source storage solution that supports Object Storage, Block Storage and File Storage.
Other open source distributed storage systems are GlusterFS and HDFS.
In this guide, we describe how to setup a basic Ceph Cluster for Block Storage. We have 25 nodes on our setup. The masternode is a MASS Region and Rack controller. The rest of the nodes are Ubuntu 16.04 deployed through MAAS. The recommended filesystem for Ceph is XFS and this is what is used on the nodes.
This guide is based on the Quick Installation guide from the Ceph Documentation. This guide uses the ceph-deploy tool which is a relatively quick way to setup Ceph, especially for newbies. There is also the Manual Installation, deployment through Ansible and juju.
Prerequisites
Topology
- 1 deploy node (masternode). MAAS region and rack controler is installed plus Ansible
- 3 monitor nodes (node01,node11,node24). Ubuntu 16.04 on XFS deployed through MAAS
- 20 OSD nodes (node02-10,12-23).
Create an Ubuntu user on masternode
It would be of convenience to create an ubuntu user on the masternode. with passwordless sudo access:
$ sudo useradd -m -s /bin/bash ubuntu
Run visudo
and give passwordless sudo access to the ubuntu user:
ubuntu ALL=NOPASSWD:ALL
Generate an SSH key pair for the ubuntu user:
$ ssh-keygen -b 4096
Generating public/private rsa key pair.
Enter file in which to save the key (/home/ubuntu/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/ubuntu/.ssh/id_rsa.
Your public key has been saved in /home/ubuntu/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:t1zWURVk7j6wJPkA3VmbcHtAKh3EB0kyanORVbiiBkU ubuntu@masternode
The key's randomart image is:
+---[RSA 4096]----+
| .E +**B=*=|
| ..o==oOo+|
| .+.o.o=.=.|
| .. oo.o....|
| .S..=oo.. |
| oo += + |
| . o o o |
| .|
| |
+----[SHA256]-----+
Deploy the /home/ubuntu/.ssh/id_rsa.pub pubkey on all the nodes (append in /home/ubuntu/.ssh/authorized_keys). You could add this pubkey on the MAAS user before deploying Ubuntu 16.04 on the nodes.
Set /etc/hosts
$ for ID in {01..24}; do echo "$(dig +short node${ID}.maas @127.0.0.1) node${ID}.maas node${ID}"; done > nodes.txt
It should look like this:
192.168.10.28 node01.maas node01
192.168.10.29 node02.maas node02
192.168.10.30 node03.maas node03
192.168.10.31 node04.maas node04
192.168.10.32 node05.maas node05
192.168.10.33 node06.maas node06
192.168.10.34 node07.maas node07
192.168.10.35 node08.maas node08
192.168.10.36 node09.maas node09
192.168.10.37 node10.maas node10
192.168.10.38 node11.maas node11
192.168.10.39 node12.maas node12
192.168.10.40 node13.maas node13
192.168.10.41 node14.maas node14
192.168.10.42 node16.maas node16
192.168.10.43 node17.maas node17
192.168.10.44 node18.maas node18
192.168.10.45 node19.maas node19
192.168.10.46 node20.maas node20
192.168.10.47 node21.maas node21
192.168.10.48 node22.maas node22
192.168.10.49 node23.maas node23
192.168.10.50 node24.maas node24
Now you can append the result in /etc/hosts:
$ cat nodes.txt | sudo tee -a /etc/hosts
Ansible setup
Use this setup in /etc/ansible/hosts on masternode:
[masternode]
masternode
[nodes]
node01
node02
node03
node04
node05
node06
node07
node08
node09
node10
node11
node12
node13
node14
node15
node16
node17
node18
node19
node20
node21
node22
node23
node24
[ceph-mon]
node01
node11
node24
[ceph-osd]
node02
node03
node04
node05
node06
node07
node08
node09
node10
node12
node13
node14
node15
node16
node17
node18
node19
node20
node21
node22
node23
Install python on all the nodes
$ for ID in {01..24}
> do
> ssh node${ID} "sudo apt -y install python-minimal"
> done
Ensure time synchronization of the nodes
Install the theodotos/debian-ntp role from Ansible Galaxy:
$ sudo ansible-galaxy install theodotos.debian-ntp
Create a basic playbook ntp-init.yml:
---
- hosts: nodes
remote_user: ubuntu
become: yes
roles:
- { role: theodotos.debian-ntp, ntp.server: masternode }
Apply the playbook:
$ ansible-playbook ntp-init.yml
Verify that the monitor nodes are time synchronized:
$ ansible ceph-mon -a 'timedatectl'
node11 | SUCCESS | rc=0 >>
Local time: Fri 2017-04-28 08:06:30 UTC
Universal time: Fri 2017-04-28 08:06:30 UTC
RTC time: Fri 2017-04-28 08:06:30
Time zone: Etc/UTC (UTC, +0000)
Network time on: yes
NTP synchronized: yes
RTC in local TZ: no
node24 | SUCCESS | rc=0 >>
Local time: Fri 2017-04-28 08:06:30 UTC
Universal time: Fri 2017-04-28 08:06:30 UTC
RTC time: Fri 2017-04-28 08:06:30
Time zone: Etc/UTC (UTC, +0000)
Network time on: yes
NTP synchronized: yes
RTC in local TZ: no
node01 | SUCCESS | rc=0 >>
Local time: Fri 2017-04-28 08:06:30 UTC
Universal time: Fri 2017-04-28 08:06:30 UTC
RTC time: Fri 2017-04-28 08:06:30
Time zone: Etc/UTC (UTC, +0000)
Network time on: yes
NTP synchronized: yes
RTC in local TZ: no
Check also the OSD nodes:
$ ansible ceph-osd -a 'timedatectl'
Install Ceph
Install ceph-deploy
On masternode:
$ sudo apt install ceph-deploy
Create a new cluster and set the monitor nodes (must be odd numbered):
$ ceph-deploy new node01 node11 node24
Install ceph on master node and all other nodes:
$ ceph-deploy install masternode node{01..24}
Deploy the monitors and gather the keys:
$ ceph-deploy mon create-initial
Prepare the OSD nodes
Create the OSD directories
Create the OSD directories on the OSD nodes:
$ I=0;
$ for ID in {02..10} {12..14} {16..23}
> do
> ssh -l ubuntu node${ID} "sudo mkdir /var/local/osd${I}"
> I=$((${I}+1))
> done;
Verify that the OSD directories are created:
$ ansible ceph-osd -a "ls /var/local" | cut -d\| -f1 | xargs -n2 | sort
node02 osd0
node03 osd1
node04 osd2
node05 osd3
node06 osd4
node07 osd5
node08 osd6
node09 osd7
node10 osd8
node12 osd9
node13 osd10
node14 osd11
node16 osd12
node17 osd13
node18 osd14
node19 osd15
node20 osd16
node21 osd17
node22 osd18
node23 osd19
Nodes 01, 11 and 24 are excluded because those are the monitor nodes.
Fix OSD permissions
Because of some bug we need to change the OSD directories owneship to ceph:ceph. Otherwise you will get this:
** ERROR: error creating empty object store in /var/local/osd0: (13) Permission denied
Change the ownership of the OSD directories on the OSD nodes:
$ I=0;
$ for ID in {02..10} {12..14} {16..23}
> do
> ssh -l ubuntu node${ID} "sudo chown ceph:ceph /var/local/osd${I}"
> I=$((${I}+1))
> done;
Prepare the OSDs
$ I=0
$ for ID in {02..10} {12..14} {16..23}
> do
> ceph-deploy --username ubuntu osd prepare node${ID}:/var/local/osd${I}
> I=$((${I}+1))
> done
Activate the OSDs
For nodes 02 – 10:
$ I=0
> for ID in {02..10} {12..14} {16..23}
> do
> ceph-deploy --username ubuntu osd activate node${ID}:/var/local/osd${I}
> I=$((${I}+1))
> done
Deploy the configuration file and admin key
Now we need to deploy the configuration file and admin key to the admin node and our Ceph nodes. This will save us from having to specify the monitor address and keyring every time we execute a Ceph cli command.
$ ceph-deploy admin masternode node{01..24}
Set the keyring to be world readable:
$ sudo chmod +r /etc/ceph/ceph.client.admin.keyring
Test and verify
$ ceph health
HEALTH_WARN too few PGs per OSD (9 < min 30)
HEALTH_ERR clock skew detected on mon.node11, mon.node24; 64 pgs are stuck inactive for more than 300 seconds; 64 pgs stuck inactive; 64 pgs stuck unclean; Monitor clock skew detected
Out newly build cluster is not healthy. We need to increase Placement Groups. The formula is the number_of_minimum_expected_PGs (30) times the number_of_OSDs (20) and rounded to the closest power of 2:
30x20=500 => pg_num=512
Increase PGs:
$ ceph osd pool set rbd pg_num 512
Now we run ceph health
again:
$ ceph health
HEALTH_WARN pool rbd pg_num 512 > pgp_num 64
Still some tweaking needs to be done. We need to adjust pgp_num to 512:
$ ceph osd pool set rbd pgp_num 512
And we are there at last:
$ ceph health
HEALTH_OK
Create a Ceph Block Device device
Check the available storage:
$ ceph df
MapGLOBAL:
SIZE AVAIL RAW USED %RAW USED
11151G 10858G 293G 2.63
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 306 0 3619G 4
Now we need to create a RADOS Block Device (RBD) to hold our data.
$ rbd create clusterdata --size 4T --image-feature layering
Check the new block device:
$ rbd ls -l
NAME SIZE PARENT FMT PROT LOCK
clusterdata 4096G 2
Map the block device:
$ sudo rbd map clusterdata --name client.admin
/dev/rbd0
Format the clusterdata device:
$ sudo mkfs -t ext4 /dev/rbd0
Mount the blobk device:
$ mkdir /srv/clusterdata
$ mount /dev/rbd0 /srv/clusterdata
Now we have a block device for data that is distributed among the 21 storage nodes.
Here’s is a summary of some useful monitoring and troubleshooting commands for ceph
$ ceph health
$ ceph health detail
$ ceph status (ceph -s)
$ ceph osd stat
$ ceph osd tree
$ ceph mon dump
$ ceph mon stat
$ ceph -w
$ ceph quorum_status --format json-pretty
$ ceph mon_status --format json-pretty
$ ceph df
If you run into trouble contact the awesome folks at the #ceph IRC channel, hosted on Open and Free Technology Community IRC network.
Start over
In case you messed up the procedure and you need to start over you can use the following commands:
$ ceph-deploy purge masternode node{01..24}
$ ceph-deploy purgedata masternode node{01..24}
$ ceph-deploy forgetkeys
$ for ID in {02..11} {11..23}; do ssh node${ID} "sudo rm -fr /var/local/osd*"; done
$ rm ceph.conf ceph-deploy-ceph.log .cephdeploy.conf
NOTE: this procedure will destroy your Ceph cluster along with all the data!
Conclusions
Using ceph-deploy maybe an easy way to get started with Ceph, but it does not provide much customization. For a more fine tuned setup you maybe better with the Manual Installation, even though there is a steeper learning curve.
References
- http://docs.ceph.com/docs/master/start/
- https://bugs.launchpad.net/ubuntu/+source/tzdata/+bug/1554806
- http://docs.ceph.com/docs/jewel/rados/configuration/filesystem-recommendations/
- http://www.virtualtothecore.com/en/adventures-ceph-storage-part-1-introduction/