maas – Mouflons and Penguins

Ceph is a distributed open source storage solution that supports Object Storage, Block Storage and File Storage.

Other open source distributed storage systems are GlusterFS and HDFS.

In this guide, we describe how to setup a basic Ceph Cluster for Block Storage. We have 25 nodes on our setup. The masternode is a MASS Region and Rack controller. The rest of the nodes are Ubuntu 16.04 deployed through MAAS. The recommended filesystem for Ceph is XFS and this is what is used on the nodes.

This guide is based on the Quick Installation guide from the Ceph Documentation. This guide uses the ceph-deploy tool which is a relatively quick way to setup Ceph, especially for newbies. There is also the Manual Installation, deployment through Ansible and juju.

Prerequisites

Topology

1 deploy node (masternode). MAAS region and rack controler is installed plus Ansible
3 monitor nodes (node01,node11,node24). Ubuntu 16.04 on XFS deployed through MAAS
20 OSD nodes (node02-10,12-23).

Create an Ubuntu user on masternode

It would be of convenience to create an ubuntu user on the masternode. with passwordless sudo access:

$ sudo useradd -m -s /bin/bash ubuntu

Run visudo and give passwordless sudo access to the ubuntu user:

ubuntu  ALL=NOPASSWD:ALL

Generate an SSH key pair for the ubuntu user:

$ ssh-keygen -b 4096
Generating public/private rsa key pair.
Enter file in which to save the key (/home/ubuntu/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/ubuntu/.ssh/id_rsa.
Your public key has been saved in /home/ubuntu/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:t1zWURVk7j6wJPkA3VmbcHtAKh3EB0kyanORVbiiBkU ubuntu@masternode
The key's randomart image is:
+---[RSA 4096]----+
|       .E +**B=*=|
|        ..o==oOo+|
|       .+.o.o=.=.|
|      .. oo.o....|
|       .S..=oo.. |
|        oo += +  |
|       .  o  o o |
|                .|
|                 |
+----[SHA256]-----+

Deploy the /home/ubuntu/.ssh/id_rsa.pub pubkey on all the nodes (append in /home/ubuntu/.ssh/authorized_keys). You could add this pubkey on the MAAS user before deploying Ubuntu 16.04 on the nodes.

Set /etc/hosts

$ for ID in {01..24}; do echo "$(dig +short node${ID}.maas @127.0.0.1) node${ID}.maas node${ID}"; done > nodes.txt

It should look like this:

192.168.10.28 node01.maas node01
192.168.10.29 node02.maas node02
192.168.10.30 node03.maas node03
192.168.10.31 node04.maas node04
192.168.10.32 node05.maas node05
192.168.10.33 node06.maas node06
192.168.10.34 node07.maas node07
192.168.10.35 node08.maas node08
192.168.10.36 node09.maas node09
192.168.10.37 node10.maas node10
192.168.10.38 node11.maas node11
192.168.10.39 node12.maas node12
192.168.10.40 node13.maas node13
192.168.10.41 node14.maas node14
192.168.10.42 node16.maas node16
192.168.10.43 node17.maas node17
192.168.10.44 node18.maas node18
192.168.10.45 node19.maas node19
192.168.10.46 node20.maas node20
192.168.10.47 node21.maas node21
192.168.10.48 node22.maas node22
192.168.10.49 node23.maas node23
192.168.10.50 node24.maas node24

Now you can append the result in /etc/hosts:

$ cat nodes.txt | sudo tee -a /etc/hosts

Ansible setup

Use this setup in /etc/ansible/hosts on masternode:

[masternode]
masternode

[nodes]
node01
node02
node03
node04
node05
node06
node07
node08
node09
node10
node11
node12
node13
node14
node15
node16
node17
node18
node19
node20
node21
node22
node23
node24

[ceph-mon]
node01
node11
node24

[ceph-osd]
node02
node03
node04
node05
node06
node07
node08
node09
node10
node12
node13
node14
node15
node16
node17
node18
node19
node20
node21
node22
node23

Install python on all the nodes

$ for ID in {01..24}
> do
>  ssh node${ID} "sudo apt -y install python-minimal"
> done

Ensure time synchronization of the nodes

Install the theodotos/debian-ntp role from Ansible Galaxy:

$ sudo ansible-galaxy install theodotos.debian-ntp

Create a basic playbook ntp-init.yml:

---
- hosts: nodes
  remote_user: ubuntu
  become: yes
  roles:
     - { role: theodotos.debian-ntp, ntp.server: masternode }

Apply the playbook:

$ ansible-playbook ntp-init.yml

Verify that the monitor nodes are time synchronized:

$ ansible ceph-mon -a 'timedatectl'
node11 | SUCCESS | rc=0 >>
      Local time: Fri 2017-04-28 08:06:30 UTC
  Universal time: Fri 2017-04-28 08:06:30 UTC
        RTC time: Fri 2017-04-28 08:06:30
       Time zone: Etc/UTC (UTC, +0000)
 Network time on: yes
NTP synchronized: yes
 RTC in local TZ: no

node24 | SUCCESS | rc=0 >>
      Local time: Fri 2017-04-28 08:06:30 UTC
  Universal time: Fri 2017-04-28 08:06:30 UTC
        RTC time: Fri 2017-04-28 08:06:30
       Time zone: Etc/UTC (UTC, +0000)
 Network time on: yes
NTP synchronized: yes
 RTC in local TZ: no

node01 | SUCCESS | rc=0 >>
      Local time: Fri 2017-04-28 08:06:30 UTC
  Universal time: Fri 2017-04-28 08:06:30 UTC
        RTC time: Fri 2017-04-28 08:06:30
       Time zone: Etc/UTC (UTC, +0000)
 Network time on: yes
NTP synchronized: yes
 RTC in local TZ: no

Check also the OSD nodes:

$ ansible ceph-osd -a 'timedatectl'

Install Ceph

Install ceph-deploy

On masternode:

$ sudo apt install ceph-deploy

Create a new cluster and set the monitor nodes (must be odd numbered):

$ ceph-deploy new node01 node11 node24

Install ceph on master node and all other nodes:

$ ceph-deploy install masternode node{01..24}

Deploy the monitors and gather the keys:

$ ceph-deploy mon create-initial

Prepare the OSD nodes

Create the OSD directories

Create the OSD directories on the OSD nodes:

$ I=0;
$ for ID in {02..10} {12..14} {16..23}
> do 
>  ssh -l ubuntu node${ID} "sudo mkdir /var/local/osd${I}"
>  I=$((${I}+1))
> done;

Verify that the OSD directories are created:

$ ansible ceph-osd -a "ls /var/local" | cut -d\| -f1 | xargs -n2 | sort
node02 osd0
node03 osd1
node04 osd2
node05 osd3
node06 osd4
node07 osd5
node08 osd6
node09 osd7
node10 osd8
node12 osd9
node13 osd10
node14 osd11
node16 osd12
node17 osd13
node18 osd14
node19 osd15
node20 osd16
node21 osd17
node22 osd18
node23 osd19

Nodes 01, 11 and 24 are excluded because those are the monitor nodes.

Fix OSD permissions

Because of some bug we need to change the OSD directories owneship to ceph:ceph. Otherwise you will get this:

** ERROR: error creating empty object store in /var/local/osd0: (13) Permission denied

Change the ownership of the OSD directories on the OSD nodes:

$ I=0;
$ for ID in {02..10} {12..14} {16..23}
> do 
>   ssh -l ubuntu node${ID} "sudo chown ceph:ceph /var/local/osd${I}"
>   I=$((${I}+1))
> done;

Prepare the OSDs

$ I=0
$ for ID in {02..10} {12..14} {16..23}
> do
>   ceph-deploy --username ubuntu osd prepare node${ID}:/var/local/osd${I}
>   I=$((${I}+1))
> done

Activate the OSDs

For nodes 02 – 10:

$ I=0
> for ID in {02..10} {12..14} {16..23}
> do
>   ceph-deploy --username ubuntu osd activate node${ID}:/var/local/osd${I}
>   I=$((${I}+1))
> done

Deploy the configuration file and admin key

Now we need to deploy the configuration file and admin key to the admin node and our Ceph nodes. This will save us from having to specify the monitor address and keyring every time we execute a Ceph cli command.

$ ceph-deploy admin masternode node{01..24}

Set the keyring to be world readable:

$ sudo chmod +r /etc/ceph/ceph.client.admin.keyring

Test and verify

$ ceph health
HEALTH_WARN too few PGs per OSD (9 < min 30)
HEALTH_ERR clock skew detected on mon.node11, mon.node24; 64 pgs are stuck inactive for more than 300 seconds; 64 pgs stuck inactive; 64 pgs stuck unclean; Monitor clock skew detected

Out newly build cluster is not healthy. We need to increase Placement Groups. The formula is the number_of_minimum_expected_PGs (30) times the number_of_OSDs (20) and rounded to the closest power of 2:

30x20=500 => pg_num=512

Increase PGs:

$ ceph osd pool set rbd pg_num 512

Now we run ceph health again:

$ ceph health
HEALTH_WARN pool rbd pg_num 512 > pgp_num 64

Still some tweaking needs to be done. We need to adjust pgp_num to 512:

$ ceph osd pool set rbd pgp_num 512

And we are there at last:

$ ceph health
HEALTH_OK

Create a Ceph Block Device device

Check the available storage:

$ ceph df
MapGLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED 
    11151G     10858G         293G          2.63 
POOLS:
    NAME     ID     USED     %USED     MAX AVAIL     OBJECTS 
    rbd      0       306         0         3619G           4

Now we need to create a RADOS Block Device (RBD) to hold our data.

$ rbd create clusterdata --size 4T --image-feature layering

Check the new block device:

$ rbd ls -l
NAME         SIZE PARENT FMT PROT LOCK 
clusterdata 4096G          2

Map the block device:

$ sudo rbd map clusterdata --name client.admin
/dev/rbd0

Format the clusterdata device:

$ sudo mkfs -t ext4 /dev/rbd0

Mount the blobk device:

$ mkdir /srv/clusterdata
$ mount /dev/rbd0 /srv/clusterdata

Now we have a block device for data that is distributed among the 21 storage nodes.

Here’s is a summary of some useful monitoring and troubleshooting commands for ceph

$ ceph health
$ ceph health detail
$ ceph status (ceph -s)
$ ceph osd stat
$ ceph osd tree
$ ceph mon dump
$ ceph mon stat
$ ceph -w
$ ceph quorum_status --format json-pretty
$ ceph mon_status --format json-pretty
$ ceph df

If you run into trouble contact the awesome folks at the #ceph IRC channel, hosted on Open and Free Technology Community IRC network.

Start over

In case you messed up the procedure and you need to start over you can use the following commands:

$ ceph-deploy purge masternode node{01..24}
$ ceph-deploy purgedata masternode node{01..24}
$ ceph-deploy forgetkeys
$ for ID in {02..11} {11..23}; do ssh node${ID} "sudo rm -fr /var/local/osd*"; done
$ rm ceph.conf ceph-deploy-ceph.log .cephdeploy.conf

NOTE: this procedure will destroy your Ceph cluster along with all the data!

Conclusions

Using ceph-deploy maybe an easy way to get started with Ceph, but it does not provide much customization. For a more fine tuned setup you maybe better with the Manual Installation, even though there is a steeper learning curve.

References

http://docs.ceph.com/docs/master/start/
https://bugs.launchpad.net/ubuntu/+source/tzdata/+bug/1554806
http://docs.ceph.com/docs/jewel/rados/configuration/filesystem-recommendations/
http://www.virtualtothecore.com/en/adventures-ceph-storage-part-1-introduction/