Docker Swarm HA Cluster Setup with Ansible, Keepalived & Automated Container Updates
How I built a resilient Docker Swarm cluster using Ansible and Keepalived, with automated container updates
I've been using docker for a few years now but last year i wanted to improve my existing setup. I was running 2 Raspberry Pi 4's each running from USB SSD drives with each host running specific containers.
My aims were:
- Allow containers to run on any node to provide High Availability and make maintenance easier
- Provide enough capacity for the future whilst still using low powered hosts
- Upgrade to 3x Raspberry Pi 5's with NVME storage
- Keep key services such as DNS and reverse proxy available all the time
- Take steps to improve docker security by using internal networks, Apparmor, sudo for docker commands and trying to run containers as a non root account
- Automate as much as possible with Ansible to allow quick rebuilding of a Pi for future OS upgrades
To achieve this i decided to go with Docker Swarm instead of going the Kubernetes route (Maybe one day!)
This post will go over how i set all this up and everything that cropped up during the process.
Raspberry Pi OS
Since i'm running my swarm cluster on Raspberry Pi's i'm also using Raspberry Pi OS which required some changes to the defaults.
I'll highlight a few changes i made here and go in depth in the Ansible section.
- Remove the default Pi user
- Create new user account without access to the docker group so sudo must be used. Set a sudo timeout of 1 hour
- Enable AppArmor
- Set additional options required for cAdvisor and graylog
- Configure unattended upgrades and email notifications
- Configure syslog forwarding to graylog
- Configure prometheus node exporter
Docker Swarm
Creating the swarm is the easy part... it's all the unknowns that come after that takes the time!
For my setup i have 3 nodes and each configured as Managers. 3 nodes is the minimum for quorum to be reached and some containers require access to the docker socket which a worker node wouldn't be able to run.
docker swarm init --advertise-addr 10.0.0.1
The output will provide another command to use on the other nodes. If you ever lose the command you find the tokens with
docker swarm join-token manager or docker swarm join-token worker
Then use docker node ls to confirm each node is showing correctly.
Networks
I've decided to create my networks so they are all referenced as external in the compose configurations. As part of this i also wanted to ensure that each network was created as internal unless it would require external access.
docker network create --driver overlay --subnet 10.0.1.0/24 proxy
docker network create --driver overlay --internal --subnet 10.0.2.0/24 socket-proxy-traefik
Users
Some containers run as the root user so where possible i changed this to a non root user and created an account on the host for them.
sudo useradd -r -u 201 -s /usr/sbin/nologin -M traefik
sudo useradd -r -u 202 -s /usr/sbin/nologin -M authelia
Then set the permissions on the container directory
sudo chown -R 201:201 ../traefik
Secrets
Since docker swarm fully supports secrets i could move away from my previous implementation of creating a file for the secret and create it directly in swarm
echo -n '[email protected]' | docker secret create traefik_cf_api_email -
echo -n 'apikey' | docker secret create traefik_cf_api_key -
Useful Commands
Frequent commands i find myself using
List Services: docker service ls
Check logs: docker service logs -f crowdsec
Remove a stack: docker stack rm traefik
Deploy a stack: docker stack deploy traefik -c /mnt/containers/_swarm/prod/traefik/docker-compose.yml
Re-deploy a stack: docker stack rm traefik && docker stack deploy traefik -c /mnt/containers/_swarm/prod/traefik/docker-compose.yml
Find active node of a running service: docker service ps traefik or docker stack ps traefik
Other commands...
List Stacks: docker stack ls
Inspect Service: docker service inspect --pretty traefik
Scale to 5 instances: docker service scale helloworld=5 (Scale to 0 would effectively stop the container)
Remove Service: docker service rm traefik
Remove all stacks: docker service rm $(docker service ls -q)
Update a service: docker service update --force bookstack
Most of the time i've found that after i've made a change and tried a docker service update it didnt have the required effect and needed the stack to be removed and recreated for the change to be picked up.
Draining
Draining a node prevents new containers being placed on a node, stops any replicas and moves them to another node. If you also run containers outside of swarm as i do then this won't do anything with those.docker node ls
docker node update --availability drain node1
And to make the node available again... docker node update --availability active node1
Making a node active again does not re-balance the containers to use this node, this will happen when:
- during a service update to scale up
- during a rolling update
- when you set another node to drain
- when a task fails on another active node
Deployment Issues
If a container fails to start and no logs are generated then it's useful to check the error with the following commanddocker service ps --no-trunc portainer_portainer
If your stack has multiple services then i found it useful to update the replicas in the compose file to 0 to prevent that container from starting and make troubleshooting a bit easier.
Constraints
Another useful feature is using constraints to prevent a container running on a specific node
Prevent running on node3
deploy:
placement:
constraints:
- node.hostname != node3Force to only run on node1:
deploy:
placement:
constraints:
- node.hostname == node1Socket Proxy
I've only got a couple of containers that require access to the docker socket but i have been using a socket proxy for a while to only provide the relevant access.
Part of this project i started to create separate socket proxy configurations for the necessary containers
services:
socket-proxy-traefik:
image: lscr.io/linuxserver/socket-proxy:latest
environment:
- LOG_LEVEL=info # debug,info,notice,warning,err,crit,alert,emerg
## Variables match the URL prefix (i.e. AUTH blocks access to /auth/* parts of the API, etc.).
# 0 to revoke access.
# 1 to grant access.
## Granted by Default
- EVENTS=1
- PING=1
## Revoked by Default
# Security critical
- AUTH=0
- SECRETS=0
- POST=0
# Other
- NETWORKS=1
- SERVICES=1
- TASKS=1
- VERSION=1
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
networks:
- socket-proxy-traefik
tmpfs:
- /run
deploy:
mode: replicated
replicas: 1
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 60s
placement:
constraints:
- "node.role==manager" # Ensures it only runs on a manager node
labels:
- "gantry.services.excluded=true"
networks:
socket-proxy-traefik:
external: true
socket-proxy/docker-compose-traefik.yml
services:
socket-proxy-gantry:
image: lscr.io/linuxserver/socket-proxy:latest
environment:
- LOG_LEVEL=info # debug,info,notice,warning,err,crit,alert,emerg
## Variables match the URL prefix (i.e. AUTH blocks access to /auth/* parts of the API, etc.).
# 0 to revoke access.
# 1 to grant access.
## Granted by Default
- EVENTS=1
- PING=1
## Revoked by Default
# Security critical
- AUTH=0
- SECRETS=0
- POST=1
# Other
- ALLOW_START=1
- ALLOW_STOP=1
- ALLOW_RESTARTS=1
- CONTAINERS=1
- DISTRIBUTION=1
- IMAGES=1
- INFO=1
- NETWORKS=1
- NODES=1
- SERVICES=1
- SWARM=1
- TASKS=1
- VERSION=1
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
networks:
- socket-proxy-gantry
tmpfs:
- /run
deploy:
mode: replicated
replicas: 1
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 60s
placement:
constraints:
- "node.role==manager"
networks:
socket-proxy-gantry:
external: true
socket-proxy/docker-compose-gantry.yml
Gantry + Apprise
I previously used watchtower for automatic container updates but had to find an alternative for swarm, this is where i came across Gantry.
There is also the nice ability to rollback an update if an update fails.
Since there is no native alerting built in to Gantry this is where Apprise comes in...
I signed up for Pushover years ago and have never looked at an alternative so here is the config to make that work. I did try to have the pushover URL in a docker secret but couldn't get it to work, that's a job for another day.
Create notifications networkdocker network create --driver overlay --subnet x.x.x.x/24 notifications
Swarm Configuration
services:
gantry:
image: shizunge/gantry:latest
networks:
- notifications
- socket-proxy-gantry
environment:
- "DOCKER_HOST=tcp://socket-proxy-gantry:2375"
- "GANTRY_NODE_NAME={{.Node.Hostname}}"
- "GANTRY_SLEEP_SECONDS=86400"
- "GANTRY_NOTIFICATION_CONDITION=on-change"
- "GANTRY_NOTIFICATION_APPRISE_URL=http://apprise:8000/notify"
- "TZ=Europe/London"
deploy:
replicas: 1
placement:
constraints:
- node.role==manager
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 60s
# Refer to https://github.com/caronc/apprise-api for all configurations of the API service.
apprise:
image: caronc/apprise:latest
networks:
- notifications
environment:
# Apprise supports almost all of the most popular notification services.
# Refer to https://github.com/caronc/apprise for all supported notification services.
- "APPRISE_STATELESS_URLS=pover://userkey@token/?sound=spacealarm"
volumes:
- "/mnt/containers/_swarm/prod/apprise/config:/config"
- "/mnt/containers/_swarm/prod/apprise/plugin:/plugin"
- "/mnt/containers/_swarm/prod/apprise/attach:/attach"
deploy:
replicas: 1
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 60s
networks:
notifications:
external: true
socket-proxy-gantry:
external: truegantry/docker-compose.yml
Send Test Notificationapprise -vv -t "Test Message Title" -b "Test Message Body" pover://userkey@token/?sound=spacealarm"
Excluding Containers
Some containers i allow auto updates by leaving the image tag set to latest and for those that i don't want to auto update i set the following label
deploy:
mode: replicated
replicas: 1
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 60s
labels:
- "gantry.services.excluded=true"
This does exclude the containers from being updated by Gantry but i also set the image tag version to prevent any accidental updates and to ensure each node runs the same version.
keepalived
I have a few services running outside of Docker Swarm, such as Pi-hole. I wanted to move away from assigning two DNS servers via DHCP and instead use a single IP address. This ensures traffic always goes to the expected server while still providing high availability if the primary node is rebooted.
For other services that required referencing a specific Swarm node IP, I wanted to guarantee continued availability even if that node went down. Although the routing mesh allows any Swarm node to handle requests, I didn’t want traffic directed to a node that was offline.
To address this, I configured Virtual IPs (VIPs) for the containers. This ensures the service remains accessible even if a node becomes unavailable, as the VIP automatically moves to another healthy node during maintenance or failure scenarios.
On each node i have ansible install keepalived, configure the keepalived.conf and copy the relevant scripts to each node. A jinja2 template is used to set the states and priority
My keepalived.conf looks like the following, with each other node having the state set to BACKUP and the priority set to something lower which are defined in the host variables.
#global_defs {
# enable_script_security
#}
vrrp_script chk_pihole {
script "/etc/keepalived/scripts/check_pihole.sh"
interval 5
timeout 3
fall 2
rise 2
user root
}
vrrp_script chk_snmpexporter {
script "/etc/keepalived/scripts/check_snmpexporter.sh"
interval 5
timeout 3
fall 2
rise 2
user root
}
vrrp_script chk_mqtt {
script "/etc/keepalived/scripts/check_mqtt.sh"
interval 5
timeout 3
fall 2
rise 2
user root
}
vrrp_script chk_traefik {
script "/etc/keepalived/scripts/check_traefik.sh"
interval 5
timeout 3
fall 2
rise 2
user root
}
vrrp_instance pihole {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass yourpassword
}
virtual_ipaddress {
10.0.0.101
}
track_script {
chk_pihole
}
}
vrrp_instance snmpexporter {
state MASTER
interface eth0
virtual_router_id 53
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass yourpassword
}
virtual_ipaddress {
10.0.0.103
}
track_script {
chk_snmpexporter
}
}
vrrp_instance mqtt {
state MASTER
interface eth0
virtual_router_id 54
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass yourpassword
}
virtual_ipaddress {
10.0.0.104
}
track_script {
chk_mqtt
}
}
vrrp_instance traefik {
state MASTER
interface eth0
virtual_router_id 55
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass yourpassword
}
virtual_ipaddress {
10.0.0.105
}
track_script {
chk_traefik
}
}
/etc/keepalived/keepalived.conf
Then the necessary scripts should exist under /etc/keepalived/scripts and owned by root
sudo mkdir /etc/keepalived/scripts
sudo chmod 700 /etc/keepalived/scripts
#!/bin/bash
# Check if traefik container is running
if docker ps --filter "name=traefik" --filter "status=running" | grep -q traefik_traefik; then
exit 0
else
exit 1
fi
/etc/keepalived/scripts/check_traefik.sh
Restart the keepalived service for changes to take effect
sudo systemctl restart keepalived.service
Ansible
I was already using Ansible prior to migrating to Swarm to allow quick rebuilds of a host when a new version of Raspberry Pi OS is released. Migrating to Swarm has brought some additional complexities but Ansible has helped make rebuilds easy.
This is still a bit of a work in progress but all my existing playbooks are below. My original ones were all under a playbooks directory but when i revisited this i ended up using some jinja2 templates so i setup the relevant folder structure and used roles.
Running a playbook would be done via ansible-playbook /mnt/nas/ansible/playbook_docker.yml -i /mnt/nas/ansible/inventory_new.yml
To help improve this slightly so the inventory part does not need specifying or inputting the vault password, i set some defaults in ansible.cfg under my home directory
[defaults]
inventory = /mnt/nas/ansible/inventory.yml
vault_password_file = ~/.ansible/vault_pass.txt
roles_path = /mnt/nas/ansible/roles
~/ansible.cfg
pibuild:
hosts:
sn3:
ansible_host: 10.0.0.3
vars:
ansible_user: nick
/mnt/nas/ansible/inventory.yml
Some useful commands...
Install pipx: sudo apt install pipx
Install ansible: pipx install --include-deps ansible
Add ~/.local/bin to PATH: pipx ensurepath
Update ansible: pipx upgrade ansible
Verify inventory: ansible-inventory -i inventory.yml --list
Run in check mode: ansible-playbook --check playbook.yaml
Start at Task: ansible-playbook playbook.yml --start-at-task="install packages"
Run interactively / step mode: ansible-playbook playbook.yml --step
Ask for become password: --ask-become-pass
Info: ansible --version and ansible-config dump
Ansible Vault
I started using Resend for an SMTP relay so i thought this was a good opportunity to try out the Vault.
ansible-vault create /mnt/nas/ansible/group_vars/all/vault.yml
ansible-vault edit /mnt/nas/ansible/group_vars/all/vault.yml
Add the line smtp_pass: "apikey" to reference the variable in the template
View the entry with ansible-vault view /mnt/shared/ansible/group_vars/all/vault.yml
I seem to be missing notes on this but it seems i had some issues with the vault and needed to use the following commands
pipx inject ansible-playbook passlib
pipx inject ansible passlib
pipx inject ansible-core passlib
To prevent asking for the Vault password store it in a file
nano .ansible/vault_pass.txt
chmod 600 .ansible/vault_pass.txt
Playbooks
Install Packages
Install required packages and only install apcupsd on sn3
- name: Raspberry Pi Updates Playbook
hosts: pibuild
become: true
tasks:
- name: Update apt repo and cache
apt:
update_cache: yes
- name: Upgrade all packages to the latest version
apt:
name: "*"
state: latest
- name: Install a list of packages
apt:
pkg:
- tldr
- tmux
- tcpdump
- stress
- sshfs
- snmp
- smartmontools
- nmap
- iperf3
- iotop
- htop
- hdparm
- dnsutils
- apticron
- msmtp-mta
- nethogs
- nload
- screen
- ca-certificates
- gnupg
- unattended-upgrades
- bat
- sg3-utils
- keepalived
- fish
- tree
- rsyslog
- duf
- name: Install apcupsd only on sn3
apt:
name: apcupsd
state: present
when: inventory_hostname == "sn3"
ansible/playbooks/playbook_apt.yml
Boot Config
Ensure apparmor is enabled and set required options for cAdvisor monitoring
- name: Playbook to update cmdline boot options
hosts: pibuild
become: true
tasks:
- name: Read current cmdline.txt
ansible.builtin.command: cat /boot/firmware/cmdline.txt
register: current_cmdline
changed_when: false
tags: bootflags
- name: Ensure AppArmor and memory cgroup flags are present in /boot/firmware/cmdline.txt
ansible.builtin.lineinfile:
path: /boot/firmware/cmdline.txt
regexp: '^.*$'
line: >-
{{ current_cmdline.stdout
| regex_replace('apparmor=1', '')
| regex_replace('security=apparmor', '')
| regex_replace('cgroup_enable=memory', '') #cAdvisor
| regex_replace('swapaccount=1', '') #cAdvisor
| regex_replace('cgroup_memory=1', '') #cAdvisor
| regex_replace(' +', ' ')
| trim
}} apparmor=1 security=apparmor cgroup_enable=memory swapaccount=1 cgroup_memory=1
backrefs: true
register: cmdline_update
tags: bootflags
ansible/playbooks/playbook_boot.yml
.bashrc
Create bash aliases and change timestamp for history command
- name: Update .bashrc
hosts: pibuild
tasks:
- name: Append Aliases
lineinfile:
dest: "/home/nick/.bashrc"
line: "{{ item }}"
state: present
insertafter: EOF
loop:
- "alias updates='sudo apt update; apt list --upgradeable'"
- "alias running_services='systemctl list-units --type=service --state=running'"
- "alias dockerstop-all='docker stop $(docker ps -q)'"
- "alias dockerstart='/home/nick/scripts/dockerstart.sh'"
- "export HISTTIMEFORMAT='%d-%m-%Y %T '"
ansible/playbooks/playbook_bashrc.yml
Update Config Files
Configure various changes for updates, msmtp and configure syslog forwarding to graylog.
Ensure only the Raspberry Pi devices have the 51unattended-upgrades file for RPI updates to be installed
Unattended-Upgrade::Origins-Pattern {
"o=Raspberry Pi Foundation,a=stable";
};51unattended-upgrades
- name: Raspberry Pi Configs Playbook
hosts: pibuild
become: true
tasks:
- name: Copy over 51unattended config to add Raspberry Pi updates
ansible.builtin.copy:
src: ../configs/51unattended-upgrades
dest: /etc/apt/apt.conf.d/51unattended-upgrades
when: inventory_hostname != "minipc"
- name: Deploy msmtp config to /etc
ansible.builtin.template:
src: ../roles/msmtprc/templates/msmtprc.j2
dest: /etc/msmtprc
owner: root
group: root
mode: '0600'
- name: Deploy msmtp config to home dir
ansible.builtin.template:
src: ../roles/msmtprc/templates/msmtprc.j2
dest: /home/{{ ansible_user }}/.msmtprc
owner: "{{ ansible_user }}"
group: "{{ ansible_user }}"
mode: '0600'
- name: Update 50unattended-upgrades
template:
src: ../roles/unattended_upgrades/templates/50unattended-upgrades.j2
dest: /etc/apt/apt.conf.d/50unattended-upgrades
- name: Copy apcupsd config to SN3
ansible.builtin.copy:
src: ../roles/apcupsd/files/sn3/apcupsd.conf
dest: /etc/apcupsd/
when: inventory_hostname == "sn3"
- name: Restart apcupsd
ansible.builtin.service:
name: apcupsd
state: restarted
- name: Set fish as the default shell for nick
ansible.builtin.user:
name: nick
shell: /usr/bin/fish
- name: Copy fish aliases
ansible.builtin.copy:
src: "{{ item }}"
dest: /home/nick/.config/fish/functions/
owner: nick
group: nick
mode: '0644'
loop: "{{ lookup('fileglob', '../configs/aliases/*.fish', wantlist=True) }}"
- name: Configure syslog forwarding to Graylog
ansible.builtin.lineinfile:
path: /etc/rsyslog.conf
line: '*.*@10.0.0.1:5140;RSYSLOG_SyslogProtocol23Format'
state: present
#create: yes
backup: yes
- name: Restart rsyslog
ansible.builtin.service:
name: rsyslog
state: restarted
- name: Ensure vm.max_map_count is 262144 for Graylog
sysctl:
name: vm.max_map_count
value: 262144
state: present
reload: yes
ansible/playbooks/playbook_configs.yml
Templates
The template for msmtprc to configure the file in /etc and also insert the smtp password from Ansible Vault. The email address is created by using the hostname for the respective node
defaults
auth on
tls on
logfile ~/.msmtp.log
account default
host smtp.resend.com
port 587
from {{ ansible_hostname }}@example.com
user resend
password {{ smtp_pass }}
ansible/roles/msmtprc/templates/msmtprc.j2
For apticron email alerts, use the template to set the email address using the hostname
- name: Update apticron.conf
template:
src: ../roles/apticron/templates/apticron.conf.j2
dest: /etc/apticron/apticron.conf
owner: root
group: root
mode: '0644'
- name: Restart apticron service
ansible.builtin.service:
name: apticron
state: restartedansible/roles/apticron/tasks/main.yml
CUSTOM_FROM="{{ inventory_hostname | lower }}@example.com"ansible/roles/apticron/templates/apticron.conf.j2
- hosts: pibuild
become: true
roles:
- apticronansible/playbooks/apticron.yml
Run the playbookansible-playbook ../ansible/playbooks/apticron.yml
For unattended-upgrades, configure what should be updated and configure the sender address
"origin=Debian,codename={distro_codename}-updates";
"origin=Debian,codename={distro_codename}-proposed-updates";
"origin=Debian,codename={distro_codename},label=Debian";
"origin=Debian,codename={distro_codename},label=Debian-Security";
"origin=Debian,codename={distro_codename}-security,label=Debian-Security";
Unattended-Upgrade::Mail "[email protected]";
Unattended-Upgrade::Sender "{{ inventory_hostname | lower }}@example.co.uk"; ansible/roles/unattended_upgrades/templates/50unattended-upgrades.j2
Scripts
Create required directories for scripts to be copied in to and set permissions while only copying the relevant scripts depending on hostname.
- name: Ensure target script directories exist
file:
path: "{{ item.path }}"
state: directory
owner: "{{ item.owner }}"
group: "{{ item.group }}"
mode: "{{ item.mode }}"
loop:
- { path: "/root/scripts", owner: "root", group: "root", mode: "0700" }
- { path: "/home/nick/scripts", owner: "nick", group: "nick", mode: "0755" }
- name: Copy backup script to /root/scripts
copy:
src: "{{ item }}"
dest: "/root/scripts/{{ item | basename }}"
mode: '0755'
with_fileglob:
- "../roles/scripts/files/common/root/*"
- name: Copy scripts to /home/nick/scripts
copy:
src: "{{ item }}"
dest: "/home/nick/scripts/{{ item | basename }}"
owner: nick
group: nick
mode: '0755'
with_fileglob:
- "../roles/scripts/files/common/nick/*"
when: inventory_hostname | lower != "sn1"
- name: Copy container backup scripts to SN1 /root/scripts
copy:
src: "{{ item }}"
dest: "/root/scripts/{{ item | basename }}"
mode: '0755'
with_fileglob:
- "../roles/scripts/files/sn1/root/*"
when: inventory_hostname | lower == "sn1"
- name: Copy SN1 specific scripts to /home/nick/scripts
copy:
src: "{{ item }}"
dest: "/home/nick/scripts/{{ item | basename }}"
owner: nick
group: nick
mode: '0755'
with_fileglob:
- "../roles/scripts/files/sn1/nick/*"
when: inventory_hostname | lower == "sn1"
ansible/roles/scripts/tasks/main.yml
Draining
Find the local hostname and drain the node
sudo docker node update --availability drain $(hostname)
sudo docker stop cadvisoransible/roles/scripts/files/common/nick/dockerstop.sh
Make the node available for tasks
sudo docker node update --availability active $(hostname)
sudo docker compose -f /mnt/containers/_swarm/prod/cadvisor/docker-compose.yml up -d
ansible/roles/scripts/files/common/nick/dockerstart.sh
Some containers i run outside of compose and so this script only needs to be copied to that node
sudo docker node update --availability drain $(hostname)
sudo docker stop pi-hole stubby homeassistant govee2mqtt cadvisor qbittorrent
ansible/roles/scripts/files/sn1/nick/dockerstop.sh
sudo docker node update --availability active $(hostname)
sudo docker compose -f /home/nick/containers/pihole/docker-compose.yml up -d
sudo docker compose -f /mnt/containers/_swarm/prod/homeassistant/docker-compose.yml up -d --force-recreate
sudo docker compose -f /mnt/containers/_swarm/prod/govee2mqtt/docker-compose.yml up -d
sudo docker compose -f /mnt/containers/_swarm/prod/qbittorrent/docker-compose.yml up -d
sudo docker compose -f /mnt/containers/_swarm/prod/cadvisor/docker-compose.yml up -d
ansible/roles/scripts/files/sn1/nick/dockerstart.sh
Backups
I only care about backing up the etc and homes dir just in case, this script is only copied to the root user on all nodes
#!/bin/bash
#Backup Homes/etc
HOST=$(hostname -s)
tar -pzcvf /mnt/backups/etc/$(date +%Y%m%d)_${HOST}-etc.zip /etc
tar -pzcvf /mnt/backups/homes/$(date +%Y%m%d)_${HOST}-home-nick.zip /home/nick
tar -pzcvf /mnt/backups/homes/$(date +%Y%m%d)_${HOST}-home-root.zip /rootansible/roles/scripts/files/common/root/backup_host.sh
Cron
Set email address and create the necessary cron jobs defined for all hosts and host specific jobs. In my case, all database backup scripts get copied to sn1 only and this creates the cron job based on the code in ansible/roles/cron/vars/main.yml
- name: Set MAILTO for root
ansible.builtin.cron:
user: root
name: "MAILTO"
env: yes
job: "[email protected]"
- name: Set MAILTO for nick
ansible.builtin.cron:
user: nick
name: "MAILTO"
env: yes
job: "[email protected]"
- name: Add common cron jobs
cron:
name: "{{ item.name }}"
user: "{{ item.user }}"
job: "{{ item.job }}"
minute: "{{ item.minute }}"
hour: "{{ item.hour }}"
weekday: "{{ item.weekday }}"
loop: "{{ common_cron_jobs }}"
- name: Add host-specific cron jobs
cron:
name: "{{ item.name }}"
user: "{{ item.user }}"
job: "{{ item.job }}"
minute: "{{ item.minute }}"
hour: "{{ item.hour }}"
weekday: "{{ item.weekday }}"
loop: "{{ host_cron_jobs[inventory_hostname] | default([]) }}"
when: host_cron_jobs[inventory_hostname] is defined
ansible/roles/cron/tasks/main.yml
The only common cron job i have for all my hosts is the above backup script so this is created on each node
common_cron_jobs:
- { name: "Backup /etc and /home", user: root, job: "/root/scripts/backup_host.sh", minute: "0", hour: "3", weekday: "5" }ansible/roles/cron/defaults/main.yml
I run the rest of my scripts from one node
host_cron_jobs:
sn1:
- { name: "Bookstack DB", user: root, job: "/root/scripts/backup_bookstackdb.sh", minute: "0", hour: "4", weekday: "5" }
- { name: "Joplin DB", user: root, job: "/root/scripts/backup_joplindb.sh", minute: "5", hour: "4", weekday: "5" }
- { name: "Firefly3 DB", user: root, job: "/root/scripts/backup_fireflydb.sh", minute: "10", hour: "4", weekday: "5" }
- { name: "Ghost DB", user: root, job: "/root/scripts/backup_ghostdb.sh", minute: "15", hour: "4", weekday: "5" }
- { name: "Cleanup", user: root, job: "/root/scripts/cleanup.sh", minute: "0", hour: "3", weekday: "7" }
ansible/roles/cron/vars/main.yml
hosts: pibuild
become: true
roles:
- cron
ansible/playbooks/cron.yml
Docker Setup
Download and install docker packages and enable metrics
- name: Raspberry Pi Docker Playbook
hosts: pibuild
become: true
tasks:
- name: Add Docker GPG apt Key
apt_key:
url: https://download.docker.com/linux/ubuntu/gpg
state: present
- name: Add Docker Repository
apt_repository:
repo: deb https://download.docker.com/linux/debian bookworm stable
state: present
- name: Update apt repo and cache
apt:
update_cache: yes
- name: Install docker packages
apt:
pkg:
- docker-ce
- docker-ce-cli
- containerd.io
- docker-buildx-plugin
- docker-compose-plugin
- name: Ensure group "docker" exists
ansible.builtin.group:
name: docker
state: present
- name: Ensure Docker daemon.json for metrics
copy:
dest: /etc/docker/daemon.json
content: |
{
"metrics-addr": "0.0.0.0:9323"
}
owner: root
group: root
mode: '0644'
- name: Restart docker service
ansible.builtin.service:
name: docker
state: restarted
ansible/playbooks/playbook_docker.yml
Container Users
Create users for the containers being run with different users
- name: Ensure container groups exist
group:
name: "{{ item.name }}"
gid: "{{ item.gid }}"
system: yes
loop: "{{ container_users }}"
- name: Ensure container service accounts exist
user:
name: "{{ item.name }}"
uid: "{{ item.uid }}"
group: "{{ item.name }}"
shell: /usr/sbin/nologin
system: yes
create_home: no
loop: "{{ container_users }}"
ansible/roles/container_users/tasks/main.yml
container_users:
- { name: "traefik", uid: 201, gid: 201 }
- { name: "authelia", uid: 202, gid: 202 }ansible/group_vars/all/main.yml
- hosts: pibuild
become: true
roles:
- container_usersansible/playbooks/container_users.yml
Prometheus
I do need to revisit this one, initially i had issues with downloading the file directly so instead i just place it in the installs directory.
Installs prometheus node exporter and configures it as a service
- name: Prometheus Node Exporter
hosts: pibuild
become: true
tasks:
- name: Download and extract prometheus node exporter
ansible.builtin.unarchive:
src: "../installs/node_exporter-{{ node_exporter_version }}.linux-arm64.tar.gz"
dest: "/tmp/"
vars:
node_exporter_version: 1.9.1
- name: Copy Extracted File to /usr/local/bin
copy:
src: "/tmp/node_exporter-{{ node_exporter_version }}.linux-arm64/node_exporter"
dest: "/usr/local/bin/"
mode: preserve
remote_src: true
vars:
node_exporter_version: 1.9.1
- name: Create Node Exporter Service Unit File
copy:
content: |
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
ExecStart=/usr/local/bin/node_exporter
Restart=always
[Install]
WantedBy=multi-user.target
dest: /etc/systemd/system/node_exporter.service
- name: Enable and Start Node Exporter Service
systemd:
name: node_exporter
enabled: yes
state: startedansible/playbooks/playbook_nodeexporter.yml
Swap
Allocate swap and add to fstab
- name: Playbook to configure swap
hosts: pibuild
become: true
tasks:
- name: set swap_file variable
set_fact:
swap_file: /swapfile
- name: check if swap file exists
stat:
path: "{{ swap_file }}"
register: swap_file_check
- name: create swap file
command: fallocate -l 4G {{ swap_file }}
when: not swap_file_check.stat.exists
- name: set permissions on swap file
file:
path: "{{ swap_file }}"
mode: 0600
- name: Mark as swap
command: mkswap {{ swap_file }}
when: not swap_file_check.stat.exists
- name: Enable swap
command: swapon /swapfile
- name: Add to fstab
lineinfile:
dest: /etc/fstab
regexp: "{{ swap_file }}"
line: "{{ swap_file }} none swap sw 0 0"
ansible/playbooks/playbook_swap.yml
Networking
- name: Raspberry Pi Networking Playbook
hosts: pibuild
become: true
tasks:
- name: Update Ethernet connection profile
become: true
nmcli:
conn_name: eth0
ifname: eth0
type: ethernet
ip4: 10.0.0.3/24
gw4: 10.0.0.254
dns4:
- 10.0.0.101
dns4_search: local.lan
method6: disabled
state: present
- name: Reboot
ansible.builtin.reboot:
ansible/playbooks/playbook_networking.yml
NFS
- name: Setup NFS mounts
hosts: pibuild
become: true
tasks:
- name: Create and mount NFS Containers share
ansible.posix.mount:
src: 10.0.0.10:/volume1/containers
path: /mnt/containers
opts: defaults
state: mounted
fstype: nfs
ansible/playbooks/playbook_nfs.yml
SSHFS
I also have an SSHFS mountpoint...it will appear to not work until you've manually connected and accepted the certificate
sudo sshfs -o allow_other,default_permissions,IdentityFile=/home/nick/.ssh/id_ed25519 nick@synology:/ /mnt/nas
- name: Setup sshfs mountpoint
hosts: pibuild
become: true
tasks:
- name: Create nas dir for nfs mount
ansible.builtin.file:
path: /mnt/nas
state: directory
mode: '770'
owner: nick
group: nick
- name: Add sshfs mountpoint to fstab
lineinfile:
dest: /etc/fstab
line: "sshfs#nick@synology:/ /mnt/nas/ fuse delay_connect,defaults,idmap=user,IdentityFile=/home/nick/.ssh/id_ed25519,port=22,uid=1000,gid=1000,allow_other 0 0"
- name: Update fuse.conf
lineinfile:
dest: /etc/fuse.conf
line: "allow_other"
- name: Update fuse.conf part 2
ansible.builtin.lineinfile:
path: /etc/fuse.conf
regexp: '^#user_allow_other '
insertafter: '^#user_allow_other '
line: user_allow_other
- name: Mount /mnt/nas
ansible.builtin.command: mount /mnt/nas
Keepalived
Update the keepalived.conf file with the correct values based on hostname and copy over the scripts for ensuring the VIP comes up on whichever node is running the container
- name: Copy keepalived.conf
template:
src: ../roles/keepalived/templates/keepalived.conf.j2
dest: /etc/keepalived/keepalived.conf
owner: root
group: root
mode: '0644'
- name: Copy keepalived scripts
copy:
src: "{{ item }}"
dest: /etc/keepalived/
owner: root
group: root
mode: '0700'
loop: "{{ lookup('fileglob', '../roles/keepalived/files/scripts/*.sh', wantlist=True) }}"
- name: Restart keepalived
ansible.builtin.service:
name: keepalived
state: restartedansible/roles/keepalived/tasks/main.yml
#global_defs {
# enable_script_security
#}
vrrp_script chk_pihole {
script "/etc/keepalived/check_pihole.sh"
interval 5
timeout 3
fall 2
rise 2
user root
}
vrrp_script chk_snmpexporter {
script "/etc/keepalived/check_snmpexporter.sh"
interval 5
timeout 3
fall 2
rise 2
user root
}
vrrp_script chk_mqtt {
script "/etc/keepalived/check_mqtt.sh"
interval 5
timeout 3
fall 2
rise 2
user root
}
vrrp_script chk_traefik {
script "/etc/keepalived/check_traefik.sh"
interval 5
timeout 3
fall 2
rise 2
user root
}
vrrp_instance pihole {
state {{ keepalived_state }}
interface eth0
virtual_router_id 51
priority {{ keepalived_priority }}
advert_int 1
authentication {
auth_type PASS
auth_pass MySecret
}
virtual_ipaddress {
10.0.0.101
}
track_script {
chk_pihole
}
}
vrrp_instance snmpexporter {
state {{ keepalived_state }}
interface eth0
virtual_router_id 53
priority {{ keepalived_priority }}
advert_int 1
authentication {
auth_type PASS
auth_pass MySecret
}
virtual_ipaddress {
10.0.0.103
}
track_script {
chk_snmpexporter
}
}
vrrp_instance mqtt {
state {{ keepalived_state }}
interface eth0
virtual_router_id 54
priority {{ keepalived_priority }}
advert_int 1
authentication {
auth_type PASS
auth_pass MySecret
}
virtual_ipaddress {
10.0.0.104
}
track_script {
chk_mqtt
}
}
vrrp_instance traefik {
state {{ keepalived_state }}
interface eth0
virtual_router_id 55
priority {{ keepalived_priority }}
advert_int 1
authentication {
auth_type PASS
auth_pass MySecret
}
virtual_ipaddress {
10.0.0.105
}
track_script {
chk_traefik
}
}ansible/roles/keepalived/templates/keepalived.conf.j2
If this script is successful then traefik is running on this node and should own the VIP
#!/bin/bash
# Check if traefik container is running
if docker ps --filter "name=traefik" --filter "status=running" | grep -q traefik_traefik; then
exit 0
else
exit 1
fi
ansible/roles/keepalived/files/scripts/check_traefik.sh
Set the state and priority for the nodes
keepalived_state: MASTER
keepalived_priority: 100
ansible/host_vars/sn1.yml
keepalived_state: BACKUP
keepalived_priority: 90
ansible/host_vars/sn2.yml
Sudo
Add my user to sudo and set a timeout of 1 hour
- name: Raspberry Pi Updates Playbook
hosts: pibuild
become: true
tasks:
- name: Ensure 'nick' user exists with password and sudo group
user:
name: nick
groups: sudo
append: yes
- name: Configure sudo access for 'nick' with password required + 1hr timeout
copy:
dest: /etc/sudoers.d/nick
content: |
nick ALL=(ALL) ALL
Defaults:nick timestamp_timeout=60
owner: root
group: root
mode: '0440'
validate: 'visudo -cf %s'
- name: Remove any existing NOPASSWD rule for 'nick'
file:
path: /etc/sudoers.d/nick_nopasswd
state: absent
ansible/playbooks/sudo.yml
After running this playbook ansible will ask for the become password going forward so you would need to use --ask-become-pass
ansible-playbook /mnt/nas/ansible/playbooks/playbook_boot.yml --ask-become-pass