Ephemeral Servers on Hetzner

Dec 9, 2024

I use servers on Hetzner cloud to do most of my client development. A good server type for this use case, CPX51, costs €54.90/mo (at time of writing). While this is pretty affordable, it’s also sitting idle most of the time, wasting money. This applies even if the server is powered off; it has to be deleted to stop the billing. I developed a solution that involves booting from a detachable volume, which allowed me to reduce costs to €20.68/mo, a 62% reduction! ¹

My goals for this project are:

A server that boots from a Hetzner Volume and can be deleted when not in use
A daemon, running on the server, to automatically delete the server when it detects idle

Booting from a Cloud Volume

Aside: Ideas I tried that didn't work

I went through several iterations of this before arriving at the current solution. Here are some of the ideas that I tried, and why they didn’t work for me.

Configuring the bootloader to just boot from the detachable volume. For some reason, GRUB is unable to detect the detachable volume. I was unable to work around this, and Hetzner support declined to offer any assistance.
Create a minimal Linux image that can be used to kexec into the detachable volume’s kernel. This actually worked, but I found that creating a server from a custom image is dramatically slower than using a stock image (from a few seconds for a stock image to multiple minutes for a custom image).
Using rescue mode and a script to kexec into the detachable volume’s kernel. This also worked, but it requires a complex script to boot the server (scripting SSH commands) and it was quite a bit slower than a stock image boot.

The first challenge is just booting from a cloud volume. I ended up creating a cloud-init script that does the kexec. So, the boot process looks like this:

On my laptop, I run hcloud server create, passing a script in the user data that performs the kexec.
The stock image boots and runs the user data script, which performs a kexec.
The volume boots, and reformats the server’s fixed disk for use as ephemeral storage.

In order to build this, we need to create a volume to boot from. This process is actually quite simple: we just clone a standard Hetzner image onto a detachable volume, then make a few modifications.

Using hetzner-bootable-volume

If you are using the script from my dotfiles, you can just do this:

hcloud volume create --size=50 --location=nbg1 --name=my-server
./hetzner-bootable-volume prepare-volume my-server
./hetzner-bootable-volume boot my-server

# Create a detachable volume to use
hcloud volume create --size 50 --location nbg1 --name my-server
# Create a powered-off temporary server to use for creation
hcloud server create --location nbg1 --volume my-server --name my-server \
  --image ubuntu-22.04 --type cpx11 --ssh-key my-ssh-key \
  --start-after-create false
# Boot the server into rescue mode
hcloud server enable-rescue --ssh-key my-ssh-key my-server
hcloud server poweron my-server
# Configure the volume, see below
ssh -l root $(hcloud server ip my-server)

Once you’re in rescue mode, there’s a lengthy procedure to follow. You can see the full script by running hetzner-bootable-volume show-prepare-script VOLUME_ID, and I will outline the steps here: reference

(Lines 253-268) Mount the stock image’s root partition and chroot into it. Note that the chroot command actually applies to the “current shell”, instead of starting a subprocess. This wouldn’t work in a typical shell script, but works here because we are running this script directly from stdin. We install kexec-tools, then exit the chroot and unmount the filesystem.
(Lines 270-276) Set up the partitions on the detachable volume. The entire drive contains a single partition, which is the root. We use cp to clone the filesystem from the stock image to the detachable volume, then randomize its permanent ID and resize it to fill the new disk size.
(Lines 292-301) Create a script /boot/hbv-kexec that does the actual kexec. We store this script on the bootable volume in case it needs to be customized in the future.
(Lines 303-323) Create a script /usr/local/sbin/hbv-ephemeral-drive that reformats the fixed drive. We run this script on every boot. The fixed drive will contain a swap partition, and a usable “ephemeral” partition.
(Lines 325-335) Create the hbv-ephemeral-drive systemd unit, which runs the script we just added.
(Lines 337-343) Prepare the new filesystem for initial boot, by telling cloud-init that it has never run before, and enabling the systemd unit we created.
(Lines 345-367) Perform the initial boot into the new filesystem.

After the initial boot has completed, we perform one last task (lines 190-243): disabling most of cloud-init. Most of what cloud-init does (like running commands on “first” boot) doesn’t make sense when the volume is used on multiple servers.

Once that whole process is completed, the script to create a server from the volume is pretty simple:

hcloud server create --location nbg1 --volume my-server --name my-server \
  --image ubuntu-22.04 --type cpx51 --ssh-key my-ssh-key \
  --user-data-from-file user-data.yml

The user data is necessary to do the kexec is simple:

#cloud-config
bootcmd:
 - |
  set -uex
  grub-editenv /boot/grub/grubenv unset recordfail
  mount /dev/disk/by-id/scsi-0HC_Volume_${volume_id}-part1 /mnt -o ro
  for i in dev sys tmp run proc; do mount --rbind /\$i /mnt/\$i; done
  chroot /mnt /boot/hbv-kexec

The grub-editenv line is necessary to fix an issue with rebooting the detachable volume. Without this GRUB thinks that the last boot failed, and so it will GRUB drop into an interactive menu for you to repair the system. The other lines simply set up a chroot, then execute the hbv-exec script we put on the detachable volume.

Remember that we disabled user data on the detachable volume, so this doesn’t conflict or cause boot loops.

When you are done using your server, you can use hcloud server delete to delete it.

Automatically shutting down idle servers

The next goal is to build a daemon that will detect when the system is not in use and shut it down. This is done in two parts:

A script that monitors the system and runs poweroff when it is idle for long enough.
A script that calls hcloud server delete on itself when the system is shutting down.

Before we can do this, we need to define what “idle” means. There isn’t a mouse to jiggle or a screen saver like on a laptop, but instead we can list out a few signals and monitor them:

Logged in users over SSH. This can be monitored by looking at the access time (atime) on all active PTS devices (/dev/pts/*).
Active GUI sessions (I wrote support for this, but never use graphical sessions). This is done by running xprintidle for each user that has an X server running.
A special marker file to signal a the server should be kept on. This allows you to run hbv-auto-shutdown caffeinate command and leave for the weekend, the server will just down after command finishes.

The result of this is hbv-auto-shutdown. Install that script to /usr/local/bin/, then add a systemd unit like this:

[Unit]
Description="Automatic Shutdown"
Documentation=https://gitlab.com/CGamesPlay/dotfiles
Requires=network-online.target
After=network-online.target

[Service]
ExecStart=/usr/local/bin/hbv-auto-shutdown daemon
KillMode=process
KillSignal=SIGTERM
Restart=on-failure

[Install]
WantedBy=multi-user.target

Activating this daemon will automatically shut down the system when it goes idle, but this will leave the server in Hetzner and it will still be billed normally. To actually delete the server, we use another simple script:

#!/bin/sh
# Immediately destroy this server
set -e
if [ "$(systemctl show --property=Job poweroff.target)" = "Job=" ]; then
    echo "Aborting self destruct because system is not powering off" >&2
    exit 0
fi

# Just allow the system to cool down
sleep 5
sync
hcloud server delete "$(cat /var/run/cloud-init/.instance-id)"

This script first checks that we are actually powering off the server (as a safety measure), and then simply calls hcloud server delete. The systemd unit that we use merits some explanation:

# When this service *stops*, it tells hcloud to delete this machine.
[Unit]
Description=self destruct on poweroff

# We want to stop this service pretty late in the shutdown process,
# but before the network goes down. By setting
# Before=network.target, our self destruct will only happen after
# everything which is After=network.target.
Before=network.target user.slice machine.slice
# But the self destruct requires the network to actually be active.
After=systemd-networkd.service nss-lookup.target

[Service]
EnvironmentFile=-/etc/hbv-self-destruct.env
ExecStop=/usr/local/sbin/hbv-self-destruct
Type=oneshot
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

This is a bit backward. We actually bring this unit up early in the boot, immediately after the network is available (1). Since there is no ExecStart but RemainAfterExit is set, the service is marked started without doing anything else. When the service comes down, which is after network.target, user.slice, and machine.slice, it invokes hbv-self-destruct (2).

The last thing to do is provide your server with an API key to use for hcloud:

echo 'HCLOUD_TOKEN=YOUR_HCLOUD_TOKEN' > /etc/hbv-self-destruct.env
chmod 600 /etc/hbv-self-destruct.env

Activate these with systemctl enable --now hbv-auto-shutdown and systemctl enable --now hbv-auto-shutdown.

Improving performance with RAID1

Benchmarks suggest that Hetzner volumes are able to sustain about 300 MiB/s of throughput, while local disks can achieve 700 MiB/s or more. By leveraging RAID1, we can use the server’s fixed drive as a RAID mirror, which improves the read throughput to the same level as a local disk (the write throughput will remain limited).

This is actually very simple to do: every boot, we reformat the local drive, then configure a RAID array with the two drives, where the detachable volume is “write mostly”, leaving the other to be used as a fast cache. The key commands are:

mdadm --build /dev/md0 --level=1 --force --raid-devices=1 \
  --write-mostly /dev/disk/by-label/root
mdadm --grow /dev/md0 --raid-devices=2 --add /dev/disk/by-partlabel/mirror

The second command treats the added drive as a blank slate, so Linux will immediately begin mirroring the main drive onto it, and automatically using it for faster reads once that finishes. If you want, you can check the progress:

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sda3[1] sdb1[0](W)
      157285359 blocks super non-persistent [2/1] [U_]
      [===>.................]  recovery = 17.7% (27984768/157285359) finish=10.5min speed=203862K/sec

unused devices: <none>

Using this RAID1 setup on the root partition is a little bit more complicated; we need to modify the initramfs to run these key commands. You can use the enable-raid1 script to do the whole process.

Conclusion

I’ve been using this iteration of hetzner-bootable-volume for 2 months now, and earlier versions of this concept for over a year. It allows me to have a completely separate development environment for each of my clients, which is especially useful when they have complex VPN requirements. By doing my development on a server in a datacenter, my docker pulls and other bandwidth-heavy activity is always fast, even when I am working over a mobile hotspot.

6.60 (150 GB cloud volume) + 0.088/hr (CPX51) * 160 hr/mo. Current prices can be found at Hetzner Pricing. ↩