Creating my personal cloud with HashiCorp

I maintain a VPS which I use to self-host a variety of services that I use: Joplin Server, Seafile, as well as for hosting websites that I develop. For the past few years, that VPS has been managed with Docker Compose, but recently I decided to build something a bit bigger. My goals with the project:

I want to be able to deploy closed-source projects in a completely self-hosted manner. This means self-hosting a Docker registry.
I want to run 1 node, but with the option to scale to multiple without changing my infrastructure. This means I'm going to be using a cluster manager scaled to one node.
I want a highly-secure way to store secrets on the server (for example, TOTP secrets with an audit trail). I decided early on that Vault was the right way to handle this.
I want the setup process to be fully automatic. Basically, I want Infrastructure-as-Code, so if I have to recreate the server for any reason, I can do so consistently.

I decided to use Nomad to implement these features, and the fruits of my labor are open sourced here: CGamesPlay/infra. If you are looking for a home-lab setup for this stack, I recommend checking it out. The software stack is certainly much more complicated, but the HashiCorp stack is much simpler than the Kubernetes stack and I'm very happy with how easy everything turned out to be, after spending the time to properly acquaint myself with the docs.

How does it work?

1. Configure services locally

I started by configuring the services locally, so that I can easily make adjustments and learn how everything works. Specifically, I run Vault, Consul, and Nomad on my workstation and configure them as I like. Making changes locally is many times faster than making the changes to remote cloud infrastructure, and this allowed me to learn about the HashiCorp stack quickly. Once I had the services working locally the way I wanted, I wrote a script that could configure them that way from scratch again. Again, by developing this locally, it was many times faster than ddeveloping an equivalent script on remote infrastructure. The final product of this is in bootstrap/prepare.sh. This script will set up these 3 services (and also do some preparation for WireGuard), and store all of the configuration in a local directory, ready to be transferred to a VM.

2. Create a new VM with Terraform

To actually create the infrastructure, we use Terraform. I went with Terraform because I needed cross-cloud capability (AWS and Hetzner Cloud in my case). I'm also a big fan of Pulumi, but the infrastructure requirements I have are so simple that Terraform made more sense. This step generates a blank VM with software dependencies installed but not configured.

3. Generate and run the installation script

In previous cloud provisioning projects I have written scripts that would perform remote installations of software. In this project, I opted for a different approach for this project which made things much easier: generating an installation script and allowing the user to run it themselves. Basically, I have a bash script that outputs a second bash script that configures the software. This massively sped up development for me because I could inspect the generated script without running it, and then manually copy and paste commands into an SSH session, fixing any errors incrementally. Of course, once I was sure everything was working, I simply piped the generation script into an ssh session, and the process became completely automatic. The generator script is in bootstrap/generate_installer.sh.

4. Upload jobs to Nomad

After those steps are completed, the server is online and ready to accept jobs. At this point I use plain Nomad jobs to set up Traefik, the private Docker registry, and whatever else I want to run. I have a script (scripts/levant.sh) that uses Levant and iterates over all of my configured jobs, applying any changes automatically.

How does it feel, 6 months in?

To me, the best test of my personal projects are if they still feel usable after 6 months of use. By this time, most of the context that I learned to develop the project has faded away and I have to rely on the documentation that I left myself. In this case, I think this project has been a great success. It's easy for me to set up new services using the existing Nomad jobspecs as guidance, and I feel much more confident doing so than I did with docker-compose files. One of my favorite features of the new system is nomad job plan. This allows me to see what the system will do wtih my jobspec without actually doing it, and this gives me a lot of confidence that the change I am making right now doesn't accidentally undo any manual changes I made before but forgot to properly add to the jobspec. Also, I feel much more confident editing HCL files than YAML files. In YAML, I am constantly worrying that a string will be interpreted as a different data type, or that an array-of-objects will interpreted as an array-of-arrays, etc.

What are my next steps?

I have a few ideas for features I'd like to add to this project:

Scaling to multiple machines is the obvious one. I'd like to be able to automatically scale up different machine configurations based on the jobspecs I'm running. For example, I want to be able to deploy a batch job that requires a GPU instance, and have the instance created and destroyed automatically.
Scaling services to zero is another project I'd like to try. Basically, I want to automatically shut down certain jobs that aren't receiving web traffic, and automatically start them up when they get a request, similar to how Heroku works. I haven't seen any self hosted implementations of this, but I have some ideas around how it can be accomplished.

But for now I'm happy using the system that I've built. I hope the code being open sourced is useful to others. If you find yourself working on up something like what I've done here, feel free to reach out to me and let me know!