Tento článok je dostupný len v anglickom jazyku.
As a small tech company with 20–30 people, we’ve gone through the natural evolution of infrastructure. From the days when one server and a few LXC containers were enough, to Docker and Docker Swarm, and finally to Kubernetes, which we now use not only in production but also for development and testing.
In this article, I’d like to share why we migrated, the challenges we faced, and how we successfully moved from Docker Swarm to Kubernetes.
Our beginnings: from LXC to Docker Swarm:
At the start, our infrastructure was simple: one server, a few VMs, and containers. As the number of projects grew, we moved to Docker and then to Docker Swarm.
We had three physical servers running VMs with Swarm on top. Storage was handled with NFS so that we could run services flexibly on different nodes. Load balancing was handled by a shared setup distributing traffic across three Swarm nodes.
Our core services included:
- self-hosted GitLab for code and CI/CD,
- Nexus as an artifact repository,
- several small Java services sharing a common database,
- internal systems (attendance, document management, etc.).
Our Swarm setup actually served us well for quite some time — it was simple to manage and had very low complexity compared to other orchestration tools. That was one of the main reasons we stuck with it for years.
However, as our projects and team grew, several limitations became clear:
- single point of failure on NFS,
- limited security and multi-team separation — it was harder to isolate deployments per team,
- lack of integration with tools we used elsewhere (like GitOps deployments with ArgoCD),
- shrinking community and ecosystem support.
Swarm’s simplicity and low operational overhead were strong advantages, but in the long run they were outweighed by the need for more features, stability, and integration.
Simple schema of our swarm setup
Why Kubernetes
As both our projects and team size grew, we started using Kubernetes in production for customer projects. It made sense to unify environments — so development and testing should also run on Kubernetes.
Our goals were:
- higher stability and availability, even if a physical server goes down,
- eliminate SPOFs in storage,
- improve control and security (secrets, certificates),
- adopt GitOps for deployments,
- keep infrastructure management simple with tools we already knew.
The new architecture
We chose MicroK8s (because we already use Ubuntu extensively). Kubernetes runs directly on physical machines, while KVM is only used separately outside of the cluster.
Key components:
- Longhorn — distributed storage replacing NFS,
- Vault (HashiCorp) — for secrets and ACME endpoint for internal certificates,
- ArgoCD — GitOps orchestration integrated with GitLab,
- Traefik — separate instances for internal and external traffic,
- MetalLB — provides LoadBalancer IPs for services.
Simple schema of new setup
🛠️ How we migrated
Experiments and first tests
We first took one server out of the Swarm cluster, reinstalled it, and ran MicroK8s for testing. The first components deployed were Longhorn, GitLab, ArgoCD, LDAP, and Vault. Nexus was also included in the base services, although we later decided to keep it on a local disk (see below).
Data migration
One of the most challenging parts of the migration was transferring existing data from Swarm into Kubernetes.
At first, we experimented with NFS, but large transfers turned out to be unreliable. In addition, we couldn’t find a straightforward way to copy data directly into Longhorn volumes. Because of that, we decided to use temporary sync pods that would upload the data once into pre-provisioned volumes.
In the end, rsync over SSH proved to be the most consistent approach. To support this, we prepared a lightweight custom image:
FROM ubuntu:latest
RUN apt-get update && apt-get install -y \
rsync \
ssh-client \
vim \
&& rm -rf /var/lib/apt/lists/*
ENTRYPOINT [ "sleep infinity" ]
Then we used sync pods to pull data from legacy VMs into new PVCs in Longhorn:
apiVersion: v1
kind: Pod
metadata:
name: sync-mongodb-data
namespace: core-databases
spec:
containers:
- name: sync-mongodb-data
image: registry.example.com/project/sync-image:latest
command: ["/bin/sh", "-c"]
args:
[
"rsync -azh --delete --stats -e 'ssh -i /key -o StrictHostKeyChecking=no' user@virt2:/media/data/mongodb/ /data/db/",
]
volumeMounts:
- mountPath: /data/db
name: mongodb-data
- mountPath: /key
subPath: key
readOnly: false
name: sync-key
volumes:
- name: mongodb-data
persistentVolumeClaim:
claimName: core-mongodb-pvc
- name: sync-key
secret:
secretName: sync-key
defaultMode: 0600
restartPolicy: Never
Nexus — a special case
We found that Nexus wasn’t stable when running on Longhorn. Instead, we placed its data on a local disk of one server. Yes, this introduces a new SPOF, but we mitigate it with regular backups, and the performance is stable.
There are certainly ways to tune Nexus to run reliably on Longhorn, but for now we decided not to invest time into this optimization. The local storage solution is sufficient for our current needs.
Separating internal and external traffic
We wanted a clear split:
- internal services (only inside the company network, SSL certificates from Vault),
- external services (public-facing, certificates from Let’s Encrypt).
We run two separate Traefik instances, each with its own IP provided by MetalLB.
Configuration — internal Traefik
kind: Deployment
apiVersion: apps/v1
metadata:
namespace: kube-system
name: traefik-internal-deployment
labels:
app: traefik-internal
spec:
replicas: 1
selector:
matchLabels:
app: traefik-internal
template:
metadata:
labels:
app: traefik-internal
spec:
serviceAccountName: traefik-account
containers:
- name: traefik-internal
image: traefik:v3.3
imagePullPolicy: Always
args:
- --entryPoints.http.address=:80
- --entryPoints.https.address=:443
- --entryPoints.metrics.address=:8082
- --metrics.prometheus=true
- --providers.kubernetesingress=true
- --providers.kubernetesingress.ingressclass=traefik-internal
- --providers.kubernetescrd
- --providers.kubernetescrd.allowEmptyServices=true
- --certificatesresolvers.internal.acme.email=admin@example.com
- --certificatesresolvers.internal.acme.storage=/internal/acme.json
- --certificatesresolvers.internal.acme.caServer=https://vault.example.lan/v1/pki_internal/acme/directory
- --certificatesresolvers.internal.acme.httpChallenge.entryPoint=http
ports:
- name: web
containerPort: 80
- name: https
containerPort: 443
- name: dashboard
containerPort: 8080
- name: metrics
containerPort: 8082
volumeMounts:
- mountPath: /internal
name: traefik-internal-data
- mountPath: /etc/ssl/certs/root_ca.crt
subPath: root-ca-crt
readOnly: true
name: traefik-internal-root-ca
volumes:
- name: traefik-internal-data
persistentVolumeClaim:
claimName: traefik-internal-data
- name: traefik-internal-root-ca
secret:
secretName: root-ca
---
apiVersion: v1
kind: Service
metadata:
name: traefik-internal-web
annotations:
metallb.universe.tf/allow-shared-ip: "internal-ip"
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: web
- port: 443
targetPort: https
selector:
app: traefik-internal
loadBalancerIP: 192.168.100.20
Explanation:
- The Deployment runs Traefik internal, serving only company services.
- Certificates are obtained via Vault ACME endpoint.
- The Service is of type LoadBalancer, with the IP (192.168.100.20) provided by MetalLB.
- The annotation allow-shared-ip allows multiple services to share the same IP if needed (e.g., developer databases)..
External Traefik
The external instance is very similar but uses Let’s Encrypt with DNS challenge.
Originally, we tried the standard HTTP challenge, but this caused issues with our geoIP filtering. To obtain certificates, we had to temporarily disable geoIP protection, which was both inconvenient and insecure.
Switching to DNS challenge solved this problem completely. Certificates are now issued directly through the DNS provider’s API, without exposing a .well-known/acme-challenge endpoint to the internet.
- --certificatesresolvers.letsencrypt.acme.dnschallenge=true
- --certificatesresolvers.letsencrypt.acme.dnschallenge.provider=websupport
- --certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json
- --certificatesresolvers.letsencrypt.acme.dnschallenge.delaybeforecheck=0
Results and benefits
After migration, our infrastructure is:
- more stable — services survive even if a whole physical server fails,
- more secure — Vault handles secrets and certificates,
- more flexible — ArgoCD + GitLab give us GitOps workflows,
- unified — Kubernetes is used consistently from dev to production,
- experimental-friendly — with Longhorn we can use snapshots for individual Persistent Volumes, which makes experimenting with services much easier.
Lessons learned
A few takeaways that might help other small teams:
- DNS challenge instead of HTTP challenge
At first, we used HTTP challenge for Let’s Encrypt, but ran into issues with geoIP filtering. We had to temporarily disable it, which wasn’t safe or convenient. Switching to DNS challenge was a game-changer — certificates worked reliably without compromises. - If we could, we’d adopt Kubernetes earlier
We postponed moving to Kubernetes, thinking it was overkill for a small team. In reality, MicroK8s was the perfect fit — easy to get started with, yet fully Kubernetes. If we could go back, we’d use it from day one. Operations are much simpler now. - Storage is critical
Our original NFS was a weakness and a single point of failure. Longhorn was a huge improvement, but we also learned that some services (like Nexus) have special requirements. In those cases, local disk + proper backups made more sense than distributed storage.
Conclusion
Migrating from Swarm to Kubernetes wasn’t trivial — we had to solve many details, from storage to data migration. But the outcome was worth it. Even a small company can benefit from Kubernetes if priorities are clear and the right tools are chosen.


