Talos deployment (4 nodes)
This directory contains a talhelper cluster definition for a 4-node Talos
cluster:
- 3 hybrid control-plane/worker nodes:
noble-cp-1..3 - 1 worker-only node:
noble-worker-1 allowSchedulingOnControlPlanes: true- CNI:
none(for Cilium via GitOps)
1) Update values for your environment
Edit talconfig.yaml:
endpoint(Kubernetes API VIP or LB IP)- each node
ipAddress - each node
installDisk(for example/dev/sda,/dev/nvme0n1) talosVersion/kubernetesVersionif desired
2) Generate cluster secrets and machine configs
From this directory:
talhelper gensecret > talsecret.sops.yaml
talhelper genconfig
Generated machine configs are written to clusterconfig/.
3) Apply Talos configs
Apply each node file to the matching node IP from talconfig.yaml:
talosctl apply-config --insecure -n 192.168.50.20 -f clusterconfig/noble-noble-cp-1.yaml
talosctl apply-config --insecure -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
talosctl apply-config --insecure -n 192.168.50.40 -f clusterconfig/noble-noble-cp-3.yaml
talosctl apply-config --insecure -n 192.168.50.10 -f clusterconfig/noble-noble-worker-1.yaml
4) Bootstrap the cluster
After all nodes are up (bootstrap once, from any control-plane node):
talosctl bootstrap -n 192.168.50.20 -e 192.168.50.230
talosctl kubeconfig -n 192.168.50.20 -e 192.168.50.230 .
5) Validate
talosctl -n 192.168.50.20 -e 192.168.50.230 health
kubectl get nodes -o wide
kubectl errors: lookup https: no such host or https://https/...
That means the active kubeconfig has a broken cluster.server URL (often a
double https:// or duplicate :6443). Kubernetes then tries to resolve
the hostname https, which fails.
Inspect what you are using:
kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}{"\n"}'
It must be a single valid URL, for example:
https://192.168.50.230:6443(API VIP fromtalconfig.yaml), orhttps://kube.noble.lab.pcenicni.dev:6443(if DNS points at that VIP)
Fix the cluster entry (replace noble with your context’s cluster name if
different):
kubectl config set-cluster noble --server=https://192.168.50.230:6443
Or point kubectl at this repo’s kubeconfig (known-good server line):
export KUBECONFIG="$(pwd)/kubeconfig"
kubectl cluster-info
Avoid pasting https:// twice when running kubectl config set-cluster ... --server=....
6) GitOps-pinned Cilium values
The Cilium settings that worked for this Talos cluster are now persisted in:
clusters/noble/apps/cilium/application.yaml
That Argo CD Application pins chart 1.16.6 and includes the required Helm
values for this environment (API host/port, cgroup settings, IPAM CIDR, and
security capabilities), so future reconciles do not drift back to defaults.
7) Argo CD app-of-apps bootstrap
This repo includes an app-of-apps structure for cluster apps:
- Root app:
clusters/noble/root-application.yaml - Child apps index:
clusters/noble/apps/kustomization.yaml - Argo CD app:
clusters/noble/apps/argocd/application.yaml - Cilium app:
clusters/noble/apps/cilium/application.yaml
Bootstrap once from your workstation:
kubectl apply -k clusters/noble/bootstrap/argocd
kubectl apply -f clusters/noble/root-application.yaml
After this, Argo CD continuously reconciles all applications under
clusters/noble/apps/.
8) kube-vip API VIP (192.168.50.230)
HAProxy has been removed in favor of kube-vip running on control-plane nodes.
Manifests are in:
clusters/noble/apps/kube-vip/application.yamlclusters/noble/apps/kube-vip/vip-rbac.yamlclusters/noble/apps/kube-vip/vip-daemonset.yaml
The DaemonSet advertises 192.168.50.230 in ARP mode and fronts the Kubernetes
API on port 6443.
Apply manually (or let Argo CD sync from root app):
kubectl apply -k clusters/noble/apps/kube-vip
Validate:
kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip-ds -o wide
nc -vz 192.168.50.230 6443
9) Argo CD via DNS host (no port)
Argo CD is exposed through a kube-vip managed LoadBalancer Service:
argo.noble.lab.pcenicni.dev
Manifests:
clusters/noble/bootstrap/argocd/argocd-server-lb.yamlclusters/noble/apps/kube-vip/vip-daemonset.yaml(svc_enable: "true")
After syncing manifests, create a Pi-hole DNS A record:
argo.noble.lab.pcenicni.dev->192.168.50.231
10) Longhorn storage and extra disks
Longhorn is deployed from:
clusters/noble/apps/longhorn/application.yaml
Monitoring apps are configured to use storageClassName: longhorn, so you can
persist Prometheus/Alertmanager/Loki data once Longhorn is healthy.
Argo CD: longhorn OutOfSync, Health Missing, no longhorn-role
Missing means nothing has been applied yet, or a sync never completed. The
Helm chart creates ClusterRole/longhorn-role on a successful install.
- See the failure reason:
kubectl describe application longhorn -n argocd
Check Status → Conditions and Status → Operation State for the error
(for example Helm render error, CRD apply failure, or repo-server cannot reach
https://charts.longhorn.io).
- Trigger a sync (Argo CD UI Sync, or CLI):
argocd app sync longhorn
- After a good sync, confirm:
kubectl get clusterrole longhorn-role
kubectl get pods -n longhorn-system
Extra drive layout (this cluster)
Each node uses:
/dev/sda— Talos install disk (installDiskintalconfig.yaml)/dev/sdb— dedicated Longhorn data disk
talconfig.yaml includes a global patch that partitions /dev/sdb and mounts it
at /var/mnt/longhorn, which matches Longhorn defaultDataPath in the Argo
Helm values.
After editing talconfig.yaml, regenerate and apply configs:
cd talos
talhelper genconfig
# apply each node’s YAML from clusterconfig/ with talosctl apply-config
Then reboot each node once so the new disk layout is applied.
talosctl TLS errors (unknown authority, Ed25519 verification failure)
talosctl does not automatically use talos/clusterconfig/talosconfig. If you
omit it, the client falls back to ~/.talos/config, which is usually a
different cluster CA — you then get TLS handshake failures against the noble
nodes.
Always set this in the shell where you run talosctl (use an absolute path
if you change directories):
cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230
Sanity check (should print Talos and Kubernetes versions, not TLS errors):
talosctl -e "${ENDPOINT}" -n 192.168.50.20 version
Then use the same shell for apply-config, reboot, and health.
If it still fails after TALOSCONFIG is set, the running cluster was likely
bootstrapped with different secrets than the ones in your current
talsecret.sops.yaml / regenerated clusterconfig/. In that case you need the
original talosconfig that matched the cluster when it was created, or you
must align secrets and cluster state (recovery / rebuild is a larger topic).
Keep talosctl roughly aligned with the node Talos version (for example
v1.12.x clients for v1.12.5 nodes).
Paste tip: run one command per line. Pasting ...cp-3.yaml and
talosctl on the same line breaks the filename and can confuse the shell.
More than one extra disk per node
If you add a third disk later, extend machine.disks in talconfig.yaml (for
example /dev/sdc → /var/mnt/longhorn-disk2) and register that path in
Longhorn as an additional disk for that node.
Recommended:
- use one dedicated filesystem per Longhorn disk path
- avoid using the Talos system disk for heavy Longhorn data
- spread replicas across nodes for resiliency
11) Upgrade Talos to v1.12.x
This repo now pins:
talosVersion: v1.12.5intalconfig.yaml
Regenerate configs
From talos/:
talhelper genconfig
Rolling upgrade order
Upgrade one node at a time, waiting for it to return healthy before moving on.
- Control plane nodes (
noble-cp-1, thennoble-cp-2, thennoble-cp-3) - Worker node (
noble-worker-1)
Example commands (adjust node IP per step):
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 upgrade --image ghcr.io/siderolabs/installer:v1.12.5
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 reboot
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 health
After all nodes are upgraded, verify:
talosctl --talosconfig ./clusterconfig/talosconfig version
kubectl get nodes -o wide