Files

Nikholas Pcenicni 8cacf5f5de Enhance monitoring configurations by enabling persistence for Loki and updating storage settings for Prometheus and Alertmanager to use Longhorn. Add Longhorn application to kustomization.yaml for improved storage management.

2026-03-27 16:27:58 -04:00

6.9 KiB

Raw Blame History

Talos deployment (4 nodes)

This directory contains a talhelper cluster definition for a 4-node Talos cluster:

3 hybrid control-plane/worker nodes: noble-cp-1..3
1 worker-only node: noble-worker-1
allowSchedulingOnControlPlanes: true
CNI: none (for Cilium via GitOps)

1) Update values for your environment

Edit talconfig.yaml:

endpoint (Kubernetes API VIP or LB IP)
each node ipAddress
each node installDisk (for example /dev/sda, /dev/nvme0n1)
talosVersion / kubernetesVersion if desired

2) Generate cluster secrets and machine configs

From this directory:

talhelper gensecret > talsecret.sops.yaml
talhelper genconfig

Generated machine configs are written to clusterconfig/.

3) Apply Talos configs

Apply each node file to the matching node IP from talconfig.yaml:

talosctl apply-config --insecure -n 192.168.50.20 -f clusterconfig/noble-noble-cp-1.yaml
talosctl apply-config --insecure -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
talosctl apply-config --insecure -n 192.168.50.40 -f clusterconfig/noble-noble-cp-3.yaml
talosctl apply-config --insecure -n 192.168.50.10 -f clusterconfig/noble-noble-worker-1.yaml

4) Bootstrap the cluster

After all nodes are up (bootstrap once, from any control-plane node):

talosctl bootstrap -n 192.168.50.20 -e 192.168.50.230
talosctl kubeconfig -n 192.168.50.20 -e 192.168.50.230 .

5) Validate

talosctl -n 192.168.50.20 -e 192.168.50.230 health
kubectl get nodes -o wide

6) GitOps-pinned Cilium values

The Cilium settings that worked for this Talos cluster are now persisted in:

clusters/noble/apps/cilium/application.yaml

That Argo CD Application pins chart 1.16.6 and includes the required Helm values for this environment (API host/port, cgroup settings, IPAM CIDR, and security capabilities), so future reconciles do not drift back to defaults.

7) Argo CD app-of-apps bootstrap

This repo includes an app-of-apps structure for cluster apps:

Root app: clusters/noble/root-application.yaml
Child apps index: clusters/noble/apps/kustomization.yaml
Argo CD app: clusters/noble/apps/argocd/application.yaml
Cilium app: clusters/noble/apps/cilium/application.yaml

Bootstrap once from your workstation:

kubectl apply -k clusters/noble/bootstrap/argocd
kubectl apply -f clusters/noble/root-application.yaml

After this, Argo CD continuously reconciles all applications under clusters/noble/apps/.

8) kube-vip API VIP (`192.168.50.230`)

HAProxy has been removed in favor of kube-vip running on control-plane nodes.

Manifests are in:

clusters/noble/apps/kube-vip/application.yaml
clusters/noble/apps/kube-vip/vip-rbac.yaml
clusters/noble/apps/kube-vip/vip-daemonset.yaml

The DaemonSet advertises 192.168.50.230 in ARP mode and fronts the Kubernetes API on port 6443.

Apply manually (or let Argo CD sync from root app):

kubectl apply -k clusters/noble/apps/kube-vip

Validate:

kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip-ds -o wide
nc -vz 192.168.50.230 6443

9) Argo CD via DNS host (no port)

Argo CD is exposed through a kube-vip managed LoadBalancer Service:

argo.noble.lab.pcenicni.dev

Manifests:

clusters/noble/bootstrap/argocd/argocd-server-lb.yaml
clusters/noble/apps/kube-vip/vip-daemonset.yaml (svc_enable: "true")

After syncing manifests, create a Pi-hole DNS A record:

argo.noble.lab.pcenicni.dev -> 192.168.50.231

10) Longhorn storage and extra disks

Longhorn is deployed from:

clusters/noble/apps/longhorn/application.yaml

Monitoring apps are configured to use storageClassName: longhorn, so you can persist Prometheus/Alertmanager/Loki data once Longhorn is healthy.

Extra drive layout (this cluster)

Each node uses:

/dev/sda — Talos install disk (installDisk in talconfig.yaml)
/dev/sdb — dedicated Longhorn data disk

talconfig.yaml includes a global patch that partitions /dev/sdb and mounts it at /var/mnt/longhorn, which matches Longhorn defaultDataPath in the Argo Helm values.

After editing talconfig.yaml, regenerate and apply configs:

cd talos
talhelper genconfig
# apply each node’s YAML from clusterconfig/ with talosctl apply-config

Then reboot each node once so the new disk layout is applied.

`talosctl` TLS errors (`unknown authority`, `Ed25519 verification failure`)

talosctl does not automatically use talos/clusterconfig/talosconfig. If you omit it, the client falls back to ~/.talos/config, which is usually a different cluster CA — you then get TLS handshake failures against the noble nodes.

Always set this in the shell where you run talosctl (use an absolute path if you change directories):

cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230

Sanity check (should print Talos and Kubernetes versions, not TLS errors):

talosctl -e "${ENDPOINT}" -n 192.168.50.20 version

Then use the same shell for apply-config, reboot, and health.

If it still fails after TALOSCONFIG is set, the running cluster was likely bootstrapped with different secrets than the ones in your current talsecret.sops.yaml / regenerated clusterconfig/. In that case you need the original talosconfig that matched the cluster when it was created, or you must align secrets and cluster state (recovery / rebuild is a larger topic).

Keep talosctl roughly aligned with the node Talos version (for example v1.12.x clients for v1.12.5 nodes).

Paste tip: run one command per line. Pasting ...cp-3.yaml and talosctl on the same line breaks the filename and can confuse the shell.

More than one extra disk per node

If you add a third disk later, extend machine.disks in talconfig.yaml (for example /dev/sdc → /var/mnt/longhorn-disk2) and register that path in Longhorn as an additional disk for that node.

Recommended:

use one dedicated filesystem per Longhorn disk path
avoid using the Talos system disk for heavy Longhorn data
spread replicas across nodes for resiliency

11) Upgrade Talos to `v1.12.x`

This repo now pins:

talosVersion: v1.12.5 in talconfig.yaml

Regenerate configs

From talos/:

talhelper genconfig

Rolling upgrade order

Upgrade one node at a time, waiting for it to return healthy before moving on.

Control plane nodes (noble-cp-1, then noble-cp-2, then noble-cp-3)
Worker node (noble-worker-1)

Example commands (adjust node IP per step):

talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 upgrade --image ghcr.io/siderolabs/installer:v1.12.5
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 reboot
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 health

After all nodes are upgraded, verify:

talosctl --talosconfig ./clusterconfig/talosconfig version
kubectl get nodes -o wide

6.9 KiB Raw Blame History Unescape Escape