Refactor noble cluster configurations by removing deprecated Argo CD application management files and transitioning to a streamlined Ansible-driven installation approach. Update kustomization.yaml files to reflect the new structure, ensuring clarity on resource management. Introduce new namespaces and configurations for cert-manager, external-secrets, and logging components, enhancing the overall deployment process. Add detailed README.md documentation for each component to guide users through the setup and management of the noble lab environment.

2026-03-28 17:02:50 -04:00
parent 41841abc84
commit 90fd8fb8a6
59 changed files with 28 additions and 38 deletions
--- a/clusters/noble/bootstrap/cilium/README.md
+++ b/clusters/noble/bootstrap/cilium/README.md
@@ -0,0 +1,34 @@
+# Cilium — noble (Talos)
+
+Talos uses **`cluster.network.cni.name: none`**; you must install Cilium (or another CNI) before nodes become **Ready** and before **MetalLB** / most workloads. See `talos/CLUSTER-BUILD.md` ordering.
+
+## 1. Install (phase 1 — required)
+
+Uses **`values.yaml`**: IPAM **kubernetes**, **`k8sServiceHost` / `k8sServicePort`** pointing at **KubePrism** (`127.0.0.1:7445`, Talos default), Talos cgroup paths, **drop `SYS_MODULE`** from agent caps, **`bpf.masquerade: false`** ([Talos Cilium](https://www.talos.dev/latest/kubernetes-guides/network/deploying-cilium/), [KubePrism](https://www.talos.dev/latest/kubernetes-guides/configuration/kubeprism/)). Without this, host-network CNI clients may **`dial tcp <VIP>:6443`** and fail if the VIP path is unhealthy.
+
+From **repository root**:
+
+```bash
+helm repo add cilium https://helm.cilium.io/
+helm repo update
+helm upgrade --install cilium cilium/cilium \
+  --namespace kube-system \
+  --version 1.16.6 \
+  -f clusters/noble/apps/cilium/values.yaml \
+  --wait
+```
+
+Verify:
+
+```bash
+kubectl -n kube-system rollout status ds/cilium
+kubectl get nodes
+```
+
+When nodes are **Ready**, continue with **MetalLB** (`clusters/noble/apps/metallb/README.md`) and other Phase B items. **kube-vip** for the Kubernetes API VIP is separate (L2 ARP); it can run after the API is reachable.
+
+## 2. Optional: kube-proxy replacement (phase 2)
+
+To replace **`kube-proxy`** with Cilium entirely, use **`values-kpr.yaml`** and **`cluster.proxy.disabled: true`** in Talos on every node (see comments inside `values-kpr.yaml`). Follow the upstream [Deploy Cilium CNI](https://www.talos.dev/latest/kubernetes-guides/network/deploying-cilium/) section *without kube-proxy*.
+
+Do **not** skip phase 1 unless you already know your cluster matches the “bootstrap window” flow from the Talos docs.
--- a/clusters/noble/bootstrap/cilium/values-kpr.yaml
+++ b/clusters/noble/bootstrap/cilium/values-kpr.yaml
@@ -0,0 +1,49 @@
+# Optional phase 2: kube-proxy replacement via Cilium + KubePrism (Talos apid forwards :7445 → :6443).
+# Prerequisites:
+#   1. Phase 1 Cilium installed and healthy; nodes Ready.
+#   2. Add to Talos machine config on ALL nodes:
+#        cluster:
+#          proxy:
+#            disabled: true
+#      (keep cluster.network.cni.name: none). Regenerate, apply-config, reboot as needed.
+#   3. Remove legacy kube-proxy objects if still present:
+#        kubectl delete ds -n kube-system kube-proxy --ignore-not-found
+#        kubectl delete cm -n kube-system kube-proxy --ignore-not-found
+#   4. helm upgrade cilium ... -f values-kpr.yaml
+#
+# Ref: https://www.talos.dev/latest/kubernetes-guides/network/deploying-cilium/
+
+ipam:
+  mode: kubernetes
+
+kubeProxyReplacement: "true"
+
+k8sServiceHost: localhost
+k8sServicePort: "7445"
+
+securityContext:
+  capabilities:
+    ciliumAgent:
+      - CHOWN
+      - KILL
+      - NET_ADMIN
+      - NET_RAW
+      - IPC_LOCK
+      - SYS_ADMIN
+      - SYS_RESOURCE
+      - DAC_OVERRIDE
+      - FOWNER
+      - SETGID
+      - SETUID
+    cleanCiliumState:
+      - NET_ADMIN
+      - SYS_ADMIN
+      - SYS_RESOURCE
+
+cgroup:
+  autoMount:
+    enabled: false
+  hostRoot: /sys/fs/cgroup
+
+bpf:
+  masquerade: false
--- a/clusters/noble/bootstrap/cilium/values.yaml
+++ b/clusters/noble/bootstrap/cilium/values.yaml
@@ -0,0 +1,44 @@
+# Cilium on Talos — phase 1: bring up CNI while kube-proxy still runs.
+# See README.md for install order (before MetalLB scheduling) and optional kube-proxy replacement.
+#
+# Chart: cilium/cilium — pin version in helm command (e.g. 1.16.6).
+# Ref: https://www.talos.dev/latest/kubernetes-guides/network/deploying-cilium/
+
+ipam:
+  mode: kubernetes
+
+kubeProxyReplacement: "false"
+
+# Host-network components cannot use kubernetes.default ClusterIP; Talos KubePrism (enabled by default)
+# on 127.0.0.1:7445 proxies to healthy apiservers and avoids flaky dials to cluster.controlPlane.endpoint (VIP).
+# Ref: https://www.talos.dev/latest/kubernetes-guides/configuration/kubeprism/
+k8sServiceHost: "127.0.0.1"
+k8sServicePort: "7445"
+
+securityContext:
+  capabilities:
+    ciliumAgent:
+      - CHOWN
+      - KILL
+      - NET_ADMIN
+      - NET_RAW
+      - IPC_LOCK
+      - SYS_ADMIN
+      - SYS_RESOURCE
+      - DAC_OVERRIDE
+      - FOWNER
+      - SETGID
+      - SETUID
+    cleanCiliumState:
+      - NET_ADMIN
+      - SYS_ADMIN
+      - SYS_RESOURCE
+
+cgroup:
+  autoMount:
+    enabled: false
+  hostRoot: /sys/fs/cgroup
+
+# Workaround: Talos host DNS forwarding + bpf masquerade can break CoreDNS; see Talos Cilium guide "Known issues".
+bpf:
+  masquerade: false