201 lines
26 KiB
Markdown
201 lines
26 KiB
Markdown
# Noble lab — Talos cluster build checklist
|
||
|
||
This document is the **exported TODO** for the **noble** Talos cluster (4 nodes). Commands and troubleshooting live in [`README.md`](./README.md).
|
||
|
||
## Current state (2026-03-28)
|
||
|
||
Lab stack is **up** on-cluster through **Phase D**–**F** and **Phase G** (**`talos/runbooks/`**, **SOPS**-encrypted secrets in **`clusters/noble/secrets/`**). **Next focus:** optional **Alertmanager** receivers (Slack/PagerDuty); tighten **RBAC** (Headlamp / cluster-admin); **Cilium** policies for other namespaces as needed; enable **Mend Renovate** for PRs; Pangolin/sample Ingress; **Velero** backup/restore drill after S3 credentials are set (**`noble_velero_install`**).
|
||
|
||
- **Talos** v1.12.6 (target) / **Kubernetes** as bundled — four nodes **Ready** unless upgrading; **`talosctl health`**; **`talos/kubeconfig`** is **local only** (gitignored — never commit; regenerate with `talosctl kubeconfig` per `talos/README.md`). **Image Factory (nocloud installer):** `factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6`
|
||
- **Cilium** Helm **1.16.6** / app **1.16.6** (`clusters/noble/bootstrap/cilium/`, phase 1 values).
|
||
- **CSI Volume Snapshot** — **external-snapshotter** **v8.5.0** CRDs + **`registry.k8s.io/sig-storage/snapshot-controller`** (`clusters/noble/bootstrap/csi-snapshot-controller/`, Ansible **`noble_csi_snapshot_controller`**).
|
||
- **MetalLB** Helm **0.15.3** / app **v0.15.3**; **IPAddressPool** `noble-l2` + **L2Advertisement** — pool **`192.168.50.210`–`192.168.50.229`**.
|
||
- **kube-vip** DaemonSet **3/3** on control planes; VIP **`192.168.50.230`** on **`ens18`** (`vip_subnet` **`/32`** required — bare **`32`** breaks parsing). **Verified from workstation:** `kubectl config set-cluster noble --server=https://192.168.50.230:6443` then **`kubectl get --raw /healthz`** → **`ok`** (`talos/kubeconfig`; see `talos/README.md`).
|
||
- **metrics-server** Helm **3.13.0** / app **v0.8.0** — `clusters/noble/bootstrap/metrics-server/values.yaml` (`--kubelet-insecure-tls` for Talos); **`kubectl top nodes`** works.
|
||
- **Longhorn** Helm **1.11.1** / app **v1.11.1** — `clusters/noble/bootstrap/longhorn/` (PSA **privileged** namespace, `defaultDataPath` `/var/mnt/longhorn`, `preUpgradeChecker` enabled); **StorageClass** `longhorn` (default); **`nodes.longhorn.io`** all **Ready**; test **PVC** `Bound` on `longhorn`.
|
||
- **Traefik** Helm **39.0.6** / app **v3.6.11** — `clusters/noble/bootstrap/traefik/`; **`Service`** **`LoadBalancer`** **`EXTERNAL-IP` `192.168.50.211`**; **`IngressClass`** **`traefik`** (default). Point **`*.apps.noble.lab.pcenicni.dev`** at **`192.168.50.211`**. MetalLB pool verification was done before replacing the temporary nginx test with Traefik.
|
||
- **cert-manager** Helm **v1.20.0** / app **v1.20.0** — `clusters/noble/bootstrap/cert-manager/`; **`ClusterIssuer`** **`letsencrypt-staging`** and **`letsencrypt-prod`** (**DNS-01** via **Cloudflare** for **`pcenicni.dev`**, Secret **`cloudflare-dns-api-token`** in **`cert-manager`**); ACME email **`certificates@noble.lab.pcenicni.dev`** (edit in manifests if you want a different mailbox).
|
||
- **Newt** Helm **1.2.0** / app **1.10.1** — `clusters/noble/bootstrap/newt/` (**fossorial/newt**); Pangolin site tunnel — **`newt-pangolin-auth`** Secret (**`PANGOLIN_ENDPOINT`**, **`NEWT_ID`**, **`NEWT_SECRET`**). Store credentials in git with **SOPS** (`clusters/noble/secrets/newt-pangolin-auth.secret.yaml`, **`age-key.txt`**, **`.sops.yaml`**) — see **`clusters/noble/secrets/README.md`**. **Public DNS** is **not** automated with ExternalDNS: **CNAME** records at your DNS host per Pangolin’s domain instructions, plus **Integration API** for HTTP resources/targets — see **`clusters/noble/bootstrap/newt/README.md`**. LAN access to Traefik can still use **`*.apps.noble.lab.pcenicni.dev`** → **`192.168.50.211`** (split horizon / local resolver).
|
||
- **Argo CD** Helm **9.4.17** / app **v3.3.6** — `clusters/noble/bootstrap/argocd/`; **`argocd-server`** **`LoadBalancer`** **`192.168.50.210`**; **`noble-root`** → **`clusters/noble/apps/`**; **`noble-bootstrap-root`** → **`clusters/noble/bootstrap`** (manual sync until **`argocd/README.md`** §5 after **`noble.yml`**). Edit **`repoURL`** in both root **`Application`** files before applying.
|
||
- **kube-prometheus-stack** — Helm chart **82.15.1** — `clusters/noble/bootstrap/kube-prometheus-stack/` (**namespace** `monitoring`, PSA **privileged** — **node-exporter** needs host mounts); **Longhorn** PVCs for Prometheus, Grafana, Alertmanager; **node-exporter** DaemonSet **4/4**. **Grafana Ingress:** **`https://grafana.apps.noble.lab.pcenicni.dev`** (Traefik **`ingressClassName: traefik`**, **`cert-manager.io/cluster-issuer: letsencrypt-prod`**). **Loki** datasource in Grafana: ConfigMap **`clusters/noble/bootstrap/grafana-loki-datasource/loki-datasource.yaml`** (sidecar label **`grafana_datasource: "1"`**) — not via **`grafana.additionalDataSources`** in the chart. **`helm upgrade --install` with `--wait` is silent until done** — use **`--timeout 30m`**; Grafana admin: Secret **`kube-prometheus-grafana`**, keys **`admin-user`** / **`admin-password`**.
|
||
- **Loki** + **Fluent Bit** — **`grafana/loki` 6.55.0** SingleBinary + **filesystem** on **Longhorn** (`clusters/noble/bootstrap/loki/`); **`loki.auth_enabled: false`**; **`chunksCache.enabled: false`** (no memcached chunk cache). **`fluent/fluent-bit` 0.56.0** → **`loki-gateway.loki.svc:80`** (`clusters/noble/bootstrap/fluent-bit/`); **`logging`** PSA **privileged**. **Grafana Explore:** **`kubectl apply -f clusters/noble/bootstrap/grafana-loki-datasource/loki-datasource.yaml`** then **Explore → Loki** (e.g. `{job="fluent-bit"}`).
|
||
- **SOPS** — cluster **`Secret`** manifests under **`clusters/noble/secrets/`** encrypted with **age** (see **`.sops.yaml`**, **`age-key.txt`** gitignored); **`noble.yml`** decrypt-applies when the private key is present.
|
||
- **Velero** Helm **12.0.0** / app **v1.18.0** — `clusters/noble/bootstrap/velero/` (**Ansible** **`noble_velero`**, not Argo); **S3-compatible** backup location + **CSI** snapshots (**`EnableCSI`**); enable with **`noble_velero_install`** per **`velero/README.md`**.
|
||
- **Still open:** **Renovate** — install **[Mend Renovate](https://github.com/apps/renovate)** (or self-host) so PRs run; optional **Alertmanager** notification channels; optional **sample Ingress + cert + Pangolin** end-to-end; **Argo CD SSO**.
|
||
|
||
## Inventory
|
||
|
||
| Host | Role | IP |
|
||
|------|------|-----|
|
||
| helium | worker | `192.168.50.10` |
|
||
| neon | control-plane + worker | `192.168.50.20` |
|
||
| argon | control-plane + worker | `192.168.50.30` |
|
||
| krypton | control-plane + worker | `192.168.50.40` |
|
||
|
||
## Network reservations
|
||
|
||
| Use | Value |
|
||
|-----|--------|
|
||
| Kubernetes API VIP (kube-vip) | `192.168.50.230` (see `talos/README.md`; align with `talos/talconfig.yaml` `additionalApiServerCertSans`) |
|
||
| MetalLB L2 pool | `192.168.50.210`–`192.168.50.229` |
|
||
| Argo CD `LoadBalancer` | **Pick one IP** in the MetalLB pool (e.g. `192.168.50.210`) |
|
||
| Traefik (apps ingress) | `192.168.50.211` — **`metallb.io/loadBalancerIPs`** in `clusters/noble/bootstrap/traefik/values.yaml` |
|
||
| Apps ingress (LAN / split horizon) | `*.apps.noble.lab.pcenicni.dev` → Traefik LB |
|
||
| Grafana (Ingress + TLS) | **`grafana.apps.noble.lab.pcenicni.dev`** — `grafana.ingress` in `clusters/noble/bootstrap/kube-prometheus-stack/values.yaml` (**`letsencrypt-prod`**) |
|
||
| Headlamp (Ingress + TLS) | **`headlamp.apps.noble.lab.pcenicni.dev`** — chart `ingress` in `clusters/noble/bootstrap/headlamp/` (**`letsencrypt-prod`**, **`ingressClassName: traefik`**) |
|
||
| Public DNS (Pangolin) | **Newt** tunnel + **CNAME** at registrar + **Integration API** — `clusters/noble/bootstrap/newt/` |
|
||
| Velero | S3-compatible endpoint + bucket — **`clusters/noble/bootstrap/velero/`**, **`ansible/playbooks/noble.yml`** (**`noble_velero_install`**) |
|
||
|
||
## Versions
|
||
|
||
- Talos: **v1.12.6** — align `talosctl` client with node image
|
||
- Talos **Image Factory** (iscsi-tools + util-linux-tools): **`factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6`** — same schematic must appear in **`machine.install.image`** after `talhelper genconfig` (bare metal may use `metal-installer/` instead of `nocloud-installer/`)
|
||
- Kubernetes: **1.35.2** on current nodes (bundled with Talos; not pinned in repo)
|
||
- Cilium: **1.16.6** (Helm chart; see `clusters/noble/bootstrap/cilium/README.md`)
|
||
- MetalLB: **0.15.3** (Helm chart; app **v0.15.3**)
|
||
- metrics-server: **3.13.0** (Helm chart; app **v0.8.0**)
|
||
- Longhorn: **1.11.1** (Helm chart; app **v1.11.1**)
|
||
- Traefik: **39.0.6** (Helm chart; app **v3.6.11**)
|
||
- cert-manager: **v1.20.0** (Helm chart; app **v1.20.0**)
|
||
- Newt (Fossorial): **1.2.0** (Helm chart; app **1.10.1**)
|
||
- Argo CD: **9.4.17** (Helm chart `argo/argo-cd`; app **v3.3.6**)
|
||
- kube-prometheus-stack: **82.15.1** (Helm chart `prometheus-community/kube-prometheus-stack`; app **v0.89.x** bundle)
|
||
- Loki: **6.55.0** (Helm chart `grafana/loki`; app **3.6.7**)
|
||
- Fluent Bit: **0.56.0** (Helm chart `fluent/fluent-bit`; app **4.2.3**)
|
||
- Kyverno: **3.7.1** (Helm chart `kyverno/kyverno`; app **v1.17.1**); **kyverno-policies** **3.7.1** — **baseline** PSS, **Audit** (`clusters/noble/bootstrap/kyverno/`)
|
||
- Headlamp: **0.40.1** (Helm chart `headlamp/headlamp`; app matches chart — see [Artifact Hub](https://artifacthub.io/packages/helm/headlamp/headlamp))
|
||
- Velero: **12.0.0** (Helm chart `vmware-tanzu/velero`; app **v1.18.0**) — **`clusters/noble/bootstrap/velero/`**; AWS plugin **v1.14.0**; Ansible **`noble_velero`**
|
||
- Renovate: **hosted** (Mend **Renovate** GitHub/GitLab app — no cluster chart) **or** **self-hosted** — pin chart when added ([Helm charts](https://docs.renovatebot.com/helm-charts/), OCI `ghcr.io/renovatebot/charts/renovate`); pair **`renovate.json`** with this repo’s Helm paths under **`clusters/noble/`**
|
||
|
||
## Repo paths (this workspace)
|
||
|
||
| Artifact | Path |
|
||
|----------|------|
|
||
| This checklist | `talos/CLUSTER-BUILD.md` |
|
||
| Operational runbooks (API VIP, etcd, Longhorn, SOPS) | `talos/runbooks/` |
|
||
| Talos quick start + networking + kubeconfig | `talos/README.md` |
|
||
| talhelper source (active) | `talos/talconfig.yaml` — may be **wipe-phase** (no Longhorn volume) during disk recovery |
|
||
| Longhorn volume restore | `talos/talconfig.with-longhorn.yaml` — copy to `talconfig.yaml` after GPT wipe (see `talos/README.md` §5) |
|
||
| Longhorn GPT wipe automation | `talos/scripts/longhorn-gpt-recovery.sh` |
|
||
| kube-vip (kustomize) | `clusters/noble/bootstrap/kube-vip/` (`vip_interface` e.g. `ens18`) |
|
||
| Cilium (Helm values) | `clusters/noble/bootstrap/cilium/` — `values.yaml` (phase 1), optional `values-kpr.yaml`, `README.md` |
|
||
| CSI Volume Snapshot (CRDs + controller) | `clusters/noble/bootstrap/csi-snapshot-controller/` — `crd/`, `controller/` kustomize; **`ansible/roles/noble_csi_snapshot_controller`** |
|
||
| MetalLB | `clusters/noble/bootstrap/metallb/` — `namespace.yaml` (PSA **privileged**), `ip-address-pool.yaml`, `kustomization.yaml`, `README.md` |
|
||
| Longhorn | `clusters/noble/bootstrap/longhorn/` — `values.yaml`, `namespace.yaml` (PSA **privileged**), `kustomization.yaml` |
|
||
| metrics-server (Helm values) | `clusters/noble/bootstrap/metrics-server/values.yaml` |
|
||
| Traefik (Helm values) | `clusters/noble/bootstrap/traefik/` — `values.yaml`, `namespace.yaml`, `README.md` |
|
||
| cert-manager (Helm + ClusterIssuers) | `clusters/noble/bootstrap/cert-manager/` — `values.yaml`, `namespace.yaml`, `kustomization.yaml`, `README.md` |
|
||
| Newt / Pangolin tunnel (Helm) | `clusters/noble/bootstrap/newt/` — `values.yaml`, `namespace.yaml`, `README.md` |
|
||
| Argo CD (Helm) + app-of-apps | `clusters/noble/bootstrap/argocd/` — `values.yaml`, `root-application.yaml`, `bootstrap-root-application.yaml`, `app-of-apps/`, `README.md`; **`noble-root`** syncs **`clusters/noble/apps/`**; **`noble-bootstrap-root`** syncs **`clusters/noble/bootstrap`** (enable automation after **`noble.yml`**) |
|
||
| kube-prometheus-stack (Helm values) | `clusters/noble/bootstrap/kube-prometheus-stack/` — `values.yaml`, `namespace.yaml` |
|
||
| Grafana Loki datasource (ConfigMap; no chart change) | `clusters/noble/bootstrap/grafana-loki-datasource/loki-datasource.yaml` |
|
||
| Loki (Helm values) | `clusters/noble/bootstrap/loki/` — `values.yaml`, `namespace.yaml` |
|
||
| Fluent Bit → Loki (Helm values) | `clusters/noble/bootstrap/fluent-bit/` — `values.yaml`, `namespace.yaml` |
|
||
| SOPS-encrypted cluster Secrets | `clusters/noble/secrets/` — `README.md`, `*.secret.yaml`; **`.sops.yaml`**, **`age-key.txt`** (gitignored) at repo root |
|
||
| Kyverno + PSS baseline policies | `clusters/noble/bootstrap/kyverno/` — `values.yaml`, `policies-values.yaml`, `namespace.yaml`, `README.md` |
|
||
| Headlamp (Helm + Ingress) | `clusters/noble/bootstrap/headlamp/` — `values.yaml`, `namespace.yaml`, `README.md` |
|
||
| Velero (Helm + S3 BSL; CSI snapshots) | `clusters/noble/bootstrap/velero/` — `values.yaml`, `namespace.yaml`, `README.md`; **`ansible/roles/noble_velero`** |
|
||
| Renovate (repo config + optional self-hosted Helm) | **`renovate.json`** at repo root; optional self-hosted chart under **`clusters/noble/apps/`** (Argo) + token Secret (SOPS under **`clusters/noble/secrets/`** or imperative **`kubectl create secret`**) |
|
||
|
||
**Git vs cluster:** manifests and `talconfig` live in git; **`talhelper genconfig -o out`**, bootstrap, Helm, and `kubectl` run on your LAN. See **`talos/README.md`** for workstation reachability (lab LAN/VPN), **`talosctl kubeconfig`** vs Kubernetes `server:` (VIP vs node IP), and **`--insecure`** only in maintenance.
|
||
|
||
## Ordering (do not skip)
|
||
|
||
1. **Talos** installed; **Cilium** (or chosen CNI) **before** most workloads — with `cni: none`, nodes stay **NotReady** / **network-unavailable** taint until CNI is up.
|
||
2. **MetalLB Helm chart** (CRDs + controller) **before** `kubectl apply -k` on the pool manifests.
|
||
3. **`clusters/noble/bootstrap/metallb/namespace.yaml`** before or merged onto `metallb-system` so Pod Security does not block speaker (see `bootstrap/metallb/README.md`).
|
||
4. **CSI Volume snapshots:** **`kubernetes-csi/external-snapshotter`** CRDs + **`snapshot-controller`** (`clusters/noble/bootstrap/csi-snapshot-controller/`) before relying on **Longhorn** / **Velero** volume snapshots.
|
||
5. **Longhorn:** Talos user volume + extensions in `talconfig.with-longhorn.yaml` (when restored); Helm **`defaultDataPath`** in `clusters/noble/bootstrap/longhorn/values.yaml`.
|
||
6. **Loki → Fluent Bit → Grafana datasource:** deploy **Loki** (`loki-gateway` Service) before **Fluent Bit**; apply **`clusters/noble/bootstrap/grafana-loki-datasource/loki-datasource.yaml`** after **Loki** (sidecar picks up the ConfigMap — no kube-prometheus values change for Loki).
|
||
7. **Headlamp:** **Traefik** + **cert-manager** (**`letsencrypt-prod`**) before exposing **`headlamp.apps.noble.lab.pcenicni.dev`**; treat as **cluster-admin** UI — protect with network policy / SSO when hardening (**Phase G**).
|
||
8. **Renovate:** **Git remote** + platform access (**hosted app** needs org/repo install; **self-hosted** needs **`RENOVATE_TOKEN`** and chart **`renovate.config`**). If the bot runs **in-cluster**, store the token with **SOPS** or an imperative Secret — no ingress required for the bot itself.
|
||
9. **Velero:** **S3-compatible** endpoint + bucket + **`velero/velero-cloud-credentials`** before **`ansible/playbooks/noble.yml`** with **`noble_velero_install: true`**; for **CSI** volume snapshots, label a **VolumeSnapshotClass** per **`clusters/noble/bootstrap/velero/README.md`** (e.g. Longhorn).
|
||
|
||
## Prerequisites (before phases)
|
||
|
||
- [x] `talos/talconfig.yaml` checked in (VIP, API SANs, `cni: none`, `iscsi-tools` / `util-linux-tools` in schematic) — run `talhelper validate talconfig talconfig.yaml` after edits
|
||
- [x] Workstation on a **routable path** to node IPs or VIP (same LAN / VPN); `talos/README.md` §3 if `kubectl` hits wrong `server:` or `network is unreachable`
|
||
- [x] `talosctl` client matches node Talos version; `talhelper` for `genconfig`
|
||
- [x] Node static IPs (helium, neon, argon, krypton)
|
||
- [x] DHCP does not lease `192.168.50.210`–`229`, `230`, or node IPs
|
||
- [x] DNS for API and apps as in `talos/README.md`
|
||
- [x] Git remote ready for Argo CD (argo-cd)
|
||
- [x] **`talos/kubeconfig`** from `talosctl kubeconfig` — root repo `kubeconfig` is a stub until populated
|
||
|
||
## Phase A — Talos bootstrap + API VIP
|
||
|
||
- Optional: **Ansible** runs the same steps — [`ansible/playbooks/talos_phase_a.yml`](../ansible/playbooks/talos_phase_a.yml) (genconfig → apply → bootstrap → kubeconfig) or [`ansible/playbooks/deploy.yml`](../ansible/playbooks/deploy.yml) (Phase A + **`noble.yml`**); see [`ansible/README.md`](../ansible/README.md).
|
||
- [x] `talhelper gensecret` → `talhelper genconfig -o out` (re-run `genconfig` after every `talconfig` edit)
|
||
- [x] `apply-config` all nodes (`talos/README.md` §2 — **no** `--insecure` after nodes join; use `TALOSCONFIG`)
|
||
- [x] `talosctl bootstrap` once; other control planes and worker join
|
||
- [x] `talosctl kubeconfig` → working `kubectl` (`talos/README.md` §3 — override `server:` if VIP not reachable from workstation)
|
||
- [x] **kube-vip manifests** in `clusters/noble/bootstrap/kube-vip`
|
||
- [x] kube-vip healthy; `vip_interface` matches uplink (`talosctl get links`); VIP reachable where needed
|
||
- [x] `talosctl health` (e.g. `talosctl health -n 192.168.50.20` with `TALOSCONFIG` set)
|
||
|
||
## Phase B — Core platform
|
||
|
||
**Install order:** **Cilium** → **Volume Snapshot CRDs + snapshot-controller** (`clusters/noble/bootstrap/csi-snapshot-controller/`, Ansible **`noble_csi_snapshot_controller`**) → **metrics-server** → **Longhorn** (Talos disk + Helm) → **MetalLB** (Helm → pool manifests) → ingress / certs / DNS as planned.
|
||
|
||
- [x] **Cilium** (Helm **1.16.6**) — **required** before MetalLB if `cni: none` (`clusters/noble/bootstrap/cilium/`)
|
||
- [x] **CSI Volume Snapshot** — CRDs + **`snapshot-controller`** in **`kube-system`** (`clusters/noble/bootstrap/csi-snapshot-controller/`); Ansible **`noble_csi_snapshot_controller`**; verify `kubectl api-resources | grep VolumeSnapshot`
|
||
- [x] **metrics-server** — Helm **3.13.0**; values in `clusters/noble/bootstrap/metrics-server/values.yaml`; verify `kubectl top nodes`
|
||
- [x] **Longhorn** — Talos: user volume + kubelet mounts + extensions (`talos/README.md` §5); Helm **1.11.1**; `kubectl apply -k clusters/noble/bootstrap/longhorn`; verify **`nodes.longhorn.io`** and test PVC **`Bound`**
|
||
- [x] **MetalLB** — chart installed; **pool + L2** from `clusters/noble/bootstrap/metallb/` applied (`192.168.50.210`–`229`)
|
||
- [x] **`Service` `LoadBalancer`** / pool check — MetalLB assigns from `210`–`229` (validated before Traefik; temporary nginx test removed in favor of Traefik)
|
||
- [x] **Traefik** `LoadBalancer` for `*.apps.noble.lab.pcenicni.dev` — `clusters/noble/bootstrap/traefik/`; **`192.168.50.211`**
|
||
- [x] **cert-manager** + ClusterIssuer (**`letsencrypt-staging`** / **`letsencrypt-prod`**) — `clusters/noble/bootstrap/cert-manager/`
|
||
- [x] **Newt** (Pangolin tunnel; replaces ExternalDNS for public DNS) — `clusters/noble/bootstrap/newt/` — **`newt-pangolin-auth`**; CNAME + **Integration API** per **`newt/README.md`**
|
||
|
||
## Phase C — GitOps
|
||
|
||
- [x] **Argo CD** bootstrap — `clusters/noble/bootstrap/argocd/` (`helm upgrade --install argocd …`) — also covered by **`ansible/playbooks/noble.yml`** (role **`noble_argocd`**)
|
||
- [x] Argo CD server **LoadBalancer** — **`192.168.50.210`** (see `values.yaml`)
|
||
- [x] **App-of-apps** — optional; **`clusters/noble/apps/kustomization.yaml`** is **empty** (core stack is **Ansible**-managed from **`clusters/noble/bootstrap/`**, not Argo). Set **`repoURL`** in **`root-application.yaml`** and add **`Application`** manifests only for optional GitOps workloads — see **`clusters/noble/apps/README.md`**
|
||
- [x] **Renovate** — **`renovate.json`** at repo root ([Renovate](https://docs.renovatebot.com/) — **Kubernetes** manager for **`clusters/noble/**/*.yaml`** image pins; grouped minor/patch PRs). **Activate PRs:** install **[Mend Renovate](https://github.com/apps/renovate)** on the Git repo (**Option A**), or **Option B:** self-hosted chart per [Helm charts](https://docs.renovatebot.com/helm-charts/) + token from **SOPS** or a one-off Secret. Helm **chart** versions pinned only in comments still need manual bumps or extra **regex** `customManagers` — extend **`renovate.json`** as needed.
|
||
- [ ] SSO — later
|
||
|
||
## Phase D — Observability
|
||
|
||
- [x] **kube-prometheus-stack** — `kubectl apply -f clusters/noble/bootstrap/kube-prometheus-stack/namespace.yaml` then **`helm upgrade --install`** as in `clusters/noble/bootstrap/kube-prometheus-stack/values.yaml` (chart **82.15.1**); PVCs **`longhorn`**; **`--wait --timeout 30m`** recommended; verify **`kubectl -n monitoring get pods,pvc`**
|
||
- [x] **Loki** + **Fluent Bit** + **Grafana Loki datasource** — **order:** **`kubectl apply -f clusters/noble/bootstrap/loki/namespace.yaml`** → **`helm upgrade --install loki`** `grafana/loki` **6.55.0** `-f clusters/noble/bootstrap/loki/values.yaml` → **`kubectl apply -f clusters/noble/bootstrap/fluent-bit/namespace.yaml`** → **`helm upgrade --install fluent-bit`** `fluent/fluent-bit` **0.56.0** `-f clusters/noble/bootstrap/fluent-bit/values.yaml` → **`kubectl apply -f clusters/noble/bootstrap/grafana-loki-datasource/loki-datasource.yaml`**. Verify **Explore → Loki** in Grafana; **`kubectl -n loki get pods,pvc`**, **`kubectl -n logging get pods`**
|
||
- [x] **Headlamp** — Kubernetes web UI ([Headlamp](https://headlamp.dev/)); **`helm repo add headlamp https://kubernetes-sigs.github.io/headlamp/`**; **`kubectl apply -f clusters/noble/bootstrap/headlamp/namespace.yaml`** → **`helm upgrade --install headlamp headlamp/headlamp --version 0.40.1 -n headlamp -f clusters/noble/bootstrap/headlamp/values.yaml`**; **Ingress** **`https://headlamp.apps.noble.lab.pcenicni.dev`** (**`ingressClassName: traefik`**, **`cert-manager.io/cluster-issuer: letsencrypt-prod`**). **`values.yaml`:** **`config.sessionTTL: null`** works around chart **0.40.1** / binary mismatch ([headlamp#4883](https://github.com/kubernetes-sigs/headlamp/issues/4883)). **RBAC:** chart defaults are permissive — tighten before LAN-wide exposure; align with **Phase G** hardening.
|
||
|
||
## Phase E — Secrets
|
||
|
||
- [x] **SOPS** — encrypt **`Secret`** YAML under **`clusters/noble/secrets/`** with **age** (see **`.sops.yaml`**, **`clusters/noble/secrets/README.md`**); keep **`age-key.txt`** private (gitignored). **`ansible/playbooks/noble.yml`** decrypt-applies **`*.yaml`** when **`age-key.txt`** exists.
|
||
|
||
## Phase F — Policy + backups
|
||
|
||
- [x] **Kyverno** baseline policies — `clusters/noble/bootstrap/kyverno/` (Helm **kyverno** **3.7.1** + **kyverno-policies** **3.7.1**, **baseline** / **Audit** — see **`README.md`**)
|
||
- [ ] **Velero** — manifests + Ansible **`noble_velero`** (`clusters/noble/bootstrap/velero/`); enable with **`noble_velero_install: true`** + S3 bucket/URL + **`velero/velero-cloud-credentials`** (see **`velero/README.md`**); optional backup/restore drill
|
||
|
||
## Phase G — Hardening
|
||
|
||
- [x] **Runbooks** — **`talos/runbooks/`** (API VIP / kube-vip, etcd–Talos, Longhorn, SOPS)
|
||
- [x] **RBAC** — **Headlamp** **`ClusterRoleBinding`** uses built-in **`edit`** (not **`cluster-admin`**); **Argo CD** **`policy.default: role:readonly`** with **`g, admin, role:admin`** — see **`clusters/noble/bootstrap/headlamp/values.yaml`**, **`clusters/noble/bootstrap/argocd/values.yaml`**, **`talos/runbooks/rbac.md`**
|
||
- [ ] **Alertmanager** — add **`slack_configs`**, **`pagerduty_configs`**, or other receivers under **`kube-prometheus-stack`** `alertmanager.config` (chart defaults use **`null`** receiver)
|
||
|
||
## Quick validation
|
||
|
||
- [x] `kubectl get nodes` — all **Ready**
|
||
- [x] API via VIP `:6443` — **`kubectl get --raw /healthz`** → **`ok`** with kubeconfig **`server:`** `https://192.168.50.230:6443`
|
||
- [x] Ingress **`LoadBalancer`** in pool `210`–`229` (**Traefik** → **`192.168.50.211`**)
|
||
- [x] **Argo CD** UI — **`argocd-server`** **`LoadBalancer`** **`192.168.50.210`** (initial **`admin`** password from **`argocd-initial-admin-secret`**)
|
||
- [x] **Renovate** — **`renovate.json`** committed; **enable** [Mend Renovate](https://github.com/apps/renovate) **or** self-hosted bot for PRs
|
||
- [ ] Sample Ingress + cert (cert-manager ready) + Pangolin resource + CNAME
|
||
- [x] PVC **`Bound`** on **Longhorn** (`storageClassName: longhorn`); Prometheus/Loki durable when configured
|
||
- [x] **`monitoring`** — **kube-prometheus-stack** core workloads **Running** (Prometheus, Grafana, Alertmanager, operator, kube-state-metrics, node-exporter); PVCs **Bound** on **longhorn**
|
||
- [x] **`loki`** — **Loki** SingleBinary + **gateway** **Running**; **`loki`** PVC **Bound** on **longhorn** (no chunks-cache by design)
|
||
- [x] **`logging`** — **Fluent Bit** DaemonSet **Running** on all nodes (logs → **Loki**)
|
||
- [x] **Grafana** — **Loki** datasource from **`grafana-loki-datasource`** ConfigMap (**Explore** works after apply + sidecar sync)
|
||
- [x] **Headlamp** — Deployment **Running** in **`headlamp`**; UI at **`https://headlamp.apps.noble.lab.pcenicni.dev`** (TLS via **`letsencrypt-prod`**)
|
||
- [x] **SOPS secrets** — **`clusters/noble/secrets/*.yaml`** encrypted in git; **`noble.yml`** applies decrypted manifests when **`age-key.txt`** is present
|
||
- [x] **`kyverno`** — admission / background / cleanup / reports controllers **Running** in **`kyverno`**; **ClusterPolicies** for **PSS baseline** **Ready** (**Audit**)
|
||
- [ ] **`velero`** — when enabled: Deployment **Running** in **`velero`**; **`BackupStorageLocation`** / **`VolumeSnapshotLocation`** **Available**; test backup per **`velero/README.md`**
|
||
- [x] **Phase G (partial)** — **`talos/runbooks/`** (incl. **RBAC**); **Headlamp**/**Argo CD** RBAC tightened — **Alertmanager** receivers still optional
|
||
|
||
---
|
||
|
||
*Keep in sync with `talos/README.md` and manifests under `clusters/noble/`.*
|