From d2b52f351860f9f97d187e3ee8f5f03870e67734 Mon Sep 17 00:00:00 2001
From: Nikholas Pcenicni <82239765+nikpcenicni@users.noreply.github.com>
Date: Sat, 28 Mar 2026 00:59:08 -0400
Subject: [PATCH] Enhance CLUSTER-BUILD.md with updates on the Talos cluster's
 current state, focusing on Phase D (observability) and next steps for Phase E
 (secrets). Clarify kube-prometheus-stack and Loki configurations, and update
 deployment instructions for Grafana and Fluent Bit integration. Mark tasks
 related to observability as completed and outline future phases for improved
 clarity.

---
 talos/CLUSTER-BUILD.md | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/talos/CLUSTER-BUILD.md b/talos/CLUSTER-BUILD.md
index c0a2279..ac4c8c9 100644
--- a/talos/CLUSTER-BUILD.md
+++ b/talos/CLUSTER-BUILD.md
@@ -4,6 +4,8 @@ This document is the **exported TODO** for the **noble** Talos cluster (4 nodes)
 
 ## Current state (2026-03-28)
 
+Lab stack is **up** on-cluster for bootstrap through **Phase D** (observability), with manifests matching this repo. **Next focus:** Phase E (secrets), optional Pangolin/sample Ingress validation, Velero when S3 exists.
+
 - **Talos** v1.12.6 (target) / **Kubernetes** as bundled — four nodes **Ready** unless upgrading; **`talosctl health`**; **`talos/kubeconfig`** for `kubectl` (root `kubeconfig` may still be a stub). **Image Factory (nocloud installer):** `factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6`
 - **Cilium** Helm **1.16.6** / app **1.16.6** (`clusters/noble/apps/cilium/`, phase 1 values).
 - **MetalLB** Helm **0.15.3** / app **v0.15.3**; **IPAddressPool** `noble-l2` + **L2Advertisement** — pool **`192.168.50.210`–`192.168.50.229`**.
@@ -14,9 +16,9 @@ This document is the **exported TODO** for the **noble** Talos cluster (4 nodes)
 - **cert-manager** Helm **v1.20.0** / app **v1.20.0** — `clusters/noble/apps/cert-manager/`; **`ClusterIssuer`** **`letsencrypt-staging`** and **`letsencrypt-prod`** (HTTP-01, ingress class **`traefik`**); ACME email **`certificates@noble.lab.pcenicni.dev`** (edit in manifests if you want a different mailbox).
 - **Newt** Helm **1.2.0** / app **1.10.1** — `clusters/noble/apps/newt/` (**fossorial/newt**); Pangolin site tunnel — **`newt-pangolin-auth`** Secret (**`PANGOLIN_ENDPOINT`**, **`NEWT_ID`**, **`NEWT_SECRET`**). **Public DNS** is **not** automated with ExternalDNS: **CNAME** records at your DNS host per Pangolin’s domain instructions, plus **Integration API** for HTTP resources/targets — see **`clusters/noble/apps/newt/README.md`**. LAN access to Traefik can still use **`*.apps.noble.lab.pcenicni.dev`** → **`192.168.50.211`** (split horizon / local resolver).
 - **Argo CD** Helm **9.4.17** / app **v3.3.6** — `clusters/noble/bootstrap/argocd/`; **`argocd-server`** **`LoadBalancer`** **`192.168.50.210`**; app-of-apps scaffold under **`bootstrap/argocd/apps/`** (edit **`root-application.yaml`** `repoURL` before applying).
-- **kube-prometheus-stack** — Helm chart **82.15.1** — `clusters/noble/apps/kube-prometheus-stack/` (**namespace** `monitoring`, PSA **privileged** — **node-exporter** needs host mounts); **Longhorn** PVCs for Prometheus, Grafana, Alertmanager. **Grafana Ingress:** **`https://grafana.apps.noble.lab.pcenicni.dev`** (Traefik **`ingressClassName: traefik`**, **`cert-manager.io/cluster-issuer: letsencrypt-prod`**). **Loki** in Grafana: ConfigMap **`clusters/noble/apps/grafana-loki-datasource/loki-datasource.yaml`** (sidecar label **`grafana_datasource`**) — apply after **Loki** is running; does not use **`grafana.additionalDataSources`** on the chart. **`helm upgrade --install` with `--wait` is silent until done** — use **`--timeout 30m`** (not `5m`) and watch **`kubectl -n monitoring get pods -w`** in another terminal. Grafana admin password: Secret **`kube-prometheus-grafana`**, keys **`admin-user`** / **`admin-password`**.
-- **Loki** + **Fluent Bit** (manifests in repo) — **`grafana/loki` 6.55.0** SingleBinary + **filesystem** on **Longhorn** (`clusters/noble/apps/loki/`); **`loki.auth_enabled: false`** (single-tenant lab — avoids **`X-Scope-OrgID`** on Grafana/Fluent Bit); **`chunksCache.enabled: false`** (default memcached cache is heavy / often Pending on small nodes). **`fluent/fluent-bit` 0.56.0** tails **`/var/log/containers`** only → **`loki-gateway.loki.svc:80`** (`clusters/noble/apps/fluent-bit/`). **`logging`** namespace PSA **privileged** (hostPath).
-- **Still open:** deploy **Loki** → **Fluent Bit** → **`helm upgrade kube-prometheus`** (Phase D checklist).
+- **kube-prometheus-stack** — Helm chart **82.15.1** — `clusters/noble/apps/kube-prometheus-stack/` (**namespace** `monitoring`, PSA **privileged** — **node-exporter** needs host mounts); **Longhorn** PVCs for Prometheus, Grafana, Alertmanager; **node-exporter** DaemonSet **4/4**. **Grafana Ingress:** **`https://grafana.apps.noble.lab.pcenicni.dev`** (Traefik **`ingressClassName: traefik`**, **`cert-manager.io/cluster-issuer: letsencrypt-prod`**). **Loki** datasource in Grafana: ConfigMap **`clusters/noble/apps/grafana-loki-datasource/loki-datasource.yaml`** (sidecar label **`grafana_datasource: "1"`**) — not via **`grafana.additionalDataSources`** in the chart. **`helm upgrade --install` with `--wait` is silent until done** — use **`--timeout 30m`**; Grafana admin: Secret **`kube-prometheus-grafana`**, keys **`admin-user`** / **`admin-password`**.
+- **Loki** + **Fluent Bit** — **`grafana/loki` 6.55.0** SingleBinary + **filesystem** on **Longhorn** (`clusters/noble/apps/loki/`); **`loki.auth_enabled: false`**; **`chunksCache.enabled: false`** (no memcached chunk cache). **`fluent/fluent-bit` 0.56.0** → **`loki-gateway.loki.svc:80`** (`clusters/noble/apps/fluent-bit/`); **`logging`** PSA **privileged**. **Grafana Explore:** **`kubectl apply -f clusters/noble/apps/grafana-loki-datasource/loki-datasource.yaml`** then **Explore → Loki** (e.g. `{job="fluent-bit"}`).
+- **Still open:** **Phase E** (Sealed Secrets / Vault / ESO); **Phase F–G**; optional **sample Ingress + cert + Pangolin** end-to-end; **Velero** when S3 is ready; **Argo CD SSO**.
 
 ## Inventory
 
@@ -88,7 +90,7 @@ This document is the **exported TODO** for the **noble** Talos cluster (4 nodes)
 2. **MetalLB Helm chart** (CRDs + controller) **before** `kubectl apply -k` on the pool manifests.
 3. **`clusters/noble/apps/metallb/namespace.yaml`** before or merged onto `metallb-system` so Pod Security does not block speaker (see `apps/metallb/README.md`).
 4. **Longhorn:** Talos user volume + extensions in `talconfig.with-longhorn.yaml` (when restored); Helm **`defaultDataPath`** in `clusters/noble/apps/longhorn/values.yaml`.
-5. **Loki → Fluent Bit → Grafana:** deploy **Loki** (`loki-gateway` Service) before **Fluent Bit**; run **`helm upgrade`** on **kube-prometheus-stack** after **Loki** so Grafana provisions the **Loki** datasource.
+5. **Loki → Fluent Bit → Grafana datasource:** deploy **Loki** (`loki-gateway` Service) before **Fluent Bit**; apply **`clusters/noble/apps/grafana-loki-datasource/loki-datasource.yaml`** after **Loki** (sidecar picks up the ConfigMap — no kube-prometheus values change for Loki).
 
 ## Prerequisites (before phases)
 
@@ -134,7 +136,7 @@ This document is the **exported TODO** for the **noble** Talos cluster (4 nodes)
 ## Phase D — Observability
 
 - [x] **kube-prometheus-stack** — `kubectl apply -f clusters/noble/apps/kube-prometheus-stack/namespace.yaml` then **`helm upgrade --install`** as in `clusters/noble/apps/kube-prometheus-stack/values.yaml` (chart **82.15.1**); PVCs **`longhorn`**; **`--wait --timeout 30m`** recommended; verify **`kubectl -n monitoring get pods,pvc`**
-- [ ] **Loki** + **Fluent Bit** + **Grafana Loki datasource** — **order:** **`kubectl apply -f clusters/noble/apps/loki/namespace.yaml`** → **`helm upgrade --install loki`** `grafana/loki` **6.55.0** `-f clusters/noble/apps/loki/values.yaml` → **`kubectl apply -f clusters/noble/apps/fluent-bit/namespace.yaml`** → **`helm upgrade --install fluent-bit`** `fluent/fluent-bit` **0.56.0** `-f clusters/noble/apps/fluent-bit/values.yaml` → **`kubectl apply -f clusters/noble/apps/grafana-loki-datasource/loki-datasource.yaml`**. Verify **Explore → Loki** in Grafana; **`kubectl -n loki get pods,pvc`**, **`kubectl -n logging get pods`**
+- [x] **Loki** + **Fluent Bit** + **Grafana Loki datasource** — **order:** **`kubectl apply -f clusters/noble/apps/loki/namespace.yaml`** → **`helm upgrade --install loki`** `grafana/loki` **6.55.0** `-f clusters/noble/apps/loki/values.yaml` → **`kubectl apply -f clusters/noble/apps/fluent-bit/namespace.yaml`** → **`helm upgrade --install fluent-bit`** `fluent/fluent-bit` **0.56.0** `-f clusters/noble/apps/fluent-bit/values.yaml` → **`kubectl apply -f clusters/noble/apps/grafana-loki-datasource/loki-datasource.yaml`**. Verify **Explore → Loki** in Grafana; **`kubectl -n loki get pods,pvc`**, **`kubectl -n logging get pods`**
 
 ## Phase E — Secrets
 
@@ -160,7 +162,10 @@ This document is the **exported TODO** for the **noble** Talos cluster (4 nodes)
 - [x] **Argo CD** UI — **`argocd-server`** **`LoadBalancer`** **`192.168.50.210`** (initial **`admin`** password from **`argocd-initial-admin-secret`**)
 - [ ] Sample Ingress + cert (cert-manager ready) + Pangolin resource + CNAME
 - [x] PVC **`Bound`** on **Longhorn** (`storageClassName: longhorn`); Prometheus/Loki durable when configured
-- [x] **`monitoring`** — **kube-prometheus-stack** core workloads **Running** (Prometheus, Grafana, Alertmanager, operator, kube-state-metrics); PVCs **Bound** on **longhorn**
+- [x] **`monitoring`** — **kube-prometheus-stack** core workloads **Running** (Prometheus, Grafana, Alertmanager, operator, kube-state-metrics, node-exporter); PVCs **Bound** on **longhorn**
+- [x] **`loki`** — **Loki** SingleBinary + **gateway** **Running**; **`loki`** PVC **Bound** on **longhorn** (no chunks-cache by design)
+- [x] **`logging`** — **Fluent Bit** DaemonSet **Running** on all nodes (logs → **Loki**)
+- [x] **Grafana** — **Loki** datasource from **`grafana-loki-datasource`** ConfigMap (**Explore** works after apply + sidecar sync)
 
 ---