Update Headlamp and Vault documentation; enhance RBAC configurations in Argo CD. Modify Headlamp README to clarify sessionTTL handling and ServiceAccount permissions. Add Cilium network policy instructions to Vault README. Update Argo CD values.yaml for default RBAC settings, ensuring local admin retains full access while new users start with read-only permissions. Reflect these changes in CLUSTER-BUILD.md.

This commit is contained in:
Nikholas Pcenicni
2026-03-28 02:02:17 -04:00
parent 906c24b1d5
commit 445a1ac211
15 changed files with 188 additions and 13 deletions

View File

@@ -2,7 +2,7 @@
[Headlamp](https://headlamp.dev/) web UI for the cluster. Exposed on **`https://headlamp.apps.noble.lab.pcenicni.dev`** via **Traefik** + **cert-manager** (`letsencrypt-prod`), same pattern as Grafana.
- **Chart:** `headlamp/headlamp` **0.40.1**
- **Chart:** `headlamp/headlamp` **0.40.1** (`config.sessionTTL: null` avoids chart/binary mismatch — [issue #4883](https://github.com/kubernetes-sigs/headlamp/issues/4883))
- **Namespace:** `headlamp`
## Install
@@ -15,4 +15,4 @@ helm upgrade --install headlamp headlamp/headlamp -n headlamp \
--version 0.40.1 -f clusters/noble/apps/headlamp/values.yaml --wait --timeout 10m
```
Sign-in uses a **ServiceAccount token** (Headlamp docs: create a limited SA for day-to-day use). The charts default **ClusterRole** is powerful — tighten RBAC and/or add **OIDC** in **`values.yaml`** under **`config.oidc`** when hardening (**Phase G**).
Sign-in uses a **ServiceAccount token** (Headlamp docs: create a limited SA for day-to-day use). This repo binds the Headlamp workload SA to the built-in **`edit`** ClusterRole (**`clusterRoleBinding.clusterRoleName: edit`** in **`values.yaml`**) — not **`cluster-admin`**. For cluster-scoped admin work, use **`kubectl`** with your admin kubeconfig. Optional **OIDC** in **`config.oidc`** replaces token login for SSO.

View File

@@ -8,6 +8,18 @@
#
# DNS: headlamp.apps.noble.lab.pcenicni.dev → Traefik LB (see talos/CLUSTER-BUILD.md).
# Default chart RBAC is broad — restrict for production (Phase G).
# Bind Headlamps ServiceAccount to the built-in **edit** ClusterRole (not **cluster-admin**).
# For break-glass cluster-admin, use kubectl with your admin kubeconfig — not Headlamp.
# If changing **clusterRoleName** on an existing install, Kubernetes forbids mutating **roleRef**:
# kubectl delete clusterrolebinding headlamp-admin
# helm upgrade … (same command as in the header comments)
clusterRoleBinding:
clusterRoleName: edit
#
# Chart 0.40.1 passes -session-ttl but the v0.40.1 binary does not define it — omit the flag:
# https://github.com/kubernetes-sigs/headlamp/issues/4883
config:
sessionTTL: null
ingress:
enabled: true

View File

@@ -22,6 +22,16 @@ kubectl -n vault get pods,pvc,svc
kubectl -n vault exec -i sts/vault -- vault status
```
## Cilium network policy (Phase G)
After **Cilium** is up, optionally restrict HTTP access to the Vault server pods (**TCP 8200**) to **`external-secrets`** and same-namespace clients:
```bash
kubectl apply -f clusters/noble/apps/vault/cilium-network-policy.yaml
```
If you add workloads in other namespaces that call Vault, extend **`ingress`** in that manifest.
## Initialize and unseal (first time)
From a workstation with `kubectl` (or `kubectl exec` into any pod with `vault` CLI):

View File

@@ -0,0 +1,33 @@
# CiliumNetworkPolicy — restrict who may reach Vault HTTP listener (8200).
# Apply after Cilium is healthy: kubectl apply -f clusters/noble/apps/vault/cilium-network-policy.yaml
#
# Ingress-only policy: egress from Vault is unchanged (Kubernetes auth needs API + DNS).
# Extend ingress rules if other namespaces must call Vault (e.g. app workloads).
#
# Ref: https://docs.cilium.io/en/stable/security/policy/language/
---
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: vault-http-ingress
namespace: vault
spec:
endpointSelector:
matchLabels:
app.kubernetes.io/name: vault
component: server
ingress:
- fromEndpoints:
- matchLabels:
"k8s:io.kubernetes.pod.namespace": external-secrets
toPorts:
- ports:
- port: "8200"
protocol: TCP
- fromEndpoints:
- matchLabels:
"k8s:io.kubernetes.pod.namespace": vault
toPorts:
- ports:
- port: "8200"
protocol: TCP

View File

@@ -15,6 +15,8 @@ helm upgrade --install argocd argo/argo-cd \
--wait
```
**RBAC:** `values.yaml` sets **`policy.default: role:readonly`** and **`g, admin, role:admin`** so the local **`admin`** user keeps full access while future OIDC users default to read-only until you add **`policy.csv`** mappings.
## 2. UI / CLI address
**HTTPS:** `https://argo.apps.noble.lab.pcenicni.dev` (Ingress via Traefik; cert from **`values.yaml`**).

View File

@@ -21,6 +21,13 @@ configs:
# TLS terminates at Traefik / cert-manager; Argo CD serves HTTP behind the Ingress.
server.insecure: true
# RBAC: default authenticated users to read-only; keep local **admin** as full admin.
# Ref: https://argo-cd.readthedocs.io/en/stable/operator-manual/rbac/
rbac:
policy.default: role:readonly
policy.csv: |
g, admin, role:admin
server:
certificate:
enabled: true

20
renovate.json Normal file
View File

@@ -0,0 +1,20 @@
{
"$schema": "https://docs.renovatebot.com/renovate-schema.json",
"extends": ["config:recommended"],
"dependencyDashboard": true,
"timezone": "America/New_York",
"schedule": ["before 4am on Monday"],
"prConcurrentLimit": 5,
"kubernetes": {
"fileMatch": ["^clusters/noble/.+\\.yaml$"]
},
"packageRules": [
{
"description": "Group minor/patch image bumps for noble Kubernetes manifests",
"matchManagers": ["kubernetes"],
"matchFileNames": ["clusters/noble/**"],
"matchUpdateTypes": ["minor", "patch"],
"groupName": "noble container images (minor/patch)"
}
]
}

View File

@@ -4,7 +4,7 @@ This document is the **exported TODO** for the **noble** Talos cluster (4 nodes)
## Current state (2026-03-28)
Lab stack is **up** on-cluster through **Phase D** (observability), **Phase E** (Sealed Secrets, External Secrets, **Vault** + **`ClusterSecretStore`**), and **Phase F** (**Kyverno** **baseline** PSS **Audit**), with manifests matching this repo. **Next focus:** optional **Headlamp** (Ingress + TLS), **Renovate** (dependency PRs for Helm/manifests), Pangolin/sample Ingress validation, **Phase G**, **Velero** when S3 exists.
Lab stack is **up** on-cluster through **Phase D****F** and **Phase G** (Vault **CiliumNetworkPolicy**, **`talos/runbooks/`**). **Next focus:** optional **Alertmanager** receivers (Slack/PagerDuty); tighten **RBAC** (Headlamp / cluster-admin); **Cilium** policies for other namespaces as needed; enable **Mend Renovate** for PRs; Pangolin/sample Ingress; **Velero** when S3 exists.
- **Talos** v1.12.6 (target) / **Kubernetes** as bundled — four nodes **Ready** unless upgrading; **`talosctl health`**; **`talos/kubeconfig`** is **local only** (gitignored — never commit; regenerate with `talosctl kubeconfig` per `talos/README.md`). **Image Factory (nocloud installer):** `factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6`
- **Cilium** Helm **1.16.6** / app **1.16.6** (`clusters/noble/apps/cilium/`, phase 1 values).
@@ -21,7 +21,7 @@ Lab stack is **up** on-cluster through **Phase D** (observability), **Phase E**
- **Sealed Secrets** Helm **2.18.4** / app **0.36.1**`clusters/noble/apps/sealed-secrets/` (namespace **`sealed-secrets`**); **`kubeseal`** on client should match controller minor (**README**); back up **`sealed-secrets-key`** (see README).
- **External Secrets Operator** Helm **2.2.0** / app **v2.2.0**`clusters/noble/apps/external-secrets/`; Vault **`ClusterSecretStore`** in **`examples/vault-cluster-secret-store.yaml`** (**`http://`** to match Vault listener — apply after Vault **Kubernetes auth**).
- **Vault** Helm **0.32.0** / app **1.21.2**`clusters/noble/apps/vault/` — standalone **file** storage, **Longhorn** PVC; **HTTP** listener (`global.tlsDisable`); optional **CronJob** lab unseal **`unseal-cronjob.yaml`**; **not** initialized in git — run **`vault operator init`** per **`README.md`**.
- **Still open:** **Headlamp** (Helm + Traefik Ingress + **`letsencrypt-prod`**); **Renovate** ([Renovate](https://docs.renovatebot.com/) — dependency bot; hosted app **or** self-hosted on-cluster); **Phase G**; optional **sample Ingress + cert + Pangolin** end-to-end; **Velero** when S3 is ready; **Argo CD SSO**.
- **Still open:** **Renovate** — install **[Mend Renovate](https://github.com/apps/renovate)** (or self-host) so PRs run; optional **Alertmanager** notification channels; optional **sample Ingress + cert + Pangolin** end-to-end; **Velero** when S3 is ready; **Argo CD SSO**.
## Inventory
@@ -74,6 +74,7 @@ Lab stack is **up** on-cluster through **Phase D** (observability), **Phase E**
| Artifact | Path |
|----------|------|
| This checklist | `talos/CLUSTER-BUILD.md` |
| Operational runbooks (API VIP, etcd, Longhorn, Vault) | `talos/runbooks/` |
| Talos quick start + networking + kubeconfig | `talos/README.md` |
| talhelper source (active) | `talos/talconfig.yaml` — may be **wipe-phase** (no Longhorn volume) during disk recovery |
| Longhorn volume restore | `talos/talconfig.with-longhorn.yaml` — copy to `talconfig.yaml` after GPT wipe (see `talos/README.md` §5) |
@@ -93,10 +94,10 @@ Lab stack is **up** on-cluster through **Phase D** (observability), **Phase E**
| Fluent Bit → Loki (Helm values) | `clusters/noble/apps/fluent-bit/``values.yaml`, `namespace.yaml` |
| Sealed Secrets (Helm) | `clusters/noble/apps/sealed-secrets/``values.yaml`, `namespace.yaml`, `README.md` |
| External Secrets Operator (Helm + Vault store example) | `clusters/noble/apps/external-secrets/``values.yaml`, `namespace.yaml`, `README.md`, `examples/vault-cluster-secret-store.yaml` |
| Vault (Helm + optional unseal CronJob) | `clusters/noble/apps/vault/``values.yaml`, `namespace.yaml`, `unseal-cronjob.yaml`, `configure-kubernetes-auth.sh`, `README.md` |
| Vault (Helm + optional unseal CronJob) | `clusters/noble/apps/vault/``values.yaml`, `namespace.yaml`, `unseal-cronjob.yaml`, `cilium-network-policy.yaml`, `configure-kubernetes-auth.sh`, `README.md` |
| Kyverno + PSS baseline policies | `clusters/noble/apps/kyverno/``values.yaml`, `policies-values.yaml`, `namespace.yaml`, `README.md` |
| Headlamp (Helm + Ingress) | `clusters/noble/apps/headlamp/``values.yaml`, `namespace.yaml` (planned — `helm repo add headlamp https://kubernetes-sigs.github.io/headlamp/`) |
| Renovate (repo config + optional self-hosted Helm) | `renovate.json` or `renovate.json5` at repo root (see [Renovate docs](https://docs.renovatebot.com/)); optional `clusters/noble/apps/renovate/` for self-hosted chart + token Secret (**Sealed Secrets** / **ESO** after **Phase E**) |
| Headlamp (Helm + Ingress) | `clusters/noble/apps/headlamp/``values.yaml`, `namespace.yaml`, `README.md` |
| Renovate (repo config + optional self-hosted Helm) | **`renovate.json`** at repo root; optional `clusters/noble/apps/renovate/` for self-hosted chart + token Secret (**Sealed Secrets** / **ESO** after **Phase E**) |
**Git vs cluster:** manifests and `talconfig` live in git; **`talhelper genconfig -o out`**, bootstrap, Helm, and `kubectl` run on your LAN. See **`talos/README.md`** for workstation reachability (lab LAN/VPN), **`talosctl kubeconfig`** vs Kubernetes `server:` (VIP vs node IP), and **`--insecure`** only in maintenance.
@@ -150,14 +151,14 @@ Lab stack is **up** on-cluster through **Phase D** (observability), **Phase E**
- [x] **Argo CD** bootstrap — `clusters/noble/bootstrap/argocd/` (`helm upgrade --install argocd …`)
- [x] Argo CD server **LoadBalancer****`192.168.50.210`** (see `values.yaml`)
- [X] **App-of-apps** — set **`repoURL`** in **`root-application.yaml`**, add **`Application`** manifests under **`bootstrap/argocd/apps/`**, apply **`root-application.yaml`**
- [ ] **Renovate** — [Renovate](https://docs.renovatebot.com/) opens PRs for Helm charts, Docker tags, and related bumps. **Option A:** install the **Mend Renovate** app on **GitHub** / **GitLab** for this repo (no cluster). **Option B:** self-hosted — **`helm repo add renovate https://docs.renovatebot.com/helm-charts`** or OCI per [Helm charts](https://docs.renovatebot.com/helm-charts/); **`renovate.config`** with token from **Sealed Secrets** / **ESO** (**`clusters/noble/apps/renovate/`** when added). Add **`renovate.json`** (or **`renovate.json5`**) at repo root with **`packageRules`**, **`kubernetes`** / **`helm-values`** file patterns covering **`clusters/noble/`** (Helm **`values.yaml`**, manifests). Verify a dry run or first dependency PR.
- [x] **Renovate****`renovate.json`** at repo root ([Renovate](https://docs.renovatebot.com/) **Kubernetes** manager for **`clusters/noble/**/*.yaml`** image pins; grouped minor/patch PRs). **Activate PRs:** install **[Mend Renovate](https://github.com/apps/renovate)** on the Git repo (**Option A**), or **Option B:** self-hosted chart per [Helm charts](https://docs.renovatebot.com/helm-charts/) + token from **Sealed Secrets** / **ESO**. Helm **chart** versions pinned only in comments still need manual bumps or extra **regex** `customManagers` — extend **`renovate.json`** as needed.
- [ ] SSO — later
## Phase D — Observability
- [x] **kube-prometheus-stack**`kubectl apply -f clusters/noble/apps/kube-prometheus-stack/namespace.yaml` then **`helm upgrade --install`** as in `clusters/noble/apps/kube-prometheus-stack/values.yaml` (chart **82.15.1**); PVCs **`longhorn`**; **`--wait --timeout 30m`** recommended; verify **`kubectl -n monitoring get pods,pvc`**
- [x] **Loki** + **Fluent Bit** + **Grafana Loki datasource****order:** **`kubectl apply -f clusters/noble/apps/loki/namespace.yaml`** → **`helm upgrade --install loki`** `grafana/loki` **6.55.0** `-f clusters/noble/apps/loki/values.yaml`**`kubectl apply -f clusters/noble/apps/fluent-bit/namespace.yaml`** → **`helm upgrade --install fluent-bit`** `fluent/fluent-bit` **0.56.0** `-f clusters/noble/apps/fluent-bit/values.yaml`**`kubectl apply -f clusters/noble/apps/grafana-loki-datasource/loki-datasource.yaml`**. Verify **Explore → Loki** in Grafana; **`kubectl -n loki get pods,pvc`**, **`kubectl -n logging get pods`**
- [ ] **Headlamp** — Kubernetes web UI ([Headlamp](https://headlamp.dev/)); **`helm repo add headlamp https://kubernetes-sigs.github.io/headlamp/`**; **`kubectl apply -f clusters/noble/apps/headlamp/namespace.yaml`** → **`helm upgrade --install headlamp headlamp/headlamp --version 0.40.1 -n headlamp -f clusters/noble/apps/headlamp/values.yaml`**; **Ingress** **`https://headlamp.apps.noble.lab.pcenicni.dev`** (**`ingressClassName: traefik`**, **`cert-manager.io/cluster-issuer: letsencrypt-prod`**). **RBAC:** chart defaults are permissive — tighten before LAN-wide exposure; align with **Phase G** hardening.
- [x] **Headlamp** — Kubernetes web UI ([Headlamp](https://headlamp.dev/)); **`helm repo add headlamp https://kubernetes-sigs.github.io/headlamp/`**; **`kubectl apply -f clusters/noble/apps/headlamp/namespace.yaml`** → **`helm upgrade --install headlamp headlamp/headlamp --version 0.40.1 -n headlamp -f clusters/noble/apps/headlamp/values.yaml`**; **Ingress** **`https://headlamp.apps.noble.lab.pcenicni.dev`** (**`ingressClassName: traefik`**, **`cert-manager.io/cluster-issuer: letsencrypt-prod`**). **`values.yaml`:** **`config.sessionTTL: null`** works around chart **0.40.1** / binary mismatch ([headlamp#4883](https://github.com/kubernetes-sigs/headlamp/issues/4883)). **RBAC:** chart defaults are permissive — tighten before LAN-wide exposure; align with **Phase G** hardening.
## Phase E — Secrets
@@ -172,8 +173,10 @@ Lab stack is **up** on-cluster through **Phase D** (observability), **Phase E**
## Phase G — Hardening
- [ ] RBAC, network policies (Cilium), Alertmanager routes
- [ ] Runbooks: API VIP, etcd, Longhorn, Vault
- [x] **Cilium** — Vault **`CiliumNetworkPolicy`** (`clusters/noble/apps/vault/cilium-network-policy.yaml`) — HTTP **8200** from **`external-secrets`** + **`vault`**; extend for other clients as needed
- [x] **Runbooks****`talos/runbooks/`** (API VIP / kube-vip, etcdTalos, Longhorn, Vault)
- [x] **RBAC****Headlamp** **`ClusterRoleBinding`** uses built-in **`edit`** (not **`cluster-admin`**); **Argo CD** **`policy.default: role:readonly`** with **`g, admin, role:admin`** — see **`clusters/noble/apps/headlamp/values.yaml`**, **`clusters/noble/bootstrap/argocd/values.yaml`**, **`talos/runbooks/rbac.md`**
- [ ] **Alertmanager** — add **`slack_configs`**, **`pagerduty_configs`**, or other receivers under **`kube-prometheus-stack`** `alertmanager.config` (chart defaults use **`null`** receiver)
## Quick validation
@@ -181,18 +184,19 @@ Lab stack is **up** on-cluster through **Phase D** (observability), **Phase E**
- [x] API via VIP `:6443`**`kubectl get --raw /healthz`** → **`ok`** with kubeconfig **`server:`** `https://192.168.50.230:6443`
- [x] Ingress **`LoadBalancer`** in pool `210``229` (**Traefik** → **`192.168.50.211`**)
- [x] **Argo CD** UI — **`argocd-server`** **`LoadBalancer`** **`192.168.50.210`** (initial **`admin`** password from **`argocd-initial-admin-secret`**)
- [ ] **Renovate**hosted app enabled for this repo **or** self-hosted workload **Running** + PRs updating **`clusters/noble/`** manifests as configured
- [x] **Renovate****`renovate.json`** committed; **enable** [Mend Renovate](https://github.com/apps/renovate) **or** self-hosted bot for PRs
- [ ] Sample Ingress + cert (cert-manager ready) + Pangolin resource + CNAME
- [x] PVC **`Bound`** on **Longhorn** (`storageClassName: longhorn`); Prometheus/Loki durable when configured
- [x] **`monitoring`** — **kube-prometheus-stack** core workloads **Running** (Prometheus, Grafana, Alertmanager, operator, kube-state-metrics, node-exporter); PVCs **Bound** on **longhorn**
- [x] **`loki`** — **Loki** SingleBinary + **gateway** **Running**; **`loki`** PVC **Bound** on **longhorn** (no chunks-cache by design)
- [x] **`logging`** — **Fluent Bit** DaemonSet **Running** on all nodes (logs → **Loki**)
- [x] **Grafana****Loki** datasource from **`grafana-loki-datasource`** ConfigMap (**Explore** works after apply + sidecar sync)
- [ ] **Headlamp** — Deployment **Running** in **`headlamp`**; UI at **`https://headlamp.apps.noble.lab.pcenicni.dev`** (TLS via **`letsencrypt-prod`**)
- [x] **Headlamp** — Deployment **Running** in **`headlamp`**; UI at **`https://headlamp.apps.noble.lab.pcenicni.dev`** (TLS via **`letsencrypt-prod`**)
- [x] **`sealed-secrets`** — controller **Deployment** **Running** in **`sealed-secrets`** (install + **`kubeseal`** per **`apps/sealed-secrets/README.md`**)
- [x] **`external-secrets`** — controller + webhook + cert-controller **Running** in **`external-secrets`**; apply **`ClusterSecretStore`** after Vault **Kubernetes auth**
- [x] **`vault`** — **StatefulSet** **Running**, **`data-vault-0`** PVC **Bound** on **longhorn**; **`vault operator init`** + unseal per **`apps/vault/README.md`**
- [x] **`kyverno`** — admission / background / cleanup / reports controllers **Running** in **`kyverno`**; **ClusterPolicies** for **PSS baseline** **Ready** (**Audit**)
- [x] **Phase G (partial)** — Vault **`CiliumNetworkPolicy`**; **`talos/runbooks/`** (incl. **RBAC**); **Headlamp**/**Argo CD** RBAC tightened — **Alertmanager** receivers still optional
---

View File

@@ -1,6 +1,7 @@
# Talos — noble lab
- **Cluster build checklist (exported TODO):** [CLUSTER-BUILD.md](./CLUSTER-BUILD.md)
- **Operational runbooks (API VIP, etcd, Longhorn, Vault):** [runbooks/README.md](./runbooks/README.md)
## Versions

11
talos/runbooks/README.md Normal file
View File

@@ -0,0 +1,11 @@
# Noble lab — operational runbooks
Short recovery / triage notes for the **noble** Talos cluster. Deep procedures live in [`talos/README.md`](../README.md) and [`talos/CLUSTER-BUILD.md`](../CLUSTER-BUILD.md).
| Topic | Runbook |
|-------|---------|
| Kubernetes API VIP (kube-vip) | [`api-vip-kube-vip.md`](./api-vip-kube-vip.md) |
| etcd / Talos control plane | [`etcd-talos.md`](./etcd-talos.md) |
| Longhorn storage | [`longhorn.md`](./longhorn.md) |
| Vault (unseal, auth, ESO) | [`vault.md`](./vault.md) |
| RBAC (Headlamp, Argo CD) | [`rbac.md`](./rbac.md) |

View File

@@ -0,0 +1,15 @@
# Runbook: Kubernetes API VIP (kube-vip)
**Symptoms:** `kubectl` timeouts, `connection refused` to `https://192.168.50.230:6443`, or nodes `NotReady` while apiserver on a node IP still works.
**Checks**
1. VIP and interface align with [`talos/talconfig.yaml`](../talconfig.yaml) (`cluster.network`, `additionalApiServerCertSans`) and [`clusters/noble/apps/kube-vip/`](../../clusters/noble/apps/kube-vip/).
2. `kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip -o wide` — DaemonSet should be **Running** on control-plane nodes.
3. From a workstation: `ping 192.168.50.230` (if ICMP allowed) and `curl -k https://192.168.50.230:6443/healthz` or `kubectl get --raw /healthz` with kubeconfig `server:` set to the VIP.
4. `talosctl health` with `TALOSCONFIG` (see [`talos/README.md`](../README.md) §3).
**Common fixes**
- Wrong uplink name in kube-vip (`ens18` vs actual): fix manifest, re-apply, verify on node with `talosctl get links`.
- Workstation routing/DNS: use VIP only when reachable; otherwise temporarily point kubeconfig `server:` at a control-plane IP (see README §3).

View File

@@ -0,0 +1,16 @@
# Runbook: etcd / Talos control plane
**Symptoms:** API flaps, `etcd` alarms, multiple control planes `NotReady`, upgrades stuck.
**Checks**
1. `talosctl health` and `talosctl etcd status` (with `TALOSCONFIG`; target a control-plane node if needed).
2. `kubectl get nodes` — control planes **Ready**; look for disk/memory pressure.
3. Talos version skew: `talosctl version` vs node image in [`talos/talconfig.yaml`](../talconfig.yaml) / Image Factory schematic.
**Common fixes**
- One bad control plane: cordon/drain workloads only after confirming quorum; follow Talos maintenance docs for replace/remove.
- Disk full on etcd volume: resolve host disk / system partition (Talos ephemeral vs user volumes per machine config).
**References:** [Talos etcd](https://www.talos.dev/latest/advanced/etcd-maintenance/), [`talos/README.md`](../README.md).

View File

@@ -0,0 +1,16 @@
# Runbook: Longhorn
**Symptoms:** PVCs stuck **Pending**, volumes **Faulted**, workloads I/O errors, Longhorn UI/alerts.
**Checks**
1. `kubectl -n longhorn-system get pods` and `kubectl get nodes.longhorn.io -o wide`.
2. Talos user disk + extensions for Longhorn (see [`talos/README.md`](../README.md) §5 and `talconfig.with-longhorn.yaml`).
3. `kubectl get sc`**longhorn** default as expected; PVC events: `kubectl describe pvc -n <ns> <name>`.
**Common fixes**
- Node disk pressure / mount missing: fix Talos machine config, reboot node per Talos docs.
- Recovery / GPT wipe scripts: [`talos/scripts/longhorn-gpt-recovery.sh`](../scripts/longhorn-gpt-recovery.sh) and CLUSTER-BUILD notes.
**References:** [`clusters/noble/apps/longhorn/`](../../clusters/noble/apps/longhorn/), [Longhorn docs](https://longhorn.io/docs/).

13
talos/runbooks/rbac.md Normal file
View File

@@ -0,0 +1,13 @@
# Runbook: Kubernetes RBAC (noble)
**Headlamp** (`clusters/noble/apps/headlamp/values.yaml`): the charts **ClusterRoleBinding** uses the built-in **`edit`** ClusterRole — not **`cluster-admin`**. Break-glass changes use **`kubectl`** with an admin kubeconfig.
**Argo CD** (`clusters/noble/bootstrap/argocd/values.yaml`): **`policy.default: role:readonly`** — new OIDC/Git users get read-only unless you add **`g, <user-or-group>, role:admin`** (or another role) in **`configs.rbac.policy.csv`**. Local user **`admin`** stays **`role:admin`** via **`g, admin, role:admin`**.
**Audits**
```bash
kubectl get clusterrolebindings -o custom-columns='NAME:.metadata.name,ROLE:.roleRef.name,SA:.subjects[?(@.kind=="ServiceAccount")].name,NS:.subjects[?(@.kind=="ServiceAccount")].namespace' | grep -E 'NAME|cluster-admin|headlamp|argocd'
```
**References:** [Headlamp chart RBAC](https://github.com/kubernetes-sigs/headlamp/tree/main/charts/headlamp), [Argo CD RBAC](https://argo-cd.readthedocs.io/en/stable/operator-manual/rbac/).

15
talos/runbooks/vault.md Normal file
View File

@@ -0,0 +1,15 @@
# Runbook: Vault (in-cluster)
**Symptoms:** External Secrets **not syncing**, `ClusterSecretStore` **InvalidProviderConfig**, Vault UI/API **503 sealed**, pods **CrashLoop** on auth.
**Checks**
1. `kubectl -n vault exec -i sts/vault -- vault status`**Sealed** / **Initialized**.
2. Unseal key Secret + optional CronJob: [`clusters/noble/apps/vault/README.md`](../../clusters/noble/apps/vault/README.md), `unseal-cronjob.yaml`.
3. Kubernetes auth for ESO: [`clusters/noble/apps/vault/configure-kubernetes-auth.sh`](../../clusters/noble/apps/vault/configure-kubernetes-auth.sh) and `kubectl describe clustersecretstore vault`.
4. **Cilium** policy: if Vault is unreachable from `external-secrets`, check [`clusters/noble/apps/vault/cilium-network-policy.yaml`](../../clusters/noble/apps/vault/cilium-network-policy.yaml) and extend `ingress` for new client namespaces.
**Common fixes**
- Sealed: `vault operator unseal` or fix auto-unseal CronJob + `vault-unseal-key` Secret.
- **403/invalid role** on ESO: re-run Kubernetes auth setup (issuer/CA/reviewer JWT) per README.