diff --git a/clusters/noble/apps/headlamp/README.md b/clusters/noble/apps/headlamp/README.md index 76840ef..f5e6bea 100644 --- a/clusters/noble/apps/headlamp/README.md +++ b/clusters/noble/apps/headlamp/README.md @@ -2,7 +2,7 @@ [Headlamp](https://headlamp.dev/) web UI for the cluster. Exposed on **`https://headlamp.apps.noble.lab.pcenicni.dev`** via **Traefik** + **cert-manager** (`letsencrypt-prod`), same pattern as Grafana. -- **Chart:** `headlamp/headlamp` **0.40.1** +- **Chart:** `headlamp/headlamp` **0.40.1** (`config.sessionTTL: null` avoids chart/binary mismatch — [issue #4883](https://github.com/kubernetes-sigs/headlamp/issues/4883)) - **Namespace:** `headlamp` ## Install @@ -15,4 +15,4 @@ helm upgrade --install headlamp headlamp/headlamp -n headlamp \ --version 0.40.1 -f clusters/noble/apps/headlamp/values.yaml --wait --timeout 10m ``` -Sign-in uses a **ServiceAccount token** (Headlamp docs: create a limited SA for day-to-day use). The chart’s default **ClusterRole** is powerful — tighten RBAC and/or add **OIDC** in **`values.yaml`** under **`config.oidc`** when hardening (**Phase G**). +Sign-in uses a **ServiceAccount token** (Headlamp docs: create a limited SA for day-to-day use). This repo binds the Headlamp workload SA to the built-in **`edit`** ClusterRole (**`clusterRoleBinding.clusterRoleName: edit`** in **`values.yaml`**) — not **`cluster-admin`**. For cluster-scoped admin work, use **`kubectl`** with your admin kubeconfig. Optional **OIDC** in **`config.oidc`** replaces token login for SSO. diff --git a/clusters/noble/apps/headlamp/values.yaml b/clusters/noble/apps/headlamp/values.yaml index 695dcf3..916b58a 100644 --- a/clusters/noble/apps/headlamp/values.yaml +++ b/clusters/noble/apps/headlamp/values.yaml @@ -8,6 +8,18 @@ # # DNS: headlamp.apps.noble.lab.pcenicni.dev → Traefik LB (see talos/CLUSTER-BUILD.md). # Default chart RBAC is broad — restrict for production (Phase G). +# Bind Headlamp’s ServiceAccount to the built-in **edit** ClusterRole (not **cluster-admin**). +# For break-glass cluster-admin, use kubectl with your admin kubeconfig — not Headlamp. +# If changing **clusterRoleName** on an existing install, Kubernetes forbids mutating **roleRef**: +# kubectl delete clusterrolebinding headlamp-admin +# helm upgrade … (same command as in the header comments) +clusterRoleBinding: + clusterRoleName: edit +# +# Chart 0.40.1 passes -session-ttl but the v0.40.1 binary does not define it — omit the flag: +# https://github.com/kubernetes-sigs/headlamp/issues/4883 +config: + sessionTTL: null ingress: enabled: true diff --git a/clusters/noble/apps/vault/README.md b/clusters/noble/apps/vault/README.md index 2f94e6c..04d90e5 100644 --- a/clusters/noble/apps/vault/README.md +++ b/clusters/noble/apps/vault/README.md @@ -22,6 +22,16 @@ kubectl -n vault get pods,pvc,svc kubectl -n vault exec -i sts/vault -- vault status ``` +## Cilium network policy (Phase G) + +After **Cilium** is up, optionally restrict HTTP access to the Vault server pods (**TCP 8200**) to **`external-secrets`** and same-namespace clients: + +```bash +kubectl apply -f clusters/noble/apps/vault/cilium-network-policy.yaml +``` + +If you add workloads in other namespaces that call Vault, extend **`ingress`** in that manifest. + ## Initialize and unseal (first time) From a workstation with `kubectl` (or `kubectl exec` into any pod with `vault` CLI): diff --git a/clusters/noble/apps/vault/cilium-network-policy.yaml b/clusters/noble/apps/vault/cilium-network-policy.yaml new file mode 100644 index 0000000..9a910eb --- /dev/null +++ b/clusters/noble/apps/vault/cilium-network-policy.yaml @@ -0,0 +1,33 @@ +# CiliumNetworkPolicy — restrict who may reach Vault HTTP listener (8200). +# Apply after Cilium is healthy: kubectl apply -f clusters/noble/apps/vault/cilium-network-policy.yaml +# +# Ingress-only policy: egress from Vault is unchanged (Kubernetes auth needs API + DNS). +# Extend ingress rules if other namespaces must call Vault (e.g. app workloads). +# +# Ref: https://docs.cilium.io/en/stable/security/policy/language/ +--- +apiVersion: cilium.io/v2 +kind: CiliumNetworkPolicy +metadata: + name: vault-http-ingress + namespace: vault +spec: + endpointSelector: + matchLabels: + app.kubernetes.io/name: vault + component: server + ingress: + - fromEndpoints: + - matchLabels: + "k8s:io.kubernetes.pod.namespace": external-secrets + toPorts: + - ports: + - port: "8200" + protocol: TCP + - fromEndpoints: + - matchLabels: + "k8s:io.kubernetes.pod.namespace": vault + toPorts: + - ports: + - port: "8200" + protocol: TCP diff --git a/clusters/noble/bootstrap/argocd/README.md b/clusters/noble/bootstrap/argocd/README.md index 475eef6..f3801f9 100644 --- a/clusters/noble/bootstrap/argocd/README.md +++ b/clusters/noble/bootstrap/argocd/README.md @@ -15,6 +15,8 @@ helm upgrade --install argocd argo/argo-cd \ --wait ``` +**RBAC:** `values.yaml` sets **`policy.default: role:readonly`** and **`g, admin, role:admin`** so the local **`admin`** user keeps full access while future OIDC users default to read-only until you add **`policy.csv`** mappings. + ## 2. UI / CLI address **HTTPS:** `https://argo.apps.noble.lab.pcenicni.dev` (Ingress via Traefik; cert from **`values.yaml`**). diff --git a/clusters/noble/bootstrap/argocd/values.yaml b/clusters/noble/bootstrap/argocd/values.yaml index f76e5dd..b606dab 100644 --- a/clusters/noble/bootstrap/argocd/values.yaml +++ b/clusters/noble/bootstrap/argocd/values.yaml @@ -21,6 +21,13 @@ configs: # TLS terminates at Traefik / cert-manager; Argo CD serves HTTP behind the Ingress. server.insecure: true + # RBAC: default authenticated users to read-only; keep local **admin** as full admin. + # Ref: https://argo-cd.readthedocs.io/en/stable/operator-manual/rbac/ + rbac: + policy.default: role:readonly + policy.csv: | + g, admin, role:admin + server: certificate: enabled: true diff --git a/renovate.json b/renovate.json new file mode 100644 index 0000000..49de23e --- /dev/null +++ b/renovate.json @@ -0,0 +1,20 @@ +{ + "$schema": "https://docs.renovatebot.com/renovate-schema.json", + "extends": ["config:recommended"], + "dependencyDashboard": true, + "timezone": "America/New_York", + "schedule": ["before 4am on Monday"], + "prConcurrentLimit": 5, + "kubernetes": { + "fileMatch": ["^clusters/noble/.+\\.yaml$"] + }, + "packageRules": [ + { + "description": "Group minor/patch image bumps for noble Kubernetes manifests", + "matchManagers": ["kubernetes"], + "matchFileNames": ["clusters/noble/**"], + "matchUpdateTypes": ["minor", "patch"], + "groupName": "noble container images (minor/patch)" + } + ] +} diff --git a/talos/CLUSTER-BUILD.md b/talos/CLUSTER-BUILD.md index a2282e6..102281b 100644 --- a/talos/CLUSTER-BUILD.md +++ b/talos/CLUSTER-BUILD.md @@ -4,7 +4,7 @@ This document is the **exported TODO** for the **noble** Talos cluster (4 nodes) ## Current state (2026-03-28) -Lab stack is **up** on-cluster through **Phase D** (observability), **Phase E** (Sealed Secrets, External Secrets, **Vault** + **`ClusterSecretStore`**), and **Phase F** (**Kyverno** **baseline** PSS **Audit**), with manifests matching this repo. **Next focus:** optional **Headlamp** (Ingress + TLS), **Renovate** (dependency PRs for Helm/manifests), Pangolin/sample Ingress validation, **Phase G**, **Velero** when S3 exists. +Lab stack is **up** on-cluster through **Phase D**–**F** and **Phase G** (Vault **CiliumNetworkPolicy**, **`talos/runbooks/`**). **Next focus:** optional **Alertmanager** receivers (Slack/PagerDuty); tighten **RBAC** (Headlamp / cluster-admin); **Cilium** policies for other namespaces as needed; enable **Mend Renovate** for PRs; Pangolin/sample Ingress; **Velero** when S3 exists. - **Talos** v1.12.6 (target) / **Kubernetes** as bundled — four nodes **Ready** unless upgrading; **`talosctl health`**; **`talos/kubeconfig`** is **local only** (gitignored — never commit; regenerate with `talosctl kubeconfig` per `talos/README.md`). **Image Factory (nocloud installer):** `factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6` - **Cilium** Helm **1.16.6** / app **1.16.6** (`clusters/noble/apps/cilium/`, phase 1 values). @@ -21,7 +21,7 @@ Lab stack is **up** on-cluster through **Phase D** (observability), **Phase E** - **Sealed Secrets** Helm **2.18.4** / app **0.36.1** — `clusters/noble/apps/sealed-secrets/` (namespace **`sealed-secrets`**); **`kubeseal`** on client should match controller minor (**README**); back up **`sealed-secrets-key`** (see README). - **External Secrets Operator** Helm **2.2.0** / app **v2.2.0** — `clusters/noble/apps/external-secrets/`; Vault **`ClusterSecretStore`** in **`examples/vault-cluster-secret-store.yaml`** (**`http://`** to match Vault listener — apply after Vault **Kubernetes auth**). - **Vault** Helm **0.32.0** / app **1.21.2** — `clusters/noble/apps/vault/` — standalone **file** storage, **Longhorn** PVC; **HTTP** listener (`global.tlsDisable`); optional **CronJob** lab unseal **`unseal-cronjob.yaml`**; **not** initialized in git — run **`vault operator init`** per **`README.md`**. -- **Still open:** **Headlamp** (Helm + Traefik Ingress + **`letsencrypt-prod`**); **Renovate** ([Renovate](https://docs.renovatebot.com/) — dependency bot; hosted app **or** self-hosted on-cluster); **Phase G**; optional **sample Ingress + cert + Pangolin** end-to-end; **Velero** when S3 is ready; **Argo CD SSO**. +- **Still open:** **Renovate** — install **[Mend Renovate](https://github.com/apps/renovate)** (or self-host) so PRs run; optional **Alertmanager** notification channels; optional **sample Ingress + cert + Pangolin** end-to-end; **Velero** when S3 is ready; **Argo CD SSO**. ## Inventory @@ -74,6 +74,7 @@ Lab stack is **up** on-cluster through **Phase D** (observability), **Phase E** | Artifact | Path | |----------|------| | This checklist | `talos/CLUSTER-BUILD.md` | +| Operational runbooks (API VIP, etcd, Longhorn, Vault) | `talos/runbooks/` | | Talos quick start + networking + kubeconfig | `talos/README.md` | | talhelper source (active) | `talos/talconfig.yaml` — may be **wipe-phase** (no Longhorn volume) during disk recovery | | Longhorn volume restore | `talos/talconfig.with-longhorn.yaml` — copy to `talconfig.yaml` after GPT wipe (see `talos/README.md` §5) | @@ -93,10 +94,10 @@ Lab stack is **up** on-cluster through **Phase D** (observability), **Phase E** | Fluent Bit → Loki (Helm values) | `clusters/noble/apps/fluent-bit/` — `values.yaml`, `namespace.yaml` | | Sealed Secrets (Helm) | `clusters/noble/apps/sealed-secrets/` — `values.yaml`, `namespace.yaml`, `README.md` | | External Secrets Operator (Helm + Vault store example) | `clusters/noble/apps/external-secrets/` — `values.yaml`, `namespace.yaml`, `README.md`, `examples/vault-cluster-secret-store.yaml` | -| Vault (Helm + optional unseal CronJob) | `clusters/noble/apps/vault/` — `values.yaml`, `namespace.yaml`, `unseal-cronjob.yaml`, `configure-kubernetes-auth.sh`, `README.md` | +| Vault (Helm + optional unseal CronJob) | `clusters/noble/apps/vault/` — `values.yaml`, `namespace.yaml`, `unseal-cronjob.yaml`, `cilium-network-policy.yaml`, `configure-kubernetes-auth.sh`, `README.md` | | Kyverno + PSS baseline policies | `clusters/noble/apps/kyverno/` — `values.yaml`, `policies-values.yaml`, `namespace.yaml`, `README.md` | -| Headlamp (Helm + Ingress) | `clusters/noble/apps/headlamp/` — `values.yaml`, `namespace.yaml` (planned — `helm repo add headlamp https://kubernetes-sigs.github.io/headlamp/`) | -| Renovate (repo config + optional self-hosted Helm) | `renovate.json` or `renovate.json5` at repo root (see [Renovate docs](https://docs.renovatebot.com/)); optional `clusters/noble/apps/renovate/` for self-hosted chart + token Secret (**Sealed Secrets** / **ESO** after **Phase E**) | +| Headlamp (Helm + Ingress) | `clusters/noble/apps/headlamp/` — `values.yaml`, `namespace.yaml`, `README.md` | +| Renovate (repo config + optional self-hosted Helm) | **`renovate.json`** at repo root; optional `clusters/noble/apps/renovate/` for self-hosted chart + token Secret (**Sealed Secrets** / **ESO** after **Phase E**) | **Git vs cluster:** manifests and `talconfig` live in git; **`talhelper genconfig -o out`**, bootstrap, Helm, and `kubectl` run on your LAN. See **`talos/README.md`** for workstation reachability (lab LAN/VPN), **`talosctl kubeconfig`** vs Kubernetes `server:` (VIP vs node IP), and **`--insecure`** only in maintenance. @@ -150,14 +151,14 @@ Lab stack is **up** on-cluster through **Phase D** (observability), **Phase E** - [x] **Argo CD** bootstrap — `clusters/noble/bootstrap/argocd/` (`helm upgrade --install argocd …`) - [x] Argo CD server **LoadBalancer** — **`192.168.50.210`** (see `values.yaml`) - [X] **App-of-apps** — set **`repoURL`** in **`root-application.yaml`**, add **`Application`** manifests under **`bootstrap/argocd/apps/`**, apply **`root-application.yaml`** -- [ ] **Renovate** — [Renovate](https://docs.renovatebot.com/) opens PRs for Helm charts, Docker tags, and related bumps. **Option A:** install the **Mend Renovate** app on **GitHub** / **GitLab** for this repo (no cluster). **Option B:** self-hosted — **`helm repo add renovate https://docs.renovatebot.com/helm-charts`** or OCI per [Helm charts](https://docs.renovatebot.com/helm-charts/); **`renovate.config`** with token from **Sealed Secrets** / **ESO** (**`clusters/noble/apps/renovate/`** when added). Add **`renovate.json`** (or **`renovate.json5`**) at repo root with **`packageRules`**, **`kubernetes`** / **`helm-values`** file patterns covering **`clusters/noble/`** (Helm **`values.yaml`**, manifests). Verify a dry run or first dependency PR. +- [x] **Renovate** — **`renovate.json`** at repo root ([Renovate](https://docs.renovatebot.com/) — **Kubernetes** manager for **`clusters/noble/**/*.yaml`** image pins; grouped minor/patch PRs). **Activate PRs:** install **[Mend Renovate](https://github.com/apps/renovate)** on the Git repo (**Option A**), or **Option B:** self-hosted chart per [Helm charts](https://docs.renovatebot.com/helm-charts/) + token from **Sealed Secrets** / **ESO**. Helm **chart** versions pinned only in comments still need manual bumps or extra **regex** `customManagers` — extend **`renovate.json`** as needed. - [ ] SSO — later ## Phase D — Observability - [x] **kube-prometheus-stack** — `kubectl apply -f clusters/noble/apps/kube-prometheus-stack/namespace.yaml` then **`helm upgrade --install`** as in `clusters/noble/apps/kube-prometheus-stack/values.yaml` (chart **82.15.1**); PVCs **`longhorn`**; **`--wait --timeout 30m`** recommended; verify **`kubectl -n monitoring get pods,pvc`** - [x] **Loki** + **Fluent Bit** + **Grafana Loki datasource** — **order:** **`kubectl apply -f clusters/noble/apps/loki/namespace.yaml`** → **`helm upgrade --install loki`** `grafana/loki` **6.55.0** `-f clusters/noble/apps/loki/values.yaml` → **`kubectl apply -f clusters/noble/apps/fluent-bit/namespace.yaml`** → **`helm upgrade --install fluent-bit`** `fluent/fluent-bit` **0.56.0** `-f clusters/noble/apps/fluent-bit/values.yaml` → **`kubectl apply -f clusters/noble/apps/grafana-loki-datasource/loki-datasource.yaml`**. Verify **Explore → Loki** in Grafana; **`kubectl -n loki get pods,pvc`**, **`kubectl -n logging get pods`** -- [ ] **Headlamp** — Kubernetes web UI ([Headlamp](https://headlamp.dev/)); **`helm repo add headlamp https://kubernetes-sigs.github.io/headlamp/`**; **`kubectl apply -f clusters/noble/apps/headlamp/namespace.yaml`** → **`helm upgrade --install headlamp headlamp/headlamp --version 0.40.1 -n headlamp -f clusters/noble/apps/headlamp/values.yaml`**; **Ingress** **`https://headlamp.apps.noble.lab.pcenicni.dev`** (**`ingressClassName: traefik`**, **`cert-manager.io/cluster-issuer: letsencrypt-prod`**). **RBAC:** chart defaults are permissive — tighten before LAN-wide exposure; align with **Phase G** hardening. +- [x] **Headlamp** — Kubernetes web UI ([Headlamp](https://headlamp.dev/)); **`helm repo add headlamp https://kubernetes-sigs.github.io/headlamp/`**; **`kubectl apply -f clusters/noble/apps/headlamp/namespace.yaml`** → **`helm upgrade --install headlamp headlamp/headlamp --version 0.40.1 -n headlamp -f clusters/noble/apps/headlamp/values.yaml`**; **Ingress** **`https://headlamp.apps.noble.lab.pcenicni.dev`** (**`ingressClassName: traefik`**, **`cert-manager.io/cluster-issuer: letsencrypt-prod`**). **`values.yaml`:** **`config.sessionTTL: null`** works around chart **0.40.1** / binary mismatch ([headlamp#4883](https://github.com/kubernetes-sigs/headlamp/issues/4883)). **RBAC:** chart defaults are permissive — tighten before LAN-wide exposure; align with **Phase G** hardening. ## Phase E — Secrets @@ -172,8 +173,10 @@ Lab stack is **up** on-cluster through **Phase D** (observability), **Phase E** ## Phase G — Hardening -- [ ] RBAC, network policies (Cilium), Alertmanager routes -- [ ] Runbooks: API VIP, etcd, Longhorn, Vault +- [x] **Cilium** — Vault **`CiliumNetworkPolicy`** (`clusters/noble/apps/vault/cilium-network-policy.yaml`) — HTTP **8200** from **`external-secrets`** + **`vault`**; extend for other clients as needed +- [x] **Runbooks** — **`talos/runbooks/`** (API VIP / kube-vip, etcd–Talos, Longhorn, Vault) +- [x] **RBAC** — **Headlamp** **`ClusterRoleBinding`** uses built-in **`edit`** (not **`cluster-admin`**); **Argo CD** **`policy.default: role:readonly`** with **`g, admin, role:admin`** — see **`clusters/noble/apps/headlamp/values.yaml`**, **`clusters/noble/bootstrap/argocd/values.yaml`**, **`talos/runbooks/rbac.md`** +- [ ] **Alertmanager** — add **`slack_configs`**, **`pagerduty_configs`**, or other receivers under **`kube-prometheus-stack`** `alertmanager.config` (chart defaults use **`null`** receiver) ## Quick validation @@ -181,18 +184,19 @@ Lab stack is **up** on-cluster through **Phase D** (observability), **Phase E** - [x] API via VIP `:6443` — **`kubectl get --raw /healthz`** → **`ok`** with kubeconfig **`server:`** `https://192.168.50.230:6443` - [x] Ingress **`LoadBalancer`** in pool `210`–`229` (**Traefik** → **`192.168.50.211`**) - [x] **Argo CD** UI — **`argocd-server`** **`LoadBalancer`** **`192.168.50.210`** (initial **`admin`** password from **`argocd-initial-admin-secret`**) -- [ ] **Renovate** — hosted app enabled for this repo **or** self-hosted workload **Running** + PRs updating **`clusters/noble/`** manifests as configured +- [x] **Renovate** — **`renovate.json`** committed; **enable** [Mend Renovate](https://github.com/apps/renovate) **or** self-hosted bot for PRs - [ ] Sample Ingress + cert (cert-manager ready) + Pangolin resource + CNAME - [x] PVC **`Bound`** on **Longhorn** (`storageClassName: longhorn`); Prometheus/Loki durable when configured - [x] **`monitoring`** — **kube-prometheus-stack** core workloads **Running** (Prometheus, Grafana, Alertmanager, operator, kube-state-metrics, node-exporter); PVCs **Bound** on **longhorn** - [x] **`loki`** — **Loki** SingleBinary + **gateway** **Running**; **`loki`** PVC **Bound** on **longhorn** (no chunks-cache by design) - [x] **`logging`** — **Fluent Bit** DaemonSet **Running** on all nodes (logs → **Loki**) - [x] **Grafana** — **Loki** datasource from **`grafana-loki-datasource`** ConfigMap (**Explore** works after apply + sidecar sync) -- [ ] **Headlamp** — Deployment **Running** in **`headlamp`**; UI at **`https://headlamp.apps.noble.lab.pcenicni.dev`** (TLS via **`letsencrypt-prod`**) +- [x] **Headlamp** — Deployment **Running** in **`headlamp`**; UI at **`https://headlamp.apps.noble.lab.pcenicni.dev`** (TLS via **`letsencrypt-prod`**) - [x] **`sealed-secrets`** — controller **Deployment** **Running** in **`sealed-secrets`** (install + **`kubeseal`** per **`apps/sealed-secrets/README.md`**) - [x] **`external-secrets`** — controller + webhook + cert-controller **Running** in **`external-secrets`**; apply **`ClusterSecretStore`** after Vault **Kubernetes auth** - [x] **`vault`** — **StatefulSet** **Running**, **`data-vault-0`** PVC **Bound** on **longhorn**; **`vault operator init`** + unseal per **`apps/vault/README.md`** - [x] **`kyverno`** — admission / background / cleanup / reports controllers **Running** in **`kyverno`**; **ClusterPolicies** for **PSS baseline** **Ready** (**Audit**) +- [x] **Phase G (partial)** — Vault **`CiliumNetworkPolicy`**; **`talos/runbooks/`** (incl. **RBAC**); **Headlamp**/**Argo CD** RBAC tightened — **Alertmanager** receivers still optional --- diff --git a/talos/README.md b/talos/README.md index 2f62d14..e46c723 100644 --- a/talos/README.md +++ b/talos/README.md @@ -1,6 +1,7 @@ # Talos — noble lab - **Cluster build checklist (exported TODO):** [CLUSTER-BUILD.md](./CLUSTER-BUILD.md) +- **Operational runbooks (API VIP, etcd, Longhorn, Vault):** [runbooks/README.md](./runbooks/README.md) ## Versions diff --git a/talos/runbooks/README.md b/talos/runbooks/README.md new file mode 100644 index 0000000..422fd21 --- /dev/null +++ b/talos/runbooks/README.md @@ -0,0 +1,11 @@ +# Noble lab — operational runbooks + +Short recovery / triage notes for the **noble** Talos cluster. Deep procedures live in [`talos/README.md`](../README.md) and [`talos/CLUSTER-BUILD.md`](../CLUSTER-BUILD.md). + +| Topic | Runbook | +|-------|---------| +| Kubernetes API VIP (kube-vip) | [`api-vip-kube-vip.md`](./api-vip-kube-vip.md) | +| etcd / Talos control plane | [`etcd-talos.md`](./etcd-talos.md) | +| Longhorn storage | [`longhorn.md`](./longhorn.md) | +| Vault (unseal, auth, ESO) | [`vault.md`](./vault.md) | +| RBAC (Headlamp, Argo CD) | [`rbac.md`](./rbac.md) | diff --git a/talos/runbooks/api-vip-kube-vip.md b/talos/runbooks/api-vip-kube-vip.md new file mode 100644 index 0000000..57b72c4 --- /dev/null +++ b/talos/runbooks/api-vip-kube-vip.md @@ -0,0 +1,15 @@ +# Runbook: Kubernetes API VIP (kube-vip) + +**Symptoms:** `kubectl` timeouts, `connection refused` to `https://192.168.50.230:6443`, or nodes `NotReady` while apiserver on a node IP still works. + +**Checks** + +1. VIP and interface align with [`talos/talconfig.yaml`](../talconfig.yaml) (`cluster.network`, `additionalApiServerCertSans`) and [`clusters/noble/apps/kube-vip/`](../../clusters/noble/apps/kube-vip/). +2. `kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip -o wide` — DaemonSet should be **Running** on control-plane nodes. +3. From a workstation: `ping 192.168.50.230` (if ICMP allowed) and `curl -k https://192.168.50.230:6443/healthz` or `kubectl get --raw /healthz` with kubeconfig `server:` set to the VIP. +4. `talosctl health` with `TALOSCONFIG` (see [`talos/README.md`](../README.md) §3). + +**Common fixes** + +- Wrong uplink name in kube-vip (`ens18` vs actual): fix manifest, re-apply, verify on node with `talosctl get links`. +- Workstation routing/DNS: use VIP only when reachable; otherwise temporarily point kubeconfig `server:` at a control-plane IP (see README §3). diff --git a/talos/runbooks/etcd-talos.md b/talos/runbooks/etcd-talos.md new file mode 100644 index 0000000..aef8654 --- /dev/null +++ b/talos/runbooks/etcd-talos.md @@ -0,0 +1,16 @@ +# Runbook: etcd / Talos control plane + +**Symptoms:** API flaps, `etcd` alarms, multiple control planes `NotReady`, upgrades stuck. + +**Checks** + +1. `talosctl health` and `talosctl etcd status` (with `TALOSCONFIG`; target a control-plane node if needed). +2. `kubectl get nodes` — control planes **Ready**; look for disk/memory pressure. +3. Talos version skew: `talosctl version` vs node image in [`talos/talconfig.yaml`](../talconfig.yaml) / Image Factory schematic. + +**Common fixes** + +- One bad control plane: cordon/drain workloads only after confirming quorum; follow Talos maintenance docs for replace/remove. +- Disk full on etcd volume: resolve host disk / system partition (Talos ephemeral vs user volumes per machine config). + +**References:** [Talos etcd](https://www.talos.dev/latest/advanced/etcd-maintenance/), [`talos/README.md`](../README.md). diff --git a/talos/runbooks/longhorn.md b/talos/runbooks/longhorn.md new file mode 100644 index 0000000..3f16feb --- /dev/null +++ b/talos/runbooks/longhorn.md @@ -0,0 +1,16 @@ +# Runbook: Longhorn + +**Symptoms:** PVCs stuck **Pending**, volumes **Faulted**, workloads I/O errors, Longhorn UI/alerts. + +**Checks** + +1. `kubectl -n longhorn-system get pods` and `kubectl get nodes.longhorn.io -o wide`. +2. Talos user disk + extensions for Longhorn (see [`talos/README.md`](../README.md) §5 and `talconfig.with-longhorn.yaml`). +3. `kubectl get sc` — **longhorn** default as expected; PVC events: `kubectl describe pvc -n `. + +**Common fixes** + +- Node disk pressure / mount missing: fix Talos machine config, reboot node per Talos docs. +- Recovery / GPT wipe scripts: [`talos/scripts/longhorn-gpt-recovery.sh`](../scripts/longhorn-gpt-recovery.sh) and CLUSTER-BUILD notes. + +**References:** [`clusters/noble/apps/longhorn/`](../../clusters/noble/apps/longhorn/), [Longhorn docs](https://longhorn.io/docs/). diff --git a/talos/runbooks/rbac.md b/talos/runbooks/rbac.md new file mode 100644 index 0000000..7115961 --- /dev/null +++ b/talos/runbooks/rbac.md @@ -0,0 +1,13 @@ +# Runbook: Kubernetes RBAC (noble) + +**Headlamp** (`clusters/noble/apps/headlamp/values.yaml`): the chart’s **ClusterRoleBinding** uses the built-in **`edit`** ClusterRole — not **`cluster-admin`**. Break-glass changes use **`kubectl`** with an admin kubeconfig. + +**Argo CD** (`clusters/noble/bootstrap/argocd/values.yaml`): **`policy.default: role:readonly`** — new OIDC/Git users get read-only unless you add **`g, , role:admin`** (or another role) in **`configs.rbac.policy.csv`**. Local user **`admin`** stays **`role:admin`** via **`g, admin, role:admin`**. + +**Audits** + +```bash +kubectl get clusterrolebindings -o custom-columns='NAME:.metadata.name,ROLE:.roleRef.name,SA:.subjects[?(@.kind=="ServiceAccount")].name,NS:.subjects[?(@.kind=="ServiceAccount")].namespace' | grep -E 'NAME|cluster-admin|headlamp|argocd' +``` + +**References:** [Headlamp chart RBAC](https://github.com/kubernetes-sigs/headlamp/tree/main/charts/headlamp), [Argo CD RBAC](https://argo-cd.readthedocs.io/en/stable/operator-manual/rbac/). diff --git a/talos/runbooks/vault.md b/talos/runbooks/vault.md new file mode 100644 index 0000000..983b734 --- /dev/null +++ b/talos/runbooks/vault.md @@ -0,0 +1,15 @@ +# Runbook: Vault (in-cluster) + +**Symptoms:** External Secrets **not syncing**, `ClusterSecretStore` **InvalidProviderConfig**, Vault UI/API **503 sealed**, pods **CrashLoop** on auth. + +**Checks** + +1. `kubectl -n vault exec -i sts/vault -- vault status` — **Sealed** / **Initialized**. +2. Unseal key Secret + optional CronJob: [`clusters/noble/apps/vault/README.md`](../../clusters/noble/apps/vault/README.md), `unseal-cronjob.yaml`. +3. Kubernetes auth for ESO: [`clusters/noble/apps/vault/configure-kubernetes-auth.sh`](../../clusters/noble/apps/vault/configure-kubernetes-auth.sh) and `kubectl describe clustersecretstore vault`. +4. **Cilium** policy: if Vault is unreachable from `external-secrets`, check [`clusters/noble/apps/vault/cilium-network-policy.yaml`](../../clusters/noble/apps/vault/cilium-network-policy.yaml) and extend `ingress` for new client namespaces. + +**Common fixes** + +- Sealed: `vault operator unseal` or fix auto-unseal CronJob + `vault-unseal-key` Secret. +- **403/invalid role** on ESO: re-run Kubernetes auth setup (issuer/CA/reviewer JWT) per README.