diff --git a/.sops.yaml b/.sops.yaml new file mode 100644 index 0000000..0f67da1 --- /dev/null +++ b/.sops.yaml @@ -0,0 +1,7 @@ +# Mozilla SOPS — encrypt/decrypt Kubernetes Secret manifests under clusters/noble/secrets/ +# Generate a key: age-keygen -o age-key.txt (age-key.txt is gitignored) +# Add the printed public key below (one recipient per line is supported). +creation_rules: + - path_regex: clusters/noble/secrets/.*\.yaml$ + age: >- + age1juym5p3ez3dkt0dxlznydgfgqvaujfnyk9hpdsssf50hsxeh3p4sjpf3gn diff --git a/ansible/README.md b/ansible/README.md index de4078e..8766a74 100644 --- a/ansible/README.md +++ b/ansible/README.md @@ -24,6 +24,7 @@ Copy **`.env.sample`** to **`.env`** at the repository root (`.env` is gitignore ## Prerequisites - `talosctl` (matches node Talos version), `talhelper`, `helm`, `kubectl`. +- **SOPS secrets:** `sops` and `age` on the control host if you use **`clusters/noble/secrets/`** with **`age-key.txt`** (see **`clusters/noble/secrets/README.md`**). - **Phase A:** same LAN/VPN as nodes so **Talos :50000** and **Kubernetes :6443** are reachable (see [`talos/README.md`](../talos/README.md) §3). - **noble.yml:** bootstrapped cluster and **`talos/kubeconfig`** (or `KUBECONFIG`). @@ -34,7 +35,7 @@ Copy **`.env.sample`** to **`.env`** at the repository root (`.env` is gitignore | [`playbooks/deploy.yml`](playbooks/deploy.yml) | **Talos Phase A** then **`noble.yml`** (full automation). | | [`playbooks/talos_phase_a.yml`](playbooks/talos_phase_a.yml) | `genconfig` → `apply-config` → `bootstrap` → `kubeconfig` only. | | [`playbooks/noble.yml`](playbooks/noble.yml) | Helm + `kubectl` platform (after Phase A). | -| [`playbooks/post_deploy.yml`](playbooks/post_deploy.yml) | Vault / ESO reminders (`noble_apply_vault_cluster_secret_store`). | +| [`playbooks/post_deploy.yml`](playbooks/post_deploy.yml) | SOPS reminders and optional Argo root Application note. | | [`playbooks/talos_bootstrap.yml`](playbooks/talos_bootstrap.yml) | **`talhelper genconfig` only** (legacy shortcut; prefer **`talos_phase_a.yml`**). | ```bash @@ -68,9 +69,10 @@ ansible-playbook playbooks/noble.yml --skip-tags newt ansible-playbook playbooks/noble.yml --tags velero -e noble_velero_install=true -e noble_velero_s3_bucket=... -e noble_velero_s3_url=... ``` -### Variables — `group_vars/all.yml` +### Variables — `group_vars/all.yml` and role defaults -- **`noble_newt_install`**, **`noble_velero_install`**, **`noble_cert_manager_require_cloudflare_secret`**, **`noble_apply_vault_cluster_secret_store`**, **`noble_k8s_api_server_override`**, **`noble_k8s_api_server_auto_fallback`**, **`noble_k8s_api_server_fallback`**, **`noble_skip_k8s_health_check`**. +- **`group_vars/all.yml`:** **`noble_newt_install`**, **`noble_velero_install`**, **`noble_cert_manager_require_cloudflare_secret`**, **`noble_k8s_api_server_override`**, **`noble_k8s_api_server_auto_fallback`**, **`noble_k8s_api_server_fallback`**, **`noble_skip_k8s_health_check`** +- **`roles/noble_platform/defaults/main.yml`:** **`noble_apply_sops_secrets`**, **`noble_sops_age_key_file`** (SOPS secrets under **`clusters/noble/secrets/`**) ## Roles diff --git a/ansible/group_vars/all.yml b/ansible/group_vars/all.yml index 6dff5ef..bf33f25 100644 --- a/ansible/group_vars/all.yml +++ b/ansible/group_vars/all.yml @@ -13,14 +13,11 @@ noble_k8s_api_server_fallback: "https://192.168.50.20:6443" # Only if you must skip the kubectl /healthz preflight (not recommended). noble_skip_k8s_health_check: false -# Pangolin / Newt — set true only after creating newt-pangolin-auth Secret (see clusters/noble/bootstrap/newt/README.md) +# Pangolin / Newt — set true only after newt-pangolin-auth Secret exists (SOPS: clusters/noble/secrets/ or imperative — see clusters/noble/bootstrap/newt/README.md) noble_newt_install: false # cert-manager needs Secret cloudflare-dns-api-token in cert-manager namespace before ClusterIssuers work noble_cert_manager_require_cloudflare_secret: true -# post_deploy.yml — apply Vault ClusterSecretStore only after Vault is initialized and K8s auth is configured -noble_apply_vault_cluster_secret_store: false - # Velero — set **noble_velero_install: true** plus S3 bucket/URL (and credentials — see clusters/noble/bootstrap/velero/README.md) noble_velero_install: false diff --git a/ansible/playbooks/post_deploy.yml b/ansible/playbooks/post_deploy.yml index 90d1a6c..b9450d2 100644 --- a/ansible/playbooks/post_deploy.yml +++ b/ansible/playbooks/post_deploy.yml @@ -1,12 +1,7 @@ --- -# Manual follow-ups after **noble.yml**: Vault init/unseal, Kubernetes auth for Vault, ESO ClusterSecretStore. -# Run: ansible-playbook playbooks/post_deploy.yml -- name: Noble cluster — post-install reminders - hosts: localhost +# Manual follow-ups after **noble.yml**: SOPS key backup, optional Argo root Application. +- hosts: localhost connection: local gather_facts: false - vars: - noble_repo_root: "{{ playbook_dir | dirname | dirname }}" - noble_kubeconfig: "{{ lookup('env', 'KUBECONFIG') | default(noble_repo_root + '/talos/kubeconfig', true) }}" roles: - - role: noble_post_deploy + - noble_post_deploy diff --git a/ansible/roles/helm_repos/defaults/main.yml b/ansible/roles/helm_repos/defaults/main.yml index d635baa..f543ed3 100644 --- a/ansible/roles/helm_repos/defaults/main.yml +++ b/ansible/roles/helm_repos/defaults/main.yml @@ -8,9 +8,6 @@ noble_helm_repos: - { name: fossorial, url: "https://charts.fossorial.io" } - { name: argo, url: "https://argoproj.github.io/argo-helm" } - { name: metrics-server, url: "https://kubernetes-sigs.github.io/metrics-server/" } - - { name: sealed-secrets, url: "https://bitnami-labs.github.io/sealed-secrets" } - - { name: external-secrets, url: "https://charts.external-secrets.io" } - - { name: hashicorp, url: "https://helm.releases.hashicorp.com" } - { name: prometheus-community, url: "https://prometheus-community.github.io/helm-charts" } - { name: grafana, url: "https://grafana.github.io/helm-charts" } - { name: fluent, url: "https://fluent.github.io/helm-charts" } diff --git a/ansible/roles/noble_landing_urls/defaults/main.yml b/ansible/roles/noble_landing_urls/defaults/main.yml index 313798d..1da1332 100644 --- a/ansible/roles/noble_landing_urls/defaults/main.yml +++ b/ansible/roles/noble_landing_urls/defaults/main.yml @@ -39,11 +39,6 @@ noble_lab_ui_entries: namespace: longhorn-system service: longhorn-frontend url: https://longhorn.apps.noble.lab.pcenicni.dev - - name: Vault - description: Secrets engine UI (after init/unseal) - namespace: vault - service: vault - url: https://vault.apps.noble.lab.pcenicni.dev - name: Velero description: Cluster backups — no web UI (velero CLI / kubectl CRDs) namespace: velero diff --git a/ansible/roles/noble_landing_urls/templates/noble-lab-ui-urls.md.j2 b/ansible/roles/noble_landing_urls/templates/noble-lab-ui-urls.md.j2 index 78cd42c..777b95a 100644 --- a/ansible/roles/noble_landing_urls/templates/noble-lab-ui-urls.md.j2 +++ b/ansible/roles/noble_landing_urls/templates/noble-lab-ui-urls.md.j2 @@ -24,7 +24,6 @@ This file is **generated** by Ansible (`noble_landing_urls` role). Use it as a t | **Prometheus** | — | No auth in default install (lab). | | **Alertmanager** | — | No auth in default install (lab). | | **Longhorn** | — | No default login unless you enable access control in the UI settings. | -| **Vault** | Token | Root token is only from **`vault operator init`** (not stored in git). See `clusters/noble/bootstrap/vault/README.md`. | ### Commands to retrieve passwords (if not filled above) @@ -46,7 +45,7 @@ To generate this file **without** calling kubectl, run Ansible with **`-e noble_ - **Argo CD** `argocd-initial-admin-secret` disappears after you change the admin password. - **Grafana** password is random unless you set `grafana.adminPassword` in chart values. -- **Vault** UI needs **unsealed** Vault; tokens come from your chosen auth method. - **Prometheus / Alertmanager** UIs are unauthenticated by default — restrict when hardening (`talos/CLUSTER-BUILD.md` Phase G). +- **SOPS:** cluster secrets in git under **`clusters/noble/secrets/`** are encrypted; decrypt with **`age-key.txt`** (not in git). See **`clusters/noble/secrets/README.md`**. - **Headlamp** token above expires after the configured duration; re-run Ansible or `kubectl create token` to refresh. - **Velero** has **no web UI** — use **`velero`** CLI or **`kubectl -n velero get backup,schedule,backupstoragelocation`**. Metrics: **`velero`** Service in **`velero`** (Prometheus scrape). See `clusters/noble/bootstrap/velero/README.md`. diff --git a/ansible/roles/noble_platform/defaults/main.yml b/ansible/roles/noble_platform/defaults/main.yml index 0e72b05..a53fc0c 100644 --- a/ansible/roles/noble_platform/defaults/main.yml +++ b/ansible/roles/noble_platform/defaults/main.yml @@ -4,5 +4,6 @@ noble_platform_kubectl_request_timeout: 120s noble_platform_kustomize_retries: 5 noble_platform_kustomize_delay: 20 -# Vault: injector (vault-k8s) owns MutatingWebhookConfiguration.caBundle; Helm upgrade can SSA-conflict. Delete webhook so Helm can recreate it. -noble_vault_delete_injector_webhook_before_helm: true +# Decrypt **clusters/noble/secrets/*.yaml** with SOPS and kubectl apply (requires **sops**, **age**, and **age-key.txt**). +noble_apply_sops_secrets: true +noble_sops_age_key_file: "{{ noble_repo_root }}/age-key.txt" diff --git a/ansible/roles/noble_platform/tasks/main.yml b/ansible/roles/noble_platform/tasks/main.yml index fb856cb..f21545b 100644 --- a/ansible/roles/noble_platform/tasks/main.yml +++ b/ansible/roles/noble_platform/tasks/main.yml @@ -1,6 +1,6 @@ --- # Mirrors former **noble-platform** Argo Application: Helm releases + plain manifests under clusters/noble/bootstrap. -- name: Apply clusters/noble/bootstrap kustomize (namespaces, Grafana Loki datasource, Vault extras) +- name: Apply clusters/noble/bootstrap kustomize (namespaces, Grafana Loki datasource) ansible.builtin.command: argv: - kubectl @@ -16,77 +16,26 @@ until: noble_platform_kustomize.rc == 0 changed_when: true -- name: Install Sealed Secrets - ansible.builtin.command: - argv: - - helm - - upgrade - - --install - - sealed-secrets - - sealed-secrets/sealed-secrets - - --namespace - - sealed-secrets - - --version - - "2.18.4" - - -f - - "{{ noble_repo_root }}/clusters/noble/bootstrap/sealed-secrets/values.yaml" - - --wait - environment: - KUBECONFIG: "{{ noble_kubeconfig }}" - changed_when: true +- name: Stat SOPS age private key (age-key.txt) + ansible.builtin.stat: + path: "{{ noble_sops_age_key_file }}" + register: noble_sops_age_key_stat -- name: Install External Secrets Operator - ansible.builtin.command: - argv: - - helm - - upgrade - - --install - - external-secrets - - external-secrets/external-secrets - - --namespace - - external-secrets - - --version - - "2.2.0" - - -f - - "{{ noble_repo_root }}/clusters/noble/bootstrap/external-secrets/values.yaml" - - --wait +- name: Apply SOPS-encrypted cluster secrets (clusters/noble/secrets/*.yaml) + ansible.builtin.shell: | + set -euo pipefail + shopt -s nullglob + for f in "{{ noble_repo_root }}/clusters/noble/secrets"/*.yaml; do + sops -d "$f" | kubectl apply -f - + done + args: + executable: /bin/bash environment: KUBECONFIG: "{{ noble_kubeconfig }}" - changed_when: true - -# vault-k8s patches webhook CA after install; Helm 3/4 SSA then conflicts on upgrade. Removing the MWC lets Helm re-apply cleanly; injector repopulates caBundle. -- name: Delete Vault agent injector MutatingWebhookConfiguration before Helm (avoids caBundle field conflict) - ansible.builtin.command: - argv: - - kubectl - - delete - - mutatingwebhookconfiguration - - vault-agent-injector-cfg - - --ignore-not-found - environment: - KUBECONFIG: "{{ noble_kubeconfig }}" - register: noble_vault_mwc_delete - when: noble_vault_delete_injector_webhook_before_helm | default(true) | bool - changed_when: "'deleted' in (noble_vault_mwc_delete.stdout | default(''))" - -- name: Install Vault - ansible.builtin.command: - argv: - - helm - - upgrade - - --install - - vault - - hashicorp/vault - - --namespace - - vault - - --version - - "0.32.0" - - -f - - "{{ noble_repo_root }}/clusters/noble/bootstrap/vault/values.yaml" - - --wait - environment: - KUBECONFIG: "{{ noble_kubeconfig }}" - HELM_SERVER_SIDE_APPLY: "false" + SOPS_AGE_KEY_FILE: "{{ noble_sops_age_key_file }}" + when: + - noble_apply_sops_secrets | default(true) | bool + - noble_sops_age_key_stat.stat.exists changed_when: true - name: Install kube-prometheus-stack diff --git a/ansible/roles/noble_post_deploy/tasks/main.yml b/ansible/roles/noble_post_deploy/tasks/main.yml index ff08dba..a0b7808 100644 --- a/ansible/roles/noble_post_deploy/tasks/main.yml +++ b/ansible/roles/noble_post_deploy/tasks/main.yml @@ -1,24 +1,10 @@ --- -- name: Vault — manual steps (not automated) +- name: SOPS secrets (workstation) ansible.builtin.debug: msg: | - 1. kubectl -n vault get pods (wait for Running) - 2. kubectl -n vault exec -it vault-0 -- vault operator init (once; save keys) - 3. Unseal per clusters/noble/bootstrap/vault/README.md - 4. ./clusters/noble/bootstrap/vault/configure-kubernetes-auth.sh - 5. kubectl apply -f clusters/noble/bootstrap/external-secrets/examples/vault-cluster-secret-store.yaml - -- name: Optional — apply Vault ClusterSecretStore for External Secrets - ansible.builtin.command: - argv: - - kubectl - - apply - - -f - - "{{ noble_repo_root }}/clusters/noble/bootstrap/external-secrets/examples/vault-cluster-secret-store.yaml" - environment: - KUBECONFIG: "{{ noble_kubeconfig }}" - when: noble_apply_vault_cluster_secret_store | default(false) | bool - changed_when: true + Encrypted Kubernetes Secrets live under clusters/noble/secrets/ (Mozilla SOPS + age). + Private key: age-key.txt at repo root (gitignored). See clusters/noble/secrets/README.md + and .sops.yaml. noble.yml decrypt-applies these when age-key.txt exists. - name: Argo CD optional root Application (empty app-of-apps) ansible.builtin.debug: diff --git a/branding/nikflix/logo.png b/branding/nikflix/logo.png new file mode 100644 index 0000000..ad8e7a9 Binary files /dev/null and b/branding/nikflix/logo.png differ diff --git a/clusters/noble/apps/README.md b/clusters/noble/apps/README.md index 8a3583d..57b1370 100644 --- a/clusters/noble/apps/README.md +++ b/clusters/noble/apps/README.md @@ -1,6 +1,6 @@ # Argo CD — optional applications (non-bootstrap) -**Base cluster configuration** (CNI, MetalLB, ingress, cert-manager, storage, observability stack, policy, Vault, etc.) is installed by **`ansible/playbooks/noble.yml`** from **`clusters/noble/bootstrap/`** — not from here. +**Base cluster configuration** (CNI, MetalLB, ingress, cert-manager, storage, observability stack, policy, SOPS secrets path, etc.) is installed by **`ansible/playbooks/noble.yml`** from **`clusters/noble/bootstrap/`** — not from here. **`noble-root`** (`clusters/noble/bootstrap/argocd/root-application.yaml`) points at **`clusters/noble/apps`**. Add **`Application`** manifests (and optional **`AppProject`** definitions) under this directory only for workloads that are additive and do not subsume the Ansible-managed platform. diff --git a/clusters/noble/apps/homepage/values.yaml b/clusters/noble/apps/homepage/values.yaml index 8014409..af7b06d 100644 --- a/clusters/noble/apps/homepage/values.yaml +++ b/clusters/noble/apps/homepage/values.yaml @@ -79,12 +79,6 @@ config: href: https://longhorn.apps.noble.lab.pcenicni.dev siteMonitor: http://longhorn-frontend.longhorn-system.svc.cluster.local:80 description: Storage volumes, nodes, backups - - Vault: - icon: si-vault - href: https://vault.apps.noble.lab.pcenicni.dev - # Unauthenticated health (HEAD/GET) — not the redirecting UI root - siteMonitor: http://vault.vault.svc.cluster.local:8200/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204 - description: Secrets engine UI (after init/unseal) - Velero: icon: mdi-backup-restore href: https://velero.io/docs/ diff --git a/clusters/noble/bootstrap/argocd/README.md b/clusters/noble/bootstrap/argocd/README.md index f8c9759..aa6338f 100644 --- a/clusters/noble/bootstrap/argocd/README.md +++ b/clusters/noble/bootstrap/argocd/README.md @@ -52,7 +52,7 @@ Use **Settings → Repositories** in the UI, or `argocd repo add` / a `Secret` o ## 4. App-of-apps (optional GitOps only) -Bootstrap **platform** workloads (CNI, ingress, cert-manager, Kyverno, observability, Vault, etc.) are installed by +Bootstrap **platform** workloads (CNI, ingress, cert-manager, Kyverno, observability, etc.) are installed by **`ansible/playbooks/noble.yml`** from **`clusters/noble/bootstrap/`** — not by Argo. **`clusters/noble/apps/kustomization.yaml`** is empty by default. 1. Edit **`root-application.yaml`**: set **`repoURL`** and **`targetRevision`** to this repository. The **`resources-finalizer.argocd.argoproj.io/background`** finalizer uses Argo’s path-qualified form so **`kubectl apply`** does not warn about finalizer names. diff --git a/clusters/noble/bootstrap/external-secrets/README.md b/clusters/noble/bootstrap/external-secrets/README.md deleted file mode 100644 index 8a4848b..0000000 --- a/clusters/noble/bootstrap/external-secrets/README.md +++ /dev/null @@ -1,60 +0,0 @@ -# External Secrets Operator (noble) - -Syncs secrets from external systems into Kubernetes **Secret** objects via **ExternalSecret** / **ClusterExternalSecret** CRDs. - -- **Chart:** `external-secrets/external-secrets` **2.2.0** (app **v2.2.0**) -- **Namespace:** `external-secrets` -- **Helm release name:** `external-secrets` (matches the operator **ServiceAccount** name `external-secrets`) - -## Install - -```bash -helm repo add external-secrets https://charts.external-secrets.io -helm repo update -kubectl apply -f clusters/noble/bootstrap/external-secrets/namespace.yaml -helm upgrade --install external-secrets external-secrets/external-secrets -n external-secrets \ - --version 2.2.0 -f clusters/noble/bootstrap/external-secrets/values.yaml --wait -``` - -Verify: - -```bash -kubectl -n external-secrets get deploy,pods -kubectl get crd | grep external-secrets -``` - -## Vault `ClusterSecretStore` (after Vault is deployed) - -The checklist expects a **Vault**-backed store. Install Vault first (`talos/CLUSTER-BUILD.md` Phase E — Vault on Longhorn + auto-unseal), then: - -1. Enable **KV v2** secrets engine and **Kubernetes** auth in Vault; create a **role** (e.g. `external-secrets`) that maps the cluster’s **`external-secrets` / `external-secrets`** service account to a policy that can read the paths you need. -2. Copy **`examples/vault-cluster-secret-store.yaml`**, set **`spec.provider.vault.server`** to your Vault URL. This repo’s Vault Helm values use **HTTP** on port **8200** (`global.tlsDisable: true`): **`http://vault.vault.svc.cluster.local:8200`**. Use **`https://`** if you enable TLS on the Vault listener. -3. If Vault uses a **private TLS CA**, configure **`caProvider`** or **`caBundle`** on the Vault provider — see [HashiCorp Vault provider](https://external-secrets.io/latest/provider/hashicorp-vault/). Do not commit private CA material to public git unless intended. -4. Apply: **`kubectl apply -f …/vault-cluster-secret-store.yaml`** -5. Confirm the store is ready: **`kubectl describe clustersecretstore vault`** - -Example **ExternalSecret** (after the store is healthy): - -```yaml -apiVersion: external-secrets.io/v1 -kind: ExternalSecret -metadata: - name: demo - namespace: default -spec: - refreshInterval: 1h - secretStoreRef: - name: vault - kind: ClusterSecretStore - target: - name: demo-synced - data: - - secretKey: password - remoteRef: - key: secret/data/myapp - property: password -``` - -## Upgrades - -Pin the chart version in `values.yaml` header comments; run the same **`helm upgrade --install`** with the new **`--version`** after reviewing [release notes](https://github.com/external-secrets/external-secrets/releases). diff --git a/clusters/noble/bootstrap/external-secrets/examples/vault-cluster-secret-store.yaml b/clusters/noble/bootstrap/external-secrets/examples/vault-cluster-secret-store.yaml deleted file mode 100644 index 159bea0..0000000 --- a/clusters/noble/bootstrap/external-secrets/examples/vault-cluster-secret-store.yaml +++ /dev/null @@ -1,31 +0,0 @@ -# ClusterSecretStore for HashiCorp Vault (KV v2) using Kubernetes auth. -# -# Do not apply until Vault is running, reachable from the cluster, and configured with: -# - Kubernetes auth at mountPath (default: kubernetes) -# - A role (below: external-secrets) bound to this service account: -# name: external-secrets -# namespace: external-secrets -# - A policy allowing read on the KV path used below (e.g. secret/data/* for path "secret") -# -# Adjust server, mountPath, role, and path to match your Vault deployment. If Vault uses TLS -# with a private CA, set provider.vault.caProvider or caBundle (see README). -# -# kubectl apply -f clusters/noble/bootstrap/external-secrets/examples/vault-cluster-secret-store.yaml ---- -apiVersion: external-secrets.io/v1 -kind: ClusterSecretStore -metadata: - name: vault -spec: - provider: - vault: - server: "http://vault.vault.svc.cluster.local:8200" - path: secret - version: v2 - auth: - kubernetes: - mountPath: kubernetes - role: external-secrets - serviceAccountRef: - name: external-secrets - namespace: external-secrets diff --git a/clusters/noble/bootstrap/external-secrets/namespace.yaml b/clusters/noble/bootstrap/external-secrets/namespace.yaml deleted file mode 100644 index eab4215..0000000 --- a/clusters/noble/bootstrap/external-secrets/namespace.yaml +++ /dev/null @@ -1,5 +0,0 @@ -# External Secrets Operator — apply before Helm. -apiVersion: v1 -kind: Namespace -metadata: - name: external-secrets diff --git a/clusters/noble/bootstrap/external-secrets/values.yaml b/clusters/noble/bootstrap/external-secrets/values.yaml deleted file mode 100644 index a630c8b..0000000 --- a/clusters/noble/bootstrap/external-secrets/values.yaml +++ /dev/null @@ -1,10 +0,0 @@ -# External Secrets Operator — noble -# -# helm repo add external-secrets https://charts.external-secrets.io -# helm repo update -# kubectl apply -f clusters/noble/bootstrap/external-secrets/namespace.yaml -# helm upgrade --install external-secrets external-secrets/external-secrets -n external-secrets \ -# --version 2.2.0 -f clusters/noble/bootstrap/external-secrets/values.yaml --wait -# -# CRDs are installed by the chart (installCRDs: true). Vault ClusterSecretStore: see README + examples/. -commonLabels: {} diff --git a/clusters/noble/bootstrap/kustomization.yaml b/clusters/noble/bootstrap/kustomization.yaml index 0882590..bebf821 100644 --- a/clusters/noble/bootstrap/kustomization.yaml +++ b/clusters/noble/bootstrap/kustomization.yaml @@ -8,13 +8,9 @@ resources: - kube-prometheus-stack/namespace.yaml - loki/namespace.yaml - fluent-bit/namespace.yaml - - sealed-secrets/namespace.yaml - - external-secrets/namespace.yaml - - vault/namespace.yaml + - newt/namespace.yaml - kyverno/namespace.yaml - velero/namespace.yaml - velero/longhorn-volumesnapshotclass.yaml - headlamp/namespace.yaml - grafana-loki-datasource/loki-datasource.yaml - - vault/unseal-cronjob.yaml - - vault/cilium-network-policy.yaml diff --git a/clusters/noble/bootstrap/kyverno/policies-values.yaml b/clusters/noble/bootstrap/kyverno/policies-values.yaml index e148211..6a6fe09 100644 --- a/clusters/noble/bootstrap/kyverno/policies-values.yaml +++ b/clusters/noble/bootstrap/kyverno/policies-values.yaml @@ -35,7 +35,6 @@ x-kyverno-exclude-infra: &kyverno_exclude_infra - kube-node-lease - argocd - cert-manager - - external-secrets - headlamp - kyverno - logging @@ -44,9 +43,7 @@ x-kyverno-exclude-infra: &kyverno_exclude_infra - metallb-system - monitoring - newt - - sealed-secrets - traefik - - vault policyExclude: disallow-capabilities: *kyverno_exclude_infra diff --git a/clusters/noble/bootstrap/newt/README.md b/clusters/noble/bootstrap/newt/README.md index 0fce92d..5d9d937 100644 --- a/clusters/noble/bootstrap/newt/README.md +++ b/clusters/noble/bootstrap/newt/README.md @@ -2,26 +2,24 @@ This is the **primary** automation path for **public** hostnames to workloads in this cluster (it **replaces** in-cluster ExternalDNS). [Newt](https://github.com/fosrl/newt) is the on-prem agent that connects your cluster to a **Pangolin** site (WireGuard tunnel). The [Fossorial Helm chart](https://github.com/fosrl/helm-charts) deploys one or more instances. -**Secrets:** Never commit endpoint, Newt ID, or Newt secret. If credentials were pasted into chat or CI logs, **rotate them** in Pangolin and recreate the Kubernetes Secret. +**Secrets:** Never commit endpoint, Newt ID, or Newt secret in **plain** YAML. If credentials were pasted into chat or CI logs, **rotate them** in Pangolin and recreate the Kubernetes Secret. ## 1. Create the Secret Keys must match `values.yaml` (`PANGOLIN_ENDPOINT`, `NEWT_ID`, `NEWT_SECRET`). -### Option A — Sealed Secret (safe for GitOps) +### Option A — SOPS (safe for GitOps) -With the [Sealed Secrets](https://github.com/bitnami-labs/sealed-secrets) controller installed (`clusters/noble/bootstrap/sealed-secrets/`), generate a `SealedSecret` from your workstation (rotate credentials in Pangolin first if they were exposed): +Encrypt a normal **`Secret`** with [Mozilla SOPS](https://github.com/getsops/sops) and **age** (see **`clusters/noble/secrets/README.md`** and **`.sops.yaml`**). The repo includes an encrypted example at **`clusters/noble/secrets/newt-pangolin-auth.secret.yaml`** — edit with `sops` after exporting **`SOPS_AGE_KEY_FILE`** to your **`age-key.txt`**, or create a new file and encrypt it. ```bash -chmod +x clusters/noble/bootstrap/sealed-secrets/examples/kubeseal-newt-pangolin-auth.sh -export PANGOLIN_ENDPOINT='https://pangolin.pcenicni.dev' -export NEWT_ID='YOUR_NEWT_ID' -export NEWT_SECRET='YOUR_NEWT_SECRET' -./clusters/noble/bootstrap/sealed-secrets/examples/kubeseal-newt-pangolin-auth.sh > newt-pangolin-auth.sealedsecret.yaml -kubectl apply -f newt-pangolin-auth.sealedsecret.yaml +export SOPS_AGE_KEY_FILE=/absolute/path/to/home-server/age-key.txt +sops clusters/noble/secrets/newt-pangolin-auth.secret.yaml +# then: +sops -d clusters/noble/secrets/newt-pangolin-auth.secret.yaml | kubectl apply -f - ``` -Commit only the `.sealedsecret.yaml` file, not plain `Secret` YAML. +**Ansible** (`noble.yml`) applies all **`clusters/noble/secrets/*.yaml`** automatically when **`age-key.txt`** exists at the repo root. ### Option B — Imperative Secret (not in git) diff --git a/clusters/noble/bootstrap/sealed-secrets/README.md b/clusters/noble/bootstrap/sealed-secrets/README.md deleted file mode 100644 index 9e7cbdb..0000000 --- a/clusters/noble/bootstrap/sealed-secrets/README.md +++ /dev/null @@ -1,50 +0,0 @@ -# Sealed Secrets (noble) - -Encrypts `Secret` manifests so they can live in git; the controller decrypts **SealedSecret** resources into **Secret**s in-cluster. - -- **Chart:** `sealed-secrets/sealed-secrets` **2.18.4** (app **0.36.1**) -- **Namespace:** `sealed-secrets` - -## Install - -```bash -helm repo add sealed-secrets https://bitnami-labs.github.io/sealed-secrets -helm repo update -kubectl apply -f clusters/noble/bootstrap/sealed-secrets/namespace.yaml -helm upgrade --install sealed-secrets sealed-secrets/sealed-secrets -n sealed-secrets \ - --version 2.18.4 -f clusters/noble/bootstrap/sealed-secrets/values.yaml --wait -``` - -## Workstation: `kubeseal` - -Install a **kubeseal** build compatible with the controller (match **app** minor, e.g. **0.36.x** for **0.36.1**). Examples: - -- **Homebrew:** `brew install kubeseal` (check `kubeseal --version` against the chart’s `image.tag` in `helm show values`). -- **GitHub releases:** [bitnami-labs/sealed-secrets](https://github.com/bitnami-labs/sealed-secrets/releases) - -Fetch the cluster’s public seal cert (once per kube context): - -```bash -kubeseal --fetch-cert > /tmp/noble-sealed-secrets.pem -``` - -Create a sealed secret from a normal secret manifest: - -```bash -kubectl create secret generic example --from-literal=foo=bar --dry-run=client -o yaml \ - | kubeseal --cert /tmp/noble-sealed-secrets.pem -o yaml > example-sealedsecret.yaml -``` - -Commit `example-sealedsecret.yaml`; apply it with `kubectl apply -f`. The controller creates the **Secret** in the same namespace as the **SealedSecret**. - -**Noble example:** `examples/kubeseal-newt-pangolin-auth.sh` (Newt / Pangolin tunnel credentials). - -## Backup the sealing key - -If the controller’s private key is lost, existing sealed files cannot be decrypted on a new cluster. Back up the key secret after install: - -```bash -kubectl get secret -n sealed-secrets -l sealedsecrets.bitnami.com/sealed-secrets-key=active -o yaml > sealed-secrets-key-backup.yaml -``` - -Store `sealed-secrets-key-backup.yaml` in a safe offline location (not in public git). diff --git a/clusters/noble/bootstrap/sealed-secrets/examples/kubeseal-newt-pangolin-auth.sh b/clusters/noble/bootstrap/sealed-secrets/examples/kubeseal-newt-pangolin-auth.sh deleted file mode 100755 index c647ac8..0000000 --- a/clusters/noble/bootstrap/sealed-secrets/examples/kubeseal-newt-pangolin-auth.sh +++ /dev/null @@ -1,19 +0,0 @@ -#!/usr/bin/env bash -# Emit a SealedSecret for newt-pangolin-auth (namespace newt). -# Prerequisites: sealed-secrets controller running; kubeseal client (same minor as controller). -# Rotate Pangolin/Newt credentials in the UI first if they were exposed, then set env vars and run: -# -# export PANGOLIN_ENDPOINT='https://pangolin.example.com' -# export NEWT_ID='...' -# export NEWT_SECRET='...' -# ./kubeseal-newt-pangolin-auth.sh > newt-pangolin-auth.sealedsecret.yaml -# kubectl apply -f newt-pangolin-auth.sealedsecret.yaml -# -set -euo pipefail -kubectl apply -f "$(dirname "$0")/../../newt/namespace.yaml" >/dev/null 2>&1 || true -kubectl -n newt create secret generic newt-pangolin-auth \ - --dry-run=client \ - --from-literal=PANGOLIN_ENDPOINT="${PANGOLIN_ENDPOINT:?}" \ - --from-literal=NEWT_ID="${NEWT_ID:?}" \ - --from-literal=NEWT_SECRET="${NEWT_SECRET:?}" \ - -o yaml | kubeseal -o yaml diff --git a/clusters/noble/bootstrap/sealed-secrets/namespace.yaml b/clusters/noble/bootstrap/sealed-secrets/namespace.yaml deleted file mode 100644 index d2e9d85..0000000 --- a/clusters/noble/bootstrap/sealed-secrets/namespace.yaml +++ /dev/null @@ -1,5 +0,0 @@ -# Sealed Secrets controller — apply before Helm. -apiVersion: v1 -kind: Namespace -metadata: - name: sealed-secrets diff --git a/clusters/noble/bootstrap/sealed-secrets/values.yaml b/clusters/noble/bootstrap/sealed-secrets/values.yaml deleted file mode 100644 index 0f84be9..0000000 --- a/clusters/noble/bootstrap/sealed-secrets/values.yaml +++ /dev/null @@ -1,18 +0,0 @@ -# Sealed Secrets — noble (Git-encrypted Secret workflow) -# -# helm repo add sealed-secrets https://bitnami-labs.github.io/sealed-secrets -# helm repo update -# kubectl apply -f clusters/noble/bootstrap/sealed-secrets/namespace.yaml -# helm upgrade --install sealed-secrets sealed-secrets/sealed-secrets -n sealed-secrets \ -# --version 2.18.4 -f clusters/noble/bootstrap/sealed-secrets/values.yaml --wait -# -# Client: install kubeseal (same minor as controller — see README). -# Defaults are sufficient for the lab; override here if you need key renewal, resources, etc. -# -# GitOps pattern: create Secrets only via SealedSecret (or External Secrets + Vault). -# Example (Newt): clusters/noble/bootstrap/sealed-secrets/examples/kubeseal-newt-pangolin-auth.sh -# Backup the controller's sealing key: kubectl -n sealed-secrets get secret sealed-secrets-key -o yaml -# -# Talos cluster secrets (bootstrap token, cluster secret, certs) belong in talhelper talsecret / -# SOPS — not Sealed Secrets. See talos/README.md. -commonLabels: {} diff --git a/clusters/noble/bootstrap/vault/README.md b/clusters/noble/bootstrap/vault/README.md deleted file mode 100644 index c05250a..0000000 --- a/clusters/noble/bootstrap/vault/README.md +++ /dev/null @@ -1,162 +0,0 @@ -# HashiCorp Vault (noble) - -Standalone Vault with **file** storage on a **Longhorn** PVC (`server.dataStorage`). The listener uses **HTTP** (`global.tlsDisable: true`) for in-cluster use; add TLS at the listener when exposing outside the cluster. - -- **Chart:** `hashicorp/vault` **0.32.0** (Vault **1.21.2**) -- **Namespace:** `vault` - -## Install - -```bash -helm repo add hashicorp https://helm.releases.hashicorp.com -helm repo update -kubectl apply -f clusters/noble/bootstrap/vault/namespace.yaml -helm upgrade --install vault hashicorp/vault -n vault \ - --version 0.32.0 -f clusters/noble/bootstrap/vault/values.yaml --wait --timeout 15m -``` - -Verify: - -```bash -kubectl -n vault get pods,pvc,svc -kubectl -n vault exec -i sts/vault -- vault status -``` - -## Cilium network policy (Phase G) - -After **Cilium** is up, optionally restrict HTTP access to the Vault server pods (**TCP 8200**) to **`external-secrets`** and same-namespace clients: - -```bash -kubectl apply -f clusters/noble/bootstrap/vault/cilium-network-policy.yaml -``` - -If you add workloads in other namespaces that call Vault, extend **`ingress`** in that manifest. - -## Initialize and unseal (first time) - -From a workstation with `kubectl` (or `kubectl exec` into any pod with `vault` CLI): - -```bash -kubectl -n vault exec -i sts/vault -- vault operator init -key-shares=1 -key-threshold=1 -``` - -**Lab-only:** `-key-shares=1 -key-threshold=1` keeps a single unseal key. For stronger Shamir splits, use more shares and store them safely. - -Save the **Unseal Key** and **Root Token** offline. Then unseal once: - -```bash -kubectl -n vault exec -i sts/vault -- vault operator unseal -# paste unseal key -``` - -Or create the Secret used by the optional CronJob and apply it: - -```bash -kubectl -n vault create secret generic vault-unseal-key --from-literal=key='YOUR_UNSEAL_KEY' -kubectl apply -f clusters/noble/bootstrap/vault/unseal-cronjob.yaml -``` - -The CronJob runs every minute and unseals if Vault is sealed and the Secret is present. - -## Auto-unseal note - -Vault **OSS** auto-unseal uses cloud KMS (AWS, GCP, Azure, OCI), **Transit** (another Vault), etc. There is no first-class “Kubernetes Secret” seal. This repo uses an optional **CronJob** as a **lab** substitute. Production clusters should use a supported seal backend. - -## Kubernetes auth (External Secrets / ClusterSecretStore) - -**One-shot:** from the repo root, `export KUBECONFIG=talos/kubeconfig` and `export VAULT_TOKEN=…`, then run **`./clusters/noble/bootstrap/vault/configure-kubernetes-auth.sh`** (idempotent). Then **`kubectl apply -f clusters/noble/bootstrap/external-secrets/examples/vault-cluster-secret-store.yaml`** on its own line (shell comments **`# …`** on the same line are parsed as extra `kubectl` args and break `apply`). **`kubectl get clustersecretstore vault`** should show **READY=True** after a few seconds. - -Run these **from your workstation** (needs `kubectl`; no local `vault` binary required). Use a **short-lived admin token** or the root token **only in your shell** — do not paste tokens into logs or chat. - -**1. Enable the auth method** (skip if already done): - -```bash -kubectl -n vault exec -it sts/vault -- sh -c ' - export VAULT_ADDR=http://127.0.0.1:8200 - export VAULT_TOKEN="YOUR_ROOT_OR_ADMIN_TOKEN" - vault auth enable kubernetes -' -``` - -**2. Configure `auth/kubernetes`** — the API **issuer** must match the `iss` claim on service account JWTs. With **kube-vip** / a custom API URL, discover it from the cluster (do not assume `kubernetes.default`): - -```bash -ISSUER=$(kubectl get --raw /.well-known/openid-configuration | jq -r .issuer) -REVIEWER=$(kubectl -n vault create token vault --duration=8760h) -CA_B64=$(kubectl config view --raw --minify -o jsonpath='{.clusters[0].cluster.certificate-authority-data}') -``` - -Then apply config **inside** the Vault pod (environment variables are passed in with `env` so quoting stays correct): - -```bash -export VAULT_TOKEN="YOUR_ROOT_OR_ADMIN_TOKEN" -export ISSUER REVIEWER CA_B64 -kubectl -n vault exec -i sts/vault -- env \ - VAULT_ADDR=http://127.0.0.1:8200 \ - VAULT_TOKEN="$VAULT_TOKEN" \ - CA_B64="$CA_B64" \ - REVIEWER="$REVIEWER" \ - ISSUER="$ISSUER" \ - sh -ec ' - echo "$CA_B64" | base64 -d > /tmp/k8s-ca.crt - vault write auth/kubernetes/config \ - kubernetes_host="https://kubernetes.default.svc:443" \ - kubernetes_ca_cert=@/tmp/k8s-ca.crt \ - token_reviewer_jwt="$REVIEWER" \ - issuer="$ISSUER" -' -``` - -**3. KV v2** at path `secret` (skip if already enabled): - -```bash -kubectl -n vault exec -it sts/vault -- sh -c ' - export VAULT_ADDR=http://127.0.0.1:8200 - export VAULT_TOKEN="YOUR_ROOT_OR_ADMIN_TOKEN" - vault secrets enable -path=secret kv-v2 -' -``` - -**4. Policy + role** for the External Secrets operator SA (`external-secrets` / `external-secrets`): - -```bash -kubectl -n vault exec -it sts/vault -- sh -c ' - export VAULT_ADDR=http://127.0.0.1:8200 - export VAULT_TOKEN="YOUR_ROOT_OR_ADMIN_TOKEN" - vault policy write external-secrets - </tmp/vauth.txt - grep -q "^kubernetes/" /tmp/vauth.txt || vault auth enable kubernetes - ' - -kubectl -n vault exec -i sts/vault -- env \ - VAULT_ADDR=http://127.0.0.1:8200 \ - VAULT_TOKEN="$VAULT_TOKEN" \ - CA_B64="$CA_B64" \ - REVIEWER="$REVIEWER" \ - ISSUER="$ISSUER" \ - sh -ec ' - echo "$CA_B64" | base64 -d > /tmp/k8s-ca.crt - vault write auth/kubernetes/config \ - kubernetes_host="https://kubernetes.default.svc:443" \ - kubernetes_ca_cert=@/tmp/k8s-ca.crt \ - token_reviewer_jwt="$REVIEWER" \ - issuer="$ISSUER" - ' - -kubectl -n vault exec -i sts/vault -- env \ - VAULT_ADDR=http://127.0.0.1:8200 \ - VAULT_TOKEN="$VAULT_TOKEN" \ - sh -ec ' - set -e - vault secrets list >/tmp/vsec.txt - grep -q "^secret/" /tmp/vsec.txt || vault secrets enable -path=secret kv-v2 - ' - -kubectl -n vault exec -i sts/vault -- env \ - VAULT_ADDR=http://127.0.0.1:8200 \ - VAULT_TOKEN="$VAULT_TOKEN" \ - sh -ec ' - vault policy write external-secrets - <192.168.50.0/24 L2"] + PP["Patch / cable mgmt"] + SW --- PP + end + subgraph RACK_B["Rack B — 10\""] + N["neon :20"] + A["argon :30"] + K["krypton :40"] + end + subgraph RACK_C["Rack C — 10\""] + H["helium :10"] + end + subgraph LOGICAL["Logical (any node holding VIP)"] + VIP["API VIP 192.168.50.230
kube-vip → apiserver :6443"] + end + WAN["Internet / other LANs"] -.->|"router (out of scope)"| SW + SW <-->|"Ethernet"| N + SW <-->|"Ethernet"| A + SW <-->|"Ethernet"| K + SW <-->|"Ethernet"| H + N --- VIP + A --- VIP + K --- VIP + WK["Workstation / CI
kubectl, browser"] -->|"HTTPS :6443"| VIP + WK -->|"L2 (MetalLB .210–.211, any node)"| SW +``` + +**Ingress path (same LAN):** clients → **`192.168.50.211`** (Traefik) or **`192.168.50.210`** (Argo CD) via **MetalLB** — still **through the same switch** to whichever node advertises the service. + +--- + +## Related docs + +- Cluster topology and services: [`architecture.md`](architecture.md) +- Build state and versions: [`../talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) diff --git a/docs/architecture.md b/docs/architecture.md index 4c5268a..59bb976 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -8,8 +8,8 @@ This document describes the **noble** Talos lab cluster: node topology, networki |---------------|---------| | **Subgraph “Cluster”** | Kubernetes cluster boundary (`noble`) | | **External / DNS / cloud** | Services outside the data plane (internet, registrar, Pangolin) | -| **Data store** | Durable data (etcd, Longhorn, Loki, Vault storage) | -| **Secrets / policy** | Secret material, Vault, admission policy | +| **Data store** | Durable data (etcd, Longhorn, Loki) | +| **Secrets / policy** | Secret material (SOPS in git), admission policy | | **LB / VIP** | Load balancer, MetalLB assignment, or API VIP | --- @@ -74,7 +74,7 @@ flowchart TB ## Platform stack (bootstrap → workloads) -Order: **Talos** → **Cilium** (cluster uses `cni: none` until CNI is installed) → **metrics-server**, **Longhorn**, **MetalLB** + pool manifests, **kube-vip** → **Traefik**, **cert-manager** → **Argo CD** (Helm only; optional empty app-of-apps). **Automated install:** `ansible/playbooks/noble.yml` (see `ansible/README.md`). Platform namespaces include `cert-manager`, `traefik`, `metallb-system`, `longhorn-system`, `monitoring`, `loki`, `logging`, `argocd`, `vault`, `external-secrets`, `sealed-secrets`, `kyverno`, `newt`, and others as deployed. +Order: **Talos** → **Cilium** (cluster uses `cni: none` until CNI is installed) → **metrics-server**, **Longhorn**, **MetalLB** + pool manifests, **kube-vip** → **Traefik**, **cert-manager** → **Argo CD** (Helm only; optional empty app-of-apps). **Automated install:** `ansible/playbooks/noble.yml` (see `ansible/README.md`). Platform namespaces include `cert-manager`, `traefik`, `metallb-system`, `longhorn-system`, `monitoring`, `loki`, `logging`, `argocd`, `kyverno`, `newt`, and others as deployed. ```mermaid flowchart TB @@ -98,7 +98,7 @@ flowchart TB Argo["Argo CD
(optional app-of-apps; platform via Ansible)"] end subgraph L5["Platform namespaces (examples)"] - NS["cert-manager, traefik, metallb-system,
longhorn-system, monitoring, loki, logging,
argocd, vault, external-secrets, sealed-secrets,
kyverno, newt, …"] + NS["cert-manager, traefik, metallb-system,
longhorn-system, monitoring, loki, logging,
argocd, kyverno, newt, …"] end Talos --> Cilium --> MS Cilium --> LH @@ -149,22 +149,20 @@ flowchart LR ## Secrets and policy -**Sealed Secrets** decrypts `SealedSecret` objects in-cluster. **External Secrets Operator** syncs from **Vault** using **`ClusterSecretStore`** (see [`examples/vault-cluster-secret-store.yaml`](../clusters/noble/bootstrap/external-secrets/examples/vault-cluster-secret-store.yaml)). Trust is **cluster → Vault** (ESO calls Vault; Vault does not initiate cluster trust). **Kyverno** with **kyverno-policies** enforces **PSS baseline** in **Audit**. +**Mozilla SOPS** with **age** encrypts plain Kubernetes **`Secret`** manifests under [`clusters/noble/secrets/`](../clusters/noble/secrets/); operators decrypt at apply time (`ansible/playbooks/noble.yml` or `sops -d … | kubectl apply`). The private key is **`age-key.txt`** at the repo root (gitignored). **Kyverno** with **kyverno-policies** enforces **PSS baseline** in **Audit**. ```mermaid flowchart LR subgraph Git["Git repo"] - SSman["SealedSecret manifests
(optional)"] + SM["SOPS-encrypted Secret YAML
clusters/noble/secrets/"] + end + subgraph ops["Apply path"] + SOPS["sops -d + kubectl apply
(or Ansible noble.yml)"] end subgraph cluster["Cluster"] - SSC["Sealed Secrets controller
sealed-secrets"] - ESO["External Secrets Operator
external-secrets"] - V["Vault
vault namespace
HTTP listener"] K["Kyverno + kyverno-policies
PSS baseline Audit"] end - SSman -->|"encrypted"| SSC -->|"decrypt to Secret"| workloads["Workload Secrets"] - ESO -->|"ClusterSecretStore →"| V - ESO -->|"sync ExternalSecret"| workloads + SM --> SOPS -->|"plain Secret"| workloads["Workload Secrets"] K -.->|"admission / audit
(PSS baseline)"| workloads ``` @@ -172,7 +170,7 @@ flowchart LR ## Data and storage -**StorageClass:** **`longhorn`** (default). Talos mounts **user volume** data at **`/var/mnt/longhorn`** (bind paths for Longhorn). Stateful consumers include **Vault**, **kube-prometheus-stack** PVCs, and **Loki**. +**StorageClass:** **`longhorn`** (default). Talos mounts **user volume** data at **`/var/mnt/longhorn`** (bind paths for Longhorn). Stateful consumers include **kube-prometheus-stack** PVCs and **Loki**. ```mermaid flowchart TB @@ -183,12 +181,10 @@ flowchart TB SC["StorageClass: longhorn (default)"] end subgraph consumers["Stateful / durable consumers"] - V["Vault PVC data-vault-0"] PGL["kube-prometheus-stack PVCs"] L["Loki PVC"] end UD --> SC - SC --> V SC --> PGL SC --> L ``` @@ -210,7 +206,7 @@ See [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) for the authoritative | Argo CD | 9.4.17 / app v3.3.6 | | kube-prometheus-stack | 82.15.1 | | Loki / Fluent Bit | 6.55.0 / 0.56.0 | -| Sealed Secrets / ESO / Vault | 2.18.4 / 2.2.0 / 0.32.0 | +| SOPS (client tooling) | see `clusters/noble/secrets/README.md` | | Kyverno | 3.7.1 / policies 3.7.1 | | Newt | 1.2.0 / app 1.10.1 | @@ -218,7 +214,7 @@ See [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) for the authoritative ## Narrative -The **noble** environment is a **Talos** lab cluster on **`192.168.50.0/24`** with **three control plane nodes and one worker**, schedulable workloads on control planes enabled, and the Kubernetes API exposed through **kube-vip** at **`192.168.50.230`**. **Cilium** provides the CNI after Talos bootstrap with **`cni: none`**; **MetalLB** advertises **`192.168.50.210`–`192.168.50.229`**, pinning **Argo CD** to **`192.168.50.210`** and **Traefik** to **`192.168.50.211`** for **`*.apps.noble.lab.pcenicni.dev`**. **cert-manager** issues certificates for Traefik Ingresses; **GitOps** is **Ansible-driven Helm** for the platform (**`clusters/noble/bootstrap/`**) plus optional **Argo CD** app-of-apps (**`clusters/noble/apps/`**, **`clusters/noble/bootstrap/argocd/`**). **Observability** uses **kube-prometheus-stack** in **`monitoring`**, **Loki** and **Fluent Bit** with Grafana wired via a **ConfigMap** datasource, with **Longhorn** PVCs for Prometheus, Grafana, Alertmanager, Loki, and **Vault**. **Secrets** combine **Sealed Secrets** for git-encrypted material, **Vault** with **External Secrets** for dynamic sync, and **Kyverno** enforces **Pod Security Standards baseline** in **Audit**. **Public** access uses **Newt** to **Pangolin** with **CNAME** and Integration API steps as documented—not generic in-cluster public DNS. +The **noble** environment is a **Talos** lab cluster on **`192.168.50.0/24`** with **three control plane nodes and one worker**, schedulable workloads on control planes enabled, and the Kubernetes API exposed through **kube-vip** at **`192.168.50.230`**. **Cilium** provides the CNI after Talos bootstrap with **`cni: none`**; **MetalLB** advertises **`192.168.50.210`–`192.168.50.229`**, pinning **Argo CD** to **`192.168.50.210`** and **Traefik** to **`192.168.50.211`** for **`*.apps.noble.lab.pcenicni.dev`**. **cert-manager** issues certificates for Traefik Ingresses; **GitOps** is **Ansible-driven Helm** for the platform (**`clusters/noble/bootstrap/`**) plus optional **Argo CD** app-of-apps (**`clusters/noble/apps/`**, **`clusters/noble/bootstrap/argocd/`**). **Observability** uses **kube-prometheus-stack** in **`monitoring`**, **Loki** and **Fluent Bit** with Grafana wired via a **ConfigMap** datasource, with **Longhorn** PVCs for Prometheus, Grafana, Alertmanager, and Loki. **Secrets** in git use **SOPS** + **age** under **`clusters/noble/secrets/`**; **Kyverno** enforces **Pod Security Standards baseline** in **Audit**. **Public** access uses **Newt** to **Pangolin** with **CNAME** and Integration API steps as documented—not generic in-cluster public DNS. --- diff --git a/docs/homelab-network.md b/docs/homelab-network.md new file mode 100644 index 0000000..535f770 --- /dev/null +++ b/docs/homelab-network.md @@ -0,0 +1,100 @@ +# Homelab network inventory + +Single place for **VLANs**, **static addressing**, and **hosts** beside the **noble** Talos cluster. **Proxmox** is the **hypervisor** for the VMs below; **all of those VMs are intended to run on `192.168.1.0/24`** (same broadcast domain as Pi-hole and typical home clients). **Noble** (Talos) stays on **`192.168.50.0/24`** per [`architecture.md`](architecture.md) and [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) until you change that design. + +## VLANs (logical) + +| Network | Role | +|---------|------| +| **`192.168.1.0/24`** | **Homelab / Proxmox LAN** — **Proxmox host(s)**, **all Proxmox VMs**, **Pi-hole**, **Mac mini**, and other servers that share this VLAN. | +| **`192.168.50.0/24`** | **Noble Talos** cluster — physical nodes, **kube-vip**, **MetalLB**, Traefik; **not** the Proxmox VM subnet. | +| **`192.168.60.0/24`** | **DMZ / WAN-facing** — **NPM**, **WebDAV**, **other services** that need WAN access. | +| **`192.168.40.0/24`** | **Home Assistant** and IoT devices — isolated; record subnet and HA IP in DHCP/router. | + +**Routing / DNS:** Clients and VMs on **`192.168.1.0/24`** reach **noble** services on **`192.168.50.0/24`** via **L3** (router/firewall). **NFS** from OMV (`192.168.1.105`) to **noble** pods uses the **OMV data IP** as the NFS server address from the cluster’s perspective. + +Firewall rules between VLANs are **out of scope** here; document them where you keep runbooks. + +--- + +## `192.168.50.0/24` — reservations (noble only) + +Do not assign **unrelated** static services on **this** VLAN without checking overlap with MetalLB and kube-vip. + +| Use | Addresses | +|-----|-----------| +| Talos nodes | `.10`–`.40` (see [`talos/talconfig.yaml`](../talos/talconfig.yaml)) | +| MetalLB L2 pool | `.210`–`.229` | +| Traefik (ingress) | `.211` (typical) | +| Argo CD | `.210` (typical) | +| Kubernetes API (kube-vip) | **`.230`** — **must not** be a VM | + +--- + +## Proxmox VMs (`192.168.1.0/24`) + +All run on **Proxmox**; addresses below use **`192.168.1.0/24`** (same host octet as your earlier `.50.x` / `.60.x` plan, moved into the homelab VLAN). Adjust if your router uses a different numbering scheme. + +Most are **Docker hosts** with multiple apps; treat the **IP** as the **host**, not individual containers. + +| VM ID | Name | IP | Notes | +|-------|------|-----|--------| +| 666 | nginxproxymanager | `192.168.1.20` | NPM (edge / WAN-facing role — firewall as you design). | +| 777 | nginxproxymanager-Lan | `192.168.1.60` | NPM on **internal** homelab LAN. | +| 100 | Openmediavault | `192.168.1.105` | **NFS** exports for *arr / media paths. | +| 110 | Monitor | `192.168.1.110` | Uptime Kuma, Peekaping, Tracearr → cluster candidates. | +| 120 | arr | `192.168.1.120` | *arr stack; media via **NFS** from OMV — see [migration](#arr-stack-nfs-and-kubernetes). | +| 130 | Automate | `192.168.1.130` | Low use — **candidate to remove** or consolidate. | +| 140 | general-purpose | `192.168.1.140` | IT tools, Mealie, Open WebUI, SparkyFitness, … | +| 150 | Media-server | `192.168.1.150` | Jellyfin (test, **NFS** media), ebook server. | +| 160 | s3 | `192.168.1.170` | Object storage; **merge** into **central S3** on noble per [`shared-data-services.md`](shared-data-services.md) when ready. | +| 190 | Auth | `192.168.1.190` | **Authentik** → **noble (K8s)** for HA. | +| 300 | gitea | `192.168.1.203` | On **`.1`**, no overlap with noble **MetalLB `.210`–`.229`** on **`.50`**. | +| 310 | gitea-nsfw | `192.168.1.204` | | +| 500 | AMP | `192.168.1.47` | | + +### Workload detail (what runs where) + +**Auth (190)** — **Authentik** is the main service; moving it to **Kubernetes (noble)** gives you **HA**, rolling upgrades, and backups via your cluster patterns (PVCs, Velero, etc.). Plan **OIDC redirect URLs** and **outposts** (if used) when the **ingress hostname** and paths to **`.50`** services change. + +**Monitor (110)** — **Uptime Kuma**, **Peekaping**, and **Tracearr** are a good fit for the cluster: small state (SQLite or small DBs), **Ingress** via Traefik, and **Longhorn** or a small DB PVC. Migrate **one app at a time** and keep the old VM until DNS and alerts are verified. + +**arr (120)** — **Lidarr, Sonarr, Radarr**, and related *arr* apps; libraries and download paths point at **NFS** from **Openmediavault (100)** at **`192.168.1.105`**. The hard part is **keeping paths, permissions (UID/GID), and download client** wiring while pods move. + +**Automate (130)** — Tools are **barely used**; **decommission**, merge into **general-purpose (140)**, or replace with a **CronJob** / one-shot on the cluster only if something still needs scheduling. + +**general-purpose (140)** — “Daily driver” stack: **IT tools**, **Mealie**, **Open WebUI**, **SparkyFitness**, and similar. **Candidates for gradual moves** to noble; group by **data sensitivity** and **persistence** (Postgres vs SQLite) when you pick order. + +**Media-server (150)** — **Jellyfin** (testing) with libraries on **NFS**; **ebook** server. Treat **Jellyfin** like *arr* for storage: same NFS export and **transcoding** needs (CPU on worker nodes or GPU if you add it). Ebook stack depends on what you run (e.g. Kavita, Audiobookshelf) — note **metadata paths** before moving. + +### Arr stack, NFS, and Kubernetes + +You do **not** have to move NFS into the cluster: **Openmediavault** on **`192.168.1.105`** can stay the **NFS server** while the *arr* apps run as **Deployments** with **ReadWriteMany** volumes. Noble nodes on **`192.168.50.0/24`** mount NFS using **that IP** (ensure **firewall** allows **NFS** from node IPs to OMV). + +1. **Keep OMV as the single source of exports** — same **export path** (e.g. `/export/media`) from the cluster’s perspective as from the current VM. +2. **Mount NFS in Kubernetes** — use a **CSI NFS driver** (e.g. **nfs-subdir-external-provisioner** or **csi-driver-nfs**) so each app gets a **PVC** backed by a **subdirectory** of the export, **or** one shared RWX PVC for a common tree if your layout needs it. +3. **Match POSIX ownership** — set **supplemental groups** or **fsGroup** / **runAsUser** on the pods so Sonarr/Radarr see the same **UID/GID** as today’s Docker setup; fix **squash** settings on OMV if you use `root_squash`. +4. **Config and DB** — back up each app’s **config volume** (or SQLite files), redeploy with the same **environment**; point **download clients** and **NFS media roots** to the **same logical paths** inside the container. +5. **Low-risk path** — run **one** *arr* app on the cluster while the rest stay on **VM 120** until imports and downloads behave; then cut DNS/NPM streams over. + +If you prefer **no** NFS from pods, the alternative is **large ReadWriteOnce** disks on Longhorn and **sync** from OMV — usually **more** moving parts than **RWX NFS** for this workload class. + +--- + +## Other hosts + +| Host | IP | VLAN / network | Notes | +|------|-----|----------------|--------| +| **Pi-hole** | `192.168.1.127` | `192.168.1.0/24` | DNS; same VLAN as Proxmox VMs. | +| **Home Assistant** | *TBD* | **IoT VLAN** | Add reservation when fixed. | +| **Mac mini** | `192.168.1.155` | `192.168.1.0/24` | Align with **Storage B** in [`Racks.md`](Racks.md) if the same machine. | + +--- + +## Related docs + +- **Shared Postgres + S3 (centralized):** [`shared-data-services.md`](shared-data-services.md) +- **VM → noble migration plan:** [`migration-vm-to-noble.md`](migration-vm-to-noble.md) +- Noble cluster topology and ingress: [`architecture.md`](architecture.md) +- Physical racks (Primary / Storage B / Rack C): [`Racks.md`](Racks.md) +- Cluster checklist: [`../talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) diff --git a/docs/migration-vm-to-noble.md b/docs/migration-vm-to-noble.md new file mode 100644 index 0000000..c577bf1 --- /dev/null +++ b/docs/migration-vm-to-noble.md @@ -0,0 +1,121 @@ +# Migration plan: Proxmox VMs → noble (Kubernetes) + +This document is the **default playbook** for moving workloads from **Proxmox VMs** on **`192.168.1.0/24`** into the **noble** Talos cluster on **`192.168.50.0/24`**. Source inventory and per-VM notes: [`homelab-network.md`](homelab-network.md). Cluster facts: [`architecture.md`](architecture.md), [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md). + +--- + +## 1. Scope and principles + +| Principle | Detail | +|-----------|--------| +| **One service at a time** | Run the new workload on **noble** while the **VM** stays up; cut over **DNS / NPM** only after checks pass. | +| **Same container image** | Prefer the **same** upstream image and major version as Docker on the VM to reduce surprises. | +| **Data moves with a plan** | **Backup** VM volumes or export DB dumps **before** the first deploy to the cluster. | +| **Ingress on noble** | Internal apps use **Traefik** + **`*.apps.noble.lab.pcenicni.dev`** (or your chosen hostnames) and **MetalLB** (e.g. **`192.168.50.211`**) per [`architecture.md`](architecture.md). | +| **Cross-VLAN** | Clients on **`.1`** reach services on **`.50`** via **routing**; **firewall** must allow **NFS** from **Talos node IPs** to **OMV `192.168.1.105`** when pods mount NFS. | + +**Not everything must move.** Keep **Openmediavault** (and optionally **NPM**) on VMs if you prefer; the cluster consumes **NFS** and **HTTP** from them. + +--- + +## 2. Prerequisites (before wave 1) + +1. **Cluster healthy** — `kubectl get nodes`; [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) checklist through ingress and cert-manager as needed. +2. **Ingress + TLS** — **Traefik** + **cert-manager** working; you can hit a **test Ingress** on the MetalLB IP. +3. **GitOps / deploy path** — Decide per app: **Helm** under `clusters/noble/apps/`, **Argo CD**, or **Ansible**-applied manifests (match how you manage the rest of noble). +4. **Secrets** — Plan **Kubernetes Secrets**; for git-stored material, align with **SOPS** (`clusters/noble/secrets/`, `.sops.yaml`). +5. **Storage** — **Longhorn** default for **ReadWriteOnce** state; for **NFS** (*arr*, Jellyfin), install a **CSI NFS** driver and test a **small RWX PVC** before migrating data-heavy apps. +6. **Shared data tier (recommended)** — Deploy **centralized PostgreSQL** and **S3-compatible storage** on noble so apps do not each ship their own DB/object store; see [`shared-data-services.md`](shared-data-services.md). +7. **Firewall** — Rules: **workstation → `192.168.50.230:6443`**; **nodes → OMV NFS ports**; **clients → `192.168.50.211`** (or split-horizon DNS) as you design. +8. **DNS** — Split-horizon or Pi-hole records for **`*.apps.noble.lab.pcenicni.dev`** → **Traefik** IP **`192.168.50.211`** for LAN clients. + +--- + +## 3. Standard migration procedure (repeat per app) + +Use this checklist for **each** application (or small group, e.g. one Helm release). + +| Step | Action | +|------|--------| +| **A. Discover** | Document **image:tag**, **ports**, **volumes** (host paths), **env vars**, **depends_on** (DB, Redis, NFS path). Export **docker inspect** / **compose** from the VM. | +| **B. Backup** | Snapshot **Proxmox VM** or backup **volume** / **SQLite** / **DB dump** to offline storage. | +| **C. Namespace** | Create a **dedicated namespace** (e.g. `monitoring-tools`, `authentik`) or use your house standard. | +| **D. Deploy** | Add **Deployment** (or **StatefulSet**), **Service**, **Ingress** (class **traefik**), **PVCs**; wire **secrets** from **Secrets** (not literals in git). | +| **E. Storage** | **Longhorn** PVC for local state; **NFS CSI** PVC for shared media/config paths that must match the VM (see [`homelab-network.md`](homelab-network.md) *arr* section). Prefer **shared Postgres** / **shared S3** per [`shared-data-services.md`](shared-data-services.md) instead of new embedded databases. Match **UID/GID** with `securityContext`. | +| **F. Smoke test** | `kubectl port-forward` or temporary **Ingress** hostname; log in, run one critical workflow (login, playback, sync). | +| **G. DNS cutover** | Point **internal DNS** or **NPM** upstream from the **VM IP** to the **new hostname** (Traefik) or **MetalLB IP** + Host header. | +| **H. Observe** | 24–72 hours: logs, alerts, **Uptime Kuma** (once migrated), backups. | +| **I. Decommission** | Stop the **container** on the VM (not the whole VM until the **whole** VM is empty). | +| **J. VM off** | When **no** services remain on that VM, **power off** and archive or delete the VM. | + +**Rollback:** Re-enable the VM service, revert **DNS/NPM** to the old IP, delete or scale the cluster deployment to zero. + +--- + +## 4. Recommended migration order (phases) + +Order balances **risk**, **dependencies**, and **learning curve**. + +| Phase | Target | Rationale | +|-------|--------|-----------| +| **0 — Optional** | **Automate (130)** | Low use: **retire** or replace with **CronJobs**; skip if nothing valuable runs. | +| **0b — Platform** | **Shared Postgres + S3** on noble | Run **before** or alongside early waves so new deploys use **one DSN** and **one object endpoint**; retire **VM 160** when empty. See [`shared-data-services.md`](shared-data-services.md). | +| **1 — Observability** | **Monitor (110)** — Uptime Kuma, Peekaping, Tracearr | Small state, validates **Ingress**, **PVCs**, and **alert paths** before auth and media. | +| **2 — Git** | **gitea (300)**, **gitea-nsfw (310)** | Point at **shared Postgres** + **S3** for attachments; move **repos** with **PVC** + backup restore if needed. | +| **3 — Object / misc** | **s3 (160)**, **AMP (500)** | **Migrate data** into **central** S3 on cluster, then **decommission** duplicate MinIO on VM **160** if applicable. | +| **4 — Auth** | **Auth (190)** — **Authentik** | Use **shared Postgres**; update **all OIDC clients** (Gitea, apps, NPM) with **new issuer URLs**; schedule a **maintenance window**. | +| **5 — Daily apps** | **general-purpose (140)** | Move **one app per release** (Mealie, Open WebUI, …); each app gets its **own database** (and bucket if needed) on the **shared** tiers — not a new Postgres pod per app. | +| **6 — Media / *arr*** | **arr (120)**, **Media-server (150)** | **NFS** from **OMV**, download clients, **transcoding** — migrate **one *arr*** then Jellyfin/ebook; see NFS bullets in [`homelab-network.md`](homelab-network.md). | +| **7 — Edge** | **NPM (666/777)** | Often **last**: either keep on Proxmox or replace with **Traefik** + **IngressRoutes** / **Gateway API**; many people keep a **dedicated** reverse proxy VM until parity is proven. | + +**Openmediavault (100)** — Typically **stays** as **NFS** (and maybe backup target) for the cluster; no need to “migrate” the whole NAS into Kubernetes. + +--- + +## 5. Ingress and reverse proxy + +| Approach | When to use | +|----------|-------------| +| **Traefik Ingress** on noble | Default for **internal** HTTPS apps; **cert-manager** for public names you control. | +| **NPM (VM)** as front door | Point **proxy host** → **Traefik MetalLB IP** or **service name** if you add internal DNS; reduces double-proxy if you **terminate TLS** in one place only. | +| **Newt / Pangolin** | Public reachability per [`clusters/noble/bootstrap/newt/README.md`](../clusters/noble/bootstrap/newt/README.md); not automatic ExternalDNS. | + +Avoid **two** TLS terminations for the same hostname unless you intend **SSL passthrough** end-to-end. + +--- + +## 6. Authentik-specific (Auth VM → cluster) + +1. **Backup** Authentik **PostgreSQL** (or embedded DB) and **media** volume from the VM. +2. Deploy **Helm** (official chart) with **same** Authentik version if possible. +3. **Restore** DB into **shared cluster Postgres** (recommended) or chart-managed DB — see [`shared-data-services.md`](shared-data-services.md). +4. Update **issuer URL** in every **OIDC/OAuth** client (Gitea, Grafana, etc.). +5. Re-test **outposts** (if any) and **redirect URIs** from both **`.1`** and **`.50`** client perspectives. +6. **Cut over DNS**; then **decommission** VM **190**. + +--- + +## 7. *arr* and Jellyfin-specific + +Follow the **numbered list** under **“Arr stack, NFS, and Kubernetes”** in [`homelab-network.md`](homelab-network.md). In short: **OMV stays**; **CSI NFS** + **RWX**; **match permissions**; migrate **one app** first; verify **download client** can reach the new pod **IP/DNS** from your download host. + +--- + +## 8. Validation checklist (per wave) + +- Pods **Ready**, **Ingress** returns **200** / login page. +- **TLS** valid for chosen hostname. +- **Persistent data** present (new uploads, DB writes survive pod restart). +- **Backups** (Velero or app-level) defined for the new location. +- **Monitoring** / alerts updated (targets, not old VM IP). +- **Documentation** in [`homelab-network.md`](homelab-network.md) updated (VM retired or marked migrated). + +--- + +## Related docs + +- **Shared Postgres + S3:** [`shared-data-services.md`](shared-data-services.md) +- VM inventory and NFS notes: [`homelab-network.md`](homelab-network.md) +- Noble topology, MetalLB, Traefik: [`architecture.md`](architecture.md) +- Bootstrap and versions: [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) +- Apps layout: [`clusters/noble/apps/README.md`](../clusters/noble/apps/README.md) diff --git a/docs/shared-data-services.md b/docs/shared-data-services.md new file mode 100644 index 0000000..5b6d9ad --- /dev/null +++ b/docs/shared-data-services.md @@ -0,0 +1,90 @@ +# Centralized PostgreSQL and S3-compatible storage + +Goal: **one shared PostgreSQL** and **one S3-compatible object store** on **noble**, instead of every app bundling its own database or MinIO. Apps keep **logical isolation** via **per-app databases** / **users** and **per-app buckets** (or prefixes), not separate clusters. + +See also: [`migration-vm-to-noble.md`](migration-vm-to-noble.md), [`homelab-network.md`](homelab-network.md) (VM **160** `s3` today), [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) (Velero + S3). + +--- + +## 1. Why centralize + +| Benefit | Detail | +|--------|--------| +| **Operations** | One backup/restore story, one upgrade cadence, one place to tune **IOPS** and **retention**. | +| **Security** | **Least privilege**: each app gets its own **DB user** and **S3 credentials** scoped to one database or bucket. | +| **Resources** | Fewer duplicate **Postgres** or **MinIO** sidecars; better use of **Longhorn** or dedicated PVCs for the shared tiers. | + +**Tradeoff:** Shared tiers are **blast-radius** targets — use **backups**, **PITR** where you care, and **NetworkPolicies** so only expected namespaces talk to Postgres/S3. + +--- + +## 2. PostgreSQL — recommended pattern + +1. **Run Postgres on noble** — Operators such as **CloudNativePG**, **Zalando Postgres operator**, or a well-maintained **Helm** chart with **replicas** + **persistent volumes** (Longhorn). +2. **One cluster instance, many databases** — For each app: `CREATE DATABASE appname;` and a **dedicated role** with `CONNECT` on that database only (not superuser). +3. **Connection from apps** — Use a **Kubernetes Service** (e.g. `postgres-platform.platform.svc.cluster.local:5432`) and pass **credentials** via **Secrets** (ideally **SOPS**-encrypted in git). +4. **Migrations** — Run app **migration** jobs or init containers against the **same** DSN after DB exists. + +**Migrating off SQLite / embedded Postgres** + +- **SQLite → Postgres:** export/import per app (native tools, or **pgloader** where appropriate). +- **Docker Postgres volume:** `pg_dumpall` or per-DB `pg_dump` → restore into a **new** database on the shared server; **freeze writes** during cutover. + +--- + +## 3. S3-compatible object storage — recommended pattern + +1. **Run one S3 API on noble** — **MinIO** (common), **Garage**, or **SeaweedFS** S3 layer — with **PVC(s)** or host path for data; **erasure coding** / replicas if the chart supports it and you want durability across nodes. +2. **Buckets per concern** — e.g. `gitea-attachments`, `velero`, `loki-archive` — not one global bucket unless you enforce **prefix** IAM policies. +3. **Credentials** — **IAM-style** users limited to **one bucket** (or prefix); **Secrets** reference **access key** / **secret**; never commit keys in plain text. +4. **Endpoint for pods** — In-cluster: `http://minio.platform.svc.cluster.local:9000` (or TLS inside mesh). Apps use **virtual-hosted** or **path-style** per SDK defaults. + +### NFS as backing store for S3 on noble + +**Yes.** You can run MinIO (or another S3-compatible server) with its **data directory** on a **ReadWriteMany** volume that is **NFS** — for example the same **Openmediavault** export you already use, mounted via your **NFS CSI** driver (see [`homelab-network.md`](homelab-network.md)). + +| Consideration | Detail | +|---------------|--------| +| **Works for homelab** | MinIO stores objects as files under a path; **POSIX** on NFS is enough for many setups. | +| **Performance** | NFS adds **latency** and shared bandwidth; fine for moderate use, less ideal for heavy multi-tenant throughput. | +| **Availability** | The **NFS server** (OMV) becomes part of the availability story for object data — plan **backups** and **OMV** health like any dependency. | +| **Locking / semantics** | Prefer **NFSv4.x**; avoid mixing **NFS** and expectations of **local SSD** (e.g. very chatty small writes). If you see odd behavior, **Longhorn** (block) on a node is the usual next step. | +| **Layering** | You are stacking **S3 API → file layout → NFS → disk**; that is normal for a lab, just **monitor** space and exports on OMV. | + +**Summary:** NFS-backed PVC for MinIO is **valid** on noble; use **Longhorn** (or local disk) when you need **better IOPS** or want object data **inside** the cluster’s storage domain without depending on OMV for that tier. + +**Migrating off VM 160 (`s3`) or per-app MinIO** + +- **MinIO → MinIO:** `mc mirror` between aliases, or **replication** if you configure it. +- **Same API:** Any tool speaking **S3** can **sync** buckets before you point apps at the new endpoint. + +**Velero** — Point the **backup location** at the **central** bucket (see cluster Velero docs); avoid a second ad-hoc object store for backups if one cluster bucket is enough. + +--- + +## 4. Ordering relative to app migrations + +| When | What | +|------|------| +| **Early** | Stand up **Postgres** + **S3** with **empty** DBs/buckets; test with **one** non-critical app (e.g. a throwaway deployment). | +| **Before auth / Git** | **Gitea** and **Authentik** benefit from **managed Postgres** early — plan **DSN** and **bucket** for attachments **before** cutover. | +| **Ongoing** | New apps **must not** ship embedded **Postgres/MinIO** unless the workload truly requires it (e.g. vendor appliance). | + +--- + +## 5. Checklist (platform team) + +- [ ] Postgres **Service** DNS name and **TLS** (optional in-cluster) documented. +- [ ] S3 **endpoint**, **region** string (can be `us-east-1` for MinIO), **TLS** for Ingress if clients are outside the cluster. +- [ ] **Backup:** scheduled **logical dumps** (Postgres) and **bucket replication** or **object versioning** where needed. +- [ ] **SOPS** / **External Secrets** pattern for **rotation** without editing app manifests by hand. +- [ ] **homelab-network.md** updated when **VM 160** is retired or repurposed. + +--- + +## Related docs + +- VM → cluster migration: [`migration-vm-to-noble.md`](migration-vm-to-noble.md) +- Inventory (s3 VM): [`homelab-network.md`](homelab-network.md) +- Longhorn / storage runbook: [`../talos/runbooks/longhorn.md`](../talos/runbooks/longhorn.md) +- Velero (S3 backup target): [`../clusters/noble/bootstrap/velero/`](../clusters/noble/bootstrap/velero/) (if present) diff --git a/komodo/monitor/tracearr/compose.yaml b/komodo/monitor/tracearr/compose.yaml index 1a3a936..e43f17c 100644 --- a/komodo/monitor/tracearr/compose.yaml +++ b/komodo/monitor/tracearr/compose.yaml @@ -7,7 +7,7 @@ services: tracearr: - image: ghcr.io/connorgallopo/tracearr:supervised-nightly + image: ghcr.io/connorgallopo/tracearr:latest shm_size: 256mb # Required for PostgreSQL shared memory ports: - "${PORT:-3000}:3000" diff --git a/talos/CLUSTER-BUILD.md b/talos/CLUSTER-BUILD.md index ff5f5b2..a8725bc 100644 --- a/talos/CLUSTER-BUILD.md +++ b/talos/CLUSTER-BUILD.md @@ -4,7 +4,7 @@ This document is the **exported TODO** for the **noble** Talos cluster (4 nodes) ## Current state (2026-03-28) -Lab stack is **up** on-cluster through **Phase D**–**F** and **Phase G** (Vault **CiliumNetworkPolicy**, **`talos/runbooks/`**). **Next focus:** optional **Alertmanager** receivers (Slack/PagerDuty); tighten **RBAC** (Headlamp / cluster-admin); **Cilium** policies for other namespaces as needed; enable **Mend Renovate** for PRs; Pangolin/sample Ingress; **Velero** backup/restore drill after S3 credentials are set (**`noble_velero_install`**). +Lab stack is **up** on-cluster through **Phase D**–**F** and **Phase G** (**`talos/runbooks/`**, **SOPS**-encrypted secrets in **`clusters/noble/secrets/`**). **Next focus:** optional **Alertmanager** receivers (Slack/PagerDuty); tighten **RBAC** (Headlamp / cluster-admin); **Cilium** policies for other namespaces as needed; enable **Mend Renovate** for PRs; Pangolin/sample Ingress; **Velero** backup/restore drill after S3 credentials are set (**`noble_velero_install`**). - **Talos** v1.12.6 (target) / **Kubernetes** as bundled — four nodes **Ready** unless upgrading; **`talosctl health`**; **`talos/kubeconfig`** is **local only** (gitignored — never commit; regenerate with `talosctl kubeconfig` per `talos/README.md`). **Image Factory (nocloud installer):** `factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6` - **Cilium** Helm **1.16.6** / app **1.16.6** (`clusters/noble/bootstrap/cilium/`, phase 1 values). @@ -15,13 +15,11 @@ Lab stack is **up** on-cluster through **Phase D**–**F** and **Phase G** (Vaul - **Longhorn** Helm **1.11.1** / app **v1.11.1** — `clusters/noble/bootstrap/longhorn/` (PSA **privileged** namespace, `defaultDataPath` `/var/mnt/longhorn`, `preUpgradeChecker` enabled); **StorageClass** `longhorn` (default); **`nodes.longhorn.io`** all **Ready**; test **PVC** `Bound` on `longhorn`. - **Traefik** Helm **39.0.6** / app **v3.6.11** — `clusters/noble/bootstrap/traefik/`; **`Service`** **`LoadBalancer`** **`EXTERNAL-IP` `192.168.50.211`**; **`IngressClass`** **`traefik`** (default). Point **`*.apps.noble.lab.pcenicni.dev`** at **`192.168.50.211`**. MetalLB pool verification was done before replacing the temporary nginx test with Traefik. - **cert-manager** Helm **v1.20.0** / app **v1.20.0** — `clusters/noble/bootstrap/cert-manager/`; **`ClusterIssuer`** **`letsencrypt-staging`** and **`letsencrypt-prod`** (**DNS-01** via **Cloudflare** for **`pcenicni.dev`**, Secret **`cloudflare-dns-api-token`** in **`cert-manager`**); ACME email **`certificates@noble.lab.pcenicni.dev`** (edit in manifests if you want a different mailbox). -- **Newt** Helm **1.2.0** / app **1.10.1** — `clusters/noble/bootstrap/newt/` (**fossorial/newt**); Pangolin site tunnel — **`newt-pangolin-auth`** Secret (**`PANGOLIN_ENDPOINT`**, **`NEWT_ID`**, **`NEWT_SECRET`**). Prefer a **SealedSecret** in git (`kubeseal` — see `clusters/noble/bootstrap/sealed-secrets/examples/`) after rotating credentials if they were exposed. **Public DNS** is **not** automated with ExternalDNS: **CNAME** records at your DNS host per Pangolin’s domain instructions, plus **Integration API** for HTTP resources/targets — see **`clusters/noble/bootstrap/newt/README.md`**. LAN access to Traefik can still use **`*.apps.noble.lab.pcenicni.dev`** → **`192.168.50.211`** (split horizon / local resolver). +- **Newt** Helm **1.2.0** / app **1.10.1** — `clusters/noble/bootstrap/newt/` (**fossorial/newt**); Pangolin site tunnel — **`newt-pangolin-auth`** Secret (**`PANGOLIN_ENDPOINT`**, **`NEWT_ID`**, **`NEWT_SECRET`**). Store credentials in git with **SOPS** (`clusters/noble/secrets/newt-pangolin-auth.secret.yaml`, **`age-key.txt`**, **`.sops.yaml`**) — see **`clusters/noble/secrets/README.md`**. **Public DNS** is **not** automated with ExternalDNS: **CNAME** records at your DNS host per Pangolin’s domain instructions, plus **Integration API** for HTTP resources/targets — see **`clusters/noble/bootstrap/newt/README.md`**. LAN access to Traefik can still use **`*.apps.noble.lab.pcenicni.dev`** → **`192.168.50.211`** (split horizon / local resolver). - **Argo CD** Helm **9.4.17** / app **v3.3.6** — `clusters/noble/bootstrap/argocd/`; **`argocd-server`** **`LoadBalancer`** **`192.168.50.210`**; app-of-apps root syncs **`clusters/noble/apps/`** (edit **`root-application.yaml`** `repoURL` before applying). - **kube-prometheus-stack** — Helm chart **82.15.1** — `clusters/noble/bootstrap/kube-prometheus-stack/` (**namespace** `monitoring`, PSA **privileged** — **node-exporter** needs host mounts); **Longhorn** PVCs for Prometheus, Grafana, Alertmanager; **node-exporter** DaemonSet **4/4**. **Grafana Ingress:** **`https://grafana.apps.noble.lab.pcenicni.dev`** (Traefik **`ingressClassName: traefik`**, **`cert-manager.io/cluster-issuer: letsencrypt-prod`**). **Loki** datasource in Grafana: ConfigMap **`clusters/noble/bootstrap/grafana-loki-datasource/loki-datasource.yaml`** (sidecar label **`grafana_datasource: "1"`**) — not via **`grafana.additionalDataSources`** in the chart. **`helm upgrade --install` with `--wait` is silent until done** — use **`--timeout 30m`**; Grafana admin: Secret **`kube-prometheus-grafana`**, keys **`admin-user`** / **`admin-password`**. - **Loki** + **Fluent Bit** — **`grafana/loki` 6.55.0** SingleBinary + **filesystem** on **Longhorn** (`clusters/noble/bootstrap/loki/`); **`loki.auth_enabled: false`**; **`chunksCache.enabled: false`** (no memcached chunk cache). **`fluent/fluent-bit` 0.56.0** → **`loki-gateway.loki.svc:80`** (`clusters/noble/bootstrap/fluent-bit/`); **`logging`** PSA **privileged**. **Grafana Explore:** **`kubectl apply -f clusters/noble/bootstrap/grafana-loki-datasource/loki-datasource.yaml`** then **Explore → Loki** (e.g. `{job="fluent-bit"}`). -- **Sealed Secrets** Helm **2.18.4** / app **0.36.1** — `clusters/noble/bootstrap/sealed-secrets/` (namespace **`sealed-secrets`**); **`kubeseal`** on client should match controller minor (**README**); back up **`sealed-secrets-key`** (see README). -- **External Secrets Operator** Helm **2.2.0** / app **v2.2.0** — `clusters/noble/bootstrap/external-secrets/`; Vault **`ClusterSecretStore`** in **`examples/vault-cluster-secret-store.yaml`** (**`http://`** to match Vault listener — apply after Vault **Kubernetes auth**). -- **Vault** Helm **0.32.0** / app **1.21.2** — `clusters/noble/bootstrap/vault/` — standalone **file** storage, **Longhorn** PVC; **HTTP** listener (`global.tlsDisable`); optional **CronJob** lab unseal **`unseal-cronjob.yaml`**; **not** initialized in git — run **`vault operator init`** per **`README.md`**. +- **SOPS** — cluster **`Secret`** manifests under **`clusters/noble/secrets/`** encrypted with **age** (see **`.sops.yaml`**, **`age-key.txt`** gitignored); **`noble.yml`** decrypt-applies when the private key is present. - **Velero** Helm **12.0.0** / app **v1.18.0** — `clusters/noble/bootstrap/velero/` (**Ansible** **`noble_velero`**, not Argo); **S3-compatible** backup location + **CSI** snapshots (**`EnableCSI`**); enable with **`noble_velero_install`** per **`velero/README.md`**. - **Still open:** **Renovate** — install **[Mend Renovate](https://github.com/apps/renovate)** (or self-host) so PRs run; optional **Alertmanager** notification channels; optional **sample Ingress + cert + Pangolin** end-to-end; **Argo CD SSO**. @@ -64,9 +62,6 @@ Lab stack is **up** on-cluster through **Phase D**–**F** and **Phase G** (Vaul - kube-prometheus-stack: **82.15.1** (Helm chart `prometheus-community/kube-prometheus-stack`; app **v0.89.x** bundle) - Loki: **6.55.0** (Helm chart `grafana/loki`; app **3.6.7**) - Fluent Bit: **0.56.0** (Helm chart `fluent/fluent-bit`; app **4.2.3**) -- Sealed Secrets: **2.18.4** (Helm chart `sealed-secrets/sealed-secrets`; app **0.36.1**) -- External Secrets Operator: **2.2.0** (Helm chart `external-secrets/external-secrets`; app **v2.2.0**) -- Vault: **0.32.0** (Helm chart `hashicorp/vault`; app **1.21.2**) - Kyverno: **3.7.1** (Helm chart `kyverno/kyverno`; app **v1.17.1**); **kyverno-policies** **3.7.1** — **baseline** PSS, **Audit** (`clusters/noble/bootstrap/kyverno/`) - Headlamp: **0.40.1** (Helm chart `headlamp/headlamp`; app matches chart — see [Artifact Hub](https://artifacthub.io/packages/helm/headlamp/headlamp)) - Velero: **12.0.0** (Helm chart `vmware-tanzu/velero`; app **v1.18.0**) — **`clusters/noble/bootstrap/velero/`**; AWS plugin **v1.14.0**; Ansible **`noble_velero`** @@ -77,7 +72,7 @@ Lab stack is **up** on-cluster through **Phase D**–**F** and **Phase G** (Vaul | Artifact | Path | |----------|------| | This checklist | `talos/CLUSTER-BUILD.md` | -| Operational runbooks (API VIP, etcd, Longhorn, Vault) | `talos/runbooks/` | +| Operational runbooks (API VIP, etcd, Longhorn, SOPS) | `talos/runbooks/` | | Talos quick start + networking + kubeconfig | `talos/README.md` | | talhelper source (active) | `talos/talconfig.yaml` — may be **wipe-phase** (no Longhorn volume) during disk recovery | | Longhorn volume restore | `talos/talconfig.with-longhorn.yaml` — copy to `talconfig.yaml` after GPT wipe (see `talos/README.md` §5) | @@ -96,13 +91,11 @@ Lab stack is **up** on-cluster through **Phase D**–**F** and **Phase G** (Vaul | Grafana Loki datasource (ConfigMap; no chart change) | `clusters/noble/bootstrap/grafana-loki-datasource/loki-datasource.yaml` | | Loki (Helm values) | `clusters/noble/bootstrap/loki/` — `values.yaml`, `namespace.yaml` | | Fluent Bit → Loki (Helm values) | `clusters/noble/bootstrap/fluent-bit/` — `values.yaml`, `namespace.yaml` | -| Sealed Secrets (Helm) | `clusters/noble/bootstrap/sealed-secrets/` — `values.yaml`, `namespace.yaml`, `README.md` | -| External Secrets Operator (Helm + Vault store example) | `clusters/noble/bootstrap/external-secrets/` — `values.yaml`, `namespace.yaml`, `README.md`, `examples/vault-cluster-secret-store.yaml` | -| Vault (Helm + optional unseal CronJob) | `clusters/noble/bootstrap/vault/` — `values.yaml`, `namespace.yaml`, `unseal-cronjob.yaml`, `cilium-network-policy.yaml`, `configure-kubernetes-auth.sh`, `README.md` | +| SOPS-encrypted cluster Secrets | `clusters/noble/secrets/` — `README.md`, `*.secret.yaml`; **`.sops.yaml`**, **`age-key.txt`** (gitignored) at repo root | | Kyverno + PSS baseline policies | `clusters/noble/bootstrap/kyverno/` — `values.yaml`, `policies-values.yaml`, `namespace.yaml`, `README.md` | | Headlamp (Helm + Ingress) | `clusters/noble/bootstrap/headlamp/` — `values.yaml`, `namespace.yaml`, `README.md` | | Velero (Helm + S3 BSL; CSI snapshots) | `clusters/noble/bootstrap/velero/` — `values.yaml`, `namespace.yaml`, `README.md`; **`ansible/roles/noble_velero`** | -| Renovate (repo config + optional self-hosted Helm) | **`renovate.json`** at repo root; optional self-hosted chart under **`clusters/noble/apps/`** (Argo) + token Secret (**Sealed Secrets** / **ESO** after **Phase E**) | +| Renovate (repo config + optional self-hosted Helm) | **`renovate.json`** at repo root; optional self-hosted chart under **`clusters/noble/apps/`** (Argo) + token Secret (SOPS under **`clusters/noble/secrets/`** or imperative **`kubectl create secret`**) | **Git vs cluster:** manifests and `talconfig` live in git; **`talhelper genconfig -o out`**, bootstrap, Helm, and `kubectl` run on your LAN. See **`talos/README.md`** for workstation reachability (lab LAN/VPN), **`talosctl kubeconfig`** vs Kubernetes `server:` (VIP vs node IP), and **`--insecure`** only in maintenance. @@ -114,10 +107,9 @@ Lab stack is **up** on-cluster through **Phase D**–**F** and **Phase G** (Vaul 4. **CSI Volume snapshots:** **`kubernetes-csi/external-snapshotter`** CRDs + **`snapshot-controller`** (`clusters/noble/bootstrap/csi-snapshot-controller/`) before relying on **Longhorn** / **Velero** volume snapshots. 5. **Longhorn:** Talos user volume + extensions in `talconfig.with-longhorn.yaml` (when restored); Helm **`defaultDataPath`** in `clusters/noble/bootstrap/longhorn/values.yaml`. 6. **Loki → Fluent Bit → Grafana datasource:** deploy **Loki** (`loki-gateway` Service) before **Fluent Bit**; apply **`clusters/noble/bootstrap/grafana-loki-datasource/loki-datasource.yaml`** after **Loki** (sidecar picks up the ConfigMap — no kube-prometheus values change for Loki). -7. **Vault:** **Longhorn** default **StorageClass** before **`clusters/noble/bootstrap/vault/`** Helm (PVC **`data-vault-0`**); **External Secrets** **`ClusterSecretStore`** after Vault is initialized, unsealed, and **Kubernetes auth** is configured. -8. **Headlamp:** **Traefik** + **cert-manager** (**`letsencrypt-prod`**) before exposing **`headlamp.apps.noble.lab.pcenicni.dev`**; treat as **cluster-admin** UI — protect with network policy / SSO when hardening (**Phase G**). -9. **Renovate:** **Git remote** + platform access (**hosted app** needs org/repo install; **self-hosted** needs **`RENOVATE_TOKEN`** and chart **`renovate.config`**). If the bot runs **in-cluster**, add the token **after** **Sealed Secrets** / **Vault** (**Phase E**) — no ingress required for the bot itself. -10. **Velero:** **S3-compatible** endpoint + bucket + **`velero/velero-cloud-credentials`** before **`ansible/playbooks/noble.yml`** with **`noble_velero_install: true`**; for **CSI** volume snapshots, label a **VolumeSnapshotClass** per **`clusters/noble/bootstrap/velero/README.md`** (e.g. Longhorn). +7. **Headlamp:** **Traefik** + **cert-manager** (**`letsencrypt-prod`**) before exposing **`headlamp.apps.noble.lab.pcenicni.dev`**; treat as **cluster-admin** UI — protect with network policy / SSO when hardening (**Phase G**). +8. **Renovate:** **Git remote** + platform access (**hosted app** needs org/repo install; **self-hosted** needs **`RENOVATE_TOKEN`** and chart **`renovate.config`**). If the bot runs **in-cluster**, store the token with **SOPS** or an imperative Secret — no ingress required for the bot itself. +9. **Velero:** **S3-compatible** endpoint + bucket + **`velero/velero-cloud-credentials`** before **`ansible/playbooks/noble.yml`** with **`noble_velero_install: true`**; for **CSI** volume snapshots, label a **VolumeSnapshotClass** per **`clusters/noble/bootstrap/velero/README.md`** (e.g. Longhorn). ## Prerequisites (before phases) @@ -160,7 +152,7 @@ Lab stack is **up** on-cluster through **Phase D**–**F** and **Phase G** (Vaul - [x] **Argo CD** bootstrap — `clusters/noble/bootstrap/argocd/` (`helm upgrade --install argocd …`) — also covered by **`ansible/playbooks/noble.yml`** (role **`noble_argocd`**) - [x] Argo CD server **LoadBalancer** — **`192.168.50.210`** (see `values.yaml`) - [x] **App-of-apps** — optional; **`clusters/noble/apps/kustomization.yaml`** is **empty** (core stack is **Ansible**-managed from **`clusters/noble/bootstrap/`**, not Argo). Set **`repoURL`** in **`root-application.yaml`** and add **`Application`** manifests only for optional GitOps workloads — see **`clusters/noble/apps/README.md`** -- [x] **Renovate** — **`renovate.json`** at repo root ([Renovate](https://docs.renovatebot.com/) — **Kubernetes** manager for **`clusters/noble/**/*.yaml`** image pins; grouped minor/patch PRs). **Activate PRs:** install **[Mend Renovate](https://github.com/apps/renovate)** on the Git repo (**Option A**), or **Option B:** self-hosted chart per [Helm charts](https://docs.renovatebot.com/helm-charts/) + token from **Sealed Secrets** / **ESO**. Helm **chart** versions pinned only in comments still need manual bumps or extra **regex** `customManagers` — extend **`renovate.json`** as needed. +- [x] **Renovate** — **`renovate.json`** at repo root ([Renovate](https://docs.renovatebot.com/) — **Kubernetes** manager for **`clusters/noble/**/*.yaml`** image pins; grouped minor/patch PRs). **Activate PRs:** install **[Mend Renovate](https://github.com/apps/renovate)** on the Git repo (**Option A**), or **Option B:** self-hosted chart per [Helm charts](https://docs.renovatebot.com/helm-charts/) + token from **SOPS** or a one-off Secret. Helm **chart** versions pinned only in comments still need manual bumps or extra **regex** `customManagers` — extend **`renovate.json`** as needed. - [ ] SSO — later ## Phase D — Observability @@ -171,9 +163,7 @@ Lab stack is **up** on-cluster through **Phase D**–**F** and **Phase G** (Vaul ## Phase E — Secrets -- [x] **Sealed Secrets** (optional Git workflow) — `clusters/noble/bootstrap/sealed-secrets/` (Helm **2.18.4**); **`kubeseal`** + key backup per **`README.md`** -- [x] **Vault** in-cluster on Longhorn + **auto-unseal** — `clusters/noble/bootstrap/vault/` (Helm **0.32.0**); **Longhorn** PVC; **OSS** “auto-unseal” = optional **`unseal-cronjob.yaml`** + Secret (**README**); **`configure-kubernetes-auth.sh`** for ESO (**Kubernetes auth** + KV + role) -- [x] **External Secrets Operator** + Vault `ClusterSecretStore` — operator **`clusters/noble/bootstrap/external-secrets/`** (Helm **2.2.0**); apply **`examples/vault-cluster-secret-store.yaml`** after Vault (**`README.md`**) +- [x] **SOPS** — encrypt **`Secret`** YAML under **`clusters/noble/secrets/`** with **age** (see **`.sops.yaml`**, **`clusters/noble/secrets/README.md`**); keep **`age-key.txt`** private (gitignored). **`ansible/playbooks/noble.yml`** decrypt-applies **`*.yaml`** when **`age-key.txt`** exists. ## Phase F — Policy + backups @@ -182,8 +172,7 @@ Lab stack is **up** on-cluster through **Phase D**–**F** and **Phase G** (Vaul ## Phase G — Hardening -- [x] **Cilium** — Vault **`CiliumNetworkPolicy`** (`clusters/noble/bootstrap/vault/cilium-network-policy.yaml`) — HTTP **8200** from **`external-secrets`** + **`vault`**; extend for other clients as needed -- [x] **Runbooks** — **`talos/runbooks/`** (API VIP / kube-vip, etcd–Talos, Longhorn, Vault) +- [x] **Runbooks** — **`talos/runbooks/`** (API VIP / kube-vip, etcd–Talos, Longhorn, SOPS) - [x] **RBAC** — **Headlamp** **`ClusterRoleBinding`** uses built-in **`edit`** (not **`cluster-admin`**); **Argo CD** **`policy.default: role:readonly`** with **`g, admin, role:admin`** — see **`clusters/noble/bootstrap/headlamp/values.yaml`**, **`clusters/noble/bootstrap/argocd/values.yaml`**, **`talos/runbooks/rbac.md`** - [ ] **Alertmanager** — add **`slack_configs`**, **`pagerduty_configs`**, or other receivers under **`kube-prometheus-stack`** `alertmanager.config` (chart defaults use **`null`** receiver) @@ -201,12 +190,10 @@ Lab stack is **up** on-cluster through **Phase D**–**F** and **Phase G** (Vaul - [x] **`logging`** — **Fluent Bit** DaemonSet **Running** on all nodes (logs → **Loki**) - [x] **Grafana** — **Loki** datasource from **`grafana-loki-datasource`** ConfigMap (**Explore** works after apply + sidecar sync) - [x] **Headlamp** — Deployment **Running** in **`headlamp`**; UI at **`https://headlamp.apps.noble.lab.pcenicni.dev`** (TLS via **`letsencrypt-prod`**) -- [x] **`sealed-secrets`** — controller **Deployment** **Running** in **`sealed-secrets`** (install + **`kubeseal`** per **`apps/sealed-secrets/README.md`**) -- [x] **`external-secrets`** — controller + webhook + cert-controller **Running** in **`external-secrets`**; apply **`ClusterSecretStore`** after Vault **Kubernetes auth** -- [x] **`vault`** — **StatefulSet** **Running**, **`data-vault-0`** PVC **Bound** on **longhorn**; **`vault operator init`** + unseal per **`apps/vault/README.md`** +- [x] **SOPS secrets** — **`clusters/noble/secrets/*.yaml`** encrypted in git; **`noble.yml`** applies decrypted manifests when **`age-key.txt`** is present - [x] **`kyverno`** — admission / background / cleanup / reports controllers **Running** in **`kyverno`**; **ClusterPolicies** for **PSS baseline** **Ready** (**Audit**) - [ ] **`velero`** — when enabled: Deployment **Running** in **`velero`**; **`BackupStorageLocation`** / **`VolumeSnapshotLocation`** **Available**; test backup per **`velero/README.md`** -- [x] **Phase G (partial)** — Vault **`CiliumNetworkPolicy`**; **`talos/runbooks/`** (incl. **RBAC**); **Headlamp**/**Argo CD** RBAC tightened — **Alertmanager** receivers still optional +- [x] **Phase G (partial)** — **`talos/runbooks/`** (incl. **RBAC**); **Headlamp**/**Argo CD** RBAC tightened — **Alertmanager** receivers still optional --- diff --git a/talos/README.md b/talos/README.md index efc33e4..89564a4 100644 --- a/talos/README.md +++ b/talos/README.md @@ -1,7 +1,7 @@ # Talos — noble lab - **Cluster build checklist (exported TODO):** [CLUSTER-BUILD.md](./CLUSTER-BUILD.md) -- **Operational runbooks (API VIP, etcd, Longhorn, Vault):** [runbooks/README.md](./runbooks/README.md) +- **Operational runbooks (API VIP, etcd, Longhorn, SOPS):** [runbooks/README.md](./runbooks/README.md) ## Versions diff --git a/talos/runbooks/README.md b/talos/runbooks/README.md index 422fd21..f198c32 100644 --- a/talos/runbooks/README.md +++ b/talos/runbooks/README.md @@ -7,5 +7,5 @@ Short recovery / triage notes for the **noble** Talos cluster. Deep procedures l | Kubernetes API VIP (kube-vip) | [`api-vip-kube-vip.md`](./api-vip-kube-vip.md) | | etcd / Talos control plane | [`etcd-talos.md`](./etcd-talos.md) | | Longhorn storage | [`longhorn.md`](./longhorn.md) | -| Vault (unseal, auth, ESO) | [`vault.md`](./vault.md) | +| SOPS (secrets in git) | [`sops.md`](./sops.md) | | RBAC (Headlamp, Argo CD) | [`rbac.md`](./rbac.md) | diff --git a/talos/runbooks/sops.md b/talos/runbooks/sops.md new file mode 100644 index 0000000..8c97efb --- /dev/null +++ b/talos/runbooks/sops.md @@ -0,0 +1,13 @@ +# Runbook: SOPS secrets (git-encrypted) + +**Symptoms:** `sops -d` fails; `kubectl apply` after Ansible shows no secret; `noble.yml` skips apply. + +**Checklist** + +1. **Private key:** `age-key.txt` at the repository root (gitignored). Create with `age-keygen -o age-key.txt` and add the **public** key to `.sops.yaml` (see `clusters/noble/secrets/README.md`). +2. **Environment:** `export SOPS_AGE_KEY_FILE=/absolute/path/to/home-server/age-key.txt` when editing or applying by hand. +3. **Edit encrypted file:** `sops clusters/noble/secrets/.secret.yaml` +4. **Apply one file:** `sops -d clusters/noble/secrets/.secret.yaml | kubectl apply -f -` +5. **Ansible:** `noble_apply_sops_secrets` is true by default; the platform role applies all `*.yaml` when `age-key.txt` exists. + +**References:** [`clusters/noble/secrets/README.md`](../../clusters/noble/secrets/README.md), [Mozilla SOPS](https://github.com/getsops/sops). diff --git a/talos/runbooks/vault.md b/talos/runbooks/vault.md deleted file mode 100644 index 4786df9..0000000 --- a/talos/runbooks/vault.md +++ /dev/null @@ -1,15 +0,0 @@ -# Runbook: Vault (in-cluster) - -**Symptoms:** External Secrets **not syncing**, `ClusterSecretStore` **InvalidProviderConfig**, Vault UI/API **503 sealed**, pods **CrashLoop** on auth. - -**Checks** - -1. `kubectl -n vault exec -i sts/vault -- vault status` — **Sealed** / **Initialized**. -2. Unseal key Secret + optional CronJob: [`clusters/noble/bootstrap/vault/README.md`](../../clusters/noble/bootstrap/vault/README.md), `unseal-cronjob.yaml`. -3. Kubernetes auth for ESO: [`clusters/noble/bootstrap/vault/configure-kubernetes-auth.sh`](../../clusters/noble/bootstrap/vault/configure-kubernetes-auth.sh) and `kubectl describe clustersecretstore vault`. -4. **Cilium** policy: if Vault is unreachable from `external-secrets`, check [`clusters/noble/bootstrap/vault/cilium-network-policy.yaml`](../../clusters/noble/bootstrap/vault/cilium-network-policy.yaml) and extend `ingress` for new client namespaces. - -**Common fixes** - -- Sealed: `vault operator unseal` or fix auto-unseal CronJob + `vault-unseal-key` Secret. -- **403/invalid role** on ESO: re-run Kubernetes auth setup (issuer/CA/reviewer JWT) per README.