Update Ansible configuration to integrate SOPS for managing secrets. Enhance README.md with SOPS usage instructions and prerequisites. Remove External Secrets Operator references and related configurations from the bootstrap process, streamlining the deployment. Adjust playbooks and roles to apply SOPS-encrypted secrets automatically, improving security and clarity in secret management.

This commit is contained in:
Nikholas Pcenicni
2026-03-30 22:42:52 -04:00
parent 023ebfee5d
commit 3a6e5dff5b
44 changed files with 644 additions and 809 deletions

169
docs/Racks.md Normal file
View File

@@ -0,0 +1,169 @@
# Physical racks — Noble lab (10")
This page is a **logical rack layout** for the **noble** Talos lab: **three 10" (half-width) racks**, how **rack units (U)** are used, and **Ethernet** paths on **`192.168.50.0/24`**. Node names and IPs match [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) and [`docs/architecture.md`](architecture.md).
## Legend
| Symbol | Meaning |
|--------|---------|
| `█` / filled cell | Equipment occupying that **1U** |
| `░` | Reserved / future use |
| `·` | Empty |
| `━━` | Copper to LAN switch |
**Rack unit numbering:** **U increases upward** (U1 = bottom of rack, like ANSI/EIA). **Slot** in the diagrams is **top → bottom** reading order for a quick visual scan.
### Three racks at a glance
Read **top → bottom** (first row = top of rack).
| Primary (10") | Storage B (10") | Rack C (10") |
|-----------------|-----------------|--------------|
| Fiber ONT | Mac Mini | *empty* |
| UniFi Fiber Gateway | NAS | *empty* |
| Patch panel | JBOD | *empty* |
| 2.5 GbE ×8 PoE switch | *empty* | *empty* |
| Raspberry Pi cluster | *empty* | *empty* |
| **helium** (Talos) | *empty* | *empty* |
| **neon** (Talos) | *empty* | *empty* |
| **argon** (Talos) | *empty* | *empty* |
| **krypton** (Talos) | *empty* | *empty* |
**Connectivity:** Primary rack gear shares **one L2** (`192.168.50.0/24`). Storage B and Rack C link the same way when cabled (e.g. **Ethernet** to the PoE switch, **VPN** or flat LAN per your design).
---
## Rack A — LAN aggregation (10" × 12U)
Dedicated to **Layer-2 access** and cable home runs. All cluster nodes plug into this switch (or into a downstream switch that uplinks here).
```
TOP OF RACK
┌────────────────────────────────────────┐
│ Slot 1 ········· empty ·············· │ 12U
│ Slot 2 ········· empty ·············· │ 11U
│ Slot 3 ········· empty ·············· │ 10U
│ Slot 4 ········· empty ·············· │ 9U
│ Slot 5 ········· empty ·············· │ 8U
│ Slot 6 ········· empty ·············· │ 7U
│ Slot 7 ░░░░░░░ optional PDU ░░░░░░░░ │ 6U
│ Slot 8 █████ 1U cable manager ██████ │ 5U
│ Slot 9 █████ 1U patch panel █████████ │ 4U
│ Slot10 ███ 8-port managed switch ████ │ 3U ← LAN L2 spine
│ Slot11 ········· empty ·············· │ 2U
│ Slot12 ········· empty ·············· │ 1U
└────────────────────────────────────────┘
BOTTOM
```
**Network role:** Every node NIC → **switch access port** → same **VLAN / flat LAN** as documented; **kube-vip** VIP **`192.168.50.230`**, **MetalLB** **`192.168.50.210``229`**, **Traefik** **`192.168.50.211`** are **logical** on node IPs (no extra hardware).
---
## Rack B — Control planes (10" × 12U)
Three **Talos control-plane** nodes (**scheduling allowed** on CPs per `talconfig.yaml`).
```
TOP OF RACK
┌────────────────────────────────────────┐
│ Slot 1 ········· empty ·············· │ 12U
│ Slot 2 ········· empty ·············· │ 11U
│ Slot 3 ········· empty ·············· │ 10U
│ Slot 4 ········· empty ·············· │ 9U
│ Slot 5 ········· empty ·············· │ 8U
│ Slot 6 ········· empty ·············· │ 7U
│ Slot 7 ········· empty ·············· │ 6U
│ Slot 8 █ neon control-plane .20 ████ │ 5U
│ Slot 9 █ argon control-plane .30 ███ │ 4U
│ Slot10 █ krypton control-plane .40 ██ │ 3U (kube-vip VIP .230)
│ Slot11 ········· empty ·············· │ 2U
│ Slot12 ········· empty ·············· │ 1U
└────────────────────────────────────────┘
BOTTOM
```
---
## Rack C — Worker (10" × 12U)
Single **worker** node; **Longhorn** data disk is **local** to each node (see `talconfig.yaml`); no separate NAS in this diagram.
```
TOP OF RACK
┌────────────────────────────────────────┐
│ Slot 1 ········· empty ·············· │ 12U
│ Slot 2 ········· empty ·············· │ 11U
│ Slot 3 ········· empty ·············· │ 10U
│ Slot 4 ········· empty ·············· │ 9U
│ Slot 5 ········· empty ·············· │ 8U
│ Slot 6 ········· empty ·············· │ 7U
│ Slot 7 ░░░░░░░ spare / future ░░░░░░░░ │ 6U
│ Slot 8 ········· empty ·············· │ 5U
│ Slot 9 ········· empty ·············· │ 4U
│ Slot10 ███ helium worker .10 █████ │ 3U
│ Slot11 ········· empty ·············· │ 2U
│ Slot12 ········· empty ·············· │ 1U
└────────────────────────────────────────┘
BOTTOM
```
---
## Space summary
| System | Rack | Approx. U | IP | Role |
|--------|------|-----------|-----|------|
| LAN switch | A | 1U | — | All nodes on `192.168.50.0/24` |
| Patch / cable mgmt | A | 2× 1U | — | Physical plant |
| **neon** | B | 1U | `192.168.50.20` | control-plane + schedulable |
| **argon** | B | 1U | `192.168.50.30` | control-plane + schedulable |
| **krypton** | B | 1U | `192.168.50.40` | control-plane + schedulable |
| **helium** | C | 1U | `192.168.50.10` | worker |
Adjust **empty vs. future** rows if your chassis are **2U** or on **shelves** — scale the `█` blocks accordingly.
---
## Network connections
All cluster nodes are on **one flat LAN**. **kube-vip** floats **`192.168.50.230:6443`** across the three control-plane hosts on **`ens18`** (see cluster bootstrap docs).
```mermaid
flowchart TB
subgraph RACK_A["Rack A — 10\""]
SW["Managed switch<br/>192.168.50.0/24 L2"]
PP["Patch / cable mgmt"]
SW --- PP
end
subgraph RACK_B["Rack B — 10\""]
N["neon :20"]
A["argon :30"]
K["krypton :40"]
end
subgraph RACK_C["Rack C — 10\""]
H["helium :10"]
end
subgraph LOGICAL["Logical (any node holding VIP)"]
VIP["API VIP 192.168.50.230<br/>kube-vip → apiserver :6443"]
end
WAN["Internet / other LANs"] -.->|"router (out of scope)"| SW
SW <-->|"Ethernet"| N
SW <-->|"Ethernet"| A
SW <-->|"Ethernet"| K
SW <-->|"Ethernet"| H
N --- VIP
A --- VIP
K --- VIP
WK["Workstation / CI<br/>kubectl, browser"] -->|"HTTPS :6443"| VIP
WK -->|"L2 (MetalLB .210.211, any node)"| SW
```
**Ingress path (same LAN):** clients → **`192.168.50.211`** (Traefik) or **`192.168.50.210`** (Argo CD) via **MetalLB** — still **through the same switch** to whichever node advertises the service.
---
## Related docs
- Cluster topology and services: [`architecture.md`](architecture.md)
- Build state and versions: [`../talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md)

View File

@@ -8,8 +8,8 @@ This document describes the **noble** Talos lab cluster: node topology, networki
|---------------|---------|
| **Subgraph “Cluster”** | Kubernetes cluster boundary (`noble`) |
| **External / DNS / cloud** | Services outside the data plane (internet, registrar, Pangolin) |
| **Data store** | Durable data (etcd, Longhorn, Loki, Vault storage) |
| **Secrets / policy** | Secret material, Vault, admission policy |
| **Data store** | Durable data (etcd, Longhorn, Loki) |
| **Secrets / policy** | Secret material (SOPS in git), admission policy |
| **LB / VIP** | Load balancer, MetalLB assignment, or API VIP |
---
@@ -74,7 +74,7 @@ flowchart TB
## Platform stack (bootstrap → workloads)
Order: **Talos****Cilium** (cluster uses `cni: none` until CNI is installed) → **metrics-server**, **Longhorn**, **MetalLB** + pool manifests, **kube-vip****Traefik**, **cert-manager****Argo CD** (Helm only; optional empty app-of-apps). **Automated install:** `ansible/playbooks/noble.yml` (see `ansible/README.md`). Platform namespaces include `cert-manager`, `traefik`, `metallb-system`, `longhorn-system`, `monitoring`, `loki`, `logging`, `argocd`, `vault`, `external-secrets`, `sealed-secrets`, `kyverno`, `newt`, and others as deployed.
Order: **Talos****Cilium** (cluster uses `cni: none` until CNI is installed) → **metrics-server**, **Longhorn**, **MetalLB** + pool manifests, **kube-vip****Traefik**, **cert-manager****Argo CD** (Helm only; optional empty app-of-apps). **Automated install:** `ansible/playbooks/noble.yml` (see `ansible/README.md`). Platform namespaces include `cert-manager`, `traefik`, `metallb-system`, `longhorn-system`, `monitoring`, `loki`, `logging`, `argocd`, `kyverno`, `newt`, and others as deployed.
```mermaid
flowchart TB
@@ -98,7 +98,7 @@ flowchart TB
Argo["Argo CD<br/>(optional app-of-apps; platform via Ansible)"]
end
subgraph L5["Platform namespaces (examples)"]
NS["cert-manager, traefik, metallb-system,<br/>longhorn-system, monitoring, loki, logging,<br/>argocd, vault, external-secrets, sealed-secrets,<br/>kyverno, newt, …"]
NS["cert-manager, traefik, metallb-system,<br/>longhorn-system, monitoring, loki, logging,<br/>argocd, kyverno, newt, …"]
end
Talos --> Cilium --> MS
Cilium --> LH
@@ -149,22 +149,20 @@ flowchart LR
## Secrets and policy
**Sealed Secrets** decrypts `SealedSecret` objects in-cluster. **External Secrets Operator** syncs from **Vault** using **`ClusterSecretStore`** (see [`examples/vault-cluster-secret-store.yaml`](../clusters/noble/bootstrap/external-secrets/examples/vault-cluster-secret-store.yaml)). Trust is **cluster → Vault** (ESO calls Vault; Vault does not initiate cluster trust). **Kyverno** with **kyverno-policies** enforces **PSS baseline** in **Audit**.
**Mozilla SOPS** with **age** encrypts plain Kubernetes **`Secret`** manifests under [`clusters/noble/secrets/`](../clusters/noble/secrets/); operators decrypt at apply time (`ansible/playbooks/noble.yml` or `sops -d … | kubectl apply`). The private key is **`age-key.txt`** at the repo root (gitignored). **Kyverno** with **kyverno-policies** enforces **PSS baseline** in **Audit**.
```mermaid
flowchart LR
subgraph Git["Git repo"]
SSman["SealedSecret manifests<br/>(optional)"]
SM["SOPS-encrypted Secret YAML<br/>clusters/noble/secrets/"]
end
subgraph ops["Apply path"]
SOPS["sops -d + kubectl apply<br/>(or Ansible noble.yml)"]
end
subgraph cluster["Cluster"]
SSC["Sealed Secrets controller<br/>sealed-secrets"]
ESO["External Secrets Operator<br/>external-secrets"]
V["Vault<br/>vault namespace<br/>HTTP listener"]
K["Kyverno + kyverno-policies<br/>PSS baseline Audit"]
end
SSman -->|"encrypted"| SSC -->|"decrypt to Secret"| workloads["Workload Secrets"]
ESO -->|"ClusterSecretStore →"| V
ESO -->|"sync ExternalSecret"| workloads
SM --> SOPS -->|"plain Secret"| workloads["Workload Secrets"]
K -.->|"admission / audit<br/>(PSS baseline)"| workloads
```
@@ -172,7 +170,7 @@ flowchart LR
## Data and storage
**StorageClass:** **`longhorn`** (default). Talos mounts **user volume** data at **`/var/mnt/longhorn`** (bind paths for Longhorn). Stateful consumers include **Vault**, **kube-prometheus-stack** PVCs, and **Loki**.
**StorageClass:** **`longhorn`** (default). Talos mounts **user volume** data at **`/var/mnt/longhorn`** (bind paths for Longhorn). Stateful consumers include **kube-prometheus-stack** PVCs and **Loki**.
```mermaid
flowchart TB
@@ -183,12 +181,10 @@ flowchart TB
SC["StorageClass: longhorn (default)"]
end
subgraph consumers["Stateful / durable consumers"]
V["Vault PVC data-vault-0"]
PGL["kube-prometheus-stack PVCs"]
L["Loki PVC"]
end
UD --> SC
SC --> V
SC --> PGL
SC --> L
```
@@ -210,7 +206,7 @@ See [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) for the authoritative
| Argo CD | 9.4.17 / app v3.3.6 |
| kube-prometheus-stack | 82.15.1 |
| Loki / Fluent Bit | 6.55.0 / 0.56.0 |
| Sealed Secrets / ESO / Vault | 2.18.4 / 2.2.0 / 0.32.0 |
| SOPS (client tooling) | see `clusters/noble/secrets/README.md` |
| Kyverno | 3.7.1 / policies 3.7.1 |
| Newt | 1.2.0 / app 1.10.1 |
@@ -218,7 +214,7 @@ See [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) for the authoritative
## Narrative
The **noble** environment is a **Talos** lab cluster on **`192.168.50.0/24`** with **three control plane nodes and one worker**, schedulable workloads on control planes enabled, and the Kubernetes API exposed through **kube-vip** at **`192.168.50.230`**. **Cilium** provides the CNI after Talos bootstrap with **`cni: none`**; **MetalLB** advertises **`192.168.50.210``192.168.50.229`**, pinning **Argo CD** to **`192.168.50.210`** and **Traefik** to **`192.168.50.211`** for **`*.apps.noble.lab.pcenicni.dev`**. **cert-manager** issues certificates for Traefik Ingresses; **GitOps** is **Ansible-driven Helm** for the platform (**`clusters/noble/bootstrap/`**) plus optional **Argo CD** app-of-apps (**`clusters/noble/apps/`**, **`clusters/noble/bootstrap/argocd/`**). **Observability** uses **kube-prometheus-stack** in **`monitoring`**, **Loki** and **Fluent Bit** with Grafana wired via a **ConfigMap** datasource, with **Longhorn** PVCs for Prometheus, Grafana, Alertmanager, Loki, and **Vault**. **Secrets** combine **Sealed Secrets** for git-encrypted material, **Vault** with **External Secrets** for dynamic sync, and **Kyverno** enforces **Pod Security Standards baseline** in **Audit**. **Public** access uses **Newt** to **Pangolin** with **CNAME** and Integration API steps as documented—not generic in-cluster public DNS.
The **noble** environment is a **Talos** lab cluster on **`192.168.50.0/24`** with **three control plane nodes and one worker**, schedulable workloads on control planes enabled, and the Kubernetes API exposed through **kube-vip** at **`192.168.50.230`**. **Cilium** provides the CNI after Talos bootstrap with **`cni: none`**; **MetalLB** advertises **`192.168.50.210``192.168.50.229`**, pinning **Argo CD** to **`192.168.50.210`** and **Traefik** to **`192.168.50.211`** for **`*.apps.noble.lab.pcenicni.dev`**. **cert-manager** issues certificates for Traefik Ingresses; **GitOps** is **Ansible-driven Helm** for the platform (**`clusters/noble/bootstrap/`**) plus optional **Argo CD** app-of-apps (**`clusters/noble/apps/`**, **`clusters/noble/bootstrap/argocd/`**). **Observability** uses **kube-prometheus-stack** in **`monitoring`**, **Loki** and **Fluent Bit** with Grafana wired via a **ConfigMap** datasource, with **Longhorn** PVCs for Prometheus, Grafana, Alertmanager, and Loki. **Secrets** in git use **SOPS** + **age** under **`clusters/noble/secrets/`**; **Kyverno** enforces **Pod Security Standards baseline** in **Audit**. **Public** access uses **Newt** to **Pangolin** with **CNAME** and Integration API steps as documented—not generic in-cluster public DNS.
---

100
docs/homelab-network.md Normal file
View File

@@ -0,0 +1,100 @@
# Homelab network inventory
Single place for **VLANs**, **static addressing**, and **hosts** beside the **noble** Talos cluster. **Proxmox** is the **hypervisor** for the VMs below; **all of those VMs are intended to run on `192.168.1.0/24`** (same broadcast domain as Pi-hole and typical home clients). **Noble** (Talos) stays on **`192.168.50.0/24`** per [`architecture.md`](architecture.md) and [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) until you change that design.
## VLANs (logical)
| Network | Role |
|---------|------|
| **`192.168.1.0/24`** | **Homelab / Proxmox LAN****Proxmox host(s)**, **all Proxmox VMs**, **Pi-hole**, **Mac mini**, and other servers that share this VLAN. |
| **`192.168.50.0/24`** | **Noble Talos** cluster — physical nodes, **kube-vip**, **MetalLB**, Traefik; **not** the Proxmox VM subnet. |
| **`192.168.60.0/24`** | **DMZ / WAN-facing****NPM**, **WebDAV**, **other services** that need WAN access. |
| **`192.168.40.0/24`** | **Home Assistant** and IoT devices — isolated; record subnet and HA IP in DHCP/router. |
**Routing / DNS:** Clients and VMs on **`192.168.1.0/24`** reach **noble** services on **`192.168.50.0/24`** via **L3** (router/firewall). **NFS** from OMV (`192.168.1.105`) to **noble** pods uses the **OMV data IP** as the NFS server address from the clusters perspective.
Firewall rules between VLANs are **out of scope** here; document them where you keep runbooks.
---
## `192.168.50.0/24` — reservations (noble only)
Do not assign **unrelated** static services on **this** VLAN without checking overlap with MetalLB and kube-vip.
| Use | Addresses |
|-----|-----------|
| Talos nodes | `.10``.40` (see [`talos/talconfig.yaml`](../talos/talconfig.yaml)) |
| MetalLB L2 pool | `.210``.229` |
| Traefik (ingress) | `.211` (typical) |
| Argo CD | `.210` (typical) |
| Kubernetes API (kube-vip) | **`.230`** — **must not** be a VM |
---
## Proxmox VMs (`192.168.1.0/24`)
All run on **Proxmox**; addresses below use **`192.168.1.0/24`** (same host octet as your earlier `.50.x` / `.60.x` plan, moved into the homelab VLAN). Adjust if your router uses a different numbering scheme.
Most are **Docker hosts** with multiple apps; treat the **IP** as the **host**, not individual containers.
| VM ID | Name | IP | Notes |
|-------|------|-----|--------|
| 666 | nginxproxymanager | `192.168.1.20` | NPM (edge / WAN-facing role — firewall as you design). |
| 777 | nginxproxymanager-Lan | `192.168.1.60` | NPM on **internal** homelab LAN. |
| 100 | Openmediavault | `192.168.1.105` | **NFS** exports for *arr / media paths. |
| 110 | Monitor | `192.168.1.110` | Uptime Kuma, Peekaping, Tracearr → cluster candidates. |
| 120 | arr | `192.168.1.120` | *arr stack; media via **NFS** from OMV — see [migration](#arr-stack-nfs-and-kubernetes). |
| 130 | Automate | `192.168.1.130` | Low use — **candidate to remove** or consolidate. |
| 140 | general-purpose | `192.168.1.140` | IT tools, Mealie, Open WebUI, SparkyFitness, … |
| 150 | Media-server | `192.168.1.150` | Jellyfin (test, **NFS** media), ebook server. |
| 160 | s3 | `192.168.1.170` | Object storage; **merge** into **central S3** on noble per [`shared-data-services.md`](shared-data-services.md) when ready. |
| 190 | Auth | `192.168.1.190` | **Authentik****noble (K8s)** for HA. |
| 300 | gitea | `192.168.1.203` | On **`.1`**, no overlap with noble **MetalLB `.210``.229`** on **`.50`**. |
| 310 | gitea-nsfw | `192.168.1.204` | |
| 500 | AMP | `192.168.1.47` | |
### Workload detail (what runs where)
**Auth (190)****Authentik** is the main service; moving it to **Kubernetes (noble)** gives you **HA**, rolling upgrades, and backups via your cluster patterns (PVCs, Velero, etc.). Plan **OIDC redirect URLs** and **outposts** (if used) when the **ingress hostname** and paths to **`.50`** services change.
**Monitor (110)****Uptime Kuma**, **Peekaping**, and **Tracearr** are a good fit for the cluster: small state (SQLite or small DBs), **Ingress** via Traefik, and **Longhorn** or a small DB PVC. Migrate **one app at a time** and keep the old VM until DNS and alerts are verified.
**arr (120)****Lidarr, Sonarr, Radarr**, and related *arr* apps; libraries and download paths point at **NFS** from **Openmediavault (100)** at **`192.168.1.105`**. The hard part is **keeping paths, permissions (UID/GID), and download client** wiring while pods move.
**Automate (130)** — Tools are **barely used**; **decommission**, merge into **general-purpose (140)**, or replace with a **CronJob** / one-shot on the cluster only if something still needs scheduling.
**general-purpose (140)** — “Daily driver” stack: **IT tools**, **Mealie**, **Open WebUI**, **SparkyFitness**, and similar. **Candidates for gradual moves** to noble; group by **data sensitivity** and **persistence** (Postgres vs SQLite) when you pick order.
**Media-server (150)****Jellyfin** (testing) with libraries on **NFS**; **ebook** server. Treat **Jellyfin** like *arr* for storage: same NFS export and **transcoding** needs (CPU on worker nodes or GPU if you add it). Ebook stack depends on what you run (e.g. Kavita, Audiobookshelf) — note **metadata paths** before moving.
### Arr stack, NFS, and Kubernetes
You do **not** have to move NFS into the cluster: **Openmediavault** on **`192.168.1.105`** can stay the **NFS server** while the *arr* apps run as **Deployments** with **ReadWriteMany** volumes. Noble nodes on **`192.168.50.0/24`** mount NFS using **that IP** (ensure **firewall** allows **NFS** from node IPs to OMV).
1. **Keep OMV as the single source of exports** — same **export path** (e.g. `/export/media`) from the clusters perspective as from the current VM.
2. **Mount NFS in Kubernetes** — use a **CSI NFS driver** (e.g. **nfs-subdir-external-provisioner** or **csi-driver-nfs**) so each app gets a **PVC** backed by a **subdirectory** of the export, **or** one shared RWX PVC for a common tree if your layout needs it.
3. **Match POSIX ownership** — set **supplemental groups** or **fsGroup** / **runAsUser** on the pods so Sonarr/Radarr see the same **UID/GID** as todays Docker setup; fix **squash** settings on OMV if you use `root_squash`.
4. **Config and DB** — back up each apps **config volume** (or SQLite files), redeploy with the same **environment**; point **download clients** and **NFS media roots** to the **same logical paths** inside the container.
5. **Low-risk path** — run **one** *arr* app on the cluster while the rest stay on **VM 120** until imports and downloads behave; then cut DNS/NPM streams over.
If you prefer **no** NFS from pods, the alternative is **large ReadWriteOnce** disks on Longhorn and **sync** from OMV — usually **more** moving parts than **RWX NFS** for this workload class.
---
## Other hosts
| Host | IP | VLAN / network | Notes |
|------|-----|----------------|--------|
| **Pi-hole** | `192.168.1.127` | `192.168.1.0/24` | DNS; same VLAN as Proxmox VMs. |
| **Home Assistant** | *TBD* | **IoT VLAN** | Add reservation when fixed. |
| **Mac mini** | `192.168.1.155` | `192.168.1.0/24` | Align with **Storage B** in [`Racks.md`](Racks.md) if the same machine. |
---
## Related docs
- **Shared Postgres + S3 (centralized):** [`shared-data-services.md`](shared-data-services.md)
- **VM → noble migration plan:** [`migration-vm-to-noble.md`](migration-vm-to-noble.md)
- Noble cluster topology and ingress: [`architecture.md`](architecture.md)
- Physical racks (Primary / Storage B / Rack C): [`Racks.md`](Racks.md)
- Cluster checklist: [`../talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md)

View File

@@ -0,0 +1,121 @@
# Migration plan: Proxmox VMs → noble (Kubernetes)
This document is the **default playbook** for moving workloads from **Proxmox VMs** on **`192.168.1.0/24`** into the **noble** Talos cluster on **`192.168.50.0/24`**. Source inventory and per-VM notes: [`homelab-network.md`](homelab-network.md). Cluster facts: [`architecture.md`](architecture.md), [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md).
---
## 1. Scope and principles
| Principle | Detail |
|-----------|--------|
| **One service at a time** | Run the new workload on **noble** while the **VM** stays up; cut over **DNS / NPM** only after checks pass. |
| **Same container image** | Prefer the **same** upstream image and major version as Docker on the VM to reduce surprises. |
| **Data moves with a plan** | **Backup** VM volumes or export DB dumps **before** the first deploy to the cluster. |
| **Ingress on noble** | Internal apps use **Traefik** + **`*.apps.noble.lab.pcenicni.dev`** (or your chosen hostnames) and **MetalLB** (e.g. **`192.168.50.211`**) per [`architecture.md`](architecture.md). |
| **Cross-VLAN** | Clients on **`.1`** reach services on **`.50`** via **routing**; **firewall** must allow **NFS** from **Talos node IPs** to **OMV `192.168.1.105`** when pods mount NFS. |
**Not everything must move.** Keep **Openmediavault** (and optionally **NPM**) on VMs if you prefer; the cluster consumes **NFS** and **HTTP** from them.
---
## 2. Prerequisites (before wave 1)
1. **Cluster healthy**`kubectl get nodes`; [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) checklist through ingress and cert-manager as needed.
2. **Ingress + TLS****Traefik** + **cert-manager** working; you can hit a **test Ingress** on the MetalLB IP.
3. **GitOps / deploy path** — Decide per app: **Helm** under `clusters/noble/apps/`, **Argo CD**, or **Ansible**-applied manifests (match how you manage the rest of noble).
4. **Secrets** — Plan **Kubernetes Secrets**; for git-stored material, align with **SOPS** (`clusters/noble/secrets/`, `.sops.yaml`).
5. **Storage****Longhorn** default for **ReadWriteOnce** state; for **NFS** (*arr*, Jellyfin), install a **CSI NFS** driver and test a **small RWX PVC** before migrating data-heavy apps.
6. **Shared data tier (recommended)** — Deploy **centralized PostgreSQL** and **S3-compatible storage** on noble so apps do not each ship their own DB/object store; see [`shared-data-services.md`](shared-data-services.md).
7. **Firewall** — Rules: **workstation → `192.168.50.230:6443`**; **nodes → OMV NFS ports**; **clients → `192.168.50.211`** (or split-horizon DNS) as you design.
8. **DNS** — Split-horizon or Pi-hole records for **`*.apps.noble.lab.pcenicni.dev`** → **Traefik** IP **`192.168.50.211`** for LAN clients.
---
## 3. Standard migration procedure (repeat per app)
Use this checklist for **each** application (or small group, e.g. one Helm release).
| Step | Action |
|------|--------|
| **A. Discover** | Document **image:tag**, **ports**, **volumes** (host paths), **env vars**, **depends_on** (DB, Redis, NFS path). Export **docker inspect** / **compose** from the VM. |
| **B. Backup** | Snapshot **Proxmox VM** or backup **volume** / **SQLite** / **DB dump** to offline storage. |
| **C. Namespace** | Create a **dedicated namespace** (e.g. `monitoring-tools`, `authentik`) or use your house standard. |
| **D. Deploy** | Add **Deployment** (or **StatefulSet**), **Service**, **Ingress** (class **traefik**), **PVCs**; wire **secrets** from **Secrets** (not literals in git). |
| **E. Storage** | **Longhorn** PVC for local state; **NFS CSI** PVC for shared media/config paths that must match the VM (see [`homelab-network.md`](homelab-network.md) *arr* section). Prefer **shared Postgres** / **shared S3** per [`shared-data-services.md`](shared-data-services.md) instead of new embedded databases. Match **UID/GID** with `securityContext`. |
| **F. Smoke test** | `kubectl port-forward` or temporary **Ingress** hostname; log in, run one critical workflow (login, playback, sync). |
| **G. DNS cutover** | Point **internal DNS** or **NPM** upstream from the **VM IP** to the **new hostname** (Traefik) or **MetalLB IP** + Host header. |
| **H. Observe** | 2472 hours: logs, alerts, **Uptime Kuma** (once migrated), backups. |
| **I. Decommission** | Stop the **container** on the VM (not the whole VM until the **whole** VM is empty). |
| **J. VM off** | When **no** services remain on that VM, **power off** and archive or delete the VM. |
**Rollback:** Re-enable the VM service, revert **DNS/NPM** to the old IP, delete or scale the cluster deployment to zero.
---
## 4. Recommended migration order (phases)
Order balances **risk**, **dependencies**, and **learning curve**.
| Phase | Target | Rationale |
|-------|--------|-----------|
| **0 — Optional** | **Automate (130)** | Low use: **retire** or replace with **CronJobs**; skip if nothing valuable runs. |
| **0b — Platform** | **Shared Postgres + S3** on noble | Run **before** or alongside early waves so new deploys use **one DSN** and **one object endpoint**; retire **VM 160** when empty. See [`shared-data-services.md`](shared-data-services.md). |
| **1 — Observability** | **Monitor (110)** — Uptime Kuma, Peekaping, Tracearr | Small state, validates **Ingress**, **PVCs**, and **alert paths** before auth and media. |
| **2 — Git** | **gitea (300)**, **gitea-nsfw (310)** | Point at **shared Postgres** + **S3** for attachments; move **repos** with **PVC** + backup restore if needed. |
| **3 — Object / misc** | **s3 (160)**, **AMP (500)** | **Migrate data** into **central** S3 on cluster, then **decommission** duplicate MinIO on VM **160** if applicable. |
| **4 — Auth** | **Auth (190)****Authentik** | Use **shared Postgres**; update **all OIDC clients** (Gitea, apps, NPM) with **new issuer URLs**; schedule a **maintenance window**. |
| **5 — Daily apps** | **general-purpose (140)** | Move **one app per release** (Mealie, Open WebUI, …); each app gets its **own database** (and bucket if needed) on the **shared** tiers — not a new Postgres pod per app. |
| **6 — Media / *arr*** | **arr (120)**, **Media-server (150)** | **NFS** from **OMV**, download clients, **transcoding** — migrate **one *arr*** then Jellyfin/ebook; see NFS bullets in [`homelab-network.md`](homelab-network.md). |
| **7 — Edge** | **NPM (666/777)** | Often **last**: either keep on Proxmox or replace with **Traefik** + **IngressRoutes** / **Gateway API**; many people keep a **dedicated** reverse proxy VM until parity is proven. |
**Openmediavault (100)** — Typically **stays** as **NFS** (and maybe backup target) for the cluster; no need to “migrate” the whole NAS into Kubernetes.
---
## 5. Ingress and reverse proxy
| Approach | When to use |
|----------|-------------|
| **Traefik Ingress** on noble | Default for **internal** HTTPS apps; **cert-manager** for public names you control. |
| **NPM (VM)** as front door | Point **proxy host****Traefik MetalLB IP** or **service name** if you add internal DNS; reduces double-proxy if you **terminate TLS** in one place only. |
| **Newt / Pangolin** | Public reachability per [`clusters/noble/bootstrap/newt/README.md`](../clusters/noble/bootstrap/newt/README.md); not automatic ExternalDNS. |
Avoid **two** TLS terminations for the same hostname unless you intend **SSL passthrough** end-to-end.
---
## 6. Authentik-specific (Auth VM → cluster)
1. **Backup** Authentik **PostgreSQL** (or embedded DB) and **media** volume from the VM.
2. Deploy **Helm** (official chart) with **same** Authentik version if possible.
3. **Restore** DB into **shared cluster Postgres** (recommended) or chart-managed DB — see [`shared-data-services.md`](shared-data-services.md).
4. Update **issuer URL** in every **OIDC/OAuth** client (Gitea, Grafana, etc.).
5. Re-test **outposts** (if any) and **redirect URIs** from both **`.1`** and **`.50`** client perspectives.
6. **Cut over DNS**; then **decommission** VM **190**.
---
## 7. *arr* and Jellyfin-specific
Follow the **numbered list** under **“Arr stack, NFS, and Kubernetes”** in [`homelab-network.md`](homelab-network.md). In short: **OMV stays**; **CSI NFS** + **RWX**; **match permissions**; migrate **one app** first; verify **download client** can reach the new pod **IP/DNS** from your download host.
---
## 8. Validation checklist (per wave)
- Pods **Ready**, **Ingress** returns **200** / login page.
- **TLS** valid for chosen hostname.
- **Persistent data** present (new uploads, DB writes survive pod restart).
- **Backups** (Velero or app-level) defined for the new location.
- **Monitoring** / alerts updated (targets, not old VM IP).
- **Documentation** in [`homelab-network.md`](homelab-network.md) updated (VM retired or marked migrated).
---
## Related docs
- **Shared Postgres + S3:** [`shared-data-services.md`](shared-data-services.md)
- VM inventory and NFS notes: [`homelab-network.md`](homelab-network.md)
- Noble topology, MetalLB, Traefik: [`architecture.md`](architecture.md)
- Bootstrap and versions: [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md)
- Apps layout: [`clusters/noble/apps/README.md`](../clusters/noble/apps/README.md)

View File

@@ -0,0 +1,90 @@
# Centralized PostgreSQL and S3-compatible storage
Goal: **one shared PostgreSQL** and **one S3-compatible object store** on **noble**, instead of every app bundling its own database or MinIO. Apps keep **logical isolation** via **per-app databases** / **users** and **per-app buckets** (or prefixes), not separate clusters.
See also: [`migration-vm-to-noble.md`](migration-vm-to-noble.md), [`homelab-network.md`](homelab-network.md) (VM **160** `s3` today), [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) (Velero + S3).
---
## 1. Why centralize
| Benefit | Detail |
|--------|--------|
| **Operations** | One backup/restore story, one upgrade cadence, one place to tune **IOPS** and **retention**. |
| **Security** | **Least privilege**: each app gets its own **DB user** and **S3 credentials** scoped to one database or bucket. |
| **Resources** | Fewer duplicate **Postgres** or **MinIO** sidecars; better use of **Longhorn** or dedicated PVCs for the shared tiers. |
**Tradeoff:** Shared tiers are **blast-radius** targets — use **backups**, **PITR** where you care, and **NetworkPolicies** so only expected namespaces talk to Postgres/S3.
---
## 2. PostgreSQL — recommended pattern
1. **Run Postgres on noble** — Operators such as **CloudNativePG**, **Zalando Postgres operator**, or a well-maintained **Helm** chart with **replicas** + **persistent volumes** (Longhorn).
2. **One cluster instance, many databases** — For each app: `CREATE DATABASE appname;` and a **dedicated role** with `CONNECT` on that database only (not superuser).
3. **Connection from apps** — Use a **Kubernetes Service** (e.g. `postgres-platform.platform.svc.cluster.local:5432`) and pass **credentials** via **Secrets** (ideally **SOPS**-encrypted in git).
4. **Migrations** — Run app **migration** jobs or init containers against the **same** DSN after DB exists.
**Migrating off SQLite / embedded Postgres**
- **SQLite → Postgres:** export/import per app (native tools, or **pgloader** where appropriate).
- **Docker Postgres volume:** `pg_dumpall` or per-DB `pg_dump` → restore into a **new** database on the shared server; **freeze writes** during cutover.
---
## 3. S3-compatible object storage — recommended pattern
1. **Run one S3 API on noble****MinIO** (common), **Garage**, or **SeaweedFS** S3 layer — with **PVC(s)** or host path for data; **erasure coding** / replicas if the chart supports it and you want durability across nodes.
2. **Buckets per concern** — e.g. `gitea-attachments`, `velero`, `loki-archive` — not one global bucket unless you enforce **prefix** IAM policies.
3. **Credentials****IAM-style** users limited to **one bucket** (or prefix); **Secrets** reference **access key** / **secret**; never commit keys in plain text.
4. **Endpoint for pods** — In-cluster: `http://minio.platform.svc.cluster.local:9000` (or TLS inside mesh). Apps use **virtual-hosted** or **path-style** per SDK defaults.
### NFS as backing store for S3 on noble
**Yes.** You can run MinIO (or another S3-compatible server) with its **data directory** on a **ReadWriteMany** volume that is **NFS** — for example the same **Openmediavault** export you already use, mounted via your **NFS CSI** driver (see [`homelab-network.md`](homelab-network.md)).
| Consideration | Detail |
|---------------|--------|
| **Works for homelab** | MinIO stores objects as files under a path; **POSIX** on NFS is enough for many setups. |
| **Performance** | NFS adds **latency** and shared bandwidth; fine for moderate use, less ideal for heavy multi-tenant throughput. |
| **Availability** | The **NFS server** (OMV) becomes part of the availability story for object data — plan **backups** and **OMV** health like any dependency. |
| **Locking / semantics** | Prefer **NFSv4.x**; avoid mixing **NFS** and expectations of **local SSD** (e.g. very chatty small writes). If you see odd behavior, **Longhorn** (block) on a node is the usual next step. |
| **Layering** | You are stacking **S3 API → file layout → NFS → disk**; that is normal for a lab, just **monitor** space and exports on OMV. |
**Summary:** NFS-backed PVC for MinIO is **valid** on noble; use **Longhorn** (or local disk) when you need **better IOPS** or want object data **inside** the clusters storage domain without depending on OMV for that tier.
**Migrating off VM 160 (`s3`) or per-app MinIO**
- **MinIO → MinIO:** `mc mirror` between aliases, or **replication** if you configure it.
- **Same API:** Any tool speaking **S3** can **sync** buckets before you point apps at the new endpoint.
**Velero** — Point the **backup location** at the **central** bucket (see cluster Velero docs); avoid a second ad-hoc object store for backups if one cluster bucket is enough.
---
## 4. Ordering relative to app migrations
| When | What |
|------|------|
| **Early** | Stand up **Postgres** + **S3** with **empty** DBs/buckets; test with **one** non-critical app (e.g. a throwaway deployment). |
| **Before auth / Git** | **Gitea** and **Authentik** benefit from **managed Postgres** early — plan **DSN** and **bucket** for attachments **before** cutover. |
| **Ongoing** | New apps **must not** ship embedded **Postgres/MinIO** unless the workload truly requires it (e.g. vendor appliance). |
---
## 5. Checklist (platform team)
- [ ] Postgres **Service** DNS name and **TLS** (optional in-cluster) documented.
- [ ] S3 **endpoint**, **region** string (can be `us-east-1` for MinIO), **TLS** for Ingress if clients are outside the cluster.
- [ ] **Backup:** scheduled **logical dumps** (Postgres) and **bucket replication** or **object versioning** where needed.
- [ ] **SOPS** / **External Secrets** pattern for **rotation** without editing app manifests by hand.
- [ ] **homelab-network.md** updated when **VM 160** is retired or repurposed.
---
## Related docs
- VM → cluster migration: [`migration-vm-to-noble.md`](migration-vm-to-noble.md)
- Inventory (s3 VM): [`homelab-network.md`](homelab-network.md)
- Longhorn / storage runbook: [`../talos/runbooks/longhorn.md`](../talos/runbooks/longhorn.md)
- Velero (S3 backup target): [`../clusters/noble/bootstrap/velero/`](../clusters/noble/bootstrap/velero/) (if present)