Files
home-server/docs/architecture.md

238 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Noble platform architecture
This document describes the **noble** Talos lab cluster: node topology, networking, platform stack, observability, secrets/policy, and storage. Facts align with [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md), [`talos/talconfig.yaml`](../talos/talconfig.yaml), and manifests under [`clusters/noble/`](../clusters/noble/).
## Legend
| Shape / style | Meaning |
|---------------|---------|
| **Subgraph “Cluster”** | Kubernetes cluster boundary (`noble`) |
| **External / DNS / cloud** | Services outside the data plane (internet, registrar, Pangolin) |
| **Data store** | Durable data (etcd, Longhorn, Loki) |
| **Secrets / policy** | Secret material (SOPS in git), admission policy |
| **LB / VIP** | Load balancer, MetalLB assignment, or API VIP |
---
## Physical / node topology
Four Talos nodes on **LAN `192.168.50.0/24`**: three control planes (**neon**, **argon**, **krypton**) and one worker (**helium**). `allowSchedulingOnControlPlanes: true` in `talconfig.yaml`. The Kubernetes API is fronted by **kube-vip** on **`192.168.50.230`** (not a separate hardware load balancer).
```mermaid
flowchart TB
subgraph LAN["LAN 192.168.50.0/24"]
subgraph CP["Control planes (kube-vip VIP 192.168.50.230:6443)"]
neon["neon<br/>192.168.50.20<br/>control-plane + schedulable"]
argon["argon<br/>192.168.50.30<br/>control-plane + schedulable"]
krypton["krypton<br/>192.168.50.40<br/>control-plane + schedulable"]
end
subgraph W["Worker"]
helium["helium<br/>192.168.50.10<br/>worker only"]
end
VIP["API VIP 192.168.50.230<br/>kube-vip on ens18<br/>→ apiserver :6443"]
end
neon --- VIP
argon --- VIP
krypton --- VIP
kubectl["kubectl / talosctl clients<br/>(workstation on LAN/VPN)"] -->|"HTTPS :6443"| VIP
```
---
## Network and ingress
**Northsouth (apps on LAN):** DNS for **`*.apps.noble.lab.pcenicni.dev`** → **Traefik** **`LoadBalancer` `192.168.50.211`**. **MetalLB** L2 pool **`192.168.50.210``192.168.50.229`**; **Argo CD** uses **`192.168.50.210`**. **Public** access is not in-cluster ExternalDNS: **Newt** (Pangolin tunnel) plus **CNAME** and **Integration API** per [`clusters/noble/bootstrap/newt/README.md`](../clusters/noble/bootstrap/newt/README.md).
```mermaid
flowchart TB
user["User"]
subgraph DNS["DNS"]
pub["Public: CNAME → Pangolin<br/>(per Newt README; not ExternalDNS)"]
split["LAN / split horizon:<br/>*.apps.noble.lab.pcenicni.dev<br/>→ 192.168.50.211"]
end
subgraph LAN["LAN"]
ML["MetalLB L2<br/>pool 192.168.50.210229<br/>IPAddressPool noble-l2"]
T["Traefik Service LoadBalancer<br/>192.168.50.211<br/>IngressClass: traefik"]
Argo["Argo CD server LoadBalancer<br/>192.168.50.210"]
Newt["Newt (Pangolin tunnel)<br/>outbound to Pangolin"]
end
subgraph Cluster["Cluster workloads"]
Ing["Ingress resources<br/>cert-manager HTTP-01"]
App["Apps / Grafana Ingress<br/>e.g. grafana.apps.noble.lab.pcenicni.dev"]
end
user --> pub
user --> split
split --> T
pub -.->|"tunnel path"| Newt
T --> Ing --> App
ML --- T
ML --- Argo
user -->|"optional direct to LB IP"| Argo
```
---
## Platform stack (bootstrap → workloads)
Order: **Talos****Cilium** (cluster uses `cni: none` until CNI is installed) → **metrics-server**, **Longhorn**, **MetalLB** + pool manifests, **kube-vip****Traefik**, **cert-manager****Argo CD** (Helm only; optional empty app-of-apps). **Automated install:** `ansible/playbooks/noble.yml` (see `ansible/README.md`). Platform namespaces include `cert-manager`, `traefik`, `metallb-system`, `longhorn-system`, `monitoring`, `loki`, `logging`, `argocd`, `kyverno`, `newt`, and others as deployed.
```mermaid
flowchart TB
subgraph L0["OS / bootstrap"]
Talos["Talos v1.12.6<br/>Image Factory schematic"]
end
subgraph L1["CNI"]
Cilium["Cilium<br/>(cni: none until installed)"]
end
subgraph L2["Core add-ons"]
MS["metrics-server"]
LH["Longhorn + default StorageClass"]
MB["MetalLB + pool manifests"]
KV["kube-vip (API VIP)"]
end
subgraph L3["Ingress and TLS"]
Traefik["Traefik"]
CM["cert-manager + ClusterIssuers"]
end
subgraph L4["GitOps"]
Argo["Argo CD<br/>(optional app-of-apps; platform via Ansible)"]
end
subgraph L5["Platform namespaces (examples)"]
NS["cert-manager, traefik, metallb-system,<br/>longhorn-system, monitoring, loki, logging,<br/>argocd, kyverno, newt, …"]
end
Talos --> Cilium --> MS
Cilium --> LH
Cilium --> MB
Cilium --> KV
MB --> Traefik
Traefik --> CM
CM --> Argo
Argo --> NS
```
---
## Observability path
**kube-prometheus-stack** in **`monitoring`**: Prometheus, Grafana, Alertmanager, node-exporter, etc. **Loki** (SingleBinary) in **`loki`** with **Fluent Bit** in **`logging`** shipping to **`loki-gateway`**. Grafana Loki datasource is applied via **ConfigMap** [`clusters/noble/bootstrap/grafana-loki-datasource/loki-datasource.yaml`](../clusters/noble/bootstrap/grafana-loki-datasource/loki-datasource.yaml). Prometheus, Grafana, Alertmanager, and Loki use **Longhorn** PVCs where configured.
```mermaid
flowchart LR
subgraph Nodes["All nodes"]
NE["node-exporter DaemonSet"]
FB["Fluent Bit DaemonSet<br/>namespace: logging"]
end
subgraph mon["monitoring"]
PROM["Prometheus"]
AM["Alertmanager"]
GF["Grafana"]
SC["ServiceMonitors / kube-state-metrics / operator"]
end
subgraph lok["loki"]
LG["loki-gateway Service"]
LO["Loki SingleBinary"]
end
NE --> PROM
PROM --> GF
AM --> GF
FB -->|"to loki-gateway:80"| LG --> LO
GF -->|"Explore / datasource ConfigMap<br/>grafana-loki-datasource"| LO
subgraph PVC["Longhorn PVCs"]
P1["Prometheus / Grafana /<br/>Alertmanager PVCs"]
P2["Loki PVC"]
end
PROM --- P1
LO --- P2
```
---
## Secrets and policy
**Mozilla SOPS** with **age** encrypts plain Kubernetes **`Secret`** manifests under [`clusters/noble/secrets/`](../clusters/noble/secrets/); operators decrypt at apply time (`ansible/playbooks/noble.yml` or `sops -d … | kubectl apply`). The private key is **`age-key.txt`** at the repo root (gitignored). **Kyverno** with **kyverno-policies** enforces **PSS baseline** in **Audit**.
```mermaid
flowchart LR
subgraph Git["Git repo"]
SM["SOPS-encrypted Secret YAML<br/>clusters/noble/secrets/"]
end
subgraph ops["Apply path"]
SOPS["sops -d + kubectl apply<br/>(or Ansible noble.yml)"]
end
subgraph cluster["Cluster"]
K["Kyverno + kyverno-policies<br/>PSS baseline Audit"]
end
SM --> SOPS -->|"plain Secret"| workloads["Workload Secrets"]
K -.->|"admission / audit<br/>(PSS baseline)"| workloads
```
---
## Data and storage
**StorageClass:** **`longhorn`** (default). Talos mounts **user volume** data at **`/var/mnt/longhorn`** (bind paths for Longhorn). Stateful consumers include **kube-prometheus-stack** PVCs and **Loki**.
```mermaid
flowchart TB
subgraph disks["Per-node Longhorn data path"]
UD["Talos user volume →<br/>/var/mnt/longhorn (bind to Longhorn paths)"]
end
subgraph LH["Longhorn"]
SC["StorageClass: longhorn (default)"]
end
subgraph consumers["Stateful / durable consumers"]
PGL["kube-prometheus-stack PVCs"]
L["Loki PVC"]
end
UD --> SC
SC --> PGL
SC --> L
```
---
## Component versions
See [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) for the authoritative checklist. Summary:
| Component | Chart / app (from CLUSTER-BUILD.md) |
|-----------|-------------------------------------|
| Talos / Kubernetes | v1.12.6 / 1.35.2 bundled |
| Cilium | Helm 1.16.6 |
| MetalLB | 0.15.3 |
| Longhorn | 1.11.1 |
| Traefik | 39.0.6 / app v3.6.11 |
| cert-manager | v1.20.0 |
| Argo CD | 9.4.17 / app v3.3.6 |
| kube-prometheus-stack | 82.15.1 |
| Loki / Fluent Bit | 6.55.0 / 0.56.0 |
| SOPS (client tooling) | see `clusters/noble/secrets/README.md` |
| Kyverno | 3.7.1 / policies 3.7.1 |
| Newt | 1.2.0 / app 1.10.1 |
---
## Narrative
The **noble** environment is a **Talos** lab cluster on **`192.168.50.0/24`** with **three control plane nodes and one worker**, schedulable workloads on control planes enabled, and the Kubernetes API exposed through **kube-vip** at **`192.168.50.230`**. **Cilium** provides the CNI after Talos bootstrap with **`cni: none`**; **MetalLB** advertises **`192.168.50.210``192.168.50.229`**, pinning **Argo CD** to **`192.168.50.210`** and **Traefik** to **`192.168.50.211`** for **`*.apps.noble.lab.pcenicni.dev`**. **cert-manager** issues certificates for Traefik Ingresses; **GitOps** is **Ansible** for the **initial** platform install (**`clusters/noble/bootstrap/`**), then **Argo CD** for the kustomize tree (**`noble-bootstrap-root`** → **`clusters/noble/bootstrap`**) and optional apps (**`noble-root`** → **`clusters/noble/apps/`**) once automated sync is enabled after **`noble.yml`** (see **`clusters/noble/bootstrap/argocd/README.md`** §5). **Observability** uses **kube-prometheus-stack** in **`monitoring`**, **Loki** and **Fluent Bit** with Grafana wired via a **ConfigMap** datasource, with **Longhorn** PVCs for Prometheus, Grafana, Alertmanager, and Loki. **Secrets** in git use **SOPS** + **age** under **`clusters/noble/secrets/`**; **Kyverno** enforces **Pod Security Standards baseline** in **Audit**. **Public** access uses **Newt** to **Pangolin** with **CNAME** and Integration API steps as documented—not generic in-cluster public DNS.
---
## Assumptions and open questions
**Assumptions**
- **Hypervisor vs bare metal:** Not fixed in inventory tables; `talconfig.yaml` comments mention Proxmox virtio disk paths as examples—treat actual host platform as **TBD** unless confirmed.
- **Workstation path:** Operators reach the VIP and node IPs from the **LAN or VPN** per [`talos/README.md`](../talos/README.md).
- **Optional components** (Headlamp, Renovate, Velero, Phase G hardening) are described in CLUSTER-BUILD.md; they are not required for the diagrams above until deployed.
**Open questions**
- **Split horizon:** Confirm whether only LAN DNS resolves `*.apps.noble.lab.pcenicni.dev` to **`192.168.50.211`** or whether public resolvers also point at that address.
- **Velero / S3:** optional **Ansible** install (**`noble_velero_install`**) from **`clusters/noble/bootstrap/velero/`** once an S3-compatible backend and credentials exist (see **`talos/CLUSTER-BUILD.md`** Phase F).
- **Argo CD:** Confirm **`repoURL`** in `root-application.yaml` and what is actually applied on-cluster.
---
*Keep in sync with [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) and manifests under [`clusters/noble/`](../clusters/noble/).*