238 lines
10 KiB
Markdown
238 lines
10 KiB
Markdown
# Noble platform architecture
|
||
|
||
This document describes the **noble** Talos lab cluster: node topology, networking, platform stack, observability, secrets/policy, and storage. Facts align with [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md), [`talos/talconfig.yaml`](../talos/talconfig.yaml), and manifests under [`clusters/noble/`](../clusters/noble/).
|
||
|
||
## Legend
|
||
|
||
| Shape / style | Meaning |
|
||
|---------------|---------|
|
||
| **Subgraph “Cluster”** | Kubernetes cluster boundary (`noble`) |
|
||
| **External / DNS / cloud** | Services outside the data plane (internet, registrar, Pangolin) |
|
||
| **Data store** | Durable data (etcd, Longhorn, Loki) |
|
||
| **Secrets / policy** | Secret material (SOPS in git), admission policy |
|
||
| **LB / VIP** | Load balancer, MetalLB assignment, or API VIP |
|
||
|
||
---
|
||
|
||
## Physical / node topology
|
||
|
||
Four Talos nodes on **LAN `192.168.50.0/24`**: three control planes (**neon**, **argon**, **krypton**) and one worker (**helium**). `allowSchedulingOnControlPlanes: true` in `talconfig.yaml`. The Kubernetes API is fronted by **kube-vip** on **`192.168.50.230`** (not a separate hardware load balancer).
|
||
|
||
```mermaid
|
||
flowchart TB
|
||
subgraph LAN["LAN 192.168.50.0/24"]
|
||
subgraph CP["Control planes (kube-vip VIP 192.168.50.230:6443)"]
|
||
neon["neon<br/>192.168.50.20<br/>control-plane + schedulable"]
|
||
argon["argon<br/>192.168.50.30<br/>control-plane + schedulable"]
|
||
krypton["krypton<br/>192.168.50.40<br/>control-plane + schedulable"]
|
||
end
|
||
subgraph W["Worker"]
|
||
helium["helium<br/>192.168.50.10<br/>worker only"]
|
||
end
|
||
VIP["API VIP 192.168.50.230<br/>kube-vip on ens18<br/>→ apiserver :6443"]
|
||
end
|
||
neon --- VIP
|
||
argon --- VIP
|
||
krypton --- VIP
|
||
kubectl["kubectl / talosctl clients<br/>(workstation on LAN/VPN)"] -->|"HTTPS :6443"| VIP
|
||
```
|
||
|
||
---
|
||
|
||
## Network and ingress
|
||
|
||
**North–south (apps on LAN):** DNS for **`*.apps.noble.lab.pcenicni.dev`** → **Traefik** **`LoadBalancer` `192.168.50.211`**. **MetalLB** L2 pool **`192.168.50.210`–`192.168.50.229`**; **Argo CD** uses **`192.168.50.210`**. **Public** access is not in-cluster ExternalDNS: **Newt** (Pangolin tunnel) plus **CNAME** and **Integration API** per [`clusters/noble/bootstrap/newt/README.md`](../clusters/noble/bootstrap/newt/README.md).
|
||
|
||
```mermaid
|
||
flowchart TB
|
||
user["User"]
|
||
subgraph DNS["DNS"]
|
||
pub["Public: CNAME → Pangolin<br/>(per Newt README; not ExternalDNS)"]
|
||
split["LAN / split horizon:<br/>*.apps.noble.lab.pcenicni.dev<br/>→ 192.168.50.211"]
|
||
end
|
||
subgraph LAN["LAN"]
|
||
ML["MetalLB L2<br/>pool 192.168.50.210–229<br/>IPAddressPool noble-l2"]
|
||
T["Traefik Service LoadBalancer<br/>192.168.50.211<br/>IngressClass: traefik"]
|
||
Argo["Argo CD server LoadBalancer<br/>192.168.50.210"]
|
||
Newt["Newt (Pangolin tunnel)<br/>outbound to Pangolin"]
|
||
end
|
||
subgraph Cluster["Cluster workloads"]
|
||
Ing["Ingress resources<br/>cert-manager HTTP-01"]
|
||
App["Apps / Grafana Ingress<br/>e.g. grafana.apps.noble.lab.pcenicni.dev"]
|
||
end
|
||
user --> pub
|
||
user --> split
|
||
split --> T
|
||
pub -.->|"tunnel path"| Newt
|
||
T --> Ing --> App
|
||
ML --- T
|
||
ML --- Argo
|
||
user -->|"optional direct to LB IP"| Argo
|
||
```
|
||
|
||
---
|
||
|
||
## Platform stack (bootstrap → workloads)
|
||
|
||
Order: **Talos** → **Cilium** (cluster uses `cni: none` until CNI is installed) → **metrics-server**, **Longhorn**, **MetalLB** + pool manifests, **kube-vip** → **Traefik**, **cert-manager** → **Argo CD** (Helm only; optional empty app-of-apps). **Automated install:** `ansible/playbooks/noble.yml` (see `ansible/README.md`). Platform namespaces include `cert-manager`, `traefik`, `metallb-system`, `longhorn-system`, `monitoring`, `loki`, `logging`, `argocd`, `kyverno`, `newt`, and others as deployed.
|
||
|
||
```mermaid
|
||
flowchart TB
|
||
subgraph L0["OS / bootstrap"]
|
||
Talos["Talos v1.12.6<br/>Image Factory schematic"]
|
||
end
|
||
subgraph L1["CNI"]
|
||
Cilium["Cilium<br/>(cni: none until installed)"]
|
||
end
|
||
subgraph L2["Core add-ons"]
|
||
MS["metrics-server"]
|
||
LH["Longhorn + default StorageClass"]
|
||
MB["MetalLB + pool manifests"]
|
||
KV["kube-vip (API VIP)"]
|
||
end
|
||
subgraph L3["Ingress and TLS"]
|
||
Traefik["Traefik"]
|
||
CM["cert-manager + ClusterIssuers"]
|
||
end
|
||
subgraph L4["GitOps"]
|
||
Argo["Argo CD<br/>(optional app-of-apps; platform via Ansible)"]
|
||
end
|
||
subgraph L5["Platform namespaces (examples)"]
|
||
NS["cert-manager, traefik, metallb-system,<br/>longhorn-system, monitoring, loki, logging,<br/>argocd, kyverno, newt, …"]
|
||
end
|
||
Talos --> Cilium --> MS
|
||
Cilium --> LH
|
||
Cilium --> MB
|
||
Cilium --> KV
|
||
MB --> Traefik
|
||
Traefik --> CM
|
||
CM --> Argo
|
||
Argo --> NS
|
||
```
|
||
|
||
---
|
||
|
||
## Observability path
|
||
|
||
**kube-prometheus-stack** in **`monitoring`**: Prometheus, Grafana, Alertmanager, node-exporter, etc. **Loki** (SingleBinary) in **`loki`** with **Fluent Bit** in **`logging`** shipping to **`loki-gateway`**. Grafana Loki datasource is applied via **ConfigMap** [`clusters/noble/bootstrap/grafana-loki-datasource/loki-datasource.yaml`](../clusters/noble/bootstrap/grafana-loki-datasource/loki-datasource.yaml). Prometheus, Grafana, Alertmanager, and Loki use **Longhorn** PVCs where configured.
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
subgraph Nodes["All nodes"]
|
||
NE["node-exporter DaemonSet"]
|
||
FB["Fluent Bit DaemonSet<br/>namespace: logging"]
|
||
end
|
||
subgraph mon["monitoring"]
|
||
PROM["Prometheus"]
|
||
AM["Alertmanager"]
|
||
GF["Grafana"]
|
||
SC["ServiceMonitors / kube-state-metrics / operator"]
|
||
end
|
||
subgraph lok["loki"]
|
||
LG["loki-gateway Service"]
|
||
LO["Loki SingleBinary"]
|
||
end
|
||
NE --> PROM
|
||
PROM --> GF
|
||
AM --> GF
|
||
FB -->|"to loki-gateway:80"| LG --> LO
|
||
GF -->|"Explore / datasource ConfigMap<br/>grafana-loki-datasource"| LO
|
||
subgraph PVC["Longhorn PVCs"]
|
||
P1["Prometheus / Grafana /<br/>Alertmanager PVCs"]
|
||
P2["Loki PVC"]
|
||
end
|
||
PROM --- P1
|
||
LO --- P2
|
||
```
|
||
|
||
---
|
||
|
||
## Secrets and policy
|
||
|
||
**Mozilla SOPS** with **age** encrypts plain Kubernetes **`Secret`** manifests under [`clusters/noble/secrets/`](../clusters/noble/secrets/); operators decrypt at apply time (`ansible/playbooks/noble.yml` or `sops -d … | kubectl apply`). The private key is **`age-key.txt`** at the repo root (gitignored). **Kyverno** with **kyverno-policies** enforces **PSS baseline** in **Audit**.
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
subgraph Git["Git repo"]
|
||
SM["SOPS-encrypted Secret YAML<br/>clusters/noble/secrets/"]
|
||
end
|
||
subgraph ops["Apply path"]
|
||
SOPS["sops -d + kubectl apply<br/>(or Ansible noble.yml)"]
|
||
end
|
||
subgraph cluster["Cluster"]
|
||
K["Kyverno + kyverno-policies<br/>PSS baseline Audit"]
|
||
end
|
||
SM --> SOPS -->|"plain Secret"| workloads["Workload Secrets"]
|
||
K -.->|"admission / audit<br/>(PSS baseline)"| workloads
|
||
```
|
||
|
||
---
|
||
|
||
## Data and storage
|
||
|
||
**StorageClass:** **`longhorn`** (default). Talos mounts **user volume** data at **`/var/mnt/longhorn`** (bind paths for Longhorn). Stateful consumers include **kube-prometheus-stack** PVCs and **Loki**.
|
||
|
||
```mermaid
|
||
flowchart TB
|
||
subgraph disks["Per-node Longhorn data path"]
|
||
UD["Talos user volume →<br/>/var/mnt/longhorn (bind to Longhorn paths)"]
|
||
end
|
||
subgraph LH["Longhorn"]
|
||
SC["StorageClass: longhorn (default)"]
|
||
end
|
||
subgraph consumers["Stateful / durable consumers"]
|
||
PGL["kube-prometheus-stack PVCs"]
|
||
L["Loki PVC"]
|
||
end
|
||
UD --> SC
|
||
SC --> PGL
|
||
SC --> L
|
||
```
|
||
|
||
---
|
||
|
||
## Component versions
|
||
|
||
See [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) for the authoritative checklist. Summary:
|
||
|
||
| Component | Chart / app (from CLUSTER-BUILD.md) |
|
||
|-----------|-------------------------------------|
|
||
| Talos / Kubernetes | v1.12.6 / 1.35.2 bundled |
|
||
| Cilium | Helm 1.16.6 |
|
||
| MetalLB | 0.15.3 |
|
||
| Longhorn | 1.11.1 |
|
||
| Traefik | 39.0.6 / app v3.6.11 |
|
||
| cert-manager | v1.20.0 |
|
||
| Argo CD | 9.4.17 / app v3.3.6 |
|
||
| kube-prometheus-stack | 82.15.1 |
|
||
| Loki / Fluent Bit | 6.55.0 / 0.56.0 |
|
||
| SOPS (client tooling) | see `clusters/noble/secrets/README.md` |
|
||
| Kyverno | 3.7.1 / policies 3.7.1 |
|
||
| Newt | 1.2.0 / app 1.10.1 |
|
||
|
||
---
|
||
|
||
## Narrative
|
||
|
||
The **noble** environment is a **Talos** lab cluster on **`192.168.50.0/24`** with **three control plane nodes and one worker**, schedulable workloads on control planes enabled, and the Kubernetes API exposed through **kube-vip** at **`192.168.50.230`**. **Cilium** provides the CNI after Talos bootstrap with **`cni: none`**; **MetalLB** advertises **`192.168.50.210`–`192.168.50.229`**, pinning **Argo CD** to **`192.168.50.210`** and **Traefik** to **`192.168.50.211`** for **`*.apps.noble.lab.pcenicni.dev`**. **cert-manager** issues certificates for Traefik Ingresses; **GitOps** is **Ansible-driven Helm** for the platform (**`clusters/noble/bootstrap/`**) plus optional **Argo CD** app-of-apps (**`clusters/noble/apps/`**, **`clusters/noble/bootstrap/argocd/`**). **Observability** uses **kube-prometheus-stack** in **`monitoring`**, **Loki** and **Fluent Bit** with Grafana wired via a **ConfigMap** datasource, with **Longhorn** PVCs for Prometheus, Grafana, Alertmanager, and Loki. **Secrets** in git use **SOPS** + **age** under **`clusters/noble/secrets/`**; **Kyverno** enforces **Pod Security Standards baseline** in **Audit**. **Public** access uses **Newt** to **Pangolin** with **CNAME** and Integration API steps as documented—not generic in-cluster public DNS.
|
||
|
||
---
|
||
|
||
## Assumptions and open questions
|
||
|
||
**Assumptions**
|
||
|
||
- **Hypervisor vs bare metal:** Not fixed in inventory tables; `talconfig.yaml` comments mention Proxmox virtio disk paths as examples—treat actual host platform as **TBD** unless confirmed.
|
||
- **Workstation path:** Operators reach the VIP and node IPs from the **LAN or VPN** per [`talos/README.md`](../talos/README.md).
|
||
- **Optional components** (Headlamp, Renovate, Velero, Phase G hardening) are described in CLUSTER-BUILD.md; they are not required for the diagrams above until deployed.
|
||
|
||
**Open questions**
|
||
|
||
- **Split horizon:** Confirm whether only LAN DNS resolves `*.apps.noble.lab.pcenicni.dev` to **`192.168.50.211`** or whether public resolvers also point at that address.
|
||
- **Velero / S3:** optional **Ansible** install (**`noble_velero_install`**) from **`clusters/noble/bootstrap/velero/`** once an S3-compatible backend and credentials exist (see **`talos/CLUSTER-BUILD.md`** Phase F).
|
||
- **Argo CD:** Confirm **`repoURL`** in `root-application.yaml` and what is actually applied on-cluster.
|
||
|
||
---
|
||
|
||
*Keep in sync with [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) and manifests under [`clusters/noble/`](../clusters/noble/).*
|