Update CLUSTER-BUILD.md to include kube-prometheus-stack Helm chart details, enhance observability phase with Grafana ingress configuration, and clarify deployment instructions for monitoring components. Mark tasks as completed for kube-prometheus-stack installation and PVC binding on Longhorn.
This commit is contained in:
11
clusters/noble/apps/kube-prometheus-stack/namespace.yaml
Normal file
11
clusters/noble/apps/kube-prometheus-stack/namespace.yaml
Normal file
@@ -0,0 +1,11 @@
|
||||
# kube-prometheus-stack — apply before Helm (omit --create-namespace on install).
|
||||
# prometheus-node-exporter uses hostNetwork, hostPID, and hostPath (/proc, /sys, /) — incompatible
|
||||
# with PSA "baseline"; use "privileged" (same idea as longhorn-system / metallb-system).
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: monitoring
|
||||
labels:
|
||||
pod-security.kubernetes.io/enforce: privileged
|
||||
pod-security.kubernetes.io/audit: privileged
|
||||
pod-security.kubernetes.io/warn: privileged
|
||||
72
clusters/noble/apps/kube-prometheus-stack/values.yaml
Normal file
72
clusters/noble/apps/kube-prometheus-stack/values.yaml
Normal file
@@ -0,0 +1,72 @@
|
||||
# kube-prometheus-stack — noble lab (Prometheus Operator + Grafana + Alertmanager + exporters)
|
||||
#
|
||||
# Chart: prometheus-community/kube-prometheus-stack — pin version on install (e.g. 82.15.1).
|
||||
#
|
||||
# Install (use one terminal; chain with && so `helm upgrade` always runs after `helm repo update`):
|
||||
#
|
||||
# kubectl apply -f clusters/noble/apps/kube-prometheus-stack/namespace.yaml
|
||||
# helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
||||
# helm repo update && helm upgrade --install kube-prometheus prometheus-community/kube-prometheus-stack -n monitoring \
|
||||
# --version 82.15.1 -f clusters/noble/apps/kube-prometheus-stack/values.yaml --wait --timeout 30m
|
||||
#
|
||||
# Why it looks "stalled": with --wait, Helm prints almost nothing until the release finishes (can be many minutes).
|
||||
# Do not use --timeout 5m for first install — Longhorn PVCs + StatefulSets often need 15–30m. To watch progress,
|
||||
# open a second terminal: kubectl -n monitoring get pods,sts,ds -w
|
||||
# To apply manifest changes without blocking: omit --wait, then kubectl -n monitoring get pods -w
|
||||
#
|
||||
# Grafana admin password: Secret `kube-prometheus-grafana` keys `admin-user` / `admin-password` unless you set grafana.adminPassword.
|
||||
|
||||
# --- Longhorn-backed persistence (default chart storage is emptyDir) ---
|
||||
alertmanager:
|
||||
alertmanagerSpec:
|
||||
storage:
|
||||
volumeClaimTemplate:
|
||||
spec:
|
||||
storageClassName: longhorn
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
resources:
|
||||
requests:
|
||||
storage: 5Gi
|
||||
|
||||
prometheus:
|
||||
prometheusSpec:
|
||||
retention: 15d
|
||||
retentionSize: 25GB
|
||||
storageSpec:
|
||||
volumeClaimTemplate:
|
||||
spec:
|
||||
storageClassName: longhorn
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
resources:
|
||||
requests:
|
||||
storage: 30Gi
|
||||
|
||||
grafana:
|
||||
persistence:
|
||||
enabled: true
|
||||
type: sts
|
||||
storageClassName: longhorn
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
size: 10Gi
|
||||
|
||||
# HTTPS via Traefik + cert-manager (ClusterIssuer letsencrypt-prod; same pattern as other *.apps.noble.lab.pcenicni.dev hosts).
|
||||
# DNS: grafana.apps.noble.lab.pcenicni.dev → Traefik LoadBalancer (192.168.50.211) — see clusters/noble/apps/traefik/values.yaml
|
||||
ingress:
|
||||
enabled: true
|
||||
ingressClassName: traefik
|
||||
path: /
|
||||
pathType: Prefix
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||
hosts:
|
||||
- grafana.apps.noble.lab.pcenicni.dev
|
||||
tls:
|
||||
- secretName: grafana-apps-noble-tls
|
||||
hosts:
|
||||
- grafana.apps.noble.lab.pcenicni.dev
|
||||
|
||||
grafana.ini:
|
||||
server:
|
||||
domain: grafana.apps.noble.lab.pcenicni.dev
|
||||
root_url: https://grafana.apps.noble.lab.pcenicni.dev/
|
||||
@@ -14,7 +14,8 @@ This document is the **exported TODO** for the **noble** Talos cluster (4 nodes)
|
||||
- **cert-manager** Helm **v1.20.0** / app **v1.20.0** — `clusters/noble/apps/cert-manager/`; **`ClusterIssuer`** **`letsencrypt-staging`** and **`letsencrypt-prod`** (HTTP-01, ingress class **`traefik`**); ACME email **`certificates@noble.lab.pcenicni.dev`** (edit in manifests if you want a different mailbox).
|
||||
- **Newt** Helm **1.2.0** / app **1.10.1** — `clusters/noble/apps/newt/` (**fossorial/newt**); Pangolin site tunnel — **`newt-pangolin-auth`** Secret (**`PANGOLIN_ENDPOINT`**, **`NEWT_ID`**, **`NEWT_SECRET`**). **Public DNS** is **not** automated with ExternalDNS: **CNAME** records at your DNS host per Pangolin’s domain instructions, plus **Integration API** for HTTP resources/targets — see **`clusters/noble/apps/newt/README.md`**. LAN access to Traefik can still use **`*.apps.noble.lab.pcenicni.dev`** → **`192.168.50.211`** (split horizon / local resolver).
|
||||
- **Argo CD** Helm **9.4.17** / app **v3.3.6** — `clusters/noble/bootstrap/argocd/`; **`argocd-server`** **`LoadBalancer`** **`192.168.50.210`**; app-of-apps scaffold under **`bootstrap/argocd/apps/`** (edit **`root-application.yaml`** `repoURL` before applying).
|
||||
- **Still open:** observability — checklist below.
|
||||
- **kube-prometheus-stack** — Helm chart **82.15.1** — `clusters/noble/apps/kube-prometheus-stack/` (**namespace** `monitoring`, PSA **privileged** — **node-exporter** needs host mounts); **Longhorn** PVCs for Prometheus, Grafana, Alertmanager. **Grafana Ingress:** **`https://grafana.apps.noble.lab.pcenicni.dev`** (Traefik **`ingressClassName: traefik`**, **`cert-manager.io/cluster-issuer: letsencrypt-prod`**). **`helm upgrade --install` with `--wait` is silent until done** — use **`--timeout 30m`** (not `5m`) and watch **`kubectl -n monitoring get pods -w`** in another terminal. Grafana admin password: Secret **`kube-prometheus-grafana`**, keys **`admin-user`** / **`admin-password`**.
|
||||
- **Still open:** **Loki** + **Fluent Bit** + Grafana datasource (Phase D).
|
||||
|
||||
## Inventory
|
||||
|
||||
@@ -34,6 +35,7 @@ This document is the **exported TODO** for the **noble** Talos cluster (4 nodes)
|
||||
| Argo CD `LoadBalancer` | **Pick one IP** in the MetalLB pool (e.g. `192.168.50.210`) |
|
||||
| Traefik (apps ingress) | `192.168.50.211` — **`metallb.io/loadBalancerIPs`** in `clusters/noble/apps/traefik/values.yaml` |
|
||||
| Apps ingress (LAN / split horizon) | `*.apps.noble.lab.pcenicni.dev` → Traefik LB |
|
||||
| Grafana (Ingress + TLS) | **`grafana.apps.noble.lab.pcenicni.dev`** — `grafana.ingress` in `clusters/noble/apps/kube-prometheus-stack/values.yaml` (**`letsencrypt-prod`**) |
|
||||
| Public DNS (Pangolin) | **Newt** tunnel + **CNAME** at registrar + **Integration API** — `clusters/noble/apps/newt/` |
|
||||
| Velero | S3-compatible URL — configure later |
|
||||
|
||||
@@ -50,6 +52,7 @@ This document is the **exported TODO** for the **noble** Talos cluster (4 nodes)
|
||||
- cert-manager: **v1.20.0** (Helm chart; app **v1.20.0**)
|
||||
- Newt (Fossorial): **1.2.0** (Helm chart; app **1.10.1**)
|
||||
- Argo CD: **9.4.17** (Helm chart `argo/argo-cd`; app **v3.3.6**)
|
||||
- kube-prometheus-stack: **82.15.1** (Helm chart `prometheus-community/kube-prometheus-stack`; app **v0.89.x** bundle)
|
||||
|
||||
## Repo paths (this workspace)
|
||||
|
||||
@@ -69,6 +72,7 @@ This document is the **exported TODO** for the **noble** Talos cluster (4 nodes)
|
||||
| cert-manager (Helm + ClusterIssuers) | `clusters/noble/apps/cert-manager/` — `values.yaml`, `namespace.yaml`, `kustomization.yaml`, `README.md` |
|
||||
| Newt / Pangolin tunnel (Helm) | `clusters/noble/apps/newt/` — `values.yaml`, `namespace.yaml`, `README.md` |
|
||||
| Argo CD (bootstrap + app-of-apps) | `clusters/noble/bootstrap/argocd/` — `values.yaml`, `root-application.yaml`, `apps/`, `README.md` |
|
||||
| kube-prometheus-stack (Helm values) | `clusters/noble/apps/kube-prometheus-stack/` — `values.yaml`, `namespace.yaml` |
|
||||
|
||||
**Git vs cluster:** manifests and `talconfig` live in git; **`talhelper genconfig -o out`**, bootstrap, Helm, and `kubectl` run on your LAN. See **`talos/README.md`** for workstation reachability (lab LAN/VPN), **`talosctl kubeconfig`** vs Kubernetes `server:` (VIP vs node IP), and **`--insecure`** only in maintenance.
|
||||
|
||||
@@ -117,12 +121,12 @@ This document is the **exported TODO** for the **noble** Talos cluster (4 nodes)
|
||||
|
||||
- [x] **Argo CD** bootstrap — `clusters/noble/bootstrap/argocd/` (`helm upgrade --install argocd …`)
|
||||
- [x] Argo CD server **LoadBalancer** — **`192.168.50.210`** (see `values.yaml`)
|
||||
- [ ] **App-of-apps** — set **`repoURL`** in **`root-application.yaml`**, add **`Application`** manifests under **`bootstrap/argocd/apps/`**, apply **`root-application.yaml`**
|
||||
- [X] **App-of-apps** — set **`repoURL`** in **`root-application.yaml`**, add **`Application`** manifests under **`bootstrap/argocd/apps/`**, apply **`root-application.yaml`**
|
||||
- [ ] SSO — later
|
||||
|
||||
## Phase D — Observability
|
||||
|
||||
- [ ] **kube-prometheus-stack** (PVCs on Longhorn)
|
||||
- [x] **kube-prometheus-stack** — `kubectl apply -f clusters/noble/apps/kube-prometheus-stack/namespace.yaml` then **`helm upgrade --install`** as in `clusters/noble/apps/kube-prometheus-stack/values.yaml` (chart **82.15.1**); PVCs **`longhorn`**; **`--wait --timeout 30m`** recommended; verify **`kubectl -n monitoring get pods,pvc`**
|
||||
- [ ] **Loki** + **Fluent Bit**; Grafana datasource
|
||||
|
||||
## Phase E — Secrets
|
||||
@@ -149,6 +153,7 @@ This document is the **exported TODO** for the **noble** Talos cluster (4 nodes)
|
||||
- [x] **Argo CD** UI — **`argocd-server`** **`LoadBalancer`** **`192.168.50.210`** (initial **`admin`** password from **`argocd-initial-admin-secret`**)
|
||||
- [ ] Sample Ingress + cert (cert-manager ready) + Pangolin resource + CNAME
|
||||
- [x] PVC **`Bound`** on **Longhorn** (`storageClassName: longhorn`); Prometheus/Loki durable when configured
|
||||
- [x] **`monitoring`** — **kube-prometheus-stack** core workloads **Running** (Prometheus, Grafana, Alertmanager, operator, kube-state-metrics); PVCs **Bound** on **longhorn**
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user