Future Improvements
This chapter covers what to add next — in priority order — to make the observability stack production-ready, secure, and easier to operate.
1. Alerting with Grafana Alerting (High Priority)
Set up alerts so the system notifies you before users report problems.
How Grafana Alerting Works
Grafana has a built-in alerting engine. You define alert rules on any dashboard panel query, and when the condition is met (e.g., error rate > 5% for 5 minutes), it fires an alert to a notification channel.
Set Up a Slack Notification Channel
- Create a Slack Incoming Webhook at api.slack.com/apps
- In Grafana: Alerting > Contact Points > Add Contact Point
- Choose Slack
- Paste your Webhook URL
- Save
Create Your First Alert Rule
- Open any Grafana panel with a PromQL query
- Click the panel menu (three dots) > Edit
- Click the Alert tab
- Click Create alert rule from this panel
- Set the condition:
- Query: your PromQL expression
- Condition: IS ABOVE 80 (for CPU %)
- For: 5m (must be above threshold for 5 minutes to avoid flapping)
- Set notification: select your Slack contact point
- Save
Recommended Alert Rules to Create
| Alert | Query | Threshold | Wait |
|---|---|---|---|
| High CPU | 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 | > 85% | 5m |
| Low Memory | node_memory_MemAvailable_bytes / 1024 / 1024 / 1024 | < 1 GB | 2m |
| High Disk | 100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100) | > 85% | 1m |
| High Error Rate | rate(healthtune_http_request_duration_seconds_count{status=~"5.."}[5m]) | > 0.1 req/s | 2m |
| Service Down | up{job="healthtune-api"} | == 0 | 1m |
2. Grafana Alertmanager (Advanced Alerting Routing)
For more control over alert routing, use Alertmanager as a standalone service.
Alertmanager lets you:
- Route different alerts to different channels (critical -> PagerDuty, warnings -> Slack)
- Silence alerts during maintenance windows
- Group related alerts to avoid alert floods
Add Alertmanager to docker-compose.yaml:
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
command:
- --config.file=/etc/alertmanager/alertmanager.yml
networks:
- observability
Sample alertmanager.yml:
route:
group_by: [alertname, app]
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: slack-team
receivers:
- name: slack-team
slack_configs:
- api_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
3. HTTPS / TLS for Grafana (Security)
Grafana currently runs on plain HTTP. For any team or public-facing URL, add HTTPS.
Option A - Nginx Reverse Proxy + Let's Encrypt (Recommended)
Install Nginx and Certbot:
sudo apt install -y nginx certbot python3-certbot-nginx
Create Nginx config for Grafana:
sudo nano /etc/nginx/sites-available/grafana
server {
listen 80;
server_name grafana.yourcompany.com;
location / {
proxy_pass http://localhost:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
Enable and get SSL certificate:
sudo ln -s /etc/nginx/sites-available/grafana /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx
sudo certbot --nginx -d grafana.yourcompany.com
Certbot automatically renews the certificate every 90 days.
Option B - Cloudflare Proxy (Simplest for GCP)
- Point your domain DNS to your VM's public IP
- Enable Cloudflare orange-cloud (proxy mode)
- Cloudflare handles SSL automatically — no cert setup needed
- Your Grafana runs HTTP internally but users see HTTPS
4. Data Retention Policies
Without retention limits, Prometheus TSDB and Loki storage grow forever and fill your disk.
Prometheus Retention
Already set to 30 days in the install. Adjust in docker-compose.yaml:
command:
- --storage.tsdb.retention.time=15d # How long to keep metrics data
- --storage.tsdb.retention.size=10GB # Or limit by size
Recommended retention by environment:
- Development VM: 7d or 10d
- Production: 30d to 90d
- Long-term: Use Thanos or Grafana Mimir to extend to years
Loki Retention
In loki-config.yaml:
limits_config:
retention_period: 168h # 7 days (168 hours)
compactor:
working_directory: /loki/compactor
shared_store: filesystem
compaction_interval: 10m
retention_enabled: true # Must enable this for retention to work
retention_delete_delay: 2h
Tempo Retention
In tempo-config.yaml:
compactor:
compaction:
block_retention: 72h # Keep traces for 3 days
Increase for production where you need to debug incidents from last week:
block_retention: 336h # 14 days
5. Kubernetes Integration (When Ready to Scale)
When your apps move to Kubernetes, the Grafana stack has first-class Helm support.
kube-prometheus-stack (Recommended)
This single Helm chart installs Prometheus + Grafana + Alertmanager + Node Exporter + all the Kubernetes-specific dashboards automatically.
# Add Grafana Helm repo
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install full Prometheus stack
helm install kube-prom prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword=admin123
# Install Loki
helm install loki grafana/loki-stack \
--namespace monitoring \
--set promtail.enabled=true \
--set loki.persistence.enabled=true \
--set loki.persistence.size=10Gi
# Install Tempo
helm install tempo grafana/tempo \
--namespace monitoring
On Kubernetes, Promtail is deployed as a DaemonSet — one pod per node — and automatically collects logs from all pods via Docker/containerd log files.
6. Long-Term Metrics Storage with Thanos or Mimir
Prometheus stores data locally. The default Docker Compose setup retains 30 days. For longer retention and high-availability, use Thanos or Grafana Mimir.
Thanos (Open Source)
Thanos adds a sidecar to Prometheus that uploads TSDB blocks to object storage (GCS, S3, Azure Blob) continuously. You get:
- Years of metrics retention
- Query across multiple Prometheus instances
- No data loss if your Prometheus container restarts
Basic Thanos sidecar config:
thanos-sidecar:
image: quay.io/thanos/thanos:latest
command:
- sidecar
- --tsdb.path=/prometheus
- --prometheus.url=http://prometheus:9090
- --objstore.config-file=/etc/thanos/bucket.yaml
volumes:
- prometheus_data:/prometheus:ro
- ./thanos/bucket.yaml:/etc/thanos/bucket.yaml:ro
7. Log Parsing and Structured Logs
Currently Promtail ships raw log lines. Add log parsing so you can filter by structured fields.
If your app logs JSON:
{"level":"error","message":"DB timeout","duration":5432,"traceId":"abc123"}
Add a pipeline stage to promtail-config.yaml:
scrape_configs:
- job_name: healthtune
pipeline_stages:
- json:
expressions:
level: level
message: message
duration: duration
traceId: traceId
- labels:
level: # Promote level to a Loki label for fast filtering
static_configs:
- targets: [localhost]
labels:
app: healthtune_api
__path__: /var/log/pm2/healthtune-dev-api-out.log
Now you can filter in Grafana by level=error or search by traceId directly in Loki.
8. Backup Strategy
Grafana Dashboards
Export dashboards as JSON and commit to Git:
- Dashboard Settings > JSON Model > Download JSON
- Save to your Git repository at monitoring/dashboards/
- Document the Grafana.com dashboard IDs you used in README
Or use Grafana's provisioning to auto-load dashboards from JSON files — they get restored automatically if you recreate the container.
Prometheus Data
# Snapshot Prometheus TSDB (creates a consistent backup point)
curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot
# Result shows snapshot name:
# {"status":"success","data":{"name":"20260325T100000Z-abc123"}}
# Snapshot is in your prometheus_data volume under /prometheus/snapshots/
# Copy it out:
SNAP=$(curl -s -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot | python3 -c "import sys,json; print(json.load(sys.stdin)['data']['name'])")
docker cp prometheus:/prometheus/snapshots/$SNAP ./prometheus-backup-$SNAP
Loki Data
# Backup Loki chunks (stop Loki first for consistency)
docker compose stop loki
docker cp loki:/loki ./loki-backup-$(date +%Y%m%d)
docker compose start loki
9. Grafana Provisioning (Config as Code)
Instead of clicking through the UI every time, define data sources and dashboards in YAML files. This means rebuilding the stack from scratch is one command.
Data sources provisioning file: ~/observability/grafana/provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
isDefault: true
access: proxy
- name: Loki
type: loki
url: http://loki:3100
access: proxy
- name: Tempo
type: tempo
url: http://tempo:3200
access: proxy
jsonData:
tracesToLogsV2:
datasourceUid: loki
customQuery: true
query: '{app="${__span.tags.service.name}"}'
Mount it in the Grafana container:
grafana:
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
Now data sources are auto-configured on every fresh start — no manual clicking needed.
Suggested Roadmap
| Priority | Task | Estimated Effort |
|---|---|---|
| High | Slack alerting for CPU / errors / service down | 45 minutes |
| High | Set retention: Prometheus 15d, Loki 7d, Tempo 3d | 15 minutes |
| High | Export all dashboards to Git JSON | 30 minutes |
| Medium | Add HTTPS via Cloudflare or Nginx + Certbot | 1-2 hours |
| Medium | Structured JSON logging + Promtail pipeline | 1 hour |
| Medium | Grafana provisioning (config as code) | 1-2 hours |
| Low | Alertmanager for advanced routing | 2 hours |
| Low | Thanos sidecar for long-term metrics | 3-4 hours |
| Low | Kubernetes migration with Helm charts | When moving to K8s |