Saltar al contenido principal

Future Improvements

This chapter covers what to add next — in priority order — to make the observability stack production-ready, secure, and easier to operate.


1. Alerting with Grafana Alerting (High Priority)

Set up alerts so the system notifies you before users report problems.

How Grafana Alerting Works

Grafana has a built-in alerting engine. You define alert rules on any dashboard panel query, and when the condition is met (e.g., error rate > 5% for 5 minutes), it fires an alert to a notification channel.

Set Up a Slack Notification Channel

  1. Create a Slack Incoming Webhook at api.slack.com/apps
  2. In Grafana: Alerting > Contact Points > Add Contact Point
  3. Choose Slack
  4. Paste your Webhook URL
  5. Save

Create Your First Alert Rule

  1. Open any Grafana panel with a PromQL query
  2. Click the panel menu (three dots) > Edit
  3. Click the Alert tab
  4. Click Create alert rule from this panel
  5. Set the condition:
    • Query: your PromQL expression
    • Condition: IS ABOVE 80 (for CPU %)
    • For: 5m (must be above threshold for 5 minutes to avoid flapping)
  6. Set notification: select your Slack contact point
  7. Save
AlertQueryThresholdWait
High CPU100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100> 85%5m
Low Memorynode_memory_MemAvailable_bytes / 1024 / 1024 / 1024< 1 GB2m
High Disk100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)> 85%1m
High Error Raterate(healthtune_http_request_duration_seconds_count{status=~"5.."}[5m])> 0.1 req/s2m
Service Downup{job="healthtune-api"}== 01m

2. Grafana Alertmanager (Advanced Alerting Routing)

For more control over alert routing, use Alertmanager as a standalone service.

Alertmanager lets you:

  • Route different alerts to different channels (critical -> PagerDuty, warnings -> Slack)
  • Silence alerts during maintenance windows
  • Group related alerts to avoid alert floods

Add Alertmanager to docker-compose.yaml:

alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
command:
- --config.file=/etc/alertmanager/alertmanager.yml
networks:
- observability

Sample alertmanager.yml:

route:
group_by: [alertname, app]
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: slack-team

receivers:
- name: slack-team
slack_configs:
- api_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

3. HTTPS / TLS for Grafana (Security)

Grafana currently runs on plain HTTP. For any team or public-facing URL, add HTTPS.

Install Nginx and Certbot:

sudo apt install -y nginx certbot python3-certbot-nginx

Create Nginx config for Grafana:

sudo nano /etc/nginx/sites-available/grafana
server {
listen 80;
server_name grafana.yourcompany.com;

location / {
proxy_pass http://localhost:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}

Enable and get SSL certificate:

sudo ln -s /etc/nginx/sites-available/grafana /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx
sudo certbot --nginx -d grafana.yourcompany.com

Certbot automatically renews the certificate every 90 days.

Option B - Cloudflare Proxy (Simplest for GCP)

  1. Point your domain DNS to your VM's public IP
  2. Enable Cloudflare orange-cloud (proxy mode)
  3. Cloudflare handles SSL automatically — no cert setup needed
  4. Your Grafana runs HTTP internally but users see HTTPS

4. Data Retention Policies

Without retention limits, Prometheus TSDB and Loki storage grow forever and fill your disk.

Prometheus Retention

Already set to 30 days in the install. Adjust in docker-compose.yaml:

command:
- --storage.tsdb.retention.time=15d # How long to keep metrics data
- --storage.tsdb.retention.size=10GB # Or limit by size

Recommended retention by environment:

  • Development VM: 7d or 10d
  • Production: 30d to 90d
  • Long-term: Use Thanos or Grafana Mimir to extend to years

Loki Retention

In loki-config.yaml:

limits_config:
retention_period: 168h # 7 days (168 hours)

compactor:
working_directory: /loki/compactor
shared_store: filesystem
compaction_interval: 10m
retention_enabled: true # Must enable this for retention to work
retention_delete_delay: 2h

Tempo Retention

In tempo-config.yaml:

compactor:
compaction:
block_retention: 72h # Keep traces for 3 days

Increase for production where you need to debug incidents from last week:

block_retention: 336h # 14 days

5. Kubernetes Integration (When Ready to Scale)

When your apps move to Kubernetes, the Grafana stack has first-class Helm support.

This single Helm chart installs Prometheus + Grafana + Alertmanager + Node Exporter + all the Kubernetes-specific dashboards automatically.

# Add Grafana Helm repo
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install full Prometheus stack
helm install kube-prom prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword=admin123

# Install Loki
helm install loki grafana/loki-stack \
--namespace monitoring \
--set promtail.enabled=true \
--set loki.persistence.enabled=true \
--set loki.persistence.size=10Gi

# Install Tempo
helm install tempo grafana/tempo \
--namespace monitoring

On Kubernetes, Promtail is deployed as a DaemonSet — one pod per node — and automatically collects logs from all pods via Docker/containerd log files.


6. Long-Term Metrics Storage with Thanos or Mimir

Prometheus stores data locally. The default Docker Compose setup retains 30 days. For longer retention and high-availability, use Thanos or Grafana Mimir.

Thanos (Open Source)

Thanos adds a sidecar to Prometheus that uploads TSDB blocks to object storage (GCS, S3, Azure Blob) continuously. You get:

  • Years of metrics retention
  • Query across multiple Prometheus instances
  • No data loss if your Prometheus container restarts

Basic Thanos sidecar config:

thanos-sidecar:
image: quay.io/thanos/thanos:latest
command:
- sidecar
- --tsdb.path=/prometheus
- --prometheus.url=http://prometheus:9090
- --objstore.config-file=/etc/thanos/bucket.yaml
volumes:
- prometheus_data:/prometheus:ro
- ./thanos/bucket.yaml:/etc/thanos/bucket.yaml:ro

7. Log Parsing and Structured Logs

Currently Promtail ships raw log lines. Add log parsing so you can filter by structured fields.

If your app logs JSON:

{"level":"error","message":"DB timeout","duration":5432,"traceId":"abc123"}

Add a pipeline stage to promtail-config.yaml:

scrape_configs:
- job_name: healthtune
pipeline_stages:
- json:
expressions:
level: level
message: message
duration: duration
traceId: traceId
- labels:
level: # Promote level to a Loki label for fast filtering
static_configs:
- targets: [localhost]
labels:
app: healthtune_api
__path__: /var/log/pm2/healthtune-dev-api-out.log

Now you can filter in Grafana by level=error or search by traceId directly in Loki.


8. Backup Strategy

Grafana Dashboards

Export dashboards as JSON and commit to Git:

  1. Dashboard Settings > JSON Model > Download JSON
  2. Save to your Git repository at monitoring/dashboards/
  3. Document the Grafana.com dashboard IDs you used in README

Or use Grafana's provisioning to auto-load dashboards from JSON files — they get restored automatically if you recreate the container.

Prometheus Data

# Snapshot Prometheus TSDB (creates a consistent backup point)
curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot

# Result shows snapshot name:
# {"status":"success","data":{"name":"20260325T100000Z-abc123"}}

# Snapshot is in your prometheus_data volume under /prometheus/snapshots/
# Copy it out:
SNAP=$(curl -s -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot | python3 -c "import sys,json; print(json.load(sys.stdin)['data']['name'])")
docker cp prometheus:/prometheus/snapshots/$SNAP ./prometheus-backup-$SNAP

Loki Data

# Backup Loki chunks (stop Loki first for consistency)
docker compose stop loki
docker cp loki:/loki ./loki-backup-$(date +%Y%m%d)
docker compose start loki

9. Grafana Provisioning (Config as Code)

Instead of clicking through the UI every time, define data sources and dashboards in YAML files. This means rebuilding the stack from scratch is one command.

Data sources provisioning file: ~/observability/grafana/provisioning/datasources/datasources.yaml

apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
isDefault: true
access: proxy

- name: Loki
type: loki
url: http://loki:3100
access: proxy

- name: Tempo
type: tempo
url: http://tempo:3200
access: proxy
jsonData:
tracesToLogsV2:
datasourceUid: loki
customQuery: true
query: '{app="${__span.tags.service.name}"}'

Mount it in the Grafana container:

grafana:
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro

Now data sources are auto-configured on every fresh start — no manual clicking needed.


Suggested Roadmap

PriorityTaskEstimated Effort
HighSlack alerting for CPU / errors / service down45 minutes
HighSet retention: Prometheus 15d, Loki 7d, Tempo 3d15 minutes
HighExport all dashboards to Git JSON30 minutes
MediumAdd HTTPS via Cloudflare or Nginx + Certbot1-2 hours
MediumStructured JSON logging + Promtail pipeline1 hour
MediumGrafana provisioning (config as code)1-2 hours
LowAlertmanager for advanced routing2 hours
LowThanos sidecar for long-term metrics3-4 hours
LowKubernetes migration with Helm chartsWhen moving to K8s