Future Improvements

This chapter covers what to add next — in priority order — to make the observability stack production-ready, secure, and easier to operate.

1. Alerting with Grafana Alerting (High Priority)

Set up alerts so the system notifies you before users report problems.

How Grafana Alerting Works

Grafana has a built-in alerting engine. You define alert rules on any dashboard panel query, and when the condition is met (e.g., error rate > 5% for 5 minutes), it fires an alert to a notification channel.

Set Up a Slack Notification Channel

Create a Slack Incoming Webhook at api.slack.com/apps
In Grafana: Alerting > Contact Points > Add Contact Point
Choose Slack
Paste your Webhook URL
Save

Create Your First Alert Rule

Open any Grafana panel with a PromQL query
Click the panel menu (three dots) > Edit
Click the Alert tab
Click Create alert rule from this panel
Set the condition:
- Query: your PromQL expression
- Condition: IS ABOVE 80 (for CPU %)
- For: 5m (must be above threshold for 5 minutes to avoid flapping)
Set notification: select your Slack contact point
Save

Recommended Alert Rules to Create

Alert	Query	Threshold	Wait
High CPU	`100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100`	> 85%	5m
Low Memory	`node_memory_MemAvailable_bytes / 1024 / 1024 / 1024`	< 1 GB	2m
High Disk	`100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)`	> 85%	1m
High Error Rate	`rate(healthtune_http_request_duration_seconds_count{status=~"5.."}[5m])`	> 0.1 req/s	2m
Service Down	`up{job="healthtune-api"}`	== 0	1m

2. Grafana Alertmanager (Advanced Alerting Routing)

For more control over alert routing, use Alertmanager as a standalone service.

Alertmanager lets you:

Route different alerts to different channels (critical -> PagerDuty, warnings -> Slack)
Silence alerts during maintenance windows
Group related alerts to avoid alert floods

Add Alertmanager to docker-compose.yaml:

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    command:
      - --config.file=/etc/alertmanager/alertmanager.yml
    networks:
      - observability

Sample alertmanager.yml:

route:
  group_by: [alertname, app]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: slack-team

receivers:
  - name: slack-team
    slack_configs:
      - api_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
        channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

3. HTTPS / TLS for Grafana (Security)

Grafana currently runs on plain HTTP. For any team or public-facing URL, add HTTPS.

Option A - Nginx Reverse Proxy + Let's Encrypt (Recommended)

Install Nginx and Certbot:

sudo apt install -y nginx certbot python3-certbot-nginx

Create Nginx config for Grafana:

sudo nano /etc/nginx/sites-available/grafana

server {
    listen 80;
    server_name grafana.yourcompany.com;

    location / {
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Enable and get SSL certificate:

sudo ln -s /etc/nginx/sites-available/grafana /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx
sudo certbot --nginx -d grafana.yourcompany.com

Certbot automatically renews the certificate every 90 days.

Option B - Cloudflare Proxy (Simplest for GCP)

Point your domain DNS to your VM's public IP
Enable Cloudflare orange-cloud (proxy mode)
Cloudflare handles SSL automatically — no cert setup needed
Your Grafana runs HTTP internally but users see HTTPS

4. Data Retention Policies

Without retention limits, Prometheus TSDB and Loki storage grow forever and fill your disk.

Prometheus Retention

Already set to 30 days in the install. Adjust in docker-compose.yaml:

command:
  - --storage.tsdb.retention.time=15d    # How long to keep metrics data
  - --storage.tsdb.retention.size=10GB   # Or limit by size

Recommended retention by environment:

Development VM: 7d or 10d
Production: 30d to 90d
Long-term: Use Thanos or Grafana Mimir to extend to years

Loki Retention

In loki-config.yaml:

limits_config:
  retention_period: 168h    # 7 days (168 hours)

compactor:
  working_directory: /loki/compactor
  shared_store: filesystem
  compaction_interval: 10m
  retention_enabled: true    # Must enable this for retention to work
  retention_delete_delay: 2h

Tempo Retention

In tempo-config.yaml:

compactor:
  compaction:
    block_retention: 72h    # Keep traces for 3 days

Increase for production where you need to debug incidents from last week:

block_retention: 336h    # 14 days

5. Kubernetes Integration (When Ready to Scale)

When your apps move to Kubernetes, the Grafana stack has first-class Helm support.

kube-prometheus-stack (Recommended)

This single Helm chart installs Prometheus + Grafana + Alertmanager + Node Exporter + all the Kubernetes-specific dashboards automatically.

# Add Grafana Helm repo
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install full Prometheus stack
helm install kube-prom prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=admin123

# Install Loki
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set promtail.enabled=true \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=10Gi

# Install Tempo
helm install tempo grafana/tempo \
  --namespace monitoring

On Kubernetes, Promtail is deployed as a DaemonSet — one pod per node — and automatically collects logs from all pods via Docker/containerd log files.

6. Long-Term Metrics Storage with Thanos or Mimir

Prometheus stores data locally. The default Docker Compose setup retains 30 days. For longer retention and high-availability, use Thanos or Grafana Mimir.

Thanos (Open Source)

Thanos adds a sidecar to Prometheus that uploads TSDB blocks to object storage (GCS, S3, Azure Blob) continuously. You get:

Years of metrics retention
Query across multiple Prometheus instances
No data loss if your Prometheus container restarts

Basic Thanos sidecar config:

thanos-sidecar:
  image: quay.io/thanos/thanos:latest
  command:
    - sidecar
    - --tsdb.path=/prometheus
    - --prometheus.url=http://prometheus:9090
    - --objstore.config-file=/etc/thanos/bucket.yaml
  volumes:
    - prometheus_data:/prometheus:ro
    - ./thanos/bucket.yaml:/etc/thanos/bucket.yaml:ro

7. Log Parsing and Structured Logs

Currently Promtail ships raw log lines. Add log parsing so you can filter by structured fields.

If your app logs JSON:

{"level":"error","message":"DB timeout","duration":5432,"traceId":"abc123"}

Add a pipeline stage to promtail-config.yaml:

scrape_configs:
  - job_name: healthtune
    pipeline_stages:
      - json:
          expressions:
            level: level
            message: message
            duration: duration
            traceId: traceId
      - labels:
          level:           # Promote level to a Loki label for fast filtering
    static_configs:
      - targets: [localhost]
        labels:
          app: healthtune_api
          __path__: /var/log/pm2/healthtune-dev-api-out.log

Now you can filter in Grafana by level=error or search by traceId directly in Loki.

8. Backup Strategy

Grafana Dashboards

Export dashboards as JSON and commit to Git:

Dashboard Settings > JSON Model > Download JSON
Save to your Git repository at monitoring/dashboards/
Document the Grafana.com dashboard IDs you used in README

Or use Grafana's provisioning to auto-load dashboards from JSON files — they get restored automatically if you recreate the container.

Prometheus Data

# Snapshot Prometheus TSDB (creates a consistent backup point)
curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot

# Result shows snapshot name:
# {"status":"success","data":{"name":"20260325T100000Z-abc123"}}

# Snapshot is in your prometheus_data volume under /prometheus/snapshots/
# Copy it out:
SNAP=$(curl -s -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot | python3 -c "import sys,json; print(json.load(sys.stdin)['data']['name'])")
docker cp prometheus:/prometheus/snapshots/$SNAP ./prometheus-backup-$SNAP

Loki Data

# Backup Loki chunks (stop Loki first for consistency)
docker compose stop loki
docker cp loki:/loki ./loki-backup-$(date +%Y%m%d)
docker compose start loki

9. Grafana Provisioning (Config as Code)

Instead of clicking through the UI every time, define data sources and dashboards in YAML files. This means rebuilding the stack from scratch is one command.

Data sources provisioning file: ~/observability/grafana/provisioning/datasources/datasources.yaml

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    isDefault: true
    access: proxy

  - name: Loki
    type: loki
    url: http://loki:3100
    access: proxy

  - name: Tempo
    type: tempo
    url: http://tempo:3200
    access: proxy
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        customQuery: true
        query: '{app="${__span.tags.service.name}"}'

Mount it in the Grafana container:

grafana:
  volumes:
    - grafana_data:/var/lib/grafana
    - ./grafana/provisioning:/etc/grafana/provisioning:ro

Now data sources are auto-configured on every fresh start — no manual clicking needed.

Suggested Roadmap

Priority	Task	Estimated Effort
High	Slack alerting for CPU / errors / service down	45 minutes
High	Set retention: Prometheus 15d, Loki 7d, Tempo 3d	15 minutes
High	Export all dashboards to Git JSON	30 minutes
Medium	Add HTTPS via Cloudflare or Nginx + Certbot	1-2 hours
Medium	Structured JSON logging + Promtail pipeline	1 hour
Medium	Grafana provisioning (config as code)	1-2 hours
Low	Alertmanager for advanced routing	2 hours
Low	Thanos sidecar for long-term metrics	3-4 hours
Low	Kubernetes migration with Helm charts	When moving to K8s

1. Alerting with Grafana Alerting (High Priority)​

How Grafana Alerting Works​

Set Up a Slack Notification Channel​

Create Your First Alert Rule​

Recommended Alert Rules to Create​

2. Grafana Alertmanager (Advanced Alerting Routing)​

3. HTTPS / TLS for Grafana (Security)​

Option A - Nginx Reverse Proxy + Let's Encrypt (Recommended)​

Option B - Cloudflare Proxy (Simplest for GCP)​

4. Data Retention Policies​

Prometheus Retention​

Loki Retention​

Tempo Retention​

5. Kubernetes Integration (When Ready to Scale)​

kube-prometheus-stack (Recommended)​

6. Long-Term Metrics Storage with Thanos or Mimir​

Thanos (Open Source)​

7. Log Parsing and Structured Logs​

8. Backup Strategy​

Grafana Dashboards​

Prometheus Data​

Loki Data​

9. Grafana Provisioning (Config as Code)​

Suggested Roadmap​