Troubleshooting
This chapter covers every common issue with the Grafana + Prometheus + Loki + Tempo stack, with exact diagnostic commands and fixes for each problem.
Quick Diagnostic - Run This First
# 1. All containers running?
docker compose ps
# 2. Ports listening?
sudo ss -ltnp | egrep ':(3000|9090|3100|3200|4317|4318|9100)\b'
# 3. Health checks
curl http://localhost:9090/-/healthy # Prometheus
curl http://localhost:3100/ready # Loki
curl http://localhost:3200/ready # Tempo
curl -I http://localhost:3000 # Grafana
# 4. Apps running?
pm2 list
# 5. Recent errors from all containers
docker compose logs --tail=30 2>&1 | grep -i error
Problem: Grafana Not Opening in Browser
Symptom: You go to http://YOUR_SERVER_IP:3000 and get nothing.
Check 1 - Is Grafana running at all?
curl -I http://localhost:3000
- Returns 200 OK -> Grafana is running. Problem is the firewall.
- Returns connection refused -> Grafana container is down.
Check 2 - Container down? Restart it
docker compose ps grafana
docker compose logs grafana --tail=50
docker compose restart grafana
Check 3 - GCP Firewall Missing
The most common cause on cloud VMs. The service is fine locally but blocked externally.
GCP Console fix:
- VPC Network > Firewall > Create Firewall Rule
- Name: allow-grafana
- Direction: Ingress, Action: Allow
- Source IP: 0.0.0.0/0 (or your team IP range)
- Protocols and ports: TCP 3000
- Save and wait 30 seconds
AWS Security Group fix:
- EC2 > Security Groups > your instance group
- Inbound Rules > Add Rule
- Custom TCP, Port 3000, Source: your IP
Check 4 - UFW Local Firewall
sudo ufw status
sudo ufw allow 3000/tcp
sudo ufw reload
Check 5 - SSH Tunnel to Isolate Network vs App
Run this on your LAPTOP:
ssh -L 3000:localhost:3000 username@YOUR_SERVER_IP
Then open http://localhost:3000. If it works -> firewall issue on the server. If not -> app issue.
Problem: No Metrics in Prometheus
Symptom: Prometheus UI is open but queries return no data, or targets show DOWN.
Check 1 - Open Prometheus Targets Page
In Prometheus UI: Status > Targets
Each scrape target shows its state:
- UP with green = working
- DOWN with red = problem
Click the target URL to see the specific error.
Check 2 - Target Shows DOWN - Diagnose
# Test if the target is reachable from inside the Prometheus container
docker exec prometheus wget -qO- http://node-exporter:9100/metrics | head -5
# Test host-based targets
docker exec prometheus wget -qO- http://host.docker.internal:3001/metrics | head -5
If wget fails: the endpoint is unreachable. Check if the app is running and on the right port.
Check 3 - host.docker.internal Not Resolving (Linux Only)
On Linux, host.docker.internal requires extra_hosts in docker-compose.yaml. Verify this is in your Prometheus service block:
extra_hosts:
- "host.docker.internal:host-gateway"
If missing, add it and restart:
docker compose up -d prometheus
Check 4 - Wrong Port in prometheus.yml
# Check what port your app is actually on
pm2 list
curl http://localhost:3001/metrics # Try different ports
curl http://localhost:3000/metrics
Update prometheus.yml with the correct port:
- job_name: healthtune-api
static_configs:
- targets: [host.docker.internal:CORRECT_PORT_HERE]
Apply the change without restarting:
curl -X POST http://localhost:9090/-/reload
Check 5 - App Does Not Expose /metrics
If your Node.js app does not have prom-client installed and a /metrics route, Prometheus cannot scrape it. Either:
Option A: Add prom-client (see Application Integration chapter) Option B: Remove the app's job_name block from prometheus.yml so Prometheus stops trying
Problem: No Logs Appearing in Loki
Symptom: Grafana Explore with Loki shows no results, or labels dropdown is empty.
Check 1 - Is Promtail Running and Shipping?
docker compose logs promtail --tail=50
Look for lines like:
level=info msg="Tailing new file" path=/var/log/pm2/healthtune-dev-api-out.log
If you see permission denied: see Check 3 below. If you see no tailing lines: the file path in promtail-config.yaml is wrong.
Check 2 - Verify Log File Paths
# Check actual PM2 log paths on your system
ls -la ~/.pm2/logs/
Then compare with what is in promtail-config.yaml:
__path__: /var/log/pm2/healthtune-dev-api-out.log
The container sees /var/log/pm2 but the host has ~/.pm2/logs. The volume mount in docker-compose.yaml connects them:
volumes:
- /home/wenawa/.pm2/logs:/var/log/pm2:ro
Make sure /home/wenawa matches your actual username. Fix it:
sed -i "s|/home/wenawa|/home/$USER|g" ~/observability/docker-compose.yaml
docker compose up -d promtail
Check 3 - Permission Denied on Log Files
The Promtail container runs as a non-root user and cannot read PM2 logs owned by your user.
Fix - make logs world-readable:
chmod o+r ~/.pm2/logs/*.log
chmod o+x ~/.pm2/logs/
# For future log files (set default permissions):
echo "umask 022" >> ~/.bashrc
source ~/.bashrc
pm2 restart all
Or use ACL:
sudo apt install -y acl
sudo setfacl -Rm u:65534:r ~/.pm2/logs/ # 65534 = nobody user Promtail uses
Check 4 - Loki Not Ready
curl http://localhost:3100/ready
If not ready, check Loki logs:
docker compose logs loki --tail=50
Common Loki startup error: loki-config.yaml has a syntax error. Validate:
docker exec loki loki -config.file=/etc/loki/local-config.yaml -verify-config 2>&1
Check 5 - Query in Wrong Label
In Grafana Explore with Loki, double-check your labels:
- Use the Label filters UI to browse available labels before typing manually
- Label values are case-sensitive: healthtune_api not HealthTune_API
Problem: No Traces in Tempo
Symptom: Grafana Explore with Tempo returns no results, or Service Name dropdown is empty.
Check 1 - Is tracing.js Loading?
pm2 logs healthtune_dev_api --lines 30 | head
Look for:
@opentelemetry/sdk-node starting up
If not seen: tracing.js is not being required. Verify the PM2 start command includes -r:
pm2 info healthtune_dev_api | grep -A2 "exec cwd"
Delete and restart with correct command:
pm2 delete healthtune_dev_api
OTEL_SERVICE_NAME=healthtune_api pm2 start "node -r /home/wenawa/healthtune_api/tracing.js dist/main.js" --name healthtune_dev_api --cwd /home/wenawa/healthtune_api
Check 2 - Is OTEL Collector Receiving Data?
docker compose logs otel-collector --tail=30
Look for: batch processor flushed or exported spans.
If you see connection refused: the app cannot reach port 4317. Check the port is listening:
sudo ss -ltnp | grep ':4317'
Check 3 - Wrong OTEL Endpoint in tracing.js
If your observability stack is on a DIFFERENT server than your apps, localhost:4317 will not work:
// WRONG - if observability is on a different server
url: 'http://localhost:4317'
// CORRECT - use the observability server IP
url: 'http://34.31.206.197:4317'
Check 4 - Generate Actual Traffic
Traces only appear when your app handles real HTTP requests:
curl http://localhost:3000/
curl http://localhost:3000/api/users
curl http://localhost:3000/api/health
Then wait 30-60 seconds and check Grafana > Explore > Tempo > Search.
Check 5 - Port Conflict Between Tempo and OTEL Collector
If both Tempo and otel-collector are trying to bind 4317 on the host, one will fail:
docker compose ps | grep -E "tempo|otel"
docker compose logs otel-collector | grep "addr already in use"
Fix: Remove the port mappings from Tempo's service block in docker-compose.yaml so only otel-collector exposes 4317 to the host. Tempo still listens on 4317 internally (container-to-container via Docker network).
Problem: Grafana Data Source Test Fails
Symptom: When testing Prometheus/Loki/Tempo data source, it shows a red error.
Prometheus Data Source Error
Error: "Get http://prometheus:9090/api/v1/query: dial tcp: no such host"
Cause: Grafana is not on the same Docker network as Prometheus.
Fix: Make sure all services are in the same Docker network in docker-compose.yaml:
networks:
- observability
And the network is defined at the top:
networks:
observability:
driver: bridge
Restart all services:
docker compose down
docker compose up -d
Loki Data Source Error
Error: "Failed to call resource" or "invalid character"
Usually a loki-config.yaml syntax error. Validate it:
python3 -c "import yaml; yaml.safe_load(open('/home/$(whoami)/observability/loki/loki-config.yaml'))" && echo "YAML OK"
Tempo Data Source Error
Error: "Failed to get services"
Check Tempo is running and healthy:
curl http://localhost:3200/ready
docker compose logs tempo --tail=30
Problem: High Memory / CPU Usage
Symptom: Server slowing down after starting the stack.
Check what is using resources:
docker stats --no-stream
free -h
Reduce Prometheus memory:
# In docker-compose.yaml prometheus command section, add:
- --query.max-samples=5000000
- --storage.tsdb.retention.time=15d # Reduce from 30d
Reduce Loki memory:
# In loki-config.yaml:
limits_config:
ingestion_rate_mb: 4 # Reduce from 16
ingestion_burst_size_mb: 8 # Reduce from 32
retention_period: 72h # Reduce from 168h
Restart affected services:
docker compose up -d prometheus loki
Full Diagnostic Script
Save as observability-check.sh and run any time:
#!/bin/bash
echo "=== Container Status ==="
docker compose -f ~/observability/docker-compose.yaml ps
echo ""
echo "=== Port Listeners ==="
sudo ss -ltnp | egrep ':(3000|9090|3100|3200|4317|4318|9100)\b' || echo "No observability ports listening"
echo ""
echo "=== Health Checks ==="
curl -s http://localhost:9090/-/healthy && echo " - Prometheus OK" || echo " - Prometheus FAIL"
curl -s http://localhost:3100/ready && echo " - Loki OK" || echo " - Loki FAIL"
curl -s http://localhost:3200/ready && echo " - Tempo OK" || echo " - Tempo FAIL"
curl -s -o /dev/null -w "%{http_code}" http://localhost:3000 | grep -q "200\|302" && echo " - Grafana OK" || echo " - Grafana FAIL"
echo ""
echo "=== PM2 Apps ==="
pm2 list 2>/dev/null || echo "PM2 not available"
echo ""
echo "=== Memory ==="
free -h
echo ""
echo "=== Disk ==="
df -h / | tail -1
chmod +x observability-check.sh
./observability-check.sh