إنتقل إلى المحتوى الرئيسي

Troubleshooting

This chapter covers every common issue with the Grafana + Prometheus + Loki + Tempo stack, with exact diagnostic commands and fixes for each problem.


Quick Diagnostic - Run This First

# 1. All containers running?
docker compose ps

# 2. Ports listening?
sudo ss -ltnp | egrep ':(3000|9090|3100|3200|4317|4318|9100)\b'

# 3. Health checks
curl http://localhost:9090/-/healthy # Prometheus
curl http://localhost:3100/ready # Loki
curl http://localhost:3200/ready # Tempo
curl -I http://localhost:3000 # Grafana

# 4. Apps running?
pm2 list

# 5. Recent errors from all containers
docker compose logs --tail=30 2>&1 | grep -i error

Problem: Grafana Not Opening in Browser

Symptom: You go to http://YOUR_SERVER_IP:3000 and get nothing.

Check 1 - Is Grafana running at all?

curl -I http://localhost:3000
  • Returns 200 OK -> Grafana is running. Problem is the firewall.
  • Returns connection refused -> Grafana container is down.

Check 2 - Container down? Restart it

docker compose ps grafana
docker compose logs grafana --tail=50
docker compose restart grafana

Check 3 - GCP Firewall Missing

The most common cause on cloud VMs. The service is fine locally but blocked externally.

GCP Console fix:

  1. VPC Network > Firewall > Create Firewall Rule
  2. Name: allow-grafana
  3. Direction: Ingress, Action: Allow
  4. Source IP: 0.0.0.0/0 (or your team IP range)
  5. Protocols and ports: TCP 3000
  6. Save and wait 30 seconds

AWS Security Group fix:

  1. EC2 > Security Groups > your instance group
  2. Inbound Rules > Add Rule
  3. Custom TCP, Port 3000, Source: your IP

Check 4 - UFW Local Firewall

sudo ufw status
sudo ufw allow 3000/tcp
sudo ufw reload

Check 5 - SSH Tunnel to Isolate Network vs App

Run this on your LAPTOP:

ssh -L 3000:localhost:3000 username@YOUR_SERVER_IP

Then open http://localhost:3000. If it works -> firewall issue on the server. If not -> app issue.


Problem: No Metrics in Prometheus

Symptom: Prometheus UI is open but queries return no data, or targets show DOWN.

Check 1 - Open Prometheus Targets Page

In Prometheus UI: Status > Targets

Each scrape target shows its state:

  • UP with green = working
  • DOWN with red = problem

Click the target URL to see the specific error.

Check 2 - Target Shows DOWN - Diagnose

# Test if the target is reachable from inside the Prometheus container
docker exec prometheus wget -qO- http://node-exporter:9100/metrics | head -5

# Test host-based targets
docker exec prometheus wget -qO- http://host.docker.internal:3001/metrics | head -5

If wget fails: the endpoint is unreachable. Check if the app is running and on the right port.

Check 3 - host.docker.internal Not Resolving (Linux Only)

On Linux, host.docker.internal requires extra_hosts in docker-compose.yaml. Verify this is in your Prometheus service block:

extra_hosts:
- "host.docker.internal:host-gateway"

If missing, add it and restart:

docker compose up -d prometheus

Check 4 - Wrong Port in prometheus.yml

# Check what port your app is actually on
pm2 list
curl http://localhost:3001/metrics # Try different ports
curl http://localhost:3000/metrics

Update prometheus.yml with the correct port:

- job_name: healthtune-api
static_configs:
- targets: [host.docker.internal:CORRECT_PORT_HERE]

Apply the change without restarting:

curl -X POST http://localhost:9090/-/reload

Check 5 - App Does Not Expose /metrics

If your Node.js app does not have prom-client installed and a /metrics route, Prometheus cannot scrape it. Either:

Option A: Add prom-client (see Application Integration chapter) Option B: Remove the app's job_name block from prometheus.yml so Prometheus stops trying


Problem: No Logs Appearing in Loki

Symptom: Grafana Explore with Loki shows no results, or labels dropdown is empty.

Check 1 - Is Promtail Running and Shipping?

docker compose logs promtail --tail=50

Look for lines like:

level=info msg="Tailing new file" path=/var/log/pm2/healthtune-dev-api-out.log

If you see permission denied: see Check 3 below. If you see no tailing lines: the file path in promtail-config.yaml is wrong.

Check 2 - Verify Log File Paths

# Check actual PM2 log paths on your system
ls -la ~/.pm2/logs/

Then compare with what is in promtail-config.yaml:

__path__: /var/log/pm2/healthtune-dev-api-out.log

The container sees /var/log/pm2 but the host has ~/.pm2/logs. The volume mount in docker-compose.yaml connects them:

volumes:
- /home/wenawa/.pm2/logs:/var/log/pm2:ro

Make sure /home/wenawa matches your actual username. Fix it:

sed -i "s|/home/wenawa|/home/$USER|g" ~/observability/docker-compose.yaml
docker compose up -d promtail

Check 3 - Permission Denied on Log Files

The Promtail container runs as a non-root user and cannot read PM2 logs owned by your user.

Fix - make logs world-readable:

chmod o+r ~/.pm2/logs/*.log
chmod o+x ~/.pm2/logs/

# For future log files (set default permissions):
echo "umask 022" >> ~/.bashrc
source ~/.bashrc
pm2 restart all

Or use ACL:

sudo apt install -y acl
sudo setfacl -Rm u:65534:r ~/.pm2/logs/ # 65534 = nobody user Promtail uses

Check 4 - Loki Not Ready

curl http://localhost:3100/ready

If not ready, check Loki logs:

docker compose logs loki --tail=50

Common Loki startup error: loki-config.yaml has a syntax error. Validate:

docker exec loki loki -config.file=/etc/loki/local-config.yaml -verify-config 2>&1

Check 5 - Query in Wrong Label

In Grafana Explore with Loki, double-check your labels:

  • Use the Label filters UI to browse available labels before typing manually
  • Label values are case-sensitive: healthtune_api not HealthTune_API

Problem: No Traces in Tempo

Symptom: Grafana Explore with Tempo returns no results, or Service Name dropdown is empty.

Check 1 - Is tracing.js Loading?

pm2 logs healthtune_dev_api --lines 30 | head

Look for:

@opentelemetry/sdk-node starting up

If not seen: tracing.js is not being required. Verify the PM2 start command includes -r:

pm2 info healthtune_dev_api | grep -A2 "exec cwd"

Delete and restart with correct command:

pm2 delete healthtune_dev_api
OTEL_SERVICE_NAME=healthtune_api pm2 start "node -r /home/wenawa/healthtune_api/tracing.js dist/main.js" --name healthtune_dev_api --cwd /home/wenawa/healthtune_api

Check 2 - Is OTEL Collector Receiving Data?

docker compose logs otel-collector --tail=30

Look for: batch processor flushed or exported spans.

If you see connection refused: the app cannot reach port 4317. Check the port is listening:

sudo ss -ltnp | grep ':4317'

Check 3 - Wrong OTEL Endpoint in tracing.js

If your observability stack is on a DIFFERENT server than your apps, localhost:4317 will not work:

// WRONG - if observability is on a different server
url: 'http://localhost:4317'

// CORRECT - use the observability server IP
url: 'http://34.31.206.197:4317'

Check 4 - Generate Actual Traffic

Traces only appear when your app handles real HTTP requests:

curl http://localhost:3000/
curl http://localhost:3000/api/users
curl http://localhost:3000/api/health

Then wait 30-60 seconds and check Grafana > Explore > Tempo > Search.

Check 5 - Port Conflict Between Tempo and OTEL Collector

If both Tempo and otel-collector are trying to bind 4317 on the host, one will fail:

docker compose ps | grep -E "tempo|otel"
docker compose logs otel-collector | grep "addr already in use"

Fix: Remove the port mappings from Tempo's service block in docker-compose.yaml so only otel-collector exposes 4317 to the host. Tempo still listens on 4317 internally (container-to-container via Docker network).


Problem: Grafana Data Source Test Fails

Symptom: When testing Prometheus/Loki/Tempo data source, it shows a red error.

Prometheus Data Source Error

Error: "Get http://prometheus:9090/api/v1/query: dial tcp: no such host"

Cause: Grafana is not on the same Docker network as Prometheus.

Fix: Make sure all services are in the same Docker network in docker-compose.yaml:

networks:
- observability

And the network is defined at the top:

networks:
observability:
driver: bridge

Restart all services:

docker compose down
docker compose up -d

Loki Data Source Error

Error: "Failed to call resource" or "invalid character"

Usually a loki-config.yaml syntax error. Validate it:

python3 -c "import yaml; yaml.safe_load(open('/home/$(whoami)/observability/loki/loki-config.yaml'))" && echo "YAML OK"

Tempo Data Source Error

Error: "Failed to get services"

Check Tempo is running and healthy:

curl http://localhost:3200/ready
docker compose logs tempo --tail=30

Problem: High Memory / CPU Usage

Symptom: Server slowing down after starting the stack.

Check what is using resources:

docker stats --no-stream
free -h

Reduce Prometheus memory:

# In docker-compose.yaml prometheus command section, add:
- --query.max-samples=5000000
- --storage.tsdb.retention.time=15d # Reduce from 30d

Reduce Loki memory:

# In loki-config.yaml:
limits_config:
ingestion_rate_mb: 4 # Reduce from 16
ingestion_burst_size_mb: 8 # Reduce from 32
retention_period: 72h # Reduce from 168h

Restart affected services:

docker compose up -d prometheus loki

Full Diagnostic Script

Save as observability-check.sh and run any time:

#!/bin/bash
echo "=== Container Status ==="
docker compose -f ~/observability/docker-compose.yaml ps

echo ""
echo "=== Port Listeners ==="
sudo ss -ltnp | egrep ':(3000|9090|3100|3200|4317|4318|9100)\b' || echo "No observability ports listening"

echo ""
echo "=== Health Checks ==="
curl -s http://localhost:9090/-/healthy && echo " - Prometheus OK" || echo " - Prometheus FAIL"
curl -s http://localhost:3100/ready && echo " - Loki OK" || echo " - Loki FAIL"
curl -s http://localhost:3200/ready && echo " - Tempo OK" || echo " - Tempo FAIL"
curl -s -o /dev/null -w "%{http_code}" http://localhost:3000 | grep -q "200\|302" && echo " - Grafana OK" || echo " - Grafana FAIL"

echo ""
echo "=== PM2 Apps ==="
pm2 list 2>/dev/null || echo "PM2 not available"

echo ""
echo "=== Memory ==="
free -h

echo ""
echo "=== Disk ==="
df -h / | tail -1
chmod +x observability-check.sh
./observability-check.sh