• 0216 210 0483
  • Küçükbakkalköy Mah. Çandarlı Sk No :7 Ekşioğlu Plaza Kat:3 Daire:18 Ataşehir/İSTANBUL
Production Monitoring: Prometheus + Grafana + Loki Stack Kurulumu

Production Monitoring: Prometheus + Grafana + Loki Stack Kurulumu

Production Monitoring: Prometheus + Grafana + Loki Stack Kurulumu

"Production down!" mesajı Slack'ten gelmesin. Monitoring sisteminden gelsin. Observability stack: Metrics (Prometheus), Visualization (Grafana), Logs (Loki).

3 Pillars of Observability

  1. Metrics: Sayısal veriler (CPU, memory, request rate)
  2. Logs: Event records (application logs, error traces)
  3. Traces: Distributed request tracking (microservices)

Stack Kurulumu (Kubernetes)

# Prometheus Operator install
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

# Loki install
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set grafana.enabled=false \
  --set prometheus.enabled=false

Critical Metrics to Track

Infrastructure Metrics

  • Node CPU/Memory utilization
  • Disk I/O, network bandwidth
  • Pod restart counts
  • PVC usage

Application Metrics

  • Request rate (RPS)
  • Error rate (%)
  • Response time (p50, p95, p99)
  • Active connections

Business Metrics

  • User registrations/hour
  • Transactions/minute
  • Revenue/day

Alerting Rules

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: application-alerts
  namespace: monitoring
spec:
  groups:
  - name: application
    interval: 30s
    rules:
    - alert: HighErrorRate
      expr: |
        rate(http_requests_total{status=~"5.."}[5m])
        / rate(http_requests_total[5m]) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value | humanizePercentage }}"
    
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} is crash looping"

Grafana Dashboards

Must-have dashboards:

  • Kubernetes cluster overview
  • Node metrics
  • Pod resource usage
  • Application performance (RED: Rate, Errors, Duration)
  • Database metrics

Log Aggregation with Loki

# LogQL query examples
# All logs from app
{app="myapp"}

# Error logs only
{app="myapp"} |= "ERROR"

# HTTP 500 errors
{app="myapp"} | json | status="500"

# Request duration > 1s
{app="myapp"} | json | duration > 1000

Alert Channels

  • Slack: Team notifications
  • PagerDuty: On-call escalation
  • Email: Non-critical alerts
  • Webhook: Custom integrations

Best Practices

  1. Alert Fatigue Avoid: Sadece actionable alerts
  2. SLO-based Alerting: Business impact odaklı
  3. Runbooks: Her alert için troubleshooting guide
  4. Dashboard Organization: Role-based (dev, ops, business)
  5. Retention Policy: Metrics 15 days, logs 7 days

Sonuç

Proactive monitoring ile downtime minimize edilir. Devups managed monitoring service.