Production Monitoring: Prometheus + Grafana + Loki Stack Kurulumu
Production Monitoring: Prometheus + Grafana + Loki Stack Kurulumu
"Production down!" mesajı Slack'ten gelmesin. Monitoring sisteminden gelsin. Observability stack: Metrics (Prometheus), Visualization (Grafana), Logs (Loki).
3 Pillars of Observability
- Metrics: Sayısal veriler (CPU, memory, request rate)
- Logs: Event records (application logs, error traces)
- Traces: Distributed request tracking (microservices)
Stack Kurulumu (Kubernetes)
# Prometheus Operator install
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
# Loki install
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
--namespace monitoring \
--set grafana.enabled=false \
--set prometheus.enabled=false
Critical Metrics to Track
Infrastructure Metrics
- Node CPU/Memory utilization
- Disk I/O, network bandwidth
- Pod restart counts
- PVC usage
Application Metrics
- Request rate (RPS)
- Error rate (%)
- Response time (p50, p95, p99)
- Active connections
Business Metrics
- User registrations/hour
- Transactions/minute
- Revenue/day
Alerting Rules
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: application-alerts
namespace: monitoring
spec:
groups:
- name: application
interval: 30s
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
Grafana Dashboards
Must-have dashboards:
- Kubernetes cluster overview
- Node metrics
- Pod resource usage
- Application performance (RED: Rate, Errors, Duration)
- Database metrics
Log Aggregation with Loki
# LogQL query examples
# All logs from app
{app="myapp"}
# Error logs only
{app="myapp"} |= "ERROR"
# HTTP 500 errors
{app="myapp"} | json | status="500"
# Request duration > 1s
{app="myapp"} | json | duration > 1000
Alert Channels
- Slack: Team notifications
- PagerDuty: On-call escalation
- Email: Non-critical alerts
- Webhook: Custom integrations
Best Practices
- Alert Fatigue Avoid: Sadece actionable alerts
- SLO-based Alerting: Business impact odaklı
- Runbooks: Her alert için troubleshooting guide
- Dashboard Organization: Role-based (dev, ops, business)
- Retention Policy: Metrics 15 days, logs 7 days
Sonuç
Proactive monitoring ile downtime minimize edilir. Devups managed monitoring service.