Prometheus
Discover Prometheus monitoring infrastructure, metrics, alerts, and scrape targets with full integration to your DevOps ecosystem.
-
Prometheus Instances: Server information and configuration
-
Scrape Targets: All monitored services with health status
-
Active Alerts: Firing and pending alerts with full context
-
Rules: Alerting and recording rules with PromQL queries
-
Automatic K8s Linking: Connects targets to pods, alerts to deployments
-
prom_query_metrics- Execute PromQL queries -
prom_query_range- Query metrics over time ranges -
prom_get_active_alerts- List active alerts with filtering -
prom_get_targets- Get scrape target health status -
prom_get_rules- List alerting and recording rules -
prom_get_metric_metadata- Get metric documentation
discovery:
enabled: true
settings:
url: "https://prometheus.example.com"
auth:
type: "basic" # or "bearer", "none"
username: "admin"
password: "SECRET"
include_targets: true
include_alerts: true
include_rules: true
# For K8s integration
kubernetes_cluster: "production"
mcp:
enabled: true
settings:
url: "https://prometheus.example.com"
auth:
type: "basic"
username: "admin"
password: "SECRET"
namespace: "default"
discovery:
enabled: true
settings:
instances:
- name: "production"
url: "https://prom-prod.example.com"
bearer_token: "SECRET"
- name: "staging"
url: "https://prom-staging.example.com"
username: "admin"
password: "SECRET"
kubernetes_cluster: "production"
namespace: "default"
kind: PrometheusInstance
metadata:
name: "production"
spec:
name: "production"
url: "https://prometheus.example.com"
kind: PrometheusTarget
metadata:
name: "production/kubernetes-pods/abc123"
labels:
job: "kubernetes-pods"
pod: "api-backend-xyz"
namespace: "production"
health: "up"
spec:
instance: "production"
job: "kubernetes-pods"
scrape_url: "http://10.244.0.5:8080/metrics"
health: "up"
last_scrape: "2024-02-10T16:00:00Z"
scrape_duration: 0.05
kind: PrometheusAlert
metadata:
name: "production/HighErrorRate/abc123"
labels:
alert_name: "HighErrorRate"
state: "firing"
severity: "critical"
spec:
instance: "production"
alert_name: "HighErrorRate"
state: "firing"
value: "15.2"
active_at: "2024-02-10T16:00:00Z"
labels:
alertname: "HighErrorRate"
service: "api"
namespace: "production"
severity: "critical"
annotations:
summary: "High error rate detected"
description: "Error rate is 15.2% (threshold: 5%)"
kind: PrometheusRule
metadata:
name: "production/alerting/HighErrorRate"
labels:
type: "alerting"
severity: "critical"
spec:
instance: "production"
name: "HighErrorRate"
type: "alerting"
query: "rate(http_requests_total{status=~\"5..\"}[5m]) > 0.05"
health: "ok"
labels:
severity: "critical"
annotations:
summary: "High error rate detected"
| Relationship | From | To | How Detected |
|---|---|---|---|
BELONGS_TO | Target/Alert/Rule | PrometheusInstance | Parent-child |
SCRAPES | PrometheusTarget | K8sPod | Match pod label |
SCRAPES | PrometheusTarget | K8sService | Match service label |
MONITORS | PrometheusAlert | K8sDeployment | Parse deployment label |
EVALUATES | PrometheusRule | K8sPod | Parse PromQL query |
degree discovery run prometheus@1.0.0 --config prometheus-config.yaml
degree discovery schedule prometheus@1.0.0 --config prometheus-config.yaml --interval 5m
"What's the error rate for the API service?"
AI: [Executes PromQL]
rate(http_requests_total{service="api",status=~"5.."}[5m])
"Current error rate: 2.3%
- 5xx errors: 23 req/min
- Total requests: 1000 req/min
- Status: ✅ Normal (threshold: 5%)"
"What alerts are firing in production?"
AI: "3 critical alerts firing:
1. HighErrorRate (API service)
- Error rate: 15.2%
- Active for: 12 minutes
- Affects: production/api deployment
2. HighMemoryUsage (worker pods)
- Memory: 92%
- Active for: 5 minutes
- Affects: 3 pods
3. PodCrashLooping (frontend)
- Restarts: 5 in 10 minutes"
"Are all pods being monitored?"
AI: [Cross-reference K8sPods with PrometheusTargets]
"Monitoring status:
✅ Production: 15/15 pods monitored
⚠️ Staging: 7/8 pods monitored
- Missing: worker-xyz-456
- Issue: No /metrics endpoint
Recommendation: Add metrics to worker service"
"Is the new deployment healthy?"
AI: "Deployment Health: api v1.2.4
Prometheus Metrics:
├─ Error rate: ✅ 0.1% (improved)
├─ Response time: ✅ 125ms (-14%)
├─ CPU: ✅ 58% (down from 65%)
└─ Memory: ✅ Stable
Alerts: ✅ No alerts firing
Conclusion: Deployment successful ✅"
PrometheusTarget → SCRAPES → K8sPod
PrometheusAlert → MONITORS → K8sDeployment
You: "Why is the api-backend pod not responding?"
AI: [Checks Prometheus targets]
"Pod api-backend-abc is not being scraped:
- Target health: down
- Last error: connection refused
- Pod status: Running but /metrics unreachable
Likely cause: Container not listening on port 8080"
PrometheusAlert fires
↓ traces to
K8sDeployment[production/api]
↓ deployed by
ArgoApplication[api]
↓ synced from
GitHubWorkflowRun[#456]
You: "Why is the API alerting?"
AI: "Alert root cause analysis:
Alert: HighErrorRate (firing 12 min)
├─ Deployment: v1.2.4 (deployed 15 min ago)
├─ CI Run: GitHub Actions #456
├─ Commit: abc123 by jane@example.com
└─ Change: Updated error handling logic
Metrics comparison:
- Before (v1.2.3): 0.2% errors
- After (v1.2.4): 15.2% errors
Recommendation: Rollback to v1.2.3"
PrometheusAlert → NOTIFIES → SlackChannel
Alert fires → Auto-notification to #oncall:
"🚨 Critical Alert: HighErrorRate
Service: production/api
Error rate: 15.2% (threshold: 5%)
Duration: 12 minutes
Investigation:
- Deployment: v1.2.4 (15 min ago)
- GitHub Actions: #456
- Author: @jane
Actions:
- @oncall notified
- PagerDuty incident created
- Rollback recommended"
Application
↓ exposes /metrics
K8sPod
↓ scraped by
PrometheusTarget
↓ metrics stored in
PrometheusInstance
↓ evaluated by
PrometheusRule
↓ fires
PrometheusAlert
↓ notifies
SlackChannel + PagerDuty
↓ investigated using
AI + Complete Graph
You: "Why is the API alerting?"
AI: Complete root cause analysis:
✓ When alert started
✓ What deployment changed
✓ Which commit caused it
✓ What metrics show
✓ Recommended action
You: "Is the API at capacity?"
AI: Capacity analysis:
✓ Current utilization (28%)
✓ Growth trends (+15% week-over-week)
✓ Projected capacity needs
✓ Scaling recommendations
✓ Cost optimization
You: "Is the deployment healthy?"
AI: Post-deployment metrics:
✓ Error rate comparison
✓ Latency changes
✓ Resource usage
✓ Alert status
✓ Performance verdict
You: "What happened during the outage?"
AI: Complete incident reconstruction:
14:32 - Latency spike (pending alert)
14:35 - Degradation starts (firing alert)
14:37 - Full outage (95% errors)
14:55 - Root cause found
15:18 - Resolved
Including: metrics, correlations, actions
The molecule makes PromQL natural:
Instead of writing:
rate(http_requests_total{job="api",status=~"5.."}[5m])
Just ask:
"What's the 5xx error rate for the API?"
Instead of:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Just ask:
"What's the 95th percentile response time?"
Monitor multiple Prometheus instances:
instances:
- name: "prod-us"
url: "https://prom-us.example.com"
- name: "prod-eu"
url: "https://prom-eu.example.com"
- name: "staging"
url: "https://prom-staging.example.com"
AI can query across all:
"Show me all firing alerts across all environments"
AI: Alerts across 3 Prometheus instances:
prod-us: 1 critical
- HighErrorRate in api service
prod-eu: ✅ No alerts
staging: 2 warnings
- SlowQueries in database
- HighMemory in worker
Ensure consistent labels for K8s integration:
- job: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
Include identifying labels:
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: rate(...) > 0.05
labels:
severity: critical
service: api
namespace: production
deployment: api-backend
degree discovery schedule prometheus@1.0.0 \
--config prometheus-config.yaml \
--interval 5m
Error: 401 Unauthorized
Solution: Check token or credentials
auth:
type: "bearer"
bearer_token: "YOUR_TOKEN"
Error: connection refused
Solution: Verify Prometheus URL and network access
curl https://prometheus.example.com/-/healthy
If targets aren't linking to pods, verify labels:
curl https://prometheus.example.com/api/v1/targets | jq '.data.activeTargets[0].labels'
- Prometheus HTTP API: https://prometheus.io/docs/prometheus/latest/querying/api/
- PromQL: https://prometheus.io/docs/prometheus/latest/querying/basics/
MIT License - see LICENSE