Skip to main content
Prometheus

Prometheus

Discover Prometheus monitoring infrastructure, metrics, alerts, and scrape targets with full integration to your DevOps ecosystem.

  • Prometheus Instances: Server information and configuration

  • Scrape Targets: All monitored services with health status

  • Active Alerts: Firing and pending alerts with full context

  • Rules: Alerting and recording rules with PromQL queries

  • Automatic K8s Linking: Connects targets to pods, alerts to deployments

  • prom_query_metrics - Execute PromQL queries

  • prom_query_range - Query metrics over time ranges

  • prom_get_active_alerts - List active alerts with filtering

  • prom_get_targets - Get scrape target health status

  • prom_get_rules - List alerting and recording rules

  • prom_get_metric_metadata - Get metric documentation

discovery:
enabled: true
settings:
url: "https://prometheus.example.com"
auth:
type: "basic" # or "bearer", "none"
username: "admin"
password: "SECRET"

include_targets: true
include_alerts: true
include_rules: true

# For K8s integration
kubernetes_cluster: "production"

mcp:
enabled: true
settings:
url: "https://prometheus.example.com"
auth:
type: "basic"
username: "admin"
password: "SECRET"

namespace: "default"
discovery:
enabled: true
settings:
instances:
- name: "production"
url: "https://prom-prod.example.com"
bearer_token: "SECRET"

- name: "staging"
url: "https://prom-staging.example.com"
username: "admin"
password: "SECRET"

kubernetes_cluster: "production"

namespace: "default"
kind: PrometheusInstance
metadata:
name: "production"
spec:
name: "production"
url: "https://prometheus.example.com"
kind: PrometheusTarget
metadata:
name: "production/kubernetes-pods/abc123"
labels:
job: "kubernetes-pods"
pod: "api-backend-xyz"
namespace: "production"
health: "up"
spec:
instance: "production"
job: "kubernetes-pods"
scrape_url: "http://10.244.0.5:8080/metrics"
health: "up"
last_scrape: "2024-02-10T16:00:00Z"
scrape_duration: 0.05
kind: PrometheusAlert
metadata:
name: "production/HighErrorRate/abc123"
labels:
alert_name: "HighErrorRate"
state: "firing"
severity: "critical"
spec:
instance: "production"
alert_name: "HighErrorRate"
state: "firing"
value: "15.2"
active_at: "2024-02-10T16:00:00Z"
labels:
alertname: "HighErrorRate"
service: "api"
namespace: "production"
severity: "critical"
annotations:
summary: "High error rate detected"
description: "Error rate is 15.2% (threshold: 5%)"
kind: PrometheusRule
metadata:
name: "production/alerting/HighErrorRate"
labels:
type: "alerting"
severity: "critical"
spec:
instance: "production"
name: "HighErrorRate"
type: "alerting"
query: "rate(http_requests_total{status=~\"5..\"}[5m]) > 0.05"
health: "ok"
labels:
severity: "critical"
annotations:
summary: "High error rate detected"
RelationshipFromToHow Detected
BELONGS_TOTarget/Alert/RulePrometheusInstanceParent-child
SCRAPESPrometheusTargetK8sPodMatch pod label
SCRAPESPrometheusTargetK8sServiceMatch service label
MONITORSPrometheusAlertK8sDeploymentParse deployment label
EVALUATESPrometheusRuleK8sPodParse PromQL query

degree discovery run prometheus@1.0.0 --config prometheus-config.yaml


degree discovery schedule prometheus@1.0.0 --config prometheus-config.yaml --interval 5m
"What's the error rate for the API service?"

AI: [Executes PromQL]
rate(http_requests_total{service="api",status=~"5.."}[5m])

"Current error rate: 2.3%
- 5xx errors: 23 req/min
- Total requests: 1000 req/min
- Status: ✅ Normal (threshold: 5%)"
"What alerts are firing in production?"

AI: "3 critical alerts firing:

1. HighErrorRate (API service)
- Error rate: 15.2%
- Active for: 12 minutes
- Affects: production/api deployment

2. HighMemoryUsage (worker pods)
- Memory: 92%
- Active for: 5 minutes
- Affects: 3 pods

3. PodCrashLooping (frontend)
- Restarts: 5 in 10 minutes"
"Are all pods being monitored?"

AI: [Cross-reference K8sPods with PrometheusTargets]
"Monitoring status:

✅ Production: 15/15 pods monitored
⚠️ Staging: 7/8 pods monitored
- Missing: worker-xyz-456
- Issue: No /metrics endpoint

Recommendation: Add metrics to worker service"
"Is the new deployment healthy?"

AI: "Deployment Health: api v1.2.4

Prometheus Metrics:
├─ Error rate: ✅ 0.1% (improved)
├─ Response time: ✅ 125ms (-14%)
├─ CPU: ✅ 58% (down from 65%)
└─ Memory: ✅ Stable

Alerts: ✅ No alerts firing

Conclusion: Deployment successful ✅"
PrometheusTarget → SCRAPES → K8sPod
PrometheusAlert → MONITORS → K8sDeployment

You: "Why is the api-backend pod not responding?"

AI: [Checks Prometheus targets]
"Pod api-backend-abc is not being scraped:
- Target health: down
- Last error: connection refused
- Pod status: Running but /metrics unreachable

Likely cause: Container not listening on port 8080"
PrometheusAlert fires
↓ traces to
K8sDeployment[production/api]
↓ deployed by
ArgoApplication[api]
↓ synced from
GitHubWorkflowRun[#456]

You: "Why is the API alerting?"

AI: "Alert root cause analysis:

Alert: HighErrorRate (firing 12 min)
├─ Deployment: v1.2.4 (deployed 15 min ago)
├─ CI Run: GitHub Actions #456
├─ Commit: abc123 by jane@example.com
└─ Change: Updated error handling logic

Metrics comparison:
- Before (v1.2.3): 0.2% errors
- After (v1.2.4): 15.2% errors

Recommendation: Rollback to v1.2.3"
PrometheusAlert → NOTIFIES → SlackChannel

Alert fires → Auto-notification to #oncall:

"🚨 Critical Alert: HighErrorRate

Service: production/api
Error rate: 15.2% (threshold: 5%)
Duration: 12 minutes

Investigation:
- Deployment: v1.2.4 (15 min ago)
- GitHub Actions: #456
- Author: @jane

Actions:
- @oncall notified
- PagerDuty incident created
- Rollback recommended"
Application
↓ exposes /metrics
K8sPod
↓ scraped by
PrometheusTarget
↓ metrics stored in
PrometheusInstance
↓ evaluated by
PrometheusRule
↓ fires
PrometheusAlert
↓ notifies
SlackChannel + PagerDuty
↓ investigated using
AI + Complete Graph
You: "Why is the API alerting?"

AI: Complete root cause analysis:
✓ When alert started
✓ What deployment changed
✓ Which commit caused it
✓ What metrics show
✓ Recommended action
You: "Is the API at capacity?"

AI: Capacity analysis:
✓ Current utilization (28%)
✓ Growth trends (+15% week-over-week)
✓ Projected capacity needs
✓ Scaling recommendations
✓ Cost optimization
You: "Is the deployment healthy?"

AI: Post-deployment metrics:
✓ Error rate comparison
✓ Latency changes
✓ Resource usage
✓ Alert status
✓ Performance verdict
You: "What happened during the outage?"

AI: Complete incident reconstruction:
14:32 - Latency spike (pending alert)
14:35 - Degradation starts (firing alert)
14:37 - Full outage (95% errors)
14:55 - Root cause found
15:18 - Resolved

Including: metrics, correlations, actions

The molecule makes PromQL natural:

Instead of writing:
rate(http_requests_total{job="api",status=~"5.."}[5m])

Just ask:
"What's the 5xx error rate for the API?"

Instead of:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Just ask:
"What's the 95th percentile response time?"

Monitor multiple Prometheus instances:

instances:
- name: "prod-us"
url: "https://prom-us.example.com"
- name: "prod-eu"
url: "https://prom-eu.example.com"
- name: "staging"
url: "https://prom-staging.example.com"

AI can query across all:

"Show me all firing alerts across all environments"

AI: Alerts across 3 Prometheus instances:

prod-us: 1 critical
- HighErrorRate in api service

prod-eu: ✅ No alerts

staging: 2 warnings
- SlowQueries in database
- HighMemory in worker

Ensure consistent labels for K8s integration:

- job: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace

Include identifying labels:

groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: rate(...) > 0.05
labels:
severity: critical
service: api
namespace: production
deployment: api-backend

degree discovery schedule prometheus@1.0.0 \
--config prometheus-config.yaml \
--interval 5m
Error: 401 Unauthorized

Solution: Check token or credentials

auth:
type: "bearer"
bearer_token: "YOUR_TOKEN"
Error: connection refused

Solution: Verify Prometheus URL and network access

curl https://prometheus.example.com/-/healthy

If targets aren't linking to pods, verify labels:


curl https://prometheus.example.com/api/v1/targets | jq '.data.activeTargets[0].labels'


MIT License - see LICENSE