Prometheus

Discover Prometheus monitoring infrastructure, metrics, alerts, and scrape targets with full integration to your DevOps ecosystem.

Prometheus Instances: Server information and configuration
Scrape Targets: All monitored services with health status
Active Alerts: Firing and pending alerts with full context
Rules: Alerting and recording rules with PromQL queries
Automatic K8s Linking: Connects targets to pods, alerts to deployments
prom_query_metrics - Execute PromQL queries
prom_query_range - Query metrics over time ranges
prom_get_active_alerts - List active alerts with filtering
prom_get_targets - Get scrape target health status
prom_get_rules - List alerting and recording rules
prom_get_metric_metadata - Get metric documentation

discovery:
  enabled: true
  settings:
    url: "https://prometheus.example.com"
    auth:
      type: "basic"  # or "bearer", "none"
      username: "admin"
      password: "SECRET"

    include_targets: true
    include_alerts: true
    include_rules: true

    # For K8s integration
    kubernetes_cluster: "production"

mcp:
  enabled: true
  settings:
    url: "https://prometheus.example.com"
    auth:
      type: "basic"
      username: "admin"
      password: "SECRET"

namespace: "default"

discovery:
  enabled: true
  settings:
    instances:
      - name: "production"
        url: "https://prom-prod.example.com"
        bearer_token: "SECRET"

      - name: "staging"
        url: "https://prom-staging.example.com"
        username: "admin"
        password: "SECRET"

    kubernetes_cluster: "production"

namespace: "default"

kind: PrometheusInstance
metadata:
  name: "production"
spec:
  name: "production"
  url: "https://prometheus.example.com"

kind: PrometheusTarget
metadata:
  name: "production/kubernetes-pods/abc123"
  labels:
    job: "kubernetes-pods"
    pod: "api-backend-xyz"
    namespace: "production"
    health: "up"
spec:
  instance: "production"
  job: "kubernetes-pods"
  scrape_url: "http://10.244.0.5:8080/metrics"
  health: "up"
  last_scrape: "2024-02-10T16:00:00Z"
  scrape_duration: 0.05

kind: PrometheusAlert
metadata:
  name: "production/HighErrorRate/abc123"
  labels:
    alert_name: "HighErrorRate"
    state: "firing"
    severity: "critical"
spec:
  instance: "production"
  alert_name: "HighErrorRate"
  state: "firing"
  value: "15.2"
  active_at: "2024-02-10T16:00:00Z"
  labels:
    alertname: "HighErrorRate"
    service: "api"
    namespace: "production"
    severity: "critical"
  annotations:
    summary: "High error rate detected"
    description: "Error rate is 15.2% (threshold: 5%)"

kind: PrometheusRule
metadata:
  name: "production/alerting/HighErrorRate"
  labels:
    type: "alerting"
    severity: "critical"
spec:
  instance: "production"
  name: "HighErrorRate"
  type: "alerting"
  query: "rate(http_requests_total{status=~\"5..\"}[5m]) > 0.05"
  health: "ok"
  labels:
    severity: "critical"
  annotations:
    summary: "High error rate detected"

Relationship	From	To	How Detected
`BELONGS_TO`	Target/Alert/Rule	PrometheusInstance	Parent-child
`SCRAPES`	PrometheusTarget	K8sPod	Match `pod` label
`SCRAPES`	PrometheusTarget	K8sService	Match `service` label
`MONITORS`	PrometheusAlert	K8sDeployment	Parse `deployment` label
`EVALUATES`	PrometheusRule	K8sPod	Parse PromQL query

degree discovery run prometheus@1.0.0 --config prometheus-config.yaml

degree discovery schedule prometheus@1.0.0 --config prometheus-config.yaml --interval 5m

"What's the error rate for the API service?"

AI: [Executes PromQL]
    rate(http_requests_total{service="api",status=~"5.."}[5m])

    "Current error rate: 2.3%
     - 5xx errors: 23 req/min
     - Total requests: 1000 req/min
     - Status: ✅ Normal (threshold: 5%)"

"What alerts are firing in production?"

AI: "3 critical alerts firing:

    1. HighErrorRate (API service)
       - Error rate: 15.2%
       - Active for: 12 minutes
       - Affects: production/api deployment

    2. HighMemoryUsage (worker pods)
       - Memory: 92%
       - Active for: 5 minutes
       - Affects: 3 pods

    3. PodCrashLooping (frontend)
       - Restarts: 5 in 10 minutes"

"Are all pods being monitored?"

AI: [Cross-reference K8sPods with PrometheusTargets]
    "Monitoring status:

    ✅ Production: 15/15 pods monitored
    ⚠️ Staging: 7/8 pods monitored
       - Missing: worker-xyz-456
       - Issue: No /metrics endpoint

    Recommendation: Add metrics to worker service"

"Is the new deployment healthy?"

AI: "Deployment Health: api v1.2.4

    Prometheus Metrics:
    ├─ Error rate: ✅ 0.1% (improved)
    ├─ Response time: ✅ 125ms (-14%)
    ├─ CPU: ✅ 58% (down from 65%)
    └─ Memory: ✅ Stable

    Alerts: ✅ No alerts firing

    Conclusion: Deployment successful ✅"

PrometheusTarget → SCRAPES → K8sPod
PrometheusAlert → MONITORS → K8sDeployment

You: "Why is the api-backend pod not responding?"

AI: [Checks Prometheus targets]
    "Pod api-backend-abc is not being scraped:
     - Target health: down
     - Last error: connection refused
     - Pod status: Running but /metrics unreachable

     Likely cause: Container not listening on port 8080"

PrometheusAlert fires
  ↓ traces to
K8sDeployment[production/api]
  ↓ deployed by
ArgoApplication[api]
  ↓ synced from
GitHubWorkflowRun[#456]

You: "Why is the API alerting?"

AI: "Alert root cause analysis:

    Alert: HighErrorRate (firing 12 min)
    ├─ Deployment: v1.2.4 (deployed 15 min ago)
    ├─ CI Run: GitHub Actions #456
    ├─ Commit: abc123 by jane@example.com
    └─ Change: Updated error handling logic

    Metrics comparison:
    - Before (v1.2.3): 0.2% errors
    - After (v1.2.4): 15.2% errors

    Recommendation: Rollback to v1.2.3"

PrometheusAlert → NOTIFIES → SlackChannel

Alert fires → Auto-notification to #oncall:

"🚨 Critical Alert: HighErrorRate

Service: production/api
Error rate: 15.2% (threshold: 5%)
Duration: 12 minutes

Investigation:
- Deployment: v1.2.4 (15 min ago)
- GitHub Actions: #456
- Author: @jane

Actions:
- @oncall notified
- PagerDuty incident created
- Rollback recommended"

Application
  ↓ exposes /metrics
K8sPod
  ↓ scraped by
PrometheusTarget
  ↓ metrics stored in
PrometheusInstance
  ↓ evaluated by
PrometheusRule
  ↓ fires
PrometheusAlert
  ↓ notifies
SlackChannel + PagerDuty
  ↓ investigated using
AI + Complete Graph

You: "Why is the API alerting?"

AI: Complete root cause analysis:
    ✓ When alert started
    ✓ What deployment changed
    ✓ Which commit caused it
    ✓ What metrics show
    ✓ Recommended action

You: "Is the API at capacity?"

AI: Capacity analysis:
    ✓ Current utilization (28%)
    ✓ Growth trends (+15% week-over-week)
    ✓ Projected capacity needs
    ✓ Scaling recommendations
    ✓ Cost optimization

You: "Is the deployment healthy?"

AI: Post-deployment metrics:
    ✓ Error rate comparison
    ✓ Latency changes
    ✓ Resource usage
    ✓ Alert status
    ✓ Performance verdict

You: "What happened during the outage?"

AI: Complete incident reconstruction:
    14:32 - Latency spike (pending alert)
    14:35 - Degradation starts (firing alert)
    14:37 - Full outage (95% errors)
    14:55 - Root cause found
    15:18 - Resolved

    Including: metrics, correlations, actions

The molecule makes PromQL natural:

Instead of writing:
rate(http_requests_total{job="api",status=~"5.."}[5m])

Just ask:
"What's the 5xx error rate for the API?"

Instead of:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Just ask:
"What's the 95th percentile response time?"

Monitor multiple Prometheus instances:

instances:
  - name: "prod-us"
    url: "https://prom-us.example.com"
  - name: "prod-eu"
    url: "https://prom-eu.example.com"
  - name: "staging"
    url: "https://prom-staging.example.com"

AI can query across all:

"Show me all firing alerts across all environments"

AI: Alerts across 3 Prometheus instances:

    prod-us: 1 critical
    - HighErrorRate in api service

    prod-eu: ✅ No alerts

    staging: 2 warnings
    - SlowQueries in database
    - HighMemory in worker

Ensure consistent labels for K8s integration:

- job: "kubernetes-pods"
  kubernetes_sd_configs:
    - role: pod
  relabel_configs:
    - source_labels: [__meta_kubernetes_pod_name]
      target_label: pod
    - source_labels: [__meta_kubernetes_namespace]
      target_label: namespace

Include identifying labels:

groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(...) > 0.05
        labels:
          severity: critical
          service: api
          namespace: production
          deployment: api-backend

degree discovery schedule prometheus@1.0.0 \
  --config prometheus-config.yaml \
  --interval 5m

Error: 401 Unauthorized

Solution: Check token or credentials

auth:
  type: "bearer"
  bearer_token: "YOUR_TOKEN"

Error: connection refused

Solution: Verify Prometheus URL and network access

curl https://prometheus.example.com/-/healthy

If targets aren't linking to pods, verify labels:

curl https://prometheus.example.com/api/v1/targets | jq '.data.activeTargets[0].labels'

Prometheus HTTP API: https://prometheus.io/docs/prometheus/latest/querying/api/
PromQL: https://prometheus.io/docs/prometheus/latest/querying/basics/

MIT License - see LICENSE