PagerDuty

Discover PagerDuty incident management infrastructure with full integration to Kubernetes, Prometheus, and Slack.

Services: All PagerDuty services with configuration
Incidents: Active and recent incidents with full context
Users: User profiles and contact information
Escalation Policies: Escalation chains and rules
Schedules: On-call schedules and rotations
Automatic K8s Linking: Connects services to deployments by name matching
pd_create_incident - Create new incidents
pd_acknowledge_incident - Acknowledge incidents
pd_resolve_incident - Resolve incidents
pd_list_incidents - List incidents with filtering
pd_get_oncall - Get on-call users for services
pd_add_note - Add notes to incidents
pd_trigger_incident - Trigger incidents via Events API v2

discovery:
  enabled: true
  settings:
    api_token: "u+ABCDEFGH"  # PagerDuty REST API token

    # Optional filters
    service_ids: ["PXYZ123", "PXYZ456"]  # Specific services only
    team_ids: ["TXYZ789"]  # Filter by team

    # For K8s integration (service name matching)
    kubernetes_cluster: "production"

mcp:
  enabled: true
  settings:
    api_token: "u+ABCDEFGH"
    integration_key: "abc123"  # For triggering incidents via Events API

namespace: "default"

discovery:
  enabled: true
  settings:
    accounts:
      - name: "production"
        api_token: "u+PROD_TOKEN"

      - name: "staging"
        api_token: "u+STAGING_TOKEN"

    kubernetes_cluster: "production"

namespace: "default"

kind: PagerDutyService
metadata:
  name: "production/api-service"
  labels:
    status: "active"
    escalation_policy: "Default"
spec:
  account: "production"
  service_id: "PXYZ123"
  name: "API Service"
  status: "active"
  description: "Production API service"
  escalation_policy_id: "PXYZ789"
  alert_creation: "create_alerts_and_incidents"

kind: PagerDutyIncident
metadata:
  name: "production/QXYZ456"
  labels:
    status: "triggered"
    urgency: "high"
    service: "API Service"
spec:
  account: "production"
  incident_id: "QXYZ456"
  incident_number: 12345
  title: "High error rate on API service"
  status: "triggered"
  urgency: "high"
  created_at: "2024-02-10T16:00:00Z"
  service_id: "PXYZ123"
  service_name: "API Service"
  assigned_to_user_id: "UXYZ789"

kind: PagerDutyUser
metadata:
  name: "production/UXYZ789"
  labels:
    role: "user"
spec:
  account: "production"
  user_id: "UXYZ789"
  name: "Jane Doe"
  email: "jane@example.com"
  role: "user"

kind: PagerDutyEscalationPolicy
metadata:
  name: "production/PXYZ789"
spec:
  account: "production"
  policy_id: "PXYZ789"
  name: "Default"
  num_loops: 1
  on_call_handoff_notifications: "if_has_services"

kind: PagerDutySchedule
metadata:
  name: "production/SXYZ123"
spec:
  account: "production"
  schedule_id: "SXYZ123"
  name: "Primary On-Call"
  time_zone: "America/New_York"

Relationship	From	To	How Detected
`BELONGS_TO`	Service/Incident/User	Account	Parent-child
`ASSIGNED_TO`	Incident	User	Incident assignment
`ESCALATES_TO`	Service	EscalationPolicy	Service configuration
`AFFECTS`	Incident	Service	Incident service reference
`ON_CALL_FOR`	User	Service	Schedule lookup
`AFFECTS`	Incident	K8sDeployment	Service name matching (future)

degree discovery run pagerduty@1.0.0 --config pagerduty-config.yaml

degree discovery schedule pagerduty@1.0.0 --config pagerduty-config.yaml --interval 5m

"Who's on-call for the API service?"

AI: [Queries PagerDuty schedules]
    "On-call for API Service:

     Primary: Jane Doe (jane@example.com)
     - Shift: Mon-Fri 9am-5pm EST
     - Phone: +1-555-0100

     Secondary: John Smith (john@example.com)
     - Escalation: After 30 minutes"

"What incidents happened during the last deployment?"

AI: [Cross-references K8s deployment time with PagerDuty incidents]
    "3 incidents during api v1.2.4 deployment (15:40-16:00):

    1. HighErrorRate (#12345)
       - Status: Resolved
       - Duration: 5 minutes
       - Assigned to: Jane Doe

    2. HighMemoryUsage (#12346)
       - Status: Resolved
       - Duration: 12 minutes
       - Assigned to: Jane Doe

    3. SlowResponseTime (#12347)
       - Status: Acknowledged
       - Duration: Ongoing (22 minutes)
       - Assigned to: John Smith"

"Show me all critical incidents this week"

AI: "Critical incidents (last 7 days):

    5 total incidents
    ├─ 4 resolved
    └─ 1 ongoing

    By service:
    ├─ API Service: 3 incidents
    ├─ Database: 1 incident
    └─ Frontend: 1 incident

    Average resolution time: 18 minutes

    Current open:
    - #12350: Database connection pool exhausted
      Status: Acknowledged (8 min ago)
      Assigned to: Jane Doe"

"Create an incident for API high latency"

AI: [Creates PagerDuty incident]
    "✅ Incident created: #12351

    Title: API high latency
    Service: API Service
    Urgency: High
    Status: Triggered

    Actions taken:
    ├─ Incident created in PagerDuty
    ├─ Jane Doe paged (on-call)
    └─ Notification sent to #oncall Slack channel

    Incident URL: https://example.pagerduty.com/incidents/QXYZ789"

PrometheusAlert → TRIGGERS → PagerDutyIncident

Prometheus alert fires
  ↓ auto-creates
PagerDutyIncident
  ↓ pages
PagerDutyUser (on-call)
  ↓ notifies
SlackChannel (#oncall)

You: "Did this Prometheus alert create a PagerDuty incident?"

AI: "Yes! Alert HighErrorRate created incident #12345:

    Alert: HighErrorRate (fired 15:47)
    ├─ Triggered PagerDuty incident #12345
    ├─ Paged Jane Doe (on-call)
    ├─ Notified #oncall channel
    └─ Status: Acknowledged (15:50)

    Timeline:
    15:47 - Alert fired
    15:47 - Incident created
    15:48 - Jane paged
    15:50 - Incident acknowledged

    Next steps: Jane is investigating"

PagerDutyService → AFFECTS → K8sDeployment

You: "Which team owns the API deployment?"

AI: [Links K8s → PagerDuty → Users]
    "API deployment ownership:

    PagerDuty Service: API Service
    ├─ Escalation: Default policy
    ├─ Primary on-call: Jane Doe
    └─ Secondary on-call: John Smith

    Contact:
    - Slack: #api-team
    - Email: api-team@example.com
    - PagerDuty: API Service"

Complete incident root cause analysis:

PrometheusAlert fires
  ↓ creates
PagerDutyIncident
  ↓ affects
K8sDeployment
  ↓ deployed by
ArgoApplication
  ↓ synced from
GitHubWorkflowRun

You: "What caused incident #12345?"

AI: "Incident #12345 root cause:

    Incident: HighErrorRate
    ├─ Created: 15:47 (Prometheus alert)
    ├─ Service: API Service
    ├─ Status: Resolved (15:55)
    └─ Assigned to: Jane Doe

    Root cause:
    ├─ Deployment: api v1.2.4 (deployed 15:40)
    ├─ CI Run: GitHub Actions #456
    ├─ Commit: abc123 by jane@example.com
    └─ Change: Updated error handling

    Timeline:
    15:40 - Deployment started
    15:42 - Pods rolled out
    15:47 - Alert fired → Incident created
    15:48 - Jane acknowledged
    15:50 - Rollback initiated
    15:55 - Incident resolved

    Post-mortem:
    - Error rate spiked from 0.2% to 15%
    - All 3 pods affected
    - Rollback to v1.2.3 resolved issue
    - PR #457 fixes root cause"

PagerDutyIncident → NOTIFIES → SlackChannel

Incident lifecycle in Slack:

15:47 - Incident created:
"🚨 New Incident: #12345
HighErrorRate - API Service
Urgency: High
Assigned to: @jane"

15:50 - Incident acknowledged:
"✅ Incident #12345 acknowledged by @jane
Investigating..."

15:52 - Note added:
"💬 @jane: Root cause identified - rolling back deployment"

15:55 - Incident resolved:
"✅ Incident #12345 resolved
Duration: 8 minutes
Resolution: Rollback to v1.2.3"

Application Error
  ↓ detected by
PrometheusAlert (HighErrorRate)
  ↓ triggers
PagerDutyIncident (created)
  ↓ pages
PagerDutyUser (on-call: Jane)
  ↓ notifies
SlackChannel (#oncall)
  ↓ investigates
K8sDeployment (recent change?)
  ↓ traces to
ArgoApplication (deployment)
  ↓ synced from
GitHubWorkflowRun (#456)
  ↓ built
GitCommit (abc123)
  ↓ action
Rollback via Argo CD
  ↓ resolves
PagerDutyIncident (resolved)
  ↓ updates
SlackChannel (#oncall)

You: "Who's on-call this week?"

AI: On-call schedule (Feb 10-16, 2024):

    API Service:
    ├─ Primary: Jane Doe
    │   Mon-Fri: 9am-5pm EST
    ├─ Secondary: John Smith
    │   Mon-Fri: 5pm-9am EST
    └─ Weekend: Bob Johnson
        Sat-Sun: All day

    Database Service:
    ├─ Primary: Alice Williams
    └─ Secondary: Charlie Brown

You: "Show me incident trends for the API service"

AI: API Service incidents (last 30 days):

    Total: 45 incidents
    ├─ Critical: 8 (18%)
    ├─ High: 15 (33%)
    └─ Low: 22 (49%)

    Status:
    ├─ Resolved: 43 (96%)
    └─ Ongoing: 2 (4%)

    Average resolution time: 22 minutes

    Top causes:
    1. High error rate: 15 incidents
    2. High memory usage: 12 incidents
    3. Slow queries: 8 incidents

    Trend: ↓ 23% reduction from last month

You: "What incidents were caused by deployments?"

AI: Deployment-related incidents (last 30 days):

    18 incidents linked to deployments

    By deployment:
    v1.2.4 (Feb 10): 3 incidents
    - HighErrorRate: Resolved (5 min)
    - HighMemory: Resolved (12 min)
    - SlowResponse: Resolved (22 min)

    v1.2.3 (Feb 9): 0 incidents ✅

    v1.2.2 (Feb 8): 1 incident
    - DatabaseConnection: Resolved (45 min)

    Recommendation: v1.2.3 is most stable

You: "Create an incident if error rate exceeds 5%"

AI: [Sets up Prometheus → PagerDuty integration]
    "✅ Alert rule configured:

    Prometheus Rule:
    - Name: HighErrorRate
    - Query: rate(errors) > 0.05
    - Duration: 5 minutes

    PagerDuty Integration:
    - Service: API Service
    - Urgency: High
    - Auto-assign: On-call user

    Actions on alert:
    1. Create PagerDuty incident
    2. Page on-call user
    3. Post to #oncall Slack channel
    4. Include runbook link

    Test this rule? [Yes/No]"

PagerDuty REST API v2: https://developer.pagerduty.com/api-reference/
Events API v2: https://developer.pagerduty.com/docs/events-api-v2/overview/

Get your API token from:

Login to PagerDuty
Navigate to Configuration → API Access
Create new API token with Full Access or Read-only
Use token format: u+ABCDEFGHIJKLMNOPQRSTUVW

For Events API v2 (triggering incidents):

Navigate to Service → Integrations
Add Integration → Events API v2
Copy the Integration Key

Error: 401 Unauthorized

Solution: Check API token

settings:
  api_token: "u+YOUR_TOKEN_HERE"

If services aren't linking to K8s deployments:

degree query "MATCH (s:PagerDutyService), (d:K8sDeployment) RETURN s.name, d.name"

Error: 429 Too Many Requests

Solution: Increase discovery interval


degree discovery schedule pagerduty@1.0.0 --interval 15m

Use consistent naming between PagerDuty and K8s:

PagerDuty Service: "api-service"
K8s Deployment: "api-service"

Structure escalation policies clearly:

Level 1: Primary on-call (immediate)
Level 2: Secondary on-call (after 5 min)
Level 3: Team lead (after 15 min)
Level 4: Manager (after 30 min)

degree discovery schedule pagerduty@1.0.0 \
  --config pagerduty-config.yaml \
  --interval 5m

Test the complete flow:

MIT License - see LICENSE