Skip to main content
PagerDuty

PagerDuty

Discover PagerDuty incident management infrastructure with full integration to Kubernetes, Prometheus, and Slack.

  • Services: All PagerDuty services with configuration

  • Incidents: Active and recent incidents with full context

  • Users: User profiles and contact information

  • Escalation Policies: Escalation chains and rules

  • Schedules: On-call schedules and rotations

  • Automatic K8s Linking: Connects services to deployments by name matching

  • pd_create_incident - Create new incidents

  • pd_acknowledge_incident - Acknowledge incidents

  • pd_resolve_incident - Resolve incidents

  • pd_list_incidents - List incidents with filtering

  • pd_get_oncall - Get on-call users for services

  • pd_add_note - Add notes to incidents

  • pd_trigger_incident - Trigger incidents via Events API v2

discovery:
enabled: true
settings:
api_token: "u+ABCDEFGH" # PagerDuty REST API token

# Optional filters
service_ids: ["PXYZ123", "PXYZ456"] # Specific services only
team_ids: ["TXYZ789"] # Filter by team

# For K8s integration (service name matching)
kubernetes_cluster: "production"

mcp:
enabled: true
settings:
api_token: "u+ABCDEFGH"
integration_key: "abc123" # For triggering incidents via Events API

namespace: "default"
discovery:
enabled: true
settings:
accounts:
- name: "production"
api_token: "u+PROD_TOKEN"

- name: "staging"
api_token: "u+STAGING_TOKEN"

kubernetes_cluster: "production"

namespace: "default"
kind: PagerDutyService
metadata:
name: "production/api-service"
labels:
status: "active"
escalation_policy: "Default"
spec:
account: "production"
service_id: "PXYZ123"
name: "API Service"
status: "active"
description: "Production API service"
escalation_policy_id: "PXYZ789"
alert_creation: "create_alerts_and_incidents"
kind: PagerDutyIncident
metadata:
name: "production/QXYZ456"
labels:
status: "triggered"
urgency: "high"
service: "API Service"
spec:
account: "production"
incident_id: "QXYZ456"
incident_number: 12345
title: "High error rate on API service"
status: "triggered"
urgency: "high"
created_at: "2024-02-10T16:00:00Z"
service_id: "PXYZ123"
service_name: "API Service"
assigned_to_user_id: "UXYZ789"
kind: PagerDutyUser
metadata:
name: "production/UXYZ789"
labels:
role: "user"
spec:
account: "production"
user_id: "UXYZ789"
name: "Jane Doe"
email: "jane@example.com"
role: "user"
kind: PagerDutyEscalationPolicy
metadata:
name: "production/PXYZ789"
spec:
account: "production"
policy_id: "PXYZ789"
name: "Default"
num_loops: 1
on_call_handoff_notifications: "if_has_services"
kind: PagerDutySchedule
metadata:
name: "production/SXYZ123"
spec:
account: "production"
schedule_id: "SXYZ123"
name: "Primary On-Call"
time_zone: "America/New_York"
RelationshipFromToHow Detected
BELONGS_TOService/Incident/UserAccountParent-child
ASSIGNED_TOIncidentUserIncident assignment
ESCALATES_TOServiceEscalationPolicyService configuration
AFFECTSIncidentServiceIncident service reference
ON_CALL_FORUserServiceSchedule lookup
AFFECTSIncidentK8sDeploymentService name matching (future)

degree discovery run pagerduty@1.0.0 --config pagerduty-config.yaml


degree discovery schedule pagerduty@1.0.0 --config pagerduty-config.yaml --interval 5m
"Who's on-call for the API service?"

AI: [Queries PagerDuty schedules]
"On-call for API Service:

Primary: Jane Doe (jane@example.com)
- Shift: Mon-Fri 9am-5pm EST
- Phone: +1-555-0100

Secondary: John Smith (john@example.com)
- Escalation: After 30 minutes"
"What incidents happened during the last deployment?"

AI: [Cross-references K8s deployment time with PagerDuty incidents]
"3 incidents during api v1.2.4 deployment (15:40-16:00):

1. HighErrorRate (#12345)
- Status: Resolved
- Duration: 5 minutes
- Assigned to: Jane Doe

2. HighMemoryUsage (#12346)
- Status: Resolved
- Duration: 12 minutes
- Assigned to: Jane Doe

3. SlowResponseTime (#12347)
- Status: Acknowledged
- Duration: Ongoing (22 minutes)
- Assigned to: John Smith"
"Show me all critical incidents this week"

AI: "Critical incidents (last 7 days):

5 total incidents
├─ 4 resolved
└─ 1 ongoing

By service:
├─ API Service: 3 incidents
├─ Database: 1 incident
└─ Frontend: 1 incident

Average resolution time: 18 minutes

Current open:
- #12350: Database connection pool exhausted
Status: Acknowledged (8 min ago)
Assigned to: Jane Doe"
"Create an incident for API high latency"

AI: [Creates PagerDuty incident]
"✅ Incident created: #12351

Title: API high latency
Service: API Service
Urgency: High
Status: Triggered

Actions taken:
├─ Incident created in PagerDuty
├─ Jane Doe paged (on-call)
└─ Notification sent to #oncall Slack channel

Incident URL: https://example.pagerduty.com/incidents/QXYZ789"
PrometheusAlert → TRIGGERS → PagerDutyIncident

Prometheus alert fires
↓ auto-creates
PagerDutyIncident
↓ pages
PagerDutyUser (on-call)
↓ notifies
SlackChannel (#oncall)

You: "Did this Prometheus alert create a PagerDuty incident?"

AI: "Yes! Alert HighErrorRate created incident #12345:

Alert: HighErrorRate (fired 15:47)
├─ Triggered PagerDuty incident #12345
├─ Paged Jane Doe (on-call)
├─ Notified #oncall channel
└─ Status: Acknowledged (15:50)

Timeline:
15:47 - Alert fired
15:47 - Incident created
15:48 - Jane paged
15:50 - Incident acknowledged

Next steps: Jane is investigating"
PagerDutyService → AFFECTS → K8sDeployment

You: "Which team owns the API deployment?"

AI: [Links K8s → PagerDuty → Users]
"API deployment ownership:

PagerDuty Service: API Service
├─ Escalation: Default policy
├─ Primary on-call: Jane Doe
└─ Secondary on-call: John Smith

Contact:
- Slack: #api-team
- Email: api-team@example.com
- PagerDuty: API Service"
Complete incident root cause analysis:

PrometheusAlert fires
↓ creates
PagerDutyIncident
↓ affects
K8sDeployment
↓ deployed by
ArgoApplication
↓ synced from
GitHubWorkflowRun

You: "What caused incident #12345?"

AI: "Incident #12345 root cause:

Incident: HighErrorRate
├─ Created: 15:47 (Prometheus alert)
├─ Service: API Service
├─ Status: Resolved (15:55)
└─ Assigned to: Jane Doe

Root cause:
├─ Deployment: api v1.2.4 (deployed 15:40)
├─ CI Run: GitHub Actions #456
├─ Commit: abc123 by jane@example.com
└─ Change: Updated error handling

Timeline:
15:40 - Deployment started
15:42 - Pods rolled out
15:47 - Alert fired → Incident created
15:48 - Jane acknowledged
15:50 - Rollback initiated
15:55 - Incident resolved

Post-mortem:
- Error rate spiked from 0.2% to 15%
- All 3 pods affected
- Rollback to v1.2.3 resolved issue
- PR #457 fixes root cause"
PagerDutyIncident → NOTIFIES → SlackChannel

Incident lifecycle in Slack:

15:47 - Incident created:
"🚨 New Incident: #12345
HighErrorRate - API Service
Urgency: High
Assigned to: @jane"

15:50 - Incident acknowledged:
"✅ Incident #12345 acknowledged by @jane
Investigating..."

15:52 - Note added:
"💬 @jane: Root cause identified - rolling back deployment"

15:55 - Incident resolved:
"✅ Incident #12345 resolved
Duration: 8 minutes
Resolution: Rollback to v1.2.3"
Application Error
↓ detected by
PrometheusAlert (HighErrorRate)
↓ triggers
PagerDutyIncident (created)
↓ pages
PagerDutyUser (on-call: Jane)
↓ notifies
SlackChannel (#oncall)
↓ investigates
K8sDeployment (recent change?)
↓ traces to
ArgoApplication (deployment)
↓ synced from
GitHubWorkflowRun (#456)
↓ built
GitCommit (abc123)
↓ action
Rollback via Argo CD
↓ resolves
PagerDutyIncident (resolved)
↓ updates
SlackChannel (#oncall)
You: "Who's on-call this week?"

AI: On-call schedule (Feb 10-16, 2024):

API Service:
├─ Primary: Jane Doe
│ Mon-Fri: 9am-5pm EST
├─ Secondary: John Smith
│ Mon-Fri: 5pm-9am EST
└─ Weekend: Bob Johnson
Sat-Sun: All day

Database Service:
├─ Primary: Alice Williams
└─ Secondary: Charlie Brown
You: "Show me incident trends for the API service"

AI: API Service incidents (last 30 days):

Total: 45 incidents
├─ Critical: 8 (18%)
├─ High: 15 (33%)
└─ Low: 22 (49%)

Status:
├─ Resolved: 43 (96%)
└─ Ongoing: 2 (4%)

Average resolution time: 22 minutes

Top causes:
1. High error rate: 15 incidents
2. High memory usage: 12 incidents
3. Slow queries: 8 incidents

Trend: ↓ 23% reduction from last month
You: "What incidents were caused by deployments?"

AI: Deployment-related incidents (last 30 days):

18 incidents linked to deployments

By deployment:
v1.2.4 (Feb 10): 3 incidents
- HighErrorRate: Resolved (5 min)
- HighMemory: Resolved (12 min)
- SlowResponse: Resolved (22 min)

v1.2.3 (Feb 9): 0 incidents ✅

v1.2.2 (Feb 8): 1 incident
- DatabaseConnection: Resolved (45 min)

Recommendation: v1.2.3 is most stable
You: "Create an incident if error rate exceeds 5%"

AI: [Sets up Prometheus → PagerDuty integration]
"✅ Alert rule configured:

Prometheus Rule:
- Name: HighErrorRate
- Query: rate(errors) > 0.05
- Duration: 5 minutes

PagerDuty Integration:
- Service: API Service
- Urgency: High
- Auto-assign: On-call user

Actions on alert:
1. Create PagerDuty incident
2. Page on-call user
3. Post to #oncall Slack channel
4. Include runbook link

Test this rule? [Yes/No]"

Get your API token from:

  1. Login to PagerDuty
  2. Navigate to Configuration → API Access
  3. Create new API token with Full Access or Read-only
  4. Use token format: u+ABCDEFGHIJKLMNOPQRSTUVW

For Events API v2 (triggering incidents):

  1. Navigate to Service → Integrations
  2. Add Integration → Events API v2
  3. Copy the Integration Key
Error: 401 Unauthorized

Solution: Check API token

settings:
api_token: "u+YOUR_TOKEN_HERE"

If services aren't linking to K8s deployments:


degree query "MATCH (s:PagerDutyService), (d:K8sDeployment) RETURN s.name, d.name"


Error: 429 Too Many Requests

Solution: Increase discovery interval


degree discovery schedule pagerduty@1.0.0 --interval 15m

Use consistent naming between PagerDuty and K8s:

PagerDuty Service: "api-service"
K8s Deployment: "api-service"

Structure escalation policies clearly:

Level 1: Primary on-call (immediate)
Level 2: Secondary on-call (after 5 min)
Level 3: Team lead (after 15 min)
Level 4: Manager (after 30 min)

degree discovery schedule pagerduty@1.0.0 \
--config pagerduty-config.yaml \
--interval 5m

Test the complete flow:






MIT License - see LICENSE