PagerDuty
Discover PagerDuty incident management infrastructure with full integration to Kubernetes, Prometheus, and Slack.
-
Services: All PagerDuty services with configuration
-
Incidents: Active and recent incidents with full context
-
Users: User profiles and contact information
-
Escalation Policies: Escalation chains and rules
-
Schedules: On-call schedules and rotations
-
Automatic K8s Linking: Connects services to deployments by name matching
-
pd_create_incident- Create new incidents -
pd_acknowledge_incident- Acknowledge incidents -
pd_resolve_incident- Resolve incidents -
pd_list_incidents- List incidents with filtering -
pd_get_oncall- Get on-call users for services -
pd_add_note- Add notes to incidents -
pd_trigger_incident- Trigger incidents via Events API v2
discovery:
enabled: true
settings:
api_token: "u+ABCDEFGH" # PagerDuty REST API token
# Optional filters
service_ids: ["PXYZ123", "PXYZ456"] # Specific services only
team_ids: ["TXYZ789"] # Filter by team
# For K8s integration (service name matching)
kubernetes_cluster: "production"
mcp:
enabled: true
settings:
api_token: "u+ABCDEFGH"
integration_key: "abc123" # For triggering incidents via Events API
namespace: "default"
discovery:
enabled: true
settings:
accounts:
- name: "production"
api_token: "u+PROD_TOKEN"
- name: "staging"
api_token: "u+STAGING_TOKEN"
kubernetes_cluster: "production"
namespace: "default"
kind: PagerDutyService
metadata:
name: "production/api-service"
labels:
status: "active"
escalation_policy: "Default"
spec:
account: "production"
service_id: "PXYZ123"
name: "API Service"
status: "active"
description: "Production API service"
escalation_policy_id: "PXYZ789"
alert_creation: "create_alerts_and_incidents"
kind: PagerDutyIncident
metadata:
name: "production/QXYZ456"
labels:
status: "triggered"
urgency: "high"
service: "API Service"
spec:
account: "production"
incident_id: "QXYZ456"
incident_number: 12345
title: "High error rate on API service"
status: "triggered"
urgency: "high"
created_at: "2024-02-10T16:00:00Z"
service_id: "PXYZ123"
service_name: "API Service"
assigned_to_user_id: "UXYZ789"
kind: PagerDutyUser
metadata:
name: "production/UXYZ789"
labels:
role: "user"
spec:
account: "production"
user_id: "UXYZ789"
name: "Jane Doe"
email: "jane@example.com"
role: "user"
kind: PagerDutyEscalationPolicy
metadata:
name: "production/PXYZ789"
spec:
account: "production"
policy_id: "PXYZ789"
name: "Default"
num_loops: 1
on_call_handoff_notifications: "if_has_services"
kind: PagerDutySchedule
metadata:
name: "production/SXYZ123"
spec:
account: "production"
schedule_id: "SXYZ123"
name: "Primary On-Call"
time_zone: "America/New_York"
| Relationship | From | To | How Detected |
|---|---|---|---|
BELONGS_TO | Service/Incident/User | Account | Parent-child |
ASSIGNED_TO | Incident | User | Incident assignment |
ESCALATES_TO | Service | EscalationPolicy | Service configuration |
AFFECTS | Incident | Service | Incident service reference |
ON_CALL_FOR | User | Service | Schedule lookup |
AFFECTS | Incident | K8sDeployment | Service name matching (future) |
degree discovery run pagerduty@1.0.0 --config pagerduty-config.yaml
degree discovery schedule pagerduty@1.0.0 --config pagerduty-config.yaml --interval 5m
"Who's on-call for the API service?"
AI: [Queries PagerDuty schedules]
"On-call for API Service:
Primary: Jane Doe (jane@example.com)
- Shift: Mon-Fri 9am-5pm EST
- Phone: +1-555-0100
Secondary: John Smith (john@example.com)
- Escalation: After 30 minutes"
"What incidents happened during the last deployment?"
AI: [Cross-references K8s deployment time with PagerDuty incidents]
"3 incidents during api v1.2.4 deployment (15:40-16:00):
1. HighErrorRate (#12345)
- Status: Resolved
- Duration: 5 minutes
- Assigned to: Jane Doe
2. HighMemoryUsage (#12346)
- Status: Resolved
- Duration: 12 minutes
- Assigned to: Jane Doe
3. SlowResponseTime (#12347)
- Status: Acknowledged
- Duration: Ongoing (22 minutes)
- Assigned to: John Smith"
"Show me all critical incidents this week"
AI: "Critical incidents (last 7 days):
5 total incidents
├─ 4 resolved
└─ 1 ongoing
By service:
├─ API Service: 3 incidents
├─ Database: 1 incident
└─ Frontend: 1 incident
Average resolution time: 18 minutes
Current open:
- #12350: Database connection pool exhausted
Status: Acknowledged (8 min ago)
Assigned to: Jane Doe"
"Create an incident for API high latency"
AI: [Creates PagerDuty incident]
"✅ Incident created: #12351
Title: API high latency
Service: API Service
Urgency: High
Status: Triggered
Actions taken:
├─ Incident created in PagerDuty
├─ Jane Doe paged (on-call)
└─ Notification sent to #oncall Slack channel
Incident URL: https://example.pagerduty.com/incidents/QXYZ789"
PrometheusAlert → TRIGGERS → PagerDutyIncident
Prometheus alert fires
↓ auto-creates
PagerDutyIncident
↓ pages
PagerDutyUser (on-call)
↓ notifies
SlackChannel (#oncall)
You: "Did this Prometheus alert create a PagerDuty incident?"
AI: "Yes! Alert HighErrorRate created incident #12345:
Alert: HighErrorRate (fired 15:47)
├─ Triggered PagerDuty incident #12345
├─ Paged Jane Doe (on-call)
├─ Notified #oncall channel
└─ Status: Acknowledged (15:50)
Timeline:
15:47 - Alert fired
15:47 - Incident created
15:48 - Jane paged
15:50 - Incident acknowledged
Next steps: Jane is investigating"
PagerDutyService → AFFECTS → K8sDeployment
You: "Which team owns the API deployment?"
AI: [Links K8s → PagerDuty → Users]
"API deployment ownership:
PagerDuty Service: API Service
├─ Escalation: Default policy
├─ Primary on-call: Jane Doe
└─ Secondary on-call: John Smith
Contact:
- Slack: #api-team
- Email: api-team@example.com
- PagerDuty: API Service"
Complete incident root cause analysis:
PrometheusAlert fires
↓ creates
PagerDutyIncident
↓ affects
K8sDeployment
↓ deployed by
ArgoApplication
↓ synced from
GitHubWorkflowRun
You: "What caused incident #12345?"
AI: "Incident #12345 root cause:
Incident: HighErrorRate
├─ Created: 15:47 (Prometheus alert)
├─ Service: API Service
├─ Status: Resolved (15:55)
└─ Assigned to: Jane Doe
Root cause:
├─ Deployment: api v1.2.4 (deployed 15:40)
├─ CI Run: GitHub Actions #456
├─ Commit: abc123 by jane@example.com
└─ Change: Updated error handling
Timeline:
15:40 - Deployment started
15:42 - Pods rolled out
15:47 - Alert fired → Incident created
15:48 - Jane acknowledged
15:50 - Rollback initiated
15:55 - Incident resolved
Post-mortem:
- Error rate spiked from 0.2% to 15%
- All 3 pods affected
- Rollback to v1.2.3 resolved issue
- PR #457 fixes root cause"
PagerDutyIncident → NOTIFIES → SlackChannel
Incident lifecycle in Slack:
15:47 - Incident created:
"🚨 New Incident: #12345
HighErrorRate - API Service
Urgency: High
Assigned to: @jane"
15:50 - Incident acknowledged:
"✅ Incident #12345 acknowledged by @jane
Investigating..."
15:52 - Note added:
"💬 @jane: Root cause identified - rolling back deployment"
15:55 - Incident resolved:
"✅ Incident #12345 resolved
Duration: 8 minutes
Resolution: Rollback to v1.2.3"
Application Error
↓ detected by
PrometheusAlert (HighErrorRate)
↓ triggers
PagerDutyIncident (created)
↓ pages
PagerDutyUser (on-call: Jane)
↓ notifies
SlackChannel (#oncall)
↓ investigates
K8sDeployment (recent change?)
↓ traces to
ArgoApplication (deployment)
↓ synced from
GitHubWorkflowRun (#456)
↓ built
GitCommit (abc123)
↓ action
Rollback via Argo CD
↓ resolves
PagerDutyIncident (resolved)
↓ updates
SlackChannel (#oncall)
You: "Who's on-call this week?"
AI: On-call schedule (Feb 10-16, 2024):
API Service:
├─ Primary: Jane Doe
│ Mon-Fri: 9am-5pm EST
├─ Secondary: John Smith
│ Mon-Fri: 5pm-9am EST
└─ Weekend: Bob Johnson
Sat-Sun: All day
Database Service:
├─ Primary: Alice Williams
└─ Secondary: Charlie Brown
You: "Show me incident trends for the API service"
AI: API Service incidents (last 30 days):
Total: 45 incidents
├─ Critical: 8 (18%)
├─ High: 15 (33%)
└─ Low: 22 (49%)
Status:
├─ Resolved: 43 (96%)
└─ Ongoing: 2 (4%)
Average resolution time: 22 minutes
Top causes:
1. High error rate: 15 incidents
2. High memory usage: 12 incidents
3. Slow queries: 8 incidents
Trend: ↓ 23% reduction from last month
You: "What incidents were caused by deployments?"
AI: Deployment-related incidents (last 30 days):
18 incidents linked to deployments
By deployment:
v1.2.4 (Feb 10): 3 incidents
- HighErrorRate: Resolved (5 min)
- HighMemory: Resolved (12 min)
- SlowResponse: Resolved (22 min)
v1.2.3 (Feb 9): 0 incidents ✅
v1.2.2 (Feb 8): 1 incident
- DatabaseConnection: Resolved (45 min)
Recommendation: v1.2.3 is most stable
You: "Create an incident if error rate exceeds 5%"
AI: [Sets up Prometheus → PagerDuty integration]
"✅ Alert rule configured:
Prometheus Rule:
- Name: HighErrorRate
- Query: rate(errors) > 0.05
- Duration: 5 minutes
PagerDuty Integration:
- Service: API Service
- Urgency: High
- Auto-assign: On-call user
Actions on alert:
1. Create PagerDuty incident
2. Page on-call user
3. Post to #oncall Slack channel
4. Include runbook link
Test this rule? [Yes/No]"
- PagerDuty REST API v2: https://developer.pagerduty.com/api-reference/
- Events API v2: https://developer.pagerduty.com/docs/events-api-v2/overview/
Get your API token from:
- Login to PagerDuty
- Navigate to Configuration → API Access
- Create new API token with Full Access or Read-only
- Use token format:
u+ABCDEFGHIJKLMNOPQRSTUVW
For Events API v2 (triggering incidents):
- Navigate to Service → Integrations
- Add Integration → Events API v2
- Copy the Integration Key
Error: 401 Unauthorized
Solution: Check API token
settings:
api_token: "u+YOUR_TOKEN_HERE"
If services aren't linking to K8s deployments:
degree query "MATCH (s:PagerDutyService), (d:K8sDeployment) RETURN s.name, d.name"
Error: 429 Too Many Requests
Solution: Increase discovery interval
degree discovery schedule pagerduty@1.0.0 --interval 15m
Use consistent naming between PagerDuty and K8s:
PagerDuty Service: "api-service"
K8s Deployment: "api-service"
Structure escalation policies clearly:
Level 1: Primary on-call (immediate)
Level 2: Secondary on-call (after 5 min)
Level 3: Team lead (after 15 min)
Level 4: Manager (after 30 min)
degree discovery schedule pagerduty@1.0.0 \
--config pagerduty-config.yaml \
--interval 5m
Test the complete flow:
MIT License - see LICENSE