
NeuroCognitive Architecture (NCA) Incident Response Runbook¶
This runbook provides a structured approach for responding to incidents that may occur in the NCA production environment. It outlines the steps for identifying, responding to, and resolving incidents while minimizing impact on users.
Incident Severity Levels¶
Incidents are classified by severity level to determine appropriate response procedures:
Level | Description | Examples | Response Time | Escalation |
---|---|---|---|---|
P1 | Critical service outage | - API completely down - Data loss - Security breach | Immediate | Leadership + On-call team |
P2 | Partial service disruption | - High latency - Feature failure - Degraded performance | < 30 minutes | On-call team |
P3 | Minor service impact | - Non-critical bugs - Isolated errors - Minor performance issues | < 4 hours | Primary on-call |
P4 | No user impact | - Warning signs - Potential future issues | Next business day | Team aware |
Incident Response Workflow¶
1. Detection¶
- Automated Detection
- System health alerts from Prometheus
- Latency spikes detected by ServiceMonitor
- Error rate increases in logs
-
Memory/CPU utilization alerts
-
Manual Detection
- User-reported issues
- Regular system health checks
- Deployment observations
2. Assessment & Classification¶
When an incident is detected:
- Determine the scope
- Which components are affected?
- Is it impacting users?
-
Is it affecting all users or only specific segments?
-
Classify severity based on the level definitions above
-
Initial documentation in the incident management system
- Incident ID
- Severity level
- Brief description
- Affected components
- Detection method
- Initial responder(s)
3. Response¶
For P1 (Critical) Incidents:¶
- Activate incident management
- Notify on-call team via PagerDuty
- Create incident channel in Slack
-
Designate Incident Commander (IC)
-
Immediate mitigation
- Consider emergency rollback to last known good version
- Implement circuit breakers if applicable
-
Scale up resources if resource-related
-
Client communication
- Post to status page
- Send initial notification to affected clients
- Establish communication cadence
For P2 (Major) Incidents:¶
- Notify on-call team
- Primary responder to lead
-
Escalate if necessary
-
Implement mitigation
- Apply fixes from playbooks if available
-
Isolate affected components if possible
-
Client communication
- Update status page if user-visible
- Prepare client communication
For P3/P4 (Minor) Incidents:¶
- Assign to primary on-call or team
- Implement mitigation during business hours
- Document in tracking system
4. Investigation¶
-
Gather diagnostic information
-
Examine logs and traces
- Check application logs
- Review Prometheus metrics
-
Analyze request traces in Jaeger
-
Perform root cause analysis
- Memory leak checks
- API throttling issues
- Database connection problems
- External dependency failures
- Changes or deployments
5. Resolution¶
- Implement permanent fix
- Deploy hotfix if needed
- Validate fix in production
-
Verify monitoring confirms resolution
-
Document resolution
- Update incident report
- Note fixed version
-
Document workarounds used
-
Client communication
- Notify of resolution
- Update status page
- Provide explanation if appropriate
6. Post-incident Follow-up¶
- Conduct post-mortem
- Schedule within 24-48 hours of resolution
- Include all participants
- Document timeline
- Identify root causes
-
No blame approach
-
Generate action items
- Preventative measures
- Detection improvements
- Response enhancements
-
Documentation updates
-
Knowledge sharing
- Update runbooks with new findings
- Share lessons learned with team
- Improve monitoring if gaps identified
Common Incident Scenarios¶
API Latency Spike¶
-
Check CPU/Memory usage
-
Check database connection pool
- Query the database metrics
-
Look for connection limits
-
Check external API dependencies
-
Review Redis, OpenAI, etc.
-
Examine recent deployments
- Any recent code changes?
-
New dependencies?
-
Actions
- Scale horizontally if resource-bound
- Increase connection pool if DB-related
- Implement circuit breakers if dependency issues
Memory Leak¶
- Verify with increasing memory trend
-
Check Prometheus graphs for memory growth pattern
-
Collect heap dumps
-
Analyze memory usage
- Look for large object allocations
-
Check for unbounded caches
-
Actions
- Rolling restart if immediate mitigation needed
- Deploy fix addressing memory leak
- Add memory bounds to caches
Database Performance Issues¶
-
Check query performance
-
Examine index usage
-
Check connection pool
- Look for maxed out connections
-
Connection leaks
-
Actions
- Add needed indexes
- Optimize slow queries
- Increase connection timeouts if needed
Emergency Contacts¶
Role | Primary | Secondary | Contact Method |
---|---|---|---|
Database Admin | [NAME] | [NAME] | Slack @dbadmin, Phone |
Infrastructure Lead | [NAME] | [NAME] | Slack @infrateam, Phone |
Security Officer | [NAME] | [NAME] | Slack @security, Phone |
Engineering Lead | [NAME] | [NAME] | Slack @eng-lead, Phone |
Rollback Procedure¶
If a deployment needs to be rolled back:
# Check deployment history
kubectl rollout history deployment/neuroca -n neuroca
# Roll back to previous version
kubectl rollout undo deployment/neuroca -n neuroca
# Roll back to specific version
kubectl rollout undo deployment/neuroca -n neuroca --to-revision=<revision_number>
# Monitor rollback
kubectl rollout status deployment/neuroca -n neuroca
Helpful Commands¶
Kubernetes¶
# Get pod logs
kubectl logs -n neuroca <pod-name>
# Get pod logs for all containers in a pod
kubectl logs -n neuroca <pod-name> --all-containers
# Describe pod for detailed information
kubectl describe pod -n neuroca <pod-name>
# Get events
kubectl get events -n neuroca --sort-by='.lastTimestamp'
# Exec into container
kubectl exec -it -n neuroca <pod-name> -- /bin/bash
# Port forward to service
kubectl port-forward -n neuroca svc/neuroca 8000:80
Monitoring¶
# Check Prometheus alerts
curl -s http://prometheus:9090/api/v1/alerts | jq
# Check service health
curl -s http://neuroca-service/health/readiness
# Get recent logs
kubectl logs -n neuroca -l app=neuroca --tail=100
Database¶
# Connect to database
kubectl exec -it -n neuroca <postgres-pod> -- psql -U postgres -d neuroca
# Check connection count
SELECT count(*), state FROM pg_stat_activity GROUP BY state;
# Check table sizes
SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) AS size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
Regular Drills¶
Schedule regular incident response drills to ensure the team is prepared:
- Quarterly Gameday exercises
- Simulate P1 incidents
- Practice coordination
-
Test communication channels
-
Monthly Runbook reviews
- Update with new information
- Add newly discovered issues
-
Remove obsolete information
-
On-call readiness check
- Verify access to all systems
- Review escalation procedures
- Update contact information