NeuroCognitive Architecture (NCA) Incident Response Runbook¶

This runbook provides a structured approach for responding to incidents that may occur in the NCA production environment. It outlines the steps for identifying, responding to, and resolving incidents while minimizing impact on users.

Incident Severity Levels¶

Incidents are classified by severity level to determine appropriate response procedures:

Level	Description	Examples	Response Time	Escalation
P1	Critical service outage	- API completely down - Data loss - Security breach	Immediate	Leadership + On-call team
P2	Partial service disruption	- High latency - Feature failure - Degraded performance	< 30 minutes	On-call team
P3	Minor service impact	- Non-critical bugs - Isolated errors - Minor performance issues	< 4 hours	Primary on-call
P4	No user impact	- Warning signs - Potential future issues	Next business day	Team aware

Incident Response Workflow¶

1. Detection¶

Automated Detection
System health alerts from Prometheus
Latency spikes detected by ServiceMonitor
Error rate increases in logs
Memory/CPU utilization alerts
Manual Detection
User-reported issues
Regular system health checks
Deployment observations

2. Assessment & Classification¶

When an incident is detected:

Determine the scope
Which components are affected?
Is it impacting users?
Is it affecting all users or only specific segments?
Classify severity based on the level definitions above
Initial documentation in the incident management system
Incident ID
Severity level
Brief description
Affected components
Detection method
Initial responder(s)

3. Response¶

For P1 (Critical) Incidents:¶

Activate incident management
Notify on-call team via PagerDuty
Create incident channel in Slack
Designate Incident Commander (IC)
Immediate mitigation
Consider emergency rollback to last known good version
Implement circuit breakers if applicable
Scale up resources if resource-related
Client communication
Post to status page
Send initial notification to affected clients
Establish communication cadence

For P2 (Major) Incidents:¶

Notify on-call team
Primary responder to lead
Escalate if necessary
Implement mitigation
Apply fixes from playbooks if available
Isolate affected components if possible
Client communication
Update status page if user-visible
Prepare client communication

For P3/P4 (Minor) Incidents:¶

Assign to primary on-call or team
Implement mitigation during business hours
Document in tracking system

4. Investigation¶

Gather diagnostic information

# Get pod logs
kubectl logs -n neuroca -l app=neuroca --tail=500

# Check pod status
kubectl get pods -n neuroca

# Check memory usage
kubectl top pod -n neuroca

# Watch metrics
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090

Examine logs and traces
Check application logs
Review Prometheus metrics
Analyze request traces in Jaeger
Perform root cause analysis
Memory leak checks
API throttling issues
Database connection problems
External dependency failures
Changes or deployments

5. Resolution¶

Implement permanent fix
Deploy hotfix if needed
Validate fix in production
Verify monitoring confirms resolution
Document resolution
Update incident report
Note fixed version
Document workarounds used
Client communication
Notify of resolution
Update status page
Provide explanation if appropriate

6. Post-incident Follow-up¶

Conduct post-mortem
Schedule within 24-48 hours of resolution
Include all participants
Document timeline
Identify root causes
No blame approach
Generate action items
Preventative measures
Detection improvements
Response enhancements
Documentation updates
Knowledge sharing
Update runbooks with new findings
Share lessons learned with team
Improve monitoring if gaps identified

Common Incident Scenarios¶

API Latency Spike¶

Check CPU/Memory usage
```
kubectl top pods -n neuroca
```
Check database connection pool
Query the database metrics
Look for connection limits
Check external API dependencies
Review Redis, OpenAI, etc.
Examine recent deployments
Any recent code changes?
New dependencies?
Actions
Scale horizontally if resource-bound
Increase connection pool if DB-related
Implement circuit breakers if dependency issues

Memory Leak¶

Verify with increasing memory trend
Check Prometheus graphs for memory growth pattern

Collect heap dumps

# Get pod name
POD=$(kubectl get pod -n neuroca -l app=neuroca -o jsonpath='{.items[0].metadata.name}')

# Execute heap dump
kubectl exec -n neuroca $POD -- python -m memory_profiler dump_mem.py > heap.dump

Analyze memory usage
Look for large object allocations
Check for unbounded caches
Actions
Rolling restart if immediate mitigation needed
Deploy fix addressing memory leak
Add memory bounds to caches

Database Performance Issues¶

Check query performance

SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;

Examine index usage
Check connection pool
Look for maxed out connections
Connection leaks
Actions
Add needed indexes
Optimize slow queries
Increase connection timeouts if needed

Emergency Contacts¶

Role	Primary	Secondary	Contact Method
Database Admin	[NAME]	[NAME]	Slack @dbadmin, Phone
Infrastructure Lead	[NAME]	[NAME]	Slack @infrateam, Phone
Security Officer	[NAME]	[NAME]	Slack @security, Phone
Engineering Lead	[NAME]	[NAME]	Slack @eng-lead, Phone

Rollback Procedure¶

If a deployment needs to be rolled back:

# Check deployment history
kubectl rollout history deployment/neuroca -n neuroca

# Roll back to previous version
kubectl rollout undo deployment/neuroca -n neuroca

# Roll back to specific version
kubectl rollout undo deployment/neuroca -n neuroca --to-revision=<revision_number>

# Monitor rollback
kubectl rollout status deployment/neuroca -n neuroca

Helpful Commands¶

Kubernetes¶

# Get pod logs
kubectl logs -n neuroca <pod-name>

# Get pod logs for all containers in a pod
kubectl logs -n neuroca <pod-name> --all-containers

# Describe pod for detailed information
kubectl describe pod -n neuroca <pod-name>

# Get events
kubectl get events -n neuroca --sort-by='.lastTimestamp'

# Exec into container
kubectl exec -it -n neuroca <pod-name> -- /bin/bash

# Port forward to service
kubectl port-forward -n neuroca svc/neuroca 8000:80

Monitoring¶

# Check Prometheus alerts
curl -s http://prometheus:9090/api/v1/alerts | jq

# Check service health
curl -s http://neuroca-service/health/readiness

# Get recent logs
kubectl logs -n neuroca -l app=neuroca --tail=100

Database¶

# Connect to database
kubectl exec -it -n neuroca <postgres-pod> -- psql -U postgres -d neuroca

# Check connection count
SELECT count(*), state FROM pg_stat_activity GROUP BY state;

# Check table sizes
SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) AS size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

Regular Drills¶

Schedule regular incident response drills to ensure the team is prepared:

Quarterly Gameday exercises
Simulate P1 incidents
Practice coordination
Test communication channels
Monthly Runbook reviews
Update with new information
Add newly discovered issues
Remove obsolete information
On-call readiness check
Verify access to all systems
Review escalation procedures
Update contact information