Skip to content
NeuroCognitive Architecture Badge

NeuroCognitive Architecture (NCA) Incident Response Runbook

This runbook provides a structured approach for responding to incidents that may occur in the NCA production environment. It outlines the steps for identifying, responding to, and resolving incidents while minimizing impact on users.

Incident Severity Levels

Incidents are classified by severity level to determine appropriate response procedures:

Level Description Examples Response Time Escalation
P1 Critical service outage - API completely down
- Data loss
- Security breach
Immediate Leadership + On-call team
P2 Partial service disruption - High latency
- Feature failure
- Degraded performance
< 30 minutes On-call team
P3 Minor service impact - Non-critical bugs
- Isolated errors
- Minor performance issues
< 4 hours Primary on-call
P4 No user impact - Warning signs
- Potential future issues
Next business day Team aware

Incident Response Workflow

1. Detection

  • Automated Detection
  • System health alerts from Prometheus
  • Latency spikes detected by ServiceMonitor
  • Error rate increases in logs
  • Memory/CPU utilization alerts

  • Manual Detection

  • User-reported issues
  • Regular system health checks
  • Deployment observations

2. Assessment & Classification

When an incident is detected:

  1. Determine the scope
  2. Which components are affected?
  3. Is it impacting users?
  4. Is it affecting all users or only specific segments?

  5. Classify severity based on the level definitions above

  6. Initial documentation in the incident management system

  7. Incident ID
  8. Severity level
  9. Brief description
  10. Affected components
  11. Detection method
  12. Initial responder(s)

3. Response

For P1 (Critical) Incidents:

  1. Activate incident management
  2. Notify on-call team via PagerDuty
  3. Create incident channel in Slack
  4. Designate Incident Commander (IC)

  5. Immediate mitigation

  6. Consider emergency rollback to last known good version
  7. Implement circuit breakers if applicable
  8. Scale up resources if resource-related

  9. Client communication

  10. Post to status page
  11. Send initial notification to affected clients
  12. Establish communication cadence

For P2 (Major) Incidents:

  1. Notify on-call team
  2. Primary responder to lead
  3. Escalate if necessary

  4. Implement mitigation

  5. Apply fixes from playbooks if available
  6. Isolate affected components if possible

  7. Client communication

  8. Update status page if user-visible
  9. Prepare client communication

For P3/P4 (Minor) Incidents:

  1. Assign to primary on-call or team
  2. Implement mitigation during business hours
  3. Document in tracking system

4. Investigation

  1. Gather diagnostic information

    # Get pod logs
    kubectl logs -n neuroca -l app=neuroca --tail=500
    
    # Check pod status
    kubectl get pods -n neuroca
    
    # Check memory usage
    kubectl top pod -n neuroca
    
    # Watch metrics
    kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
    

  2. Examine logs and traces

  3. Check application logs
  4. Review Prometheus metrics
  5. Analyze request traces in Jaeger

  6. Perform root cause analysis

  7. Memory leak checks
  8. API throttling issues
  9. Database connection problems
  10. External dependency failures
  11. Changes or deployments

5. Resolution

  1. Implement permanent fix
  2. Deploy hotfix if needed
  3. Validate fix in production
  4. Verify monitoring confirms resolution

  5. Document resolution

  6. Update incident report
  7. Note fixed version
  8. Document workarounds used

  9. Client communication

  10. Notify of resolution
  11. Update status page
  12. Provide explanation if appropriate

6. Post-incident Follow-up

  1. Conduct post-mortem
  2. Schedule within 24-48 hours of resolution
  3. Include all participants
  4. Document timeline
  5. Identify root causes
  6. No blame approach

  7. Generate action items

  8. Preventative measures
  9. Detection improvements
  10. Response enhancements
  11. Documentation updates

  12. Knowledge sharing

  13. Update runbooks with new findings
  14. Share lessons learned with team
  15. Improve monitoring if gaps identified

Common Incident Scenarios

API Latency Spike

  1. Check CPU/Memory usage

    kubectl top pods -n neuroca
    

  2. Check database connection pool

  3. Query the database metrics
  4. Look for connection limits

  5. Check external API dependencies

  6. Review Redis, OpenAI, etc.

  7. Examine recent deployments

  8. Any recent code changes?
  9. New dependencies?

  10. Actions

  11. Scale horizontally if resource-bound
  12. Increase connection pool if DB-related
  13. Implement circuit breakers if dependency issues

Memory Leak

  1. Verify with increasing memory trend
  2. Check Prometheus graphs for memory growth pattern

  3. Collect heap dumps

    # Get pod name
    POD=$(kubectl get pod -n neuroca -l app=neuroca -o jsonpath='{.items[0].metadata.name}')
    
    # Execute heap dump
    kubectl exec -n neuroca $POD -- python -m memory_profiler dump_mem.py > heap.dump
    

  4. Analyze memory usage

  5. Look for large object allocations
  6. Check for unbounded caches

  7. Actions

  8. Rolling restart if immediate mitigation needed
  9. Deploy fix addressing memory leak
  10. Add memory bounds to caches

Database Performance Issues

  1. Check query performance

    SELECT query, calls, total_time, mean_time
    FROM pg_stat_statements
    ORDER BY total_time DESC
    LIMIT 10;
    

  2. Examine index usage

  3. Check connection pool

  4. Look for maxed out connections
  5. Connection leaks

  6. Actions

  7. Add needed indexes
  8. Optimize slow queries
  9. Increase connection timeouts if needed

Emergency Contacts

Role Primary Secondary Contact Method
Database Admin [NAME] [NAME] Slack @dbadmin, Phone
Infrastructure Lead [NAME] [NAME] Slack @infrateam, Phone
Security Officer [NAME] [NAME] Slack @security, Phone
Engineering Lead [NAME] [NAME] Slack @eng-lead, Phone

Rollback Procedure

If a deployment needs to be rolled back:

# Check deployment history
kubectl rollout history deployment/neuroca -n neuroca

# Roll back to previous version
kubectl rollout undo deployment/neuroca -n neuroca

# Roll back to specific version
kubectl rollout undo deployment/neuroca -n neuroca --to-revision=<revision_number>

# Monitor rollback
kubectl rollout status deployment/neuroca -n neuroca

Helpful Commands

Kubernetes

# Get pod logs
kubectl logs -n neuroca <pod-name>

# Get pod logs for all containers in a pod
kubectl logs -n neuroca <pod-name> --all-containers

# Describe pod for detailed information
kubectl describe pod -n neuroca <pod-name>

# Get events
kubectl get events -n neuroca --sort-by='.lastTimestamp'

# Exec into container
kubectl exec -it -n neuroca <pod-name> -- /bin/bash

# Port forward to service
kubectl port-forward -n neuroca svc/neuroca 8000:80

Monitoring

# Check Prometheus alerts
curl -s http://prometheus:9090/api/v1/alerts | jq

# Check service health
curl -s http://neuroca-service/health/readiness

# Get recent logs
kubectl logs -n neuroca -l app=neuroca --tail=100

Database

# Connect to database
kubectl exec -it -n neuroca <postgres-pod> -- psql -U postgres -d neuroca

# Check connection count
SELECT count(*), state FROM pg_stat_activity GROUP BY state;

# Check table sizes
SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) AS size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

Regular Drills

Schedule regular incident response drills to ensure the team is prepared:

  1. Quarterly Gameday exercises
  2. Simulate P1 incidents
  3. Practice coordination
  4. Test communication channels

  5. Monthly Runbook reviews

  6. Update with new information
  7. Add newly discovered issues
  8. Remove obsolete information

  9. On-call readiness check

  10. Verify access to all systems
  11. Review escalation procedures
  12. Update contact information