Errors Are Inevitable
Every production application encounters errors — network failures, database timeouts, external API outages, invalid user input, and bugs. The difference between a fragile and a resilient application is how it handles errors. At Nexis Limited, resilience engineering is built into the architecture of all SaaS products.
Error Classification
Not all errors are equal. Classify errors to determine the appropriate response:
- Transient errors: Temporary network glitches, database connection drops, rate limits. Retry after a delay.
- Permanent errors: Invalid input, missing resources, authentication failures. Do not retry — fix the cause.
- System errors: Out of memory, disk full, service crashes. Alert operations and fail gracefully.
- Business errors: Insufficient funds, duplicate registration, policy violation. Handle in business logic with appropriate user feedback.
Retry Strategies
Exponential Backoff with Jitter
For transient errors, retry with increasing delays: 1s, 2s, 4s, 8s. Add random jitter (±30%) to prevent thundering herd problems when many clients retry simultaneously. Set a maximum number of retries (3-5) and a maximum delay (30-60 seconds).
Retry Budget
Limit the total number of retries across all requests, not just per request. If 50% of requests are being retried, the downstream service is likely overwhelmed — continuing to retry makes it worse. Implement a retry budget that caps total retries at 10-20% of normal request volume.
Circuit Breaker Pattern
The circuit breaker prevents cascading failures when a downstream service is unavailable:
- Closed (normal): Requests pass through. Track failure rate.
- Open: When failures exceed a threshold (e.g., 50% of requests fail), the circuit opens. All requests fail immediately without calling the downstream service, preventing resource exhaustion and giving the failing service time to recover.
- Half-Open: After a timeout, allow a few test requests through. If they succeed, close the circuit. If they fail, re-open.
This pattern is essential for microservices architectures where one failing service can take down the entire system.
Graceful Degradation
When a non-critical service fails, degrade gracefully rather than failing the entire request:
- If the recommendation service is down, show popular items instead of personalized recommendations.
- If the notification service is down, queue notifications for later delivery.
- If analytics tracking fails, silently skip it — do not fail the user's action.
- Cache critical data so the application can serve stale data when the source is unavailable.
User-Facing Error Messages
- Be helpful: Tell the user what happened and what they can do. "Payment failed — please check your card details and try again" is better than "Error 500."
- Never expose technical details: Stack traces, database errors, and internal identifiers should never reach the user. Log them server-side.
- Provide next steps: "Try again," "Contact support," or "Use a different method."
- Distinguish user errors from system errors: Validation errors need specific field-level feedback. System errors need a general apology and retry option.
Structured Error Responses
Design consistent API error responses:
- Use appropriate HTTP status codes (400 for client errors, 500 for server errors).
- Include an error code for programmatic handling (AUTH_TOKEN_EXPIRED, RESOURCE_NOT_FOUND).
- Include a human-readable message for display.
- Include a request ID for correlating errors in logs.
Logging and Alerting
- Log errors with full context — request ID, user ID, input parameters, stack trace.
- Implement error rate monitoring with alerts when rates exceed thresholds.
- Distinguish between expected errors (invalid input) and unexpected errors (null pointer exceptions). Alert only on unexpected errors.
- Use structured logging (JSON) for easy searching and aggregation.
Conclusion
Resilient applications do not prevent errors — they handle errors gracefully. Classify errors, retry transient failures with backoff, implement circuit breakers, degrade gracefully for non-critical services, and provide helpful user-facing messages. Build for failure, and your application will be reliable.
Building resilient systems? Our engineering team designs fault-tolerant architectures.