From Monitoring to Observability

Monitoring tells you when something is wrong. Observability helps you understand why. Modern production systems need both — dashboards that alert on symptoms and tools that help you diagnose root causes. At Nexis Limited, we operate a comprehensive monitoring stack across all our SaaS products and client infrastructure.

The Three Pillars of Observability

1. Metrics

Metrics are numerical measurements collected at regular intervals — CPU usage, request latency, error rates, queue depths. We use Prometheus as our metrics collection and alerting engine. Prometheus scrapes metrics from application endpoints and infrastructure exporters, stores them in a time-series database, and evaluates alerting rules.

2. Logs

Logs provide detailed records of individual events in your system. Structured logging (JSON format) makes logs machine-readable and searchable. Key practices include structured format with consistent fields, correlation IDs to trace requests across services, log levels (debug, info, warn, error) used appropriately, and centralized log aggregation for search and analysis.

3. Traces

Distributed traces track a request as it flows through multiple services. Each service adds a span to the trace, recording timing and metadata. This is essential for diagnosing latency issues in microservices architectures. We use OpenTelemetry for trace instrumentation across our Go and Python services.

Grafana Dashboards

Grafana provides visualization for Prometheus metrics. We maintain dashboards for:

  • System overview: CPU, memory, disk, and network metrics across all nodes.
  • Application performance: Request rate, error rate, and latency (RED metrics).
  • Business metrics: Active users, transaction volume, and revenue-relevant KPIs.
  • Database performance: Query latency, connection pool utilization, and replication lag.

Alerting Strategy

Effective alerting follows these principles:

  • Alert on symptoms, not causes: Alert when error rate exceeds 1%, not when a specific pod restarts.
  • Reduce noise: Every alert should be actionable. If an alert fires and the response is "ignore it," the alert should be removed or adjusted.
  • Use severity levels: Critical alerts page on-call engineers immediately. Warning alerts create tickets for next-business-day resolution.
  • Include runbooks: Each alert should link to a runbook describing the likely cause and resolution steps.

DockWarden: Our Open-Source Contribution

We built DockWarden to provide Docker container monitoring with Prometheus metrics, health checks, and notifications via Discord, Slack, and Telegram. It integrates directly with our Grafana dashboards for container-level visibility.

Conclusion

Observability is not optional for production systems. Implement the three pillars — metrics, logs, and traces — with consistent instrumentation, actionable alerting, and clear dashboards. The investment pays off every time you need to diagnose and resolve a production incident quickly.

Need help setting up monitoring? Our DevOps team designs and operates monitoring infrastructure for production workloads.