Skip to content

Observability & Operations

VRP Billing exposes consistent logging, metrics and SLO instrumentation so that both sandbox and live environments can be monitored with the same tooling.

Structured logging

All requests now emit a single JSON log line via api.middleware.RequestLoggingMiddleware. Every entry contains the following keys:

Field Description
timestamp UTC timestamp with millisecond precision.
level Log severity (INFO, WARNING, ERROR).
logger Originating logger name (request logs use vrp.request).
request_id Unique hex identifier attached to the Django HttpRequest. Included on API error payloads.
method HTTP verb.
path / route Raw path plus the resolved Django route/view name.
status_code HTTP status returned to the caller.
duration_ms End-to-end processing time.
environment sandbox, live or dev (falls back to ENVIRONMENT).
merchant_id Active merchant id resolved from the API key or session, when available.
client_ip / user_agent Caller metadata extracted from the request headers.
rate_limit Aggregated limit state (scope, bucket, remaining requests and whether the call was throttled).
idempotency Idempotency key, lookup outcome (miss, stored, replay, conflict, locked, invalid) and stored record id when present.
response_bytes Payload size where determinable.

Logs are emitted via common.logging.JsonFormatter, so any additional context added with extra={"event": {...}} is preserved in the JSON structure.

Prometheus metrics

api.metrics registers the following Prometheus series (exported at /metrics/):

Metric Type Labels Description
vrp_api_requests_total Counter method, route, status, environment Total HTTP requests processed.
vrp_api_request_latency_seconds Histogram method, route, environment Request latency distribution.
vrp_api_rate_limit_requests_total Counter scope, bucket_type Rate limiter evaluations by scope/bucket.
vrp_api_rate_limit_blocked_total Counter scope, bucket_type Requests rejected by the rate limiter.
vrp_api_throttle_drops_total Counter scope, bucket_type Requests dropped because a throttle bucket was exhausted.
vrp_webhook_delivery_attempts_total Counter merchant_id, endpoint_id, status (delivered, retry, dead_letter) Fan-out webhook delivery attempts.
vrp_webhook_delivery_failures_total Counter merchant_id, endpoint_id, reason Categorised failure reasons (endpoint_disabled, network_error, http_4xx, http_5xx, max_retries, …).
vrp_webhook_delivery_latency_seconds Histogram merchant_id, endpoint_id Delivery latency per endpoint.

observe_request_metrics is called from the request logging middleware and feeds requests_total and request_latency_seconds. Throttled requests add entries to vrp_api_throttle_drops_total, while the existing rate-limit database roll-up remains unchanged. merchants.webhooks records webhook attempts, distinguishing retries, dead letters and delivered events.

Service level objectives

Sandbox

  • Availability target: 99.0% successful responses over a rolling 30-day window.
  • Error budget: 1.0% (~7h 12m monthly). Consumed whenever vrp_api_requests_total{environment="sandbox",status=~"5.."} increments.
  • Alerts:
  • Early warning: fire when 50% of the monthly budget is consumed within 7 days.
  • Budget exhaustion: fire when 75% of the budget is consumed or the remaining budget will be depleted in less than 72 hours at the current burn rate.

These alerts can be expressed in Prometheus using recording rules on vrp_api_requests_total split by environment.

Live

  • Availability target: 99.9% successful responses over a rolling 30-day window.
  • Error budget: 0.1% (~43m monthly). Tracked via the same counters but filtered with environment="live".
  • Alerts:
  • Early warning: trigger when burn rate exceeds 2× over 1 hour and 6 hours (multi-window, multi-burn-rate alerting) to catch fast failures.
  • Budget exhaustion: trigger when projected depletion is under 24 hours.

Use the latency histogram to define latency SLOs (e.g. P95 < 500 ms) by aggregating vrp_api_request_latency_seconds_bucket.

Both environments share the same instrumentation, so Grafana dashboards and alerting rules can be parameterised by environment. The additional webhook metrics provide visibility into downstream integrations, enabling alerts when retries or dead letters exceed healthy thresholds.