Observability & Operations

VRP Billing exposes consistent logging, metrics and SLO instrumentation so that both sandbox and live environments can be monitored with the same tooling.

Structured logging

All requests now emit a single JSON log line via api.middleware.RequestLoggingMiddleware. Every entry contains the following keys:

Field	Description
`timestamp`	UTC timestamp with millisecond precision.
`level`	Log severity (`INFO`, `WARNING`, `ERROR`).
`logger`	Originating logger name (request logs use `vrp.request`).
`request_id`	Unique hex identifier attached to the Django `HttpRequest`. Included on API error payloads.
`method`	HTTP verb.
`path` / `route`	Raw path plus the resolved Django route/view name.
`status_code`	HTTP status returned to the caller.
`duration_ms`	End-to-end processing time.
`environment`	`sandbox`, `live` or `dev` (falls back to `ENVIRONMENT`).
`merchant_id`	Active merchant id resolved from the API key or session, when available.
`client_ip` / `user_agent`	Caller metadata extracted from the request headers.
`rate_limit`	Aggregated limit state (scope, bucket, remaining requests and whether the call was throttled).
`idempotency`	Idempotency key, lookup outcome (`miss`, `stored`, `replay`, `conflict`, `locked`, `invalid`) and stored record id when present.
`response_bytes`	Payload size where determinable.

Logs are emitted via common.logging.JsonFormatter, so any additional context added with extra={"event": {...}} is preserved in the JSON structure.

Prometheus metrics

api.metrics registers the following Prometheus series (exported at /metrics/):

Metric	Type	Labels	Description
`vrp_api_requests_total`	Counter	`method`, `route`, `status`, `environment`	Total HTTP requests processed.
`vrp_api_request_latency_seconds`	Histogram	`method`, `route`, `environment`	Request latency distribution.
`vrp_api_rate_limit_requests_total`	Counter	`scope`, `bucket_type`	Rate limiter evaluations by scope/bucket.
`vrp_api_rate_limit_blocked_total`	Counter	`scope`, `bucket_type`	Requests rejected by the rate limiter.
`vrp_api_throttle_drops_total`	Counter	`scope`, `bucket_type`	Requests dropped because a throttle bucket was exhausted.
`vrp_webhook_delivery_attempts_total`	Counter	`merchant_id`, `endpoint_id`, `status` (`delivered`, `retry`, `dead_letter`)	Fan-out webhook delivery attempts.
`vrp_webhook_delivery_failures_total`	Counter	`merchant_id`, `endpoint_id`, `reason`	Categorised failure reasons (`endpoint_disabled`, `network_error`, `http_4xx`, `http_5xx`, `max_retries`, …).
`vrp_webhook_delivery_latency_seconds`	Histogram	`merchant_id`, `endpoint_id`	Delivery latency per endpoint.

observe_request_metrics is called from the request logging middleware and feeds requests_total and request_latency_seconds. Throttled requests add entries to vrp_api_throttle_drops_total, while the existing rate-limit database roll-up remains unchanged. merchants.webhooks records webhook attempts, distinguishing retries, dead letters and delivered events.

Service level objectives

Sandbox

Availability target: 99.0% successful responses over a rolling 30-day window.
Error budget: 1.0% (~7h 12m monthly). Consumed whenever vrp_api_requests_total{environment="sandbox",status=~"5.."} increments.
Alerts:
Early warning: fire when 50% of the monthly budget is consumed within 7 days.
Budget exhaustion: fire when 75% of the budget is consumed or the remaining budget will be depleted in less than 72 hours at the current burn rate.

These alerts can be expressed in Prometheus using recording rules on vrp_api_requests_total split by environment.

Live

Availability target: 99.9% successful responses over a rolling 30-day window.
Error budget: 0.1% (~43m monthly). Tracked via the same counters but filtered with environment="live".
Alerts:
Early warning: trigger when burn rate exceeds 2× over 1 hour and 6 hours (multi-window, multi-burn-rate alerting) to catch fast failures.
Budget exhaustion: trigger when projected depletion is under 24 hours.

Use the latency histogram to define latency SLOs (e.g. P95 < 500 ms) by aggregating vrp_api_request_latency_seconds_bucket.

Both environments share the same instrumentation, so Grafana dashboards and alerting rules can be parameterised by environment. The additional webhook metrics provide visibility into downstream integrations, enabling alerts when retries or dead letters exceed healthy thresholds.