Observability & Operations
VRP Billing exposes consistent logging, metrics and SLO instrumentation so that both sandbox and live environments can be monitored with the same tooling.
Structured logging
All requests now emit a single JSON log line via api.middleware.RequestLoggingMiddleware
. Every
entry contains the following keys:
Field | Description |
---|---|
timestamp |
UTC timestamp with millisecond precision. |
level |
Log severity (INFO , WARNING , ERROR ). |
logger |
Originating logger name (request logs use vrp.request ). |
request_id |
Unique hex identifier attached to the Django HttpRequest . Included on API error payloads. |
method |
HTTP verb. |
path / route |
Raw path plus the resolved Django route/view name. |
status_code |
HTTP status returned to the caller. |
duration_ms |
End-to-end processing time. |
environment |
sandbox , live or dev (falls back to ENVIRONMENT ). |
merchant_id |
Active merchant id resolved from the API key or session, when available. |
client_ip / user_agent |
Caller metadata extracted from the request headers. |
rate_limit |
Aggregated limit state (scope, bucket, remaining requests and whether the call was throttled). |
idempotency |
Idempotency key, lookup outcome (miss , stored , replay , conflict , locked , invalid ) and stored record id when present. |
response_bytes |
Payload size where determinable. |
Logs are emitted via common.logging.JsonFormatter
, so any additional context added with
extra={"event": {...}}
is preserved in the JSON structure.
Prometheus metrics
api.metrics
registers the following Prometheus series (exported at /metrics/
):
Metric | Type | Labels | Description |
---|---|---|---|
vrp_api_requests_total |
Counter | method , route , status , environment |
Total HTTP requests processed. |
vrp_api_request_latency_seconds |
Histogram | method , route , environment |
Request latency distribution. |
vrp_api_rate_limit_requests_total |
Counter | scope , bucket_type |
Rate limiter evaluations by scope/bucket. |
vrp_api_rate_limit_blocked_total |
Counter | scope , bucket_type |
Requests rejected by the rate limiter. |
vrp_api_throttle_drops_total |
Counter | scope , bucket_type |
Requests dropped because a throttle bucket was exhausted. |
vrp_webhook_delivery_attempts_total |
Counter | merchant_id , endpoint_id , status (delivered , retry , dead_letter ) |
Fan-out webhook delivery attempts. |
vrp_webhook_delivery_failures_total |
Counter | merchant_id , endpoint_id , reason |
Categorised failure reasons (endpoint_disabled , network_error , http_4xx , http_5xx , max_retries , …). |
vrp_webhook_delivery_latency_seconds |
Histogram | merchant_id , endpoint_id |
Delivery latency per endpoint. |
observe_request_metrics
is called from the request logging middleware and feeds requests_total
and request_latency_seconds
. Throttled requests add entries to vrp_api_throttle_drops_total
,
while the existing rate-limit database roll-up remains unchanged. merchants.webhooks
records
webhook attempts, distinguishing retries, dead letters and delivered events.
Service level objectives
Sandbox
- Availability target: 99.0% successful responses over a rolling 30-day window.
- Error budget: 1.0% (~7h 12m monthly). Consumed whenever
vrp_api_requests_total{environment="sandbox",status=~"5.."}
increments. - Alerts:
- Early warning: fire when 50% of the monthly budget is consumed within 7 days.
- Budget exhaustion: fire when 75% of the budget is consumed or the remaining budget will be depleted in less than 72 hours at the current burn rate.
These alerts can be expressed in Prometheus using recording rules on
vrp_api_requests_total
split by environment.
Live
- Availability target: 99.9% successful responses over a rolling 30-day window.
- Error budget: 0.1% (~43m monthly). Tracked via the same counters but filtered with
environment="live"
. - Alerts:
- Early warning: trigger when burn rate exceeds 2× over 1 hour and 6 hours (multi-window, multi-burn-rate alerting) to catch fast failures.
- Budget exhaustion: trigger when projected depletion is under 24 hours.
Use the latency histogram to define latency SLOs (e.g. P95 < 500 ms) by aggregating
vrp_api_request_latency_seconds_bucket
.
Both environments share the same instrumentation, so Grafana dashboards and alerting rules can be
parameterised by environment
. The additional webhook metrics provide visibility into downstream
integrations, enabling alerts when retries or dead letters exceed healthy thresholds.