Your CaptchaAI callback endpoint is a critical dependency — if it goes down, solved CAPTCHAs don't reach your application. Built-in monitoring catches problems before they cascade.
What to Monitor
| Metric | Why It Matters | Healthy Range |
|---|---|---|
| Endpoint uptime | Callbacks fail during downtime | > 99.5% |
| Response latency | Slow responses may timeout | < 500 ms |
| Error rate (4xx/5xx) | Indicates handler bugs | < 1% |
| Callback delivery rate | Ratio of callbacks received vs tasks submitted | > 95% |
| Time between callbacks | Detects sudden stops | < 5× average interval |
Self-Monitoring Middleware
Add monitoring directly to your callback handler.
Python (Flask)
import time
import threading
from collections import deque
from flask import Flask, request, jsonify
app = Flask(__name__)
# Rolling window metrics (last 1000 callbacks)
metrics = {
"total_received": 0,
"total_errors": 0,
"latencies": deque(maxlen=1000),
"last_callback_at": 0,
"error_counts": {}
}
metrics_lock = threading.Lock()
@app.route("/callback")
def captcha_callback():
start = time.time()
task_id = request.args.get("id")
solution = request.args.get("code")
try:
# Process the callback
store_result(task_id, solution)
status = "ok"
http_code = 200
except Exception as e:
status = "error"
http_code = 200 # Still ACK to CaptchaAI
error_type = type(e).__name__
with metrics_lock:
metrics["total_errors"] += 1
metrics["error_counts"][error_type] = \
metrics["error_counts"].get(error_type, 0) + 1
# Record metrics
latency_ms = (time.time() - start) * 1000
with metrics_lock:
metrics["total_received"] += 1
metrics["latencies"].append(latency_ms)
metrics["last_callback_at"] = time.time()
return "OK", http_code
@app.route("/health/callbacks")
def callback_health():
"""Health endpoint for monitoring."""
with metrics_lock:
latencies = list(metrics["latencies"])
last_at = metrics["last_callback_at"]
now = time.time()
avg_latency = sum(latencies) / len(latencies) if latencies else 0
p95_latency = sorted(latencies)[int(len(latencies) * 0.95)] if latencies else 0
seconds_since_last = now - last_at if last_at > 0 else -1
health = {
"status": "healthy" if seconds_since_last < 300 else "stale",
"total_received": metrics["total_received"],
"total_errors": metrics["total_errors"],
"error_rate": metrics["total_errors"] / max(metrics["total_received"], 1),
"avg_latency_ms": round(avg_latency, 2),
"p95_latency_ms": round(p95_latency, 2),
"seconds_since_last_callback": round(seconds_since_last, 1),
"error_breakdown": dict(metrics["error_counts"])
}
status_code = 200 if health["status"] == "healthy" else 503
return jsonify(health), status_code
JavaScript (Express)
const express = require("express");
const app = express();
const metrics = {
totalReceived: 0,
totalErrors: 0,
latencies: [],
lastCallbackAt: 0,
errorCounts: {},
};
const MAX_LATENCIES = 1000;
app.get("/callback", (req, res) => {
const start = Date.now();
const taskId = req.query.id;
const solution = req.query.code;
try {
storeResult(taskId, solution);
} catch (err) {
metrics.totalErrors++;
const errType = err.constructor.name;
metrics.errorCounts[errType] = (metrics.errorCounts[errType] || 0) + 1;
}
const latencyMs = Date.now() - start;
metrics.totalReceived++;
metrics.latencies.push(latencyMs);
if (metrics.latencies.length > MAX_LATENCIES) metrics.latencies.shift();
metrics.lastCallbackAt = Date.now();
res.sendStatus(200);
});
app.get("/health/callbacks", (req, res) => {
const latencies = [...metrics.latencies].sort((a, b) => a - b);
const avgLatency =
latencies.length > 0
? latencies.reduce((a, b) => a + b, 0) / latencies.length
: 0;
const p95Latency =
latencies.length > 0
? latencies[Math.floor(latencies.length * 0.95)]
: 0;
const secondsSinceLast =
metrics.lastCallbackAt > 0
? (Date.now() - metrics.lastCallbackAt) / 1000
: -1;
const health = {
status: secondsSinceLast < 300 ? "healthy" : "stale",
totalReceived: metrics.totalReceived,
totalErrors: metrics.totalErrors,
errorRate: metrics.totalErrors / Math.max(metrics.totalReceived, 1),
avgLatencyMs: Math.round(avgLatency * 100) / 100,
p95LatencyMs: Math.round(p95Latency * 100) / 100,
secondsSinceLastCallback: Math.round(secondsSinceLast * 10) / 10,
errorBreakdown: metrics.errorCounts,
};
res.status(health.status === "healthy" ? 200 : 503).json(health);
});
app.listen(3000);
Delivery Rate Tracking
Compare tasks submitted with callbacks received to measure delivery success:
Python
import time
submitted_tasks = {} # task_id -> submitted_at
delivered_tasks = set()
delivery_timeout = 300 # 5 minutes
def on_submit(task_id):
"""Call after submitting to CaptchaAI with pingback."""
submitted_tasks[task_id] = time.time()
def on_callback(task_id):
"""Call when callback is received."""
delivered_tasks.add(task_id)
submitted_tasks.pop(task_id, None)
def get_delivery_stats():
"""Calculate delivery metrics."""
now = time.time()
# Expired tasks = submitted > 5 min ago, never received callback
expired = [
tid for tid, ts in submitted_tasks.items()
if now - ts > delivery_timeout
]
total = len(delivered_tasks) + len(expired)
rate = len(delivered_tasks) / max(total, 1)
return {
"delivered": len(delivered_tasks),
"missed": len(expired),
"pending": len(submitted_tasks) - len(expired),
"delivery_rate": round(rate, 4),
"missed_task_ids": expired[:10] # Sample for debugging
}
Alert Conditions
Set up alerts for these conditions:
| Alert | Trigger | Severity |
|---|---|---|
| Stale endpoint | No callback received in 5+ minutes | Warning |
| High error rate | > 5% error rate over 100 requests | Critical |
| Slow responses | p95 latency > 1000 ms | Warning |
| Low delivery rate | < 90% delivery rate | Critical |
| Endpoint down | Health check returns 503 or timeout | Critical |
Simple Alert Script
import requests
import time
def check_callback_health(health_url, alert_callback):
"""Periodic health checker."""
while True:
try:
resp = requests.get(health_url, timeout=5)
health = resp.json()
if resp.status_code != 200:
alert_callback("CRITICAL", f"Callback endpoint unhealthy: {health['status']}")
if health.get("error_rate", 0) > 0.05:
alert_callback("CRITICAL", f"High error rate: {health['error_rate']:.1%}")
if health.get("p95_latency_ms", 0) > 1000:
alert_callback("WARNING", f"Slow callbacks: p95={health['p95_latency_ms']}ms")
if health.get("seconds_since_last_callback", -1) > 300:
alert_callback("WARNING", f"No callbacks for {health['seconds_since_last_callback']:.0f}s")
except requests.RequestException as e:
alert_callback("CRITICAL", f"Health check failed: {e}")
time.sleep(60) # Check every minute
External Monitoring Integration
For production systems, pair self-monitoring with external uptime checks:
| Tool | Integration |
|---|---|
| UptimeRobot | Monitor /health/callbacks endpoint |
| Pingdom | HTTP check with response body validation |
| AWS CloudWatch | Synthetic canary on health endpoint |
| Self-hosted | Cron job calling health check script |
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| Health endpoint shows "stale" with no callbacks | No tasks submitted recently, or callbacks not reaching server | Check if tasks are being submitted with pingback; verify firewall rules |
| High latency on callback handler | Slow database writes in handler | Process async — accept callback, queue for background processing |
| Delivery rate dropping | Server restarts clearing in-memory task tracking | Use Redis or database to persist submitted task IDs |
| Error rate spikes | Downstream service (database) failing | Check error breakdown; fix underlying service |
FAQ
Should I use a separate service for monitoring?
For small setups, self-monitoring middleware is sufficient. For production systems with SLAs, add external monitoring (UptimeRobot, Pingdom) that checks from outside your infrastructure.
How long should I keep metrics in memory?
A rolling window of the last 1,000 events is usually enough for real-time dashboards. For historical analysis, export metrics to Prometheus, Datadog, or a time-series database.
What if my callback endpoint is behind a load balancer?
Each instance tracks its own metrics. Aggregate across instances in your monitoring platform, or expose a shared metrics store (Redis) that all instances write to.
Related Articles
Next Steps
Monitor your callback endpoints — get your CaptchaAI API key and add health checks from day one.
Related guides:
Discussions (0)
Join the conversation
Sign in to share your opinion.
Sign InNo comments yet.