When your revenue depends on automated data collection, a CAPTCHA solving outage means lost data and broken SLAs. High availability (HA) design ensures your pipeline keeps running through partial failures — worker crashes, network issues, or API hiccups.
HA Components
[Health Checker] ──── monitors ────→ [Worker Pool A]
↓ ↓ (primary)
[Circuit Breaker] [CaptchaAI API]
↓ ↑ (fallback)
[Failover Router] ── redirects ──→ [Worker Pool B]
↓
[Dead Letter Queue] ← unrecoverable failures
Layer 1: Worker Redundancy
Run more workers than you need. When one fails, the remaining workers absorb the load.
Python — Supervised Worker Pool
import os
import time
import threading
import queue
import requests
API_KEY = os.environ["CAPTCHAAI_API_KEY"]
task_queue = queue.Queue(maxsize=200)
results = {}
class SupervisedWorkerPool:
def __init__(self, worker_count, min_workers=2):
self.worker_count = worker_count
self.min_workers = min_workers
self.workers = {}
self.lock = threading.Lock()
def start(self):
"""Launch workers and supervisor."""
for i in range(self.worker_count):
self._launch_worker(i)
# Supervisor thread monitors worker health
supervisor = threading.Thread(target=self._supervise, daemon=True)
supervisor.start()
def _launch_worker(self, worker_id):
t = threading.Thread(
target=self._worker_loop,
args=(worker_id,),
daemon=True
)
t.start()
with self.lock:
self.workers[worker_id] = {
"thread": t,
"alive": True,
"last_heartbeat": time.time(),
"tasks_completed": 0
}
def _worker_loop(self, worker_id):
session = requests.Session()
while True:
try:
task = task_queue.get(timeout=30)
result = solve_captcha(session, task)
results[task["task_id"]] = result
with self.lock:
self.workers[worker_id]["last_heartbeat"] = time.time()
self.workers[worker_id]["tasks_completed"] += 1
task_queue.task_done()
except queue.Empty:
# Heartbeat even when idle
with self.lock:
self.workers[worker_id]["last_heartbeat"] = time.time()
except Exception as e:
print(f"Worker {worker_id} error: {e}")
with self.lock:
self.workers[worker_id]["last_heartbeat"] = time.time()
def _supervise(self):
"""Restart dead workers."""
while True:
time.sleep(15)
with self.lock:
now = time.time()
for wid, info in list(self.workers.items()):
if not info["thread"].is_alive():
print(f"Worker {wid} died — restarting")
self._launch_worker(wid)
elif now - info["last_heartbeat"] > 120:
print(f"Worker {wid} stalled — replacing")
self._launch_worker(wid)
@property
def status(self):
with self.lock:
alive = sum(1 for w in self.workers.values()
if w["thread"].is_alive())
return {
"alive": alive,
"total": len(self.workers),
"healthy": alive >= self.min_workers
}
def solve_captcha(session, task):
resp = session.post("https://ocr.captchaai.com/in.php", data={
"key": API_KEY,
"method": task.get("method", "userrecaptcha"),
"googlekey": task["sitekey"],
"pageurl": task["pageurl"],
"json": 1
})
data = resp.json()
if data.get("status") != 1:
return {"error": data.get("request")}
captcha_id = data["request"]
for _ in range(60):
time.sleep(5)
result = session.get("https://ocr.captchaai.com/res.php", params={
"key": API_KEY, "action": "get", "id": captcha_id, "json": 1
}).json()
if result.get("status") == 1:
return {"solution": result["request"]}
if result.get("request") != "CAPCHA_NOT_READY":
return {"error": result.get("request")}
return {"error": "TIMEOUT"}
# Start pool with 8 workers, minimum 3 healthy
pool = SupervisedWorkerPool(worker_count=8, min_workers=3)
pool.start()
Layer 2: Circuit Breaker
Detect when CaptchaAI is having issues and stop sending requests to avoid wasting balance on timeouts:
JavaScript
class CircuitBreaker {
constructor(options = {}) {
this.failureThreshold = options.failureThreshold || 5;
this.resetTimeout = options.resetTimeout || 60000; // 1 minute
this.failures = 0;
this.lastFailure = 0;
this.state = "closed"; // closed, open, half-open
this.successesInHalfOpen = 0;
}
async execute(fn) {
if (this.state === "open") {
if (Date.now() - this.lastFailure > this.resetTimeout) {
this.state = "half-open";
this.successesInHalfOpen = 0;
} else {
throw new Error("Circuit breaker is OPEN — requests blocked");
}
}
try {
const result = await fn();
this._onSuccess();
return result;
} catch (err) {
this._onFailure();
throw err;
}
}
_onSuccess() {
if (this.state === "half-open") {
this.successesInHalfOpen++;
if (this.successesInHalfOpen >= 3) {
this.state = "closed";
this.failures = 0;
console.log("Circuit breaker CLOSED — service recovered");
}
} else {
this.failures = 0;
}
}
_onFailure() {
this.failures++;
this.lastFailure = Date.now();
if (this.failures >= this.failureThreshold) {
this.state = "open";
console.log(
`Circuit breaker OPEN — ${this.failures} consecutive failures`
);
}
}
}
// Usage
const breaker = new CircuitBreaker({
failureThreshold: 5,
resetTimeout: 60000,
});
async function solveCaptchaWithBreaker(sitekey, pageurl) {
return breaker.execute(() => solveCaptcha(sitekey, pageurl));
}
Layer 3: Health Check Endpoint
Expose health status for load balancers and monitoring:
Python (Flask)
from flask import Flask, jsonify
app = Flask(__name__)
@app.route("/health")
def health_check():
pool_status = pool.status
queue_depth = task_queue.qsize()
health = {
"status": "healthy" if pool_status["healthy"] else "degraded",
"workers_alive": pool_status["alive"],
"workers_total": pool_status["total"],
"queue_depth": queue_depth,
"queue_capacity": task_queue.maxsize
}
code = 200 if health["status"] == "healthy" else 503
return jsonify(health), code
@app.route("/health/ready")
def readiness_check():
"""Readiness probe — is this instance ready to receive tasks?"""
if pool.status["alive"] > 0 and task_queue.qsize() < task_queue.maxsize:
return "ready", 200
return "not ready", 503
Layer 4: Graceful Degradation
When things go wrong, degrade gracefully instead of failing completely:
class GracefulDegradation:
def __init__(self):
self.mode = "normal" # normal, degraded, emergency
def set_mode(self, error_rate, queue_depth, workers_alive):
if workers_alive == 0 or error_rate > 0.5:
self.mode = "emergency"
elif error_rate > 0.2 or queue_depth > 150:
self.mode = "degraded"
else:
self.mode = "normal"
def should_accept_task(self, priority):
if self.mode == "normal":
return True
if self.mode == "degraded":
return priority in ("high", "critical")
return priority == "critical" # Emergency: critical only
@property
def status(self):
return {
"mode": self.mode,
"accepting": {
"normal": self.mode == "normal",
"degraded": self.mode in ("normal", "degraded"),
"emergency": True
}
}
HA Checklist
| Component | Implemented? | Notes |
|---|---|---|
| Multiple workers (N+1) | ☐ | At least 1 spare worker |
| Worker health monitoring | ☐ | Supervisor thread or process manager |
| Automatic worker restart | ☐ | On crash or stall |
| Circuit breaker | ☐ | Stop requests during API issues |
| Health check endpoint | ☐ | For load balancers |
| Graceful degradation | ☐ | Priority-based task acceptance |
| Dead-letter queue | ☐ | For unrecoverable failures |
| Fallback polling | ☐ | When callbacks fail |
| Alerting | ☐ | PagerDuty, Slack, email |
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| All workers crash simultaneously | Shared dependency failure (DNS, network) | Add retry with backoff; check infrastructure health |
| Circuit breaker stays open | Reset timeout too long, or issue persists | Reduce reset timeout; investigate root cause |
| Health check passes but tasks fail | Health check is too simple | Check actual solve success in health endpoint |
| Failover flapping | Unstable network causing rapid healthy/unhealthy switches | Add hysteresis (require N consecutive failures before failover) |
FAQ
What's the minimum HA setup?
Two workers with a supervisor process, a circuit breaker, and basic health monitoring. This handles single-worker failures and API hiccups.
Should I have a secondary CAPTCHA provider as failover?
For critical systems, yes. If CaptchaAI is unreachable, route to a backup provider. CaptchaAI's API is compatible with common formats, making dual-provider setup straightforward.
How do I test HA without causing real outages?
Kill individual worker processes during load tests. Simulate network failures with tc netem (Linux) or add artificial delays. Use chaos engineering tools for automated failure injection.
Next Steps
Build resilient CAPTCHA solving — get your CaptchaAI API key and implement HA from the start.
Related guides:
Discussions (0)
Join the conversation
Sign in to share your opinion.
Sign InNo comments yet.