High Availability CAPTCHA Solving: Failover and Redundancy

When your revenue depends on automated data collection, a CAPTCHA solving outage means lost data and broken SLAs. High availability (HA) design ensures your pipeline keeps running through partial failures — worker crashes, network issues, or API hiccups.

HA Components

[Health Checker] ──── monitors ────→ [Worker Pool A]
       ↓                                    ↓ (primary)
[Circuit Breaker]                    [CaptchaAI API]
       ↓                                    ↑ (fallback)
[Failover Router] ── redirects ──→ [Worker Pool B]
       ↓
[Dead Letter Queue] ← unrecoverable failures

Layer 1: Worker Redundancy

Run more workers than you need. When one fails, the remaining workers absorb the load.

Python — Supervised Worker Pool

import os
import time
import threading
import queue
import requests

API_KEY = os.environ["CAPTCHAAI_API_KEY"]
task_queue = queue.Queue(maxsize=200)
results = {}


class SupervisedWorkerPool:
    def __init__(self, worker_count, min_workers=2):
        self.worker_count = worker_count
        self.min_workers = min_workers
        self.workers = {}
        self.lock = threading.Lock()

    def start(self):
        """Launch workers and supervisor."""
        for i in range(self.worker_count):
            self._launch_worker(i)

        # Supervisor thread monitors worker health
        supervisor = threading.Thread(target=self._supervise, daemon=True)
        supervisor.start()

    def _launch_worker(self, worker_id):
        t = threading.Thread(
            target=self._worker_loop,
            args=(worker_id,),
            daemon=True
        )
        t.start()
        with self.lock:
            self.workers[worker_id] = {
                "thread": t,
                "alive": True,
                "last_heartbeat": time.time(),
                "tasks_completed": 0
            }

    def _worker_loop(self, worker_id):
        session = requests.Session()
        while True:
            try:
                task = task_queue.get(timeout=30)
                result = solve_captcha(session, task)
                results[task["task_id"]] = result

                with self.lock:
                    self.workers[worker_id]["last_heartbeat"] = time.time()
                    self.workers[worker_id]["tasks_completed"] += 1

                task_queue.task_done()
            except queue.Empty:
                # Heartbeat even when idle
                with self.lock:
                    self.workers[worker_id]["last_heartbeat"] = time.time()
            except Exception as e:
                print(f"Worker {worker_id} error: {e}")
                with self.lock:
                    self.workers[worker_id]["last_heartbeat"] = time.time()

    def _supervise(self):
        """Restart dead workers."""
        while True:
            time.sleep(15)
            with self.lock:
                now = time.time()
                for wid, info in list(self.workers.items()):
                    if not info["thread"].is_alive():
                        print(f"Worker {wid} died — restarting")
                        self._launch_worker(wid)
                    elif now - info["last_heartbeat"] > 120:
                        print(f"Worker {wid} stalled — replacing")
                        self._launch_worker(wid)

    @property
    def status(self):
        with self.lock:
            alive = sum(1 for w in self.workers.values()
                       if w["thread"].is_alive())
            return {
                "alive": alive,
                "total": len(self.workers),
                "healthy": alive >= self.min_workers
            }


def solve_captcha(session, task):
    resp = session.post("https://ocr.captchaai.com/in.php", data={
        "key": API_KEY,
        "method": task.get("method", "userrecaptcha"),
        "googlekey": task["sitekey"],
        "pageurl": task["pageurl"],
        "json": 1
    })
    data = resp.json()

    if data.get("status") != 1:
        return {"error": data.get("request")}

    captcha_id = data["request"]
    for _ in range(60):
        time.sleep(5)
        result = session.get("https://ocr.captchaai.com/res.php", params={
            "key": API_KEY, "action": "get", "id": captcha_id, "json": 1
        }).json()
        if result.get("status") == 1:
            return {"solution": result["request"]}
        if result.get("request") != "CAPCHA_NOT_READY":
            return {"error": result.get("request")}
    return {"error": "TIMEOUT"}


# Start pool with 8 workers, minimum 3 healthy
pool = SupervisedWorkerPool(worker_count=8, min_workers=3)
pool.start()

Layer 2: Circuit Breaker

Detect when CaptchaAI is having issues and stop sending requests to avoid wasting balance on timeouts:

JavaScript

class CircuitBreaker {
  constructor(options = {}) {
    this.failureThreshold = options.failureThreshold || 5;
    this.resetTimeout = options.resetTimeout || 60000; // 1 minute
    this.failures = 0;
    this.lastFailure = 0;
    this.state = "closed"; // closed, open, half-open
    this.successesInHalfOpen = 0;
  }

  async execute(fn) {
    if (this.state === "open") {
      if (Date.now() - this.lastFailure > this.resetTimeout) {
        this.state = "half-open";
        this.successesInHalfOpen = 0;
      } else {
        throw new Error("Circuit breaker is OPEN — requests blocked");
      }
    }

    try {
      const result = await fn();
      this._onSuccess();
      return result;
    } catch (err) {
      this._onFailure();
      throw err;
    }
  }

  _onSuccess() {
    if (this.state === "half-open") {
      this.successesInHalfOpen++;
      if (this.successesInHalfOpen >= 3) {
        this.state = "closed";
        this.failures = 0;
        console.log("Circuit breaker CLOSED — service recovered");
      }
    } else {
      this.failures = 0;
    }
  }

  _onFailure() {
    this.failures++;
    this.lastFailure = Date.now();
    if (this.failures >= this.failureThreshold) {
      this.state = "open";
      console.log(
        `Circuit breaker OPEN — ${this.failures} consecutive failures`
      );
    }
  }
}

// Usage
const breaker = new CircuitBreaker({
  failureThreshold: 5,
  resetTimeout: 60000,
});

async function solveCaptchaWithBreaker(sitekey, pageurl) {
  return breaker.execute(() => solveCaptcha(sitekey, pageurl));
}

Layer 3: Health Check Endpoint

Expose health status for load balancers and monitoring:

Python (Flask)

from flask import Flask, jsonify

app = Flask(__name__)


@app.route("/health")
def health_check():
    pool_status = pool.status
    queue_depth = task_queue.qsize()

    health = {
        "status": "healthy" if pool_status["healthy"] else "degraded",
        "workers_alive": pool_status["alive"],
        "workers_total": pool_status["total"],
        "queue_depth": queue_depth,
        "queue_capacity": task_queue.maxsize
    }

    code = 200 if health["status"] == "healthy" else 503
    return jsonify(health), code


@app.route("/health/ready")
def readiness_check():
    """Readiness probe — is this instance ready to receive tasks?"""
    if pool.status["alive"] > 0 and task_queue.qsize() < task_queue.maxsize:
        return "ready", 200
    return "not ready", 503

Layer 4: Graceful Degradation

When things go wrong, degrade gracefully instead of failing completely:

class GracefulDegradation:
    def __init__(self):
        self.mode = "normal"  # normal, degraded, emergency

    def set_mode(self, error_rate, queue_depth, workers_alive):
        if workers_alive == 0 or error_rate > 0.5:
            self.mode = "emergency"
        elif error_rate > 0.2 or queue_depth > 150:
            self.mode = "degraded"
        else:
            self.mode = "normal"

    def should_accept_task(self, priority):
        if self.mode == "normal":
            return True
        if self.mode == "degraded":
            return priority in ("high", "critical")
        return priority == "critical"  # Emergency: critical only

    @property
    def status(self):
        return {
            "mode": self.mode,
            "accepting": {
                "normal": self.mode == "normal",
                "degraded": self.mode in ("normal", "degraded"),
                "emergency": True
            }
        }

HA Checklist

Component	Implemented?	Notes
Multiple workers (N+1)	☐	At least 1 spare worker
Worker health monitoring	☐	Supervisor thread or process manager
Automatic worker restart	☐	On crash or stall
Circuit breaker	☐	Stop requests during API issues
Health check endpoint	☐	For load balancers
Graceful degradation	☐	Priority-based task acceptance
Dead-letter queue	☐	For unrecoverable failures
Fallback polling	☐	When callbacks fail
Alerting	☐	PagerDuty, Slack, email

Troubleshooting

Issue	Cause	Fix
All workers crash simultaneously	Shared dependency failure (DNS, network)	Add retry with backoff; check infrastructure health
Circuit breaker stays open	Reset timeout too long, or issue persists	Reduce reset timeout; investigate root cause
Health check passes but tasks fail	Health check is too simple	Check actual solve success in health endpoint
Failover flapping	Unstable network causing rapid healthy/unhealthy switches	Add hysteresis (require N consecutive failures before failover)

FAQ

What's the minimum HA setup?

Two workers with a supervisor process, a circuit breaker, and basic health monitoring. This handles single-worker failures and API hiccups.

Should I have a secondary CAPTCHA provider as failover?

For critical systems, yes. If CaptchaAI is unreachable, route to a backup provider. CaptchaAI's API is compatible with common formats, making dual-provider setup straightforward.

How do I test HA without causing real outages?

Kill individual worker processes during load tests. Simulate network failures with tc netem (Linux) or add artificial delays. Use chaos engineering tools for automated failure injection.

Next Steps

Build resilient CAPTCHA solving — get your CaptchaAI API key and implement HA from the start.

Related guides:

Full Working Code

Complete runnable examples for this article in Python, Node.js, PHP, Go, Java, C#, Ruby, Rust, Kotlin & Bash.

View on GitHub →

High Availability CAPTCHA Solving: Failover and Redundancy

HA Components

Layer 1: Worker Redundancy

Python — Supervised Worker Pool

Layer 2: Circuit Breaker

JavaScript

Layer 3: Health Check Endpoint

Python (Flask)

Layer 4: Graceful Degradation

HA Checklist

Troubleshooting

FAQ

What's the minimum HA setup?

Should I have a secondary CAPTCHA provider as failover?

How do I test HA without causing real outages?

Next Steps

Discussions (0)

NATS Messaging + CaptchaAI: Lightweight CAPTCHA Task Distribution

Google Cloud Functions + CaptchaAI Integration

RabbitMQ + CaptchaAI: Message Queue Integration

Backpressure Handling in CAPTCHA Solving Queues

Grafana Dashboard Templates for CaptchaAI Metrics

Horizontal Scaling CAPTCHA Solving Workers: When and How

HA Components

Layer 1: Worker Redundancy

Python — Supervised Worker Pool

Layer 2: Circuit Breaker

JavaScript

Layer 3: Health Check Endpoint

Python (Flask)

Layer 4: Graceful Degradation

HA Checklist

Troubleshooting

FAQ

What's the minimum HA setup?

Should I have a secondary CAPTCHA provider as failover?

How do I test HA without causing real outages?

Next Steps

Discussions (0)

Join the conversation

Related Posts

NATS Messaging + CaptchaAI: Lightweight CAPTCHA Task Distribution

Google Cloud Functions + CaptchaAI Integration

RabbitMQ + CaptchaAI: Message Queue Integration

Backpressure Handling in CAPTCHA Solving Queues

Grafana Dashboard Templates for CaptchaAI Metrics

Horizontal Scaling CAPTCHA Solving Workers: When and How