Use Cases

Scraping Automation and CAPTCHA Handling

Production scraping pipelines need to handle CAPTCHAs automatically — no manual intervention. This guide shows how to build automated scrapers with CaptchaAI integrated for CAPTCHA solving, error recovery, and scheduling.

Architecture Overview

[Scheduler] → [URL Queue] → [Scraper Workers] → [CAPTCHA Solver] → [Data Store]
                                    ↕
                             [Proxy Rotator]

Each component:

  • Scheduler: Triggers scraping jobs (cron, task queue)
  • URL Queue: Manages URLs to scrape
  • Scraper Workers: Fetch pages, detect CAPTCHAs
  • CAPTCHA Solver: CaptchaAI API handles all CAPTCHA types
  • Proxy Rotator: Distributes requests across IPs

Core Scraper with CAPTCHA Handling

import requests
import time
import logging
from bs4 import BeautifulSoup

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

API_KEY = "YOUR_API_KEY"

class AutomatedScraper:
    def __init__(self, api_key, max_retries=3):
        self.api_key = api_key
        self.max_retries = max_retries
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        })
        self.stats = {"pages": 0, "captchas": 0, "errors": 0}

    def scrape(self, url):
        for attempt in range(self.max_retries):
            try:
                resp = self.session.get(url, timeout=30)

                if self._is_captcha(resp.text):
                    self.stats["captchas"] += 1
                    logger.info(f"CAPTCHA detected on {url}")
                    resp = self._solve_and_retry(resp.text, url)

                self.stats["pages"] += 1
                return resp.text

            except Exception as e:
                self.stats["errors"] += 1
                logger.error(f"Attempt {attempt + 1} failed for {url}: {e}")
                if attempt == self.max_retries - 1:
                    raise
                time.sleep(2 ** attempt)

    def _is_captcha(self, html):
        return any(m in html.lower() for m in
                   ["g-recaptcha", "cf-turnstile", "h-captcha", "captcha"])

    def _solve_and_retry(self, html, url):
        soup = BeautifulSoup(html, "html.parser")

        # Detect CAPTCHA type and solve
        rc = soup.find("div", class_="g-recaptcha")
        if rc:
            token = self._solve("userrecaptcha", {
                "googlekey": rc["data-sitekey"],
                "pageurl": url
            })
            return self.session.post(url, data={"g-recaptcha-response": token})

        ts = soup.find("div", class_="cf-turnstile")
        if ts:
            token = self._solve("turnstile", {
                "sitekey": ts["data-sitekey"],
                "pageurl": url
            })
            return self.session.post(url, data={"cf-turnstile-response": token})

        raise Exception("Unrecognized CAPTCHA type")

    def _solve(self, method, params):
        params["key"] = self.api_key
        params["method"] = method

        resp = requests.get("https://ocr.captchaai.com/in.php", params=params)
        if not resp.text.startswith("OK|"):
            raise Exception(f"Submit error: {resp.text}")

        task_id = resp.text.split("|")[1]

        for _ in range(60):
            time.sleep(5)
            result = requests.get("https://ocr.captchaai.com/res.php", params={
                "key": self.api_key, "action": "get", "id": task_id
            })
            if result.text == "CAPCHA_NOT_READY":
                continue
            if result.text.startswith("OK|"):
                return result.text.split("|")[1]
            raise Exception(f"Solve error: {result.text}")

        raise TimeoutError("Solve timed out")

    def get_stats(self):
        return self.stats

Batch Processing with Queue

from queue import Queue
from threading import Thread

def worker(scraper, url_queue, results):
    while not url_queue.empty():
        url = url_queue.get()
        try:
            html = scraper.scrape(url)
            results.append({"url": url, "html": html, "status": "success"})
        except Exception as e:
            results.append({"url": url, "error": str(e), "status": "failed"})
        finally:
            url_queue.task_done()
            time.sleep(2)

def scrape_batch(urls, num_workers=3):
    scraper = AutomatedScraper(API_KEY)
    url_queue = Queue()
    results = []

    for url in urls:
        url_queue.put(url)

    threads = []
    for _ in range(num_workers):
        t = Thread(target=worker, args=(scraper, url_queue, results))
        t.start()
        threads.append(t)

    for t in threads:
        t.join()

    logger.info(f"Stats: {scraper.get_stats()}")
    return results

Scheduling with Cron

Create a script that runs on a schedule:

# scheduled_scrape.py
import json
import sys

def run_scheduled_scrape():
    urls = [
        "https://example.com/page/1",
        "https://example.com/page/2",
        "https://example.com/page/3",
    ]

    results = scrape_batch(urls)

    # Save results
    with open(f"results_{int(time.time())}.json", "w") as f:
        json.dump(results, f, indent=2)

    # Report stats
    success = sum(1 for r in results if r["status"] == "success")
    failed = sum(1 for r in results if r["status"] == "failed")
    print(f"Completed: {success} success, {failed} failed")

if __name__ == "__main__":
    run_scheduled_scrape()

Add to crontab:

0 */6 * * * cd /path/to/scraper && python scheduled_scrape.py

Error Recovery Patterns

def scrape_with_recovery(scraper, urls, checkpoint_file="checkpoint.json"):
    # Load checkpoint
    completed = set()
    if os.path.exists(checkpoint_file):
        with open(checkpoint_file) as f:
            completed = set(json.load(f))

    remaining = [u for u in urls if u not in completed]
    logger.info(f"Resuming: {len(remaining)} URLs remaining")

    for url in remaining:
        try:
            html = scraper.scrape(url)
            # Process html...
            completed.add(url)

            # Save checkpoint
            with open(checkpoint_file, "w") as f:
                json.dump(list(completed), f)

        except Exception as e:
            logger.error(f"Failed: {url} - {e}")
            continue

FAQ

How do I handle different CAPTCHA types in one pipeline?

The AutomatedScraper class above detects the CAPTCHA type automatically and uses the correct CaptchaAI method. Add detection for each CAPTCHA type your target sites use.

What's the optimal number of concurrent workers?

Start with 3-5 workers. More workers mean more concurrent requests, which increases CAPTCHA frequency. Balance speed against CAPTCHA cost.

How do I monitor my scraping pipeline?

Track three metrics: pages scraped, CAPTCHAs solved, and errors. The stats dict in the scraper class provides this. For production, export to a monitoring system.

Discussions (0)

No comments yet.

Related Posts

Explainers Rate Limiting CAPTCHA Solving Workflows
Sending too many requests too fast triggers blocks, bans, and wasted CAPTCHA solves.

Sending too many requests too fast triggers blocks, bans, and wasted CAPTCHA solves. Smart rate limiting keeps...

Automation Python All CAPTCHA Types
Apr 04, 2026
Explainers User-Agent Management for CAPTCHA Solving Workflows
Manage user-agent strings for CAPTCHA solving workflows.

Manage user-agent strings for CAPTCHA solving workflows. Avoid detection with proper UA rotation, consistency,...

Automation Python All CAPTCHA Types
Mar 09, 2026
Integrations Axios + CaptchaAI: Solve CAPTCHAs Without a Browser
Use Axios and Captcha AI to solve re CAPTCHA, Turnstile, and image CAPTCHAs in Node.js without launching a browser.

Use Axios and Captcha AI to solve re CAPTCHA, Turnstile, and image CAPTCHAs in Node.js without launching a bro...

Automation All CAPTCHA Types
Apr 08, 2026
Tutorials Streaming Batch Results: Processing CAPTCHA Solutions as They Arrive
Process CAPTCHA solutions the moment they arrive instead of waiting for tasks to complete — use async generators, event emitters, and callback patterns for stre...

Process CAPTCHA solutions the moment they arrive instead of waiting for all tasks to complete — use async gene...

Automation Python All CAPTCHA Types
Apr 07, 2026
DevOps & Scaling Blue-Green Deployment for CAPTCHA Solving Infrastructure
Implement blue-green deployments for CAPTCHA solving infrastructure — zero-downtime upgrades, traffic switching, and rollback strategies with Captcha AI.

Implement blue-green deployments for CAPTCHA solving infrastructure — zero-downtime upgrades, traffic switchin...

Automation Python All CAPTCHA Types
Apr 07, 2026
Reference CAPTCHA Solving Performance by Region: Latency Analysis
Analyze how geographic region affects Captcha AI solve times — network latency, proxy location, and optimization strategies for global deployments.

Analyze how geographic region affects Captcha AI solve times — network latency, proxy location, and optimizati...

Automation Python All CAPTCHA Types
Apr 05, 2026
DevOps & Scaling Ansible Playbooks for CaptchaAI Worker Deployment
Deploy and manage Captcha AI workers with Ansible — playbooks for provisioning, configuration, rolling updates, and health checks across your server fleet.

Deploy and manage Captcha AI workers with Ansible — playbooks for provisioning, configuration, rolling updates...

Automation Python All CAPTCHA Types
Apr 07, 2026
Use Cases Multi-Step Workflow Automation with CaptchaAI
Manage workflows across multiple accounts on CAPTCHA-protected platforms — , action, and data collection at scale.

Manage workflows across multiple accounts on CAPTCHA-protected platforms — , action, and data collection at sc...

Automation Python reCAPTCHA v2
Apr 06, 2026
Tutorials Bulkhead Pattern: Isolating CAPTCHA Solving Failures
Apply the bulkhead pattern to isolate CAPTCHA solving failures — partition resources into independent pools so a slow or failing solver type doesn't starve othe...

Apply the bulkhead pattern to isolate CAPTCHA solving failures — partition resources into independent pools so...

Automation Python All CAPTCHA Types
Apr 07, 2026
Integrations Puppeteer Stealth + CaptchaAI: Reliable Browser Automation
Standard Puppeteer gets detected immediately by anti-bot systems.

Standard Puppeteer gets detected immediately by anti-bot systems. `puppeteer-extra-plugin-stealth` patches the...

Automation reCAPTCHA v2 Cloudflare Turnstile
Apr 05, 2026
Use Cases Retail Site Data Collection with CAPTCHA Handling
Amazon uses image CAPTCHAs to block automated access.

Amazon uses image CAPTCHAs to block automated access. When you hit their anti-bot threshold, you'll see a page...

Web Scraping Image OCR
Apr 07, 2026
Use Cases Academic Research Web Scraping with CAPTCHA Solving
How researchers can collect data from academic databases, journals, and citation sources protected by CAPTCHAs using Captcha AI.

How researchers can collect data from academic databases, journals, and citation sources protected by CAPTCHAs...

Python reCAPTCHA v2 Cloudflare Turnstile
Apr 06, 2026
Use Cases Cyrillic Text CAPTCHA Solving with CaptchaAI
Solve Cyrillic text CAPTCHAs on Russian, Ukrainian, and other Slavic-language websites — handle character recognition, confusable glyphs, and encoding for Cyril...

Solve Cyrillic text CAPTCHAs on Russian, Ukrainian, and other Slavic-language websites — handle character reco...

Python Image OCR
Mar 28, 2026