Integrations

Scrapy + CaptchaAI Integration Guide

Scrapy is the most popular Python crawling framework. This guide shows how to add CaptchaAI CAPTCHA solving to your spiders using a custom middleware.

Requirements

Requirement Details
Python 3.8+
Scrapy 2.5+
requests For CaptchaAI API calls
CaptchaAI API key Get one here
pip install scrapy requests

CaptchaAI Solver Module

Create captcha_solver.py in your Scrapy project root:

import requests
import time


class CaptchaAISolver:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://ocr.captchaai.com"

    def solve_recaptcha(self, site_key, page_url, timeout=300):
        resp = requests.get(f"{self.base_url}/in.php", params={
            "key": self.api_key,
            "method": "userrecaptcha",
            "googlekey": site_key,
            "pageurl": page_url,
        })

        if not resp.text.startswith("OK|"):
            raise Exception(f"Submit failed: {resp.text}")

        task_id = resp.text.split("|")[1]
        deadline = time.time() + timeout

        while time.time() < deadline:
            time.sleep(5)
            result = requests.get(f"{self.base_url}/res.php", params={
                "key": self.api_key,
                "action": "get",
                "id": task_id,
            })

            if result.text == "CAPCHA_NOT_READY":
                continue
            if result.text.startswith("OK|"):
                return result.text.split("|", 1)[1]
            raise Exception(f"Solve failed: {result.text}")

        raise TimeoutError(f"Task {task_id} timed out")

    def solve_image(self, image_base64, timeout=120):
        resp = requests.get(f"{self.base_url}/in.php", params={
            "key": self.api_key,
            "method": "base64",
            "body": image_base64,
        })

        if not resp.text.startswith("OK|"):
            raise Exception(f"Submit failed: {resp.text}")

        task_id = resp.text.split("|")[1]
        deadline = time.time() + timeout

        while time.time() < deadline:
            time.sleep(5)
            result = requests.get(f"{self.base_url}/res.php", params={
                "key": self.api_key,
                "action": "get",
                "id": task_id,
            })

            if result.text == "CAPCHA_NOT_READY":
                continue
            if result.text.startswith("OK|"):
                return result.text.split("|", 1)[1]
            raise Exception(f"Solve failed: {result.text}")

        raise TimeoutError(f"Task {task_id} timed out")

Scrapy Middleware

Create middlewares.py:

import base64
import re
from scrapy import signals
from scrapy.http import HtmlResponse
from captcha_solver import CaptchaAISolver


class CaptchaAIMiddleware:
    """Scrapy downloader middleware that detects and solves CAPTCHAs."""

    def __init__(self, api_key):
        self.solver = CaptchaAISolver(api_key)

    @classmethod
    def from_crawler(cls, crawler):
        api_key = crawler.settings.get("CAPTCHAAI_API_KEY")
        if not api_key:
            raise ValueError("CAPTCHAAI_API_KEY setting is required")
        return cls(api_key)

    def process_response(self, request, response, spider):
        # Check for reCAPTCHA on the page
        site_key = self._find_recaptcha_key(response.text)
        if site_key:
            spider.logger.info(f"reCAPTCHA detected on {response.url}")
            token = self.solver.solve_recaptcha(site_key, response.url)
            request.meta["captcha_token"] = token
            spider.logger.info("CAPTCHA solved successfully")

        # Check for image CAPTCHA
        captcha_img = self._find_image_captcha(response)
        if captcha_img:
            spider.logger.info(f"Image CAPTCHA detected on {response.url}")
            text = self.solver.solve_image(captcha_img)
            request.meta["captcha_text"] = text
            spider.logger.info(f"Image CAPTCHA solved: {text}")

        return response

    def _find_recaptcha_key(self, html):
        match = re.search(
            r'data-sitekey=["\']([A-Za-z0-9_-]+)["\']', html
        )
        return match.group(1) if match else None

    def _find_image_captcha(self, response):
        img = response.css("img#captcha-image::attr(src)").get()
        if img and img.startswith("data:image"):
            return img.split(",", 1)[1]
        return None

Settings Configuration

Add to settings.py:

import os

CAPTCHAAI_API_KEY = os.environ.get("CAPTCHAAI_API_KEY")

DOWNLOADER_MIDDLEWARES = {
    "myproject.middlewares.CaptchaAIMiddleware": 560,
}

Spider Example

import scrapy


class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products"]

    def parse(self, response):
        # If CAPTCHA was solved, the token is in meta
        token = response.meta.get("captcha_token")
        if token:
            # Resubmit the page with the token
            yield scrapy.FormRequest(
                url=response.url,
                formdata={"g-recaptcha-response": token},
                callback=self.parse_products,
            )
        else:
            yield from self.parse_products(response)

    def parse_products(self, response):
        for product in response.css(".product-item"):
            yield {
                "name": product.css("h2::text").get(),
                "price": product.css(".price::text").get(),
                "url": response.urljoin(
                    product.css("a::attr(href)").get()
                ),
            }

        next_page = response.css("a.next-page::attr(href)").get()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page))

Retry on CAPTCHA Pages

Add automatic retry when CAPTCHAs appear:

class CaptchaRetryMiddleware:
    """Retry requests that return CAPTCHA challenge pages."""

    max_retries = 3

    def process_response(self, request, response, spider):
        if self._is_captcha_page(response):
            retries = request.meta.get("captcha_retries", 0)
            if retries < self.max_retries:
                request.meta["captcha_retries"] = retries + 1
                spider.logger.info(
                    f"CAPTCHA page detected, retry {retries + 1}"
                )
                return request.copy()

        return response

    def _is_captcha_page(self, response):
        indicators = [
            "g-recaptcha",
            "cf-turnstile",
            "captcha-image",
            "Please verify you are human",
        ]
        return any(ind in response.text for ind in indicators)

Running the Spider

export CAPTCHAAI_API_KEY="YOUR_API_KEY"
scrapy crawl products -o products.json

Troubleshooting

Issue Cause Fix
ValueError: CAPTCHAAI_API_KEY setting is required Missing env var Set CAPTCHAAI_API_KEY
CAPTCHA not detected Different HTML structure Update regex pattern in middleware
TimeoutError on solve Slow solve or network Increase timeout in solver
Spider gets blocked after solving IP-based blocking Add proxy rotation middleware

FAQ

Can I use this with Scrapy-Splash or Scrapy-Playwright?

Yes. For JavaScript-rendered pages, the middleware works the same way — it inspects the final HTML response for CAPTCHA elements.

Does the middleware slow down crawling?

CAPTCHA solving takes 5-15 seconds per page. Use CONCURRENT_REQUESTS to crawl other pages while waiting. Only pages with CAPTCHAs cause delays.

How do I handle different CAPTCHA types per page?

Extend the middleware's process_response method to check for Turnstile, GeeTest, or other types and call the appropriate solver method.

Discussions (0)

No comments yet.

Related Posts

Reference CAPTCHA Token Injection Methods Reference
Complete reference for injecting solved CAPTCHA tokens into web pages.

Complete reference for injecting solved CAPTCHA tokens into web pages. Covers re CAPTCHA, Turnstile, and Cloud...

Python Automation Cloudflare Turnstile
Apr 08, 2026
Explainers reCAPTCHA v2 Invisible: Trigger Detection and Solving
Detect and solve re CAPTCHA v 2 Invisible challenges with Captcha AI — identify triggers, extract parameters, and handle auto-invoked CAPTCHAs.

Detect and solve re CAPTCHA v 2 Invisible challenges with Captcha AI — identify triggers, extract parameters,...

Python Automation reCAPTCHA v2
Apr 07, 2026
Tutorials Pytest Fixtures for CaptchaAI API Testing
Build reusable pytest fixtures to test CAPTCHA-solving workflows with Captcha AI.

Build reusable pytest fixtures to test CAPTCHA-solving workflows with Captcha AI. Covers mocking, live integra...

Python Automation Cloudflare Turnstile
Apr 08, 2026
API Tutorials How to Solve reCAPTCHA v2 Enterprise with Python
Solve re CAPTCHA v 2 Enterprise using Python and Captcha AI API.

Solve re CAPTCHA v 2 Enterprise using Python and Captcha AI API. Complete guide with sitekey extraction, task...

Python Automation reCAPTCHA v2
Apr 08, 2026
Troubleshooting ERROR_PAGEURL: URL Mismatch Troubleshooting Guide
Fix ERROR_PAGEURL when using Captcha AI.

Fix ERROR_PAGEURL when using Captcha AI. Diagnose URL mismatch issues, handle redirects, SPAs, and dynamic URL...

Python Automation Cloudflare Turnstile
Mar 23, 2026
API Tutorials Solving CAPTCHAs with Swift and CaptchaAI API
Complete guide to solving re CAPTCHA, Turnstile, and image CAPTCHAs in Swift using Captcha AI's HTTP API with URLSession, async/await, and Alamofire.

Complete guide to solving re CAPTCHA, Turnstile, and image CAPTCHAs in Swift using Captcha AI's HTTP API with...

Automation Cloudflare Turnstile reCAPTCHA v2
Apr 05, 2026
Integrations Scrapy Spider Middleware for CaptchaAI: Advanced Patterns
Build advanced Scrapy middleware for automatic Captcha AI CAPTCHA solving.

Build advanced Scrapy middleware for automatic Captcha AI CAPTCHA solving. Downloader middleware, signal handl...

Python reCAPTCHA v2 Web Scraping
Apr 04, 2026
Troubleshooting Handling reCAPTCHA v2 and Cloudflare Turnstile on the Same Site
Solve both re CAPTCHA v 2 and Cloudflare Turnstile on sites that use multiple CAPTCHA providers — detect which type appears, solve each correctly, and handle pr...

Solve both re CAPTCHA v 2 and Cloudflare Turnstile on sites that use multiple CAPTCHA providers — detect which...

Python Automation Cloudflare Turnstile
Mar 23, 2026
API Tutorials How to Solve reCAPTCHA v2 Callback Using API
how to solve re CAPTCHA v 2 callback implementations using Captcha AI API.

Learn how to solve re CAPTCHA v 2 callback implementations using Captcha AI API. Detect the callback function,...

Automation reCAPTCHA v2 Webhooks
Mar 01, 2026
Integrations Axios + CaptchaAI: Solve CAPTCHAs Without a Browser
Use Axios and Captcha AI to solve re CAPTCHA, Turnstile, and image CAPTCHAs in Node.js without launching a browser.

Use Axios and Captcha AI to solve re CAPTCHA, Turnstile, and image CAPTCHAs in Node.js without launching a bro...

Automation All CAPTCHA Types
Apr 08, 2026
Integrations aiohttp + CaptchaAI: Async CAPTCHA Solving
Solve CAPTCHAs asynchronously in Python using aiohttp and Captcha AI API for high-throughput concurrent scraping.

Solve CAPTCHAs asynchronously in Python using aiohttp and Captcha AI API for high-throughput concurrent scrapi...

Automation All CAPTCHA Types aiohttp
Mar 16, 2026
Integrations Puppeteer Stealth + CaptchaAI: Reliable Browser Automation
Standard Puppeteer gets detected immediately by anti-bot systems.

Standard Puppeteer gets detected immediately by anti-bot systems. `puppeteer-extra-plugin-stealth` patches the...

Automation Cloudflare Turnstile reCAPTCHA v2
Apr 05, 2026