Scrapy + CaptchaAI Integration Guide

Scrapy is the most popular Python crawling framework. This guide shows how to add CaptchaAI CAPTCHA solving to your spiders using a custom middleware.

Requirements

Requirement	Details
Python	3.8+
Scrapy	2.5+
requests	For CaptchaAI API calls
CaptchaAI API key	Get one here

pip install scrapy requests

CaptchaAI Solver Module

Create captcha_solver.py in your Scrapy project root:

import requests
import time


class CaptchaAISolver:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://ocr.captchaai.com"

    def solve_recaptcha(self, site_key, page_url, timeout=300):
        resp = requests.get(f"{self.base_url}/in.php", params={
            "key": self.api_key,
            "method": "userrecaptcha",
            "googlekey": site_key,
            "pageurl": page_url,
        })

        if not resp.text.startswith("OK|"):
            raise Exception(f"Submit failed: {resp.text}")

        task_id = resp.text.split("|")[1]
        deadline = time.time() + timeout

        while time.time() < deadline:
            time.sleep(5)
            result = requests.get(f"{self.base_url}/res.php", params={
                "key": self.api_key,
                "action": "get",
                "id": task_id,
            })

            if result.text == "CAPCHA_NOT_READY":
                continue
            if result.text.startswith("OK|"):
                return result.text.split("|", 1)[1]
            raise Exception(f"Solve failed: {result.text}")

        raise TimeoutError(f"Task {task_id} timed out")

    def solve_image(self, image_base64, timeout=120):
        resp = requests.get(f"{self.base_url}/in.php", params={
            "key": self.api_key,
            "method": "base64",
            "body": image_base64,
        })

        if not resp.text.startswith("OK|"):
            raise Exception(f"Submit failed: {resp.text}")

        task_id = resp.text.split("|")[1]
        deadline = time.time() + timeout

        while time.time() < deadline:
            time.sleep(5)
            result = requests.get(f"{self.base_url}/res.php", params={
                "key": self.api_key,
                "action": "get",
                "id": task_id,
            })

            if result.text == "CAPCHA_NOT_READY":
                continue
            if result.text.startswith("OK|"):
                return result.text.split("|", 1)[1]
            raise Exception(f"Solve failed: {result.text}")

        raise TimeoutError(f"Task {task_id} timed out")

Scrapy Middleware

Create middlewares.py:

import base64
import re
from scrapy import signals
from scrapy.http import HtmlResponse
from captcha_solver import CaptchaAISolver


class CaptchaAIMiddleware:
    """Scrapy downloader middleware that detects and solves CAPTCHAs."""

    def __init__(self, api_key):
        self.solver = CaptchaAISolver(api_key)

    @classmethod
    def from_crawler(cls, crawler):
        api_key = crawler.settings.get("CAPTCHAAI_API_KEY")
        if not api_key:
            raise ValueError("CAPTCHAAI_API_KEY setting is required")
        return cls(api_key)

    def process_response(self, request, response, spider):
        # Check for reCAPTCHA on the page
        site_key = self._find_recaptcha_key(response.text)
        if site_key:
            spider.logger.info(f"reCAPTCHA detected on {response.url}")
            token = self.solver.solve_recaptcha(site_key, response.url)
            request.meta["captcha_token"] = token
            spider.logger.info("CAPTCHA solved successfully")

        # Check for image CAPTCHA
        captcha_img = self._find_image_captcha(response)
        if captcha_img:
            spider.logger.info(f"Image CAPTCHA detected on {response.url}")
            text = self.solver.solve_image(captcha_img)
            request.meta["captcha_text"] = text
            spider.logger.info(f"Image CAPTCHA solved: {text}")

        return response

    def _find_recaptcha_key(self, html):
        match = re.search(
            r'data-sitekey=["\']([A-Za-z0-9_-]+)["\']', html
        )
        return match.group(1) if match else None

    def _find_image_captcha(self, response):
        img = response.css("img#captcha-image::attr(src)").get()
        if img and img.startswith("data:image"):
            return img.split(",", 1)[1]
        return None

Settings Configuration

Add to settings.py:

import os

CAPTCHAAI_API_KEY = os.environ.get("CAPTCHAAI_API_KEY")

DOWNLOADER_MIDDLEWARES = {
    "myproject.middlewares.CaptchaAIMiddleware": 560,
}

Spider Example

import scrapy


class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products"]

    def parse(self, response):
        # If CAPTCHA was solved, the token is in meta
        token = response.meta.get("captcha_token")
        if token:
            # Resubmit the page with the token
            yield scrapy.FormRequest(
                url=response.url,
                formdata={"g-recaptcha-response": token},
                callback=self.parse_products,
            )
        else:
            yield from self.parse_products(response)

    def parse_products(self, response):
        for product in response.css(".product-item"):
            yield {
                "name": product.css("h2::text").get(),
                "price": product.css(".price::text").get(),
                "url": response.urljoin(
                    product.css("a::attr(href)").get()
                ),
            }

        next_page = response.css("a.next-page::attr(href)").get()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page))

Retry on CAPTCHA Pages

Add automatic retry when CAPTCHAs appear:

class CaptchaRetryMiddleware:
    """Retry requests that return CAPTCHA challenge pages."""

    max_retries = 3

    def process_response(self, request, response, spider):
        if self._is_captcha_page(response):
            retries = request.meta.get("captcha_retries", 0)
            if retries < self.max_retries:
                request.meta["captcha_retries"] = retries + 1
                spider.logger.info(
                    f"CAPTCHA page detected, retry {retries + 1}"
                )
                return request.copy()

        return response

    def _is_captcha_page(self, response):
        indicators = [
            "g-recaptcha",
            "cf-turnstile",
            "captcha-image",
            "Please verify you are human",
        ]
        return any(ind in response.text for ind in indicators)

Running the Spider

export CAPTCHAAI_API_KEY="YOUR_API_KEY"
scrapy crawl products -o products.json

Troubleshooting

Issue	Cause	Fix
`ValueError: CAPTCHAAI_API_KEY setting is required`	Missing env var	Set `CAPTCHAAI_API_KEY`
CAPTCHA not detected	Different HTML structure	Update regex pattern in middleware
`TimeoutError` on solve	Slow solve or network	Increase timeout in solver
Spider gets blocked after solving	IP-based blocking	Add proxy rotation middleware

FAQ

Can I use this with Scrapy-Splash or Scrapy-Playwright?

Yes. For JavaScript-rendered pages, the middleware works the same way — it inspects the final HTML response for CAPTCHA elements.

Does the middleware slow down crawling?

CAPTCHA solving takes 5-15 seconds per page. Use CONCURRENT_REQUESTS to crawl other pages while waiting. Only pages with CAPTCHAs cause delays.

How do I handle different CAPTCHA types per page?

Extend the middleware's process_response method to check for Turnstile, GeeTest, or other types and call the appropriate solver method.

Full Working Code

Complete runnable examples for this article in Python, Node.js, PHP, Go, Java, C#, Ruby, Rust, Kotlin & Bash.

View on GitHub →

Scrapy + CaptchaAI Integration Guide

Requirements

CaptchaAI Solver Module

Scrapy Middleware

Settings Configuration

Spider Example

Retry on CAPTCHA Pages

Running the Spider

Troubleshooting

FAQ

Can I use this with Scrapy-Splash or Scrapy-Playwright?

Does the middleware slow down crawling?

How do I handle different CAPTCHA types per page?

Discussions (0)

CAPTCHA Token Injection Methods Reference

reCAPTCHA Token Validation: Server-Side Verification Flow

reCAPTCHA Token Expiration: Timing Windows and Race Conditions

Common reCAPTCHA v2 Solving Errors and Fixes

How reCAPTCHA Token Lifecycle Works: Expiration, Renewal, Validation

Puppeteer Stealth + CaptchaAI: Reliable Browser Automation

Requirements

CaptchaAI Solver Module

Scrapy Middleware

Settings Configuration

Spider Example

Retry on CAPTCHA Pages

Running the Spider

Troubleshooting

FAQ

Can I use this with Scrapy-Splash or Scrapy-Playwright?

Does the middleware slow down crawling?

How do I handle different CAPTCHA types per page?

Related Guides

Discussions (0)

Join the conversation

Related Posts

CAPTCHA Token Injection Methods Reference

reCAPTCHA Token Validation: Server-Side Verification Flow

reCAPTCHA Token Expiration: Timing Windows and Race Conditions

Common reCAPTCHA v2 Solving Errors and Fixes

How reCAPTCHA Token Lifecycle Works: Expiration, Renewal, Validation

Puppeteer Stealth + CaptchaAI: Reliable Browser Automation