Scrapy is the most popular Python crawling framework. This guide shows how to add CaptchaAI CAPTCHA solving to your spiders using a custom middleware.
Requirements
| Requirement | Details |
|---|---|
| Python | 3.8+ |
| Scrapy | 2.5+ |
| requests | For CaptchaAI API calls |
| CaptchaAI API key | Get one here |
pip install scrapy requests
CaptchaAI Solver Module
Create captcha_solver.py in your Scrapy project root:
import requests
import time
class CaptchaAISolver:
def __init__(self, api_key):
self.api_key = api_key
self.base_url = "https://ocr.captchaai.com"
def solve_recaptcha(self, site_key, page_url, timeout=300):
resp = requests.get(f"{self.base_url}/in.php", params={
"key": self.api_key,
"method": "userrecaptcha",
"googlekey": site_key,
"pageurl": page_url,
})
if not resp.text.startswith("OK|"):
raise Exception(f"Submit failed: {resp.text}")
task_id = resp.text.split("|")[1]
deadline = time.time() + timeout
while time.time() < deadline:
time.sleep(5)
result = requests.get(f"{self.base_url}/res.php", params={
"key": self.api_key,
"action": "get",
"id": task_id,
})
if result.text == "CAPCHA_NOT_READY":
continue
if result.text.startswith("OK|"):
return result.text.split("|", 1)[1]
raise Exception(f"Solve failed: {result.text}")
raise TimeoutError(f"Task {task_id} timed out")
def solve_image(self, image_base64, timeout=120):
resp = requests.get(f"{self.base_url}/in.php", params={
"key": self.api_key,
"method": "base64",
"body": image_base64,
})
if not resp.text.startswith("OK|"):
raise Exception(f"Submit failed: {resp.text}")
task_id = resp.text.split("|")[1]
deadline = time.time() + timeout
while time.time() < deadline:
time.sleep(5)
result = requests.get(f"{self.base_url}/res.php", params={
"key": self.api_key,
"action": "get",
"id": task_id,
})
if result.text == "CAPCHA_NOT_READY":
continue
if result.text.startswith("OK|"):
return result.text.split("|", 1)[1]
raise Exception(f"Solve failed: {result.text}")
raise TimeoutError(f"Task {task_id} timed out")
Scrapy Middleware
Create middlewares.py:
import base64
import re
from scrapy import signals
from scrapy.http import HtmlResponse
from captcha_solver import CaptchaAISolver
class CaptchaAIMiddleware:
"""Scrapy downloader middleware that detects and solves CAPTCHAs."""
def __init__(self, api_key):
self.solver = CaptchaAISolver(api_key)
@classmethod
def from_crawler(cls, crawler):
api_key = crawler.settings.get("CAPTCHAAI_API_KEY")
if not api_key:
raise ValueError("CAPTCHAAI_API_KEY setting is required")
return cls(api_key)
def process_response(self, request, response, spider):
# Check for reCAPTCHA on the page
site_key = self._find_recaptcha_key(response.text)
if site_key:
spider.logger.info(f"reCAPTCHA detected on {response.url}")
token = self.solver.solve_recaptcha(site_key, response.url)
request.meta["captcha_token"] = token
spider.logger.info("CAPTCHA solved successfully")
# Check for image CAPTCHA
captcha_img = self._find_image_captcha(response)
if captcha_img:
spider.logger.info(f"Image CAPTCHA detected on {response.url}")
text = self.solver.solve_image(captcha_img)
request.meta["captcha_text"] = text
spider.logger.info(f"Image CAPTCHA solved: {text}")
return response
def _find_recaptcha_key(self, html):
match = re.search(
r'data-sitekey=["\']([A-Za-z0-9_-]+)["\']', html
)
return match.group(1) if match else None
def _find_image_captcha(self, response):
img = response.css("img#captcha-image::attr(src)").get()
if img and img.startswith("data:image"):
return img.split(",", 1)[1]
return None
Settings Configuration
Add to settings.py:
import os
CAPTCHAAI_API_KEY = os.environ.get("CAPTCHAAI_API_KEY")
DOWNLOADER_MIDDLEWARES = {
"myproject.middlewares.CaptchaAIMiddleware": 560,
}
Spider Example
import scrapy
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://example.com/products"]
def parse(self, response):
# If CAPTCHA was solved, the token is in meta
token = response.meta.get("captcha_token")
if token:
# Resubmit the page with the token
yield scrapy.FormRequest(
url=response.url,
formdata={"g-recaptcha-response": token},
callback=self.parse_products,
)
else:
yield from self.parse_products(response)
def parse_products(self, response):
for product in response.css(".product-item"):
yield {
"name": product.css("h2::text").get(),
"price": product.css(".price::text").get(),
"url": response.urljoin(
product.css("a::attr(href)").get()
),
}
next_page = response.css("a.next-page::attr(href)").get()
if next_page:
yield scrapy.Request(response.urljoin(next_page))
Retry on CAPTCHA Pages
Add automatic retry when CAPTCHAs appear:
class CaptchaRetryMiddleware:
"""Retry requests that return CAPTCHA challenge pages."""
max_retries = 3
def process_response(self, request, response, spider):
if self._is_captcha_page(response):
retries = request.meta.get("captcha_retries", 0)
if retries < self.max_retries:
request.meta["captcha_retries"] = retries + 1
spider.logger.info(
f"CAPTCHA page detected, retry {retries + 1}"
)
return request.copy()
return response
def _is_captcha_page(self, response):
indicators = [
"g-recaptcha",
"cf-turnstile",
"captcha-image",
"Please verify you are human",
]
return any(ind in response.text for ind in indicators)
Running the Spider
export CAPTCHAAI_API_KEY="YOUR_API_KEY"
scrapy crawl products -o products.json
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
ValueError: CAPTCHAAI_API_KEY setting is required |
Missing env var | Set CAPTCHAAI_API_KEY |
| CAPTCHA not detected | Different HTML structure | Update regex pattern in middleware |
TimeoutError on solve |
Slow solve or network | Increase timeout in solver |
| Spider gets blocked after solving | IP-based blocking | Add proxy rotation middleware |
FAQ
Can I use this with Scrapy-Splash or Scrapy-Playwright?
Yes. For JavaScript-rendered pages, the middleware works the same way — it inspects the final HTML response for CAPTCHA elements.
Does the middleware slow down crawling?
CAPTCHA solving takes 5-15 seconds per page. Use CONCURRENT_REQUESTS to crawl other pages while waiting. Only pages with CAPTCHAs cause delays.
How do I handle different CAPTCHA types per page?
Extend the middleware's process_response method to check for Turnstile, GeeTest, or other types and call the appropriate solver method.
Discussions (0)
Join the conversation
Sign in to share your opinion.
Sign InNo comments yet.