Most high-value websites use CAPTCHAs as part of their anti-bot defense. This guide covers strategies for scraping these sites reliably using CaptchaAI, including how to identify CAPTCHA types, solve them automatically, and build resilient scrapers.
Common CAPTCHA Implementations
| CAPTCHA | Where Used | CaptchaAI Method |
|---|---|---|
| reCAPTCHA v2 | Login forms, search pages | method=userrecaptcha |
| reCAPTCHA v3 | Background scoring on any page | method=userrecaptcha&version=v3 |
| Cloudflare Turnstile | Sites behind Cloudflare | method=turnstile |
| Cloudflare Challenge | Full-page Cloudflare block | method=cloudflare_challenge |
| Image/OCR CAPTCHA | Legacy sites, Amazon | method=base64 |
| hCaptcha | Privacy-focused sites | method=hcaptcha |
Strategy 1: Detect and Solve on Demand
The most reliable approach — scrape normally and solve CAPTCHAs only when they appear:
import requests
import time
from bs4 import BeautifulSoup
API_KEY = "YOUR_API_KEY"
class ProtectedScraper:
def __init__(self):
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
def scrape(self, url):
resp = self.session.get(url)
# Check for CAPTCHA
if self._has_captcha(resp.text):
resp = self._handle_captcha(resp.text, url)
return resp.text
def _has_captcha(self, html):
indicators = ["g-recaptcha", "cf-turnstile", "h-captcha", "captcha"]
return any(ind in html.lower() for ind in indicators)
def _handle_captcha(self, html, url):
soup = BeautifulSoup(html, "html.parser")
# reCAPTCHA v2
rc = soup.find("div", class_="g-recaptcha")
if rc:
token = self._solve_recaptcha(rc["data-sitekey"], url)
return self.session.post(url, data={"g-recaptcha-response": token})
# Cloudflare Turnstile
ts = soup.find("div", class_="cf-turnstile")
if ts:
token = self._solve_turnstile(ts["data-sitekey"], url)
return self.session.post(url, data={"cf-turnstile-response": token})
raise Exception("Unknown CAPTCHA type")
def _solve_recaptcha(self, site_key, page_url):
resp = requests.get("https://ocr.captchaai.com/in.php", params={
"key": API_KEY, "method": "userrecaptcha",
"googlekey": site_key, "pageurl": page_url
})
return self._poll(resp.text.split("|")[1])
def _solve_turnstile(self, site_key, page_url):
resp = requests.get("https://ocr.captchaai.com/in.php", params={
"key": API_KEY, "method": "turnstile",
"sitekey": site_key, "pageurl": page_url
})
return self._poll(resp.text.split("|")[1])
def _poll(self, task_id):
for _ in range(60):
time.sleep(5)
result = requests.get("https://ocr.captchaai.com/res.php", params={
"key": API_KEY, "action": "get", "id": task_id
})
if result.text == "CAPCHA_NOT_READY": continue
if result.text.startswith("OK|"): return result.text.split("|")[1]
raise Exception(result.text)
raise TimeoutError()
# Usage
scraper = ProtectedScraper()
html = scraper.scrape("https://example.com/data")
Strategy 2: Pre-Solve for Known CAPTCHA Pages
If you know which pages always have CAPTCHAs, solve preemptively:
def scrape_known_captcha_page(url, site_key):
# Solve before even loading the page
token = solve_recaptcha(site_key, url)
# Submit directly with token
resp = requests.post(url, data={
"g-recaptcha-response": token,
"query": "search term"
})
return resp.text
Strategy 3: Cloudflare-Protected Sites
Sites behind Cloudflare often require a cf_clearance cookie:
def get_cloudflare_clearance(url, proxy):
resp = requests.get("https://ocr.captchaai.com/in.php", params={
"key": API_KEY,
"method": "cloudflare_challenge",
"pageurl": url,
"proxy": proxy,
"proxytype": "HTTP"
})
task_id = resp.text.split("|")[1]
for _ in range(60):
time.sleep(5)
result = requests.get("https://ocr.captchaai.com/res.php", params={
"key": API_KEY, "action": "get", "id": task_id
})
if result.text == "CAPCHA_NOT_READY": continue
if "cf_clearance" in result.text:
# Parse cf_clearance and user_agent from response
return result.text
raise TimeoutError()
Multi-Page Scraping Pattern
def scrape_multiple_pages(base_url, site_key, pages):
scraper = ProtectedScraper()
results = []
for page in pages:
url = f"{base_url}?page={page}"
try:
html = scraper.scrape(url)
soup = BeautifulSoup(html, "html.parser")
items = soup.find_all("div", class_="item")
results.extend([item.text.strip() for item in items])
print(f"Page {page}: {len(items)} items")
except Exception as e:
print(f"Page {page} failed: {e}")
time.sleep(random.uniform(2, 5))
return results
Troubleshooting
| Issue | Fix |
|---|---|
| CAPTCHA appears on every page | Use proxies; reduce request rate |
| Token rejected after solving | Token may have expired; use within 120s |
| Cloudflare blocks despite clearance | Use same proxy and user-agent for all requests |
| Site returns different page after solve | Check for additional redirects or cookies |
FAQ
Which sites are hardest to scrape?
Sites using Cloudflare Enterprise, PerimeterX, or Akamai Bot Manager are the most challenging. CaptchaAI handles their CAPTCHA components; combine with stealth browsers and proxies for best results.
Can I scrape sites that require login?
Yes. Log in first (solving any login CAPTCHA), maintain the session cookies, then scrape authenticated pages. CaptchaAI handles CAPTCHAs at any stage.
How do I handle JavaScript-rendered pages?
Use Selenium, Puppeteer, or Playwright to render JavaScript, then extract CAPTCHA parameters and solve via CaptchaAI. See Selenium CAPTCHA Handling.
Discussions (0)
Join the conversation
Sign in to share your opinion.
Sign InNo comments yet.