CAPTCHAs are the most common blocker in web scraping workflows. When a target site serves a reCAPTCHA, Cloudflare Turnstile, or image CAPTCHA, your scraper stops dead. CaptchaAI's API solves these challenges automatically so your scraper keeps running.
How CAPTCHA Blocking Works in Scraping
Websites trigger CAPTCHAs based on behavioral signals:
| Signal | Trigger |
|---|---|
| Request rate | Too many requests from one IP |
| Missing cookies | No session or preference cookies |
| Bot-like headers | Missing Accept-Language, Referer |
| JavaScript fingerprint | No JS execution or headless browser detected |
| IP reputation | Datacenter or proxy IP flagged |
When triggered, the site returns a CAPTCHA challenge instead of the page content. Your scraper needs to solve it and submit the token to proceed.
Requirements
| Requirement | Details |
|---|---|
| CaptchaAI API key | From captchaai.com |
| Python 3.7+ or Node.js 16+ | For code examples |
requests / axios |
HTTP client library |
| Target site URL | The page serving the CAPTCHA |
| CAPTCHA site key | Extracted from the page source |
Step 1: Identify the CAPTCHA Type
Before solving, identify what CAPTCHA the site uses. Check the page source:
reCAPTCHA v2:
<div class="g-recaptcha" data-sitekey="6Le-wvkS..."></div>
reCAPTCHA v3:
<script src="https://www.google.com/recaptcha/api.js?render=6Le-wvkS..."></script>
Cloudflare Turnstile:
<div class="cf-turnstile" data-sitekey="0x4AAAAA..."></div>
Each type requires a different method parameter when submitting to CaptchaAI.
Step 2: Extract the Site Key
Python (with requests + BeautifulSoup)
from bs4 import BeautifulSoup
import requests
page = requests.get("https://example.com/protected-page", headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
soup = BeautifulSoup(page.text, "html.parser")
# reCAPTCHA v2
recaptcha_div = soup.find("div", class_="g-recaptcha")
if recaptcha_div:
site_key = recaptcha_div["data-sitekey"]
print(f"reCAPTCHA v2 site key: {site_key}")
Node.js (with cheerio)
const axios = require("axios");
const cheerio = require("cheerio");
const { data } = await axios.get("https://example.com/protected-page");
const $ = cheerio.load(data);
const siteKey = $(".g-recaptcha").attr("data-sitekey");
console.log("Site key:", siteKey);
Step 3: Submit the CAPTCHA to CaptchaAI
Python
import requests
import time
API_KEY = "YOUR_API_KEY"
SITE_KEY = "6Le-wvkS..."
PAGE_URL = "https://example.com/protected-page"
# Submit
resp = requests.get("https://ocr.captchaai.com/in.php", params={
"key": API_KEY,
"method": "userrecaptcha",
"googlekey": SITE_KEY,
"pageurl": PAGE_URL
})
if not resp.text.startswith("OK|"):
raise Exception(f"Submit error: {resp.text}")
task_id = resp.text.split("|")[1]
print(f"Task submitted: {task_id}")
# Poll for result
while True:
time.sleep(5)
result = requests.get("https://ocr.captchaai.com/res.php", params={
"key": API_KEY,
"action": "get",
"id": task_id
})
if result.text == "CAPCHA_NOT_READY":
continue
if result.text.startswith("OK|"):
token = result.text.split("|")[1]
print(f"Solved! Token: {token[:50]}...")
break
raise Exception(f"Solve error: {result.text}")
Node.js
const axios = require("axios");
const API_KEY = "YOUR_API_KEY";
const SITE_KEY = "6Le-wvkS...";
const PAGE_URL = "https://example.com/protected-page";
// Submit
const submitResp = await axios.get("https://ocr.captchaai.com/in.php", {
params: {
key: API_KEY,
method: "userrecaptcha",
googlekey: SITE_KEY,
pageurl: PAGE_URL,
},
});
const taskId = submitResp.data.split("|")[1];
// Poll
const sleep = (ms) => new Promise((r) => setTimeout(r, ms));
while (true) {
await sleep(5000);
const result = await axios.get("https://ocr.captchaai.com/res.php", {
params: { key: API_KEY, action: "get", id: taskId },
});
if (result.data === "CAPCHA_NOT_READY") continue;
if (result.data.startsWith("OK|")) {
const token = result.data.split("|")[1];
console.log("Token:", token.substring(0, 50));
break;
}
throw new Error(`Error: ${result.data}`);
}
Step 4: Submit the Token to the Target Site
Once you have the token, submit it with the form data the site expects:
Python
# Submit the solved token with the form
form_data = {
"g-recaptcha-response": token,
"username": "user@example.com",
"password": "password123"
}
response = requests.post(PAGE_URL, data=form_data, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
print(f"Status: {response.status_code}")
Step 5: Build a Reusable Scraper Function
Wrap the solve logic into a reusable function:
import requests
import time
API_KEY = "YOUR_API_KEY"
def solve_captcha(site_key, page_url, method="userrecaptcha"):
resp = requests.get("https://ocr.captchaai.com/in.php", params={
"key": API_KEY,
"method": method,
"googlekey": site_key,
"pageurl": page_url
})
if not resp.text.startswith("OK|"):
raise Exception(resp.text)
task_id = resp.text.split("|")[1]
for _ in range(60):
time.sleep(5)
result = requests.get("https://ocr.captchaai.com/res.php", params={
"key": API_KEY, "action": "get", "id": task_id
})
if result.text == "CAPCHA_NOT_READY":
continue
if result.text.startswith("OK|"):
return result.text.split("|")[1]
raise Exception(result.text)
raise TimeoutError("CAPTCHA solve timed out")
# Use in your scraper
def scrape_page(url, site_key):
token = solve_captcha(site_key, url)
response = requests.post(url, data={"g-recaptcha-response": token})
return response.text
Troubleshooting
| Error | Cause | Fix |
|---|---|---|
ERROR_WRONG_USER_KEY |
Invalid API key | Check your key at captchaai.com dashboard |
ERROR_ZERO_BALANCE |
No funds | Add balance to your account |
ERROR_CAPTCHA_UNSOLVABLE |
Challenge couldn't be solved | Verify the site key and URL are correct |
CAPCHA_NOT_READY (loops forever) |
Slow solve or wrong parameters | Increase timeout; verify site key matches the page |
| Token rejected by site | Token expired or wrong site key | Use token within 120 seconds; confirm site key |
Best Practices
- Rotate user agents — Use realistic browser User-Agent strings
- Add delays — Space requests 2-5 seconds apart to avoid rate limits
- Use proxies — Rotate residential proxies to distribute requests
- Handle cookies — Maintain session cookies across requests
- Cache tokens — Some tokens work for multiple requests within their validity window
FAQ
Does this work with Cloudflare-protected sites?
Yes. Use method=turnstile for Turnstile CAPTCHAs or method=cloudflare_challenge for full Cloudflare challenge pages. See How to Bypass Cloudflare Turnstile.
Do I need a headless browser?
Not always. For simple form submissions with reCAPTCHA, plain HTTP requests work. For JavaScript-heavy sites, combine CaptchaAI with Selenium or Puppeteer.
How much does it cost to scrape 10,000 pages?
At CaptchaAI's rates, solving 10,000 reCAPTCHA v2 challenges costs approximately $10. Image CAPTCHAs are even cheaper.
Can I solve CAPTCHAs in parallel?
Yes. Submit multiple tasks simultaneously and poll for each result. See Solving Multiple CAPTCHAs in Parallel.
Discussions (0)
Join the conversation
Sign in to share your opinion.
Sign InNo comments yet.