Understanding how sites detect scrapers helps you build better automation. This guide covers the technical mechanisms behind CAPTCHA triggers — and how to handle them with CaptchaAI when they fire.
Detection Layers
Modern anti-bot systems use multiple detection layers. A CAPTCHA appears when enough signals combine to indicate automated traffic.
Layer 1: IP-Based Detection
The simplest and most common trigger:
| Signal | Threshold | Result |
|---|---|---|
| Requests per minute | >20-30 from one IP | Rate limit or CAPTCHA |
| Requests per hour | >200-500 from one IP | Temporary block |
| IP reputation | Known datacenter range | Immediate CAPTCHA |
| Geographic mismatch | VPN/proxy detected | Elevated scrutiny |
Mitigation: Proxy rotation distributes requests across IPs. See Proxy Rotation for CAPTCHA Scraping.
Layer 2: HTTP Header Analysis
Servers inspect request headers for bot indicators:
# Bot-like request (triggers CAPTCHA)
GET /page HTTP/1.1
User-Agent: python-requests/2.28.0
# Human-like request (less likely to trigger)
GET /page HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate, br
Referer: https://www.google.com/
Key headers that trigger detection:
User-Agent— Default library UAs are instantly flaggedAccept-Language— Missing = botReferer— No referrer on deep pages = suspiciousCookie— No session cookies = new/bot visitor
Layer 3: JavaScript Fingerprinting
Anti-bot services run JavaScript to profile the browser:
// What fingerprinting scripts check:
navigator.webdriver // true in automated browsers
navigator.plugins.length // 0 in headless
window.chrome // undefined in non-Chrome
navigator.languages // unusual in headless
WebGL renderer // "SwiftShader" = headless
canvas fingerprint // consistent across headless instances
reCAPTCHA v3 uses these signals to compute a trust score (0.0 = bot, 1.0 = human). Low scores trigger visible CAPTCHAs.
Layer 4: Behavioral Analysis
Advanced systems track user behavior over time:
| Human Behavior | Bot Behavior |
|---|---|
| Random navigation patterns | Sequential page access |
| Variable time on page | Consistent quick loads |
| Mouse movement and scrolling | No mouse/scroll events |
| Click variations | Exact coordinate clicks |
| Search then navigate | Direct URL access |
Layer 5: Cookie and Session Tracking
Sites plant tracking cookies to identify return visitors:
# First visit — site sets tracking cookies
# Second visit — site checks:
# - Are the cookie values consistent?
# - Was the cookie modified?
# - Is this a fresh session?
Missing or inconsistent cookies elevate the suspicion score.
How reCAPTCHA v3 Scoring Works
reCAPTCHA v3 runs invisibly and assigns a score:
| Score Range | Classification | Action |
|---|---|---|
| 0.7 - 1.0 | Likely human | Allow through |
| 0.3 - 0.7 | Uncertain | May show CAPTCHA |
| 0.0 - 0.3 | Likely bot | Block or CAPTCHA |
Inputs to the score:
- Browser JavaScript environment
- Mouse/keyboard interaction patterns
- Historical Google cookie data
- IP reputation
- Page interaction time
When reCAPTCHA v3 assigns a low score, the site can choose to serve a reCAPTCHA v2 challenge. CaptchaAI solves both versions.
How Cloudflare Detection Works
Cloudflare's Bot Management checks:
- JavaScript challenge — Runs browser tests in an interstitial page
- Managed challenge — Shows Turnstile widget for borderline traffic
- Block — Rejects known malicious IPs
- IP reputation — Cloudflare sees ~20% of internet traffic, building massive IP profiles
CaptchaAI solves both Turnstile widgets (method=turnstile) and full challenge pages (method=cloudflare_challenge).
Handling Detection with CaptchaAI
When your scraper encounters a CAPTCHA, CaptchaAI solves it regardless of what triggered it:
import requests
import time
API_KEY = "YOUR_API_KEY"
def handle_captcha(captcha_type, site_key, page_url, **kwargs):
params = {
"key": API_KEY,
"pageurl": page_url
}
if captcha_type == "recaptcha_v2":
params["method"] = "userrecaptcha"
params["googlekey"] = site_key
elif captcha_type == "recaptcha_v3":
params["method"] = "userrecaptcha"
params["googlekey"] = site_key
params["version"] = "v3"
params["action"] = kwargs.get("action", "verify")
elif captcha_type == "turnstile":
params["method"] = "turnstile"
params["sitekey"] = site_key
elif captcha_type == "cloudflare":
params["method"] = "cloudflare_challenge"
params["proxy"] = kwargs["proxy"]
params["proxytype"] = "HTTP"
resp = requests.get("https://ocr.captchaai.com/in.php", params=params)
task_id = resp.text.split("|")[1]
for _ in range(60):
time.sleep(5)
result = requests.get("https://ocr.captchaai.com/res.php", params={
"key": API_KEY, "action": "get", "id": task_id
})
if result.text == "CAPCHA_NOT_READY": continue
if result.text.startswith("OK|"): return result.text.split("|")[1]
raise Exception(result.text)
FAQ
Can I avoid all CAPTCHAs with stealth techniques?
At low volumes, yes — stealth headers, proxies, and realistic behavior patterns avoid most triggers. At scale, CAPTCHAs become inevitable. CaptchaAI handles them when they appear.
Why do I get CAPTCHAs with residential proxies?
Residential IPs aren't immune. High request rates, missing cookies, or bot-like headers can still trigger CAPTCHAs. Proxies reduce frequency but don't eliminate detection.
How does reCAPTCHA know I'm a bot if I'm in a real browser?
reCAPTCHA checks dozens of signals including cookie history, mouse movement patterns, and Google account activity. Automated browsers lack the organic interaction patterns of real users.
Discussions (0)
Join the conversation
Sign in to share your opinion.
Sign InNo comments yet.