Use Cases

How CAPTCHA Detection Works in Web Scraping

Understanding how sites detect scrapers helps you build better automation. This guide covers the technical mechanisms behind CAPTCHA triggers — and how to handle them with CaptchaAI when they fire.

Detection Layers

Modern anti-bot systems use multiple detection layers. A CAPTCHA appears when enough signals combine to indicate automated traffic.

Layer 1: IP-Based Detection

The simplest and most common trigger:

Signal Threshold Result
Requests per minute >20-30 from one IP Rate limit or CAPTCHA
Requests per hour >200-500 from one IP Temporary block
IP reputation Known datacenter range Immediate CAPTCHA
Geographic mismatch VPN/proxy detected Elevated scrutiny

Mitigation: Proxy rotation distributes requests across IPs. See Proxy Rotation for CAPTCHA Scraping.

Layer 2: HTTP Header Analysis

Servers inspect request headers for bot indicators:

# Bot-like request (triggers CAPTCHA)
GET /page HTTP/1.1
User-Agent: python-requests/2.28.0

# Human-like request (less likely to trigger)
GET /page HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate, br
Referer: https://www.google.com/

Key headers that trigger detection:

  • User-Agent — Default library UAs are instantly flagged
  • Accept-Language — Missing = bot
  • Referer — No referrer on deep pages = suspicious
  • Cookie — No session cookies = new/bot visitor

Layer 3: JavaScript Fingerprinting

Anti-bot services run JavaScript to profile the browser:

// What fingerprinting scripts check:
navigator.webdriver        // true in automated browsers
navigator.plugins.length   // 0 in headless
window.chrome              // undefined in non-Chrome
navigator.languages        // unusual in headless
WebGL renderer             // "SwiftShader" = headless
canvas fingerprint         // consistent across headless instances

reCAPTCHA v3 uses these signals to compute a trust score (0.0 = bot, 1.0 = human). Low scores trigger visible CAPTCHAs.

Layer 4: Behavioral Analysis

Advanced systems track user behavior over time:

Human Behavior Bot Behavior
Random navigation patterns Sequential page access
Variable time on page Consistent quick loads
Mouse movement and scrolling No mouse/scroll events
Click variations Exact coordinate clicks
Search then navigate Direct URL access

Sites plant tracking cookies to identify return visitors:

# First visit — site sets tracking cookies
# Second visit — site checks:
# - Are the cookie values consistent?
# - Was the cookie modified?
# - Is this a fresh session?

Missing or inconsistent cookies elevate the suspicion score.

How reCAPTCHA v3 Scoring Works

reCAPTCHA v3 runs invisibly and assigns a score:

Score Range Classification Action
0.7 - 1.0 Likely human Allow through
0.3 - 0.7 Uncertain May show CAPTCHA
0.0 - 0.3 Likely bot Block or CAPTCHA

Inputs to the score:

  • Browser JavaScript environment
  • Mouse/keyboard interaction patterns
  • Historical Google cookie data
  • IP reputation
  • Page interaction time

When reCAPTCHA v3 assigns a low score, the site can choose to serve a reCAPTCHA v2 challenge. CaptchaAI solves both versions.

How Cloudflare Detection Works

Cloudflare's Bot Management checks:

  1. JavaScript challenge — Runs browser tests in an interstitial page
  2. Managed challenge — Shows Turnstile widget for borderline traffic
  3. Block — Rejects known malicious IPs
  4. IP reputation — Cloudflare sees ~20% of internet traffic, building massive IP profiles

CaptchaAI solves both Turnstile widgets (method=turnstile) and full challenge pages (method=cloudflare_challenge).

Handling Detection with CaptchaAI

When your scraper encounters a CAPTCHA, CaptchaAI solves it regardless of what triggered it:

import requests
import time

API_KEY = "YOUR_API_KEY"

def handle_captcha(captcha_type, site_key, page_url, **kwargs):
    params = {
        "key": API_KEY,
        "pageurl": page_url
    }

    if captcha_type == "recaptcha_v2":
        params["method"] = "userrecaptcha"
        params["googlekey"] = site_key
    elif captcha_type == "recaptcha_v3":
        params["method"] = "userrecaptcha"
        params["googlekey"] = site_key
        params["version"] = "v3"
        params["action"] = kwargs.get("action", "verify")
    elif captcha_type == "turnstile":
        params["method"] = "turnstile"
        params["sitekey"] = site_key
    elif captcha_type == "cloudflare":
        params["method"] = "cloudflare_challenge"
        params["proxy"] = kwargs["proxy"]
        params["proxytype"] = "HTTP"

    resp = requests.get("https://ocr.captchaai.com/in.php", params=params)
    task_id = resp.text.split("|")[1]

    for _ in range(60):
        time.sleep(5)
        result = requests.get("https://ocr.captchaai.com/res.php", params={
            "key": API_KEY, "action": "get", "id": task_id
        })
        if result.text == "CAPCHA_NOT_READY": continue
        if result.text.startswith("OK|"): return result.text.split("|")[1]
        raise Exception(result.text)

FAQ

Can I avoid all CAPTCHAs with stealth techniques?

At low volumes, yes — stealth headers, proxies, and realistic behavior patterns avoid most triggers. At scale, CAPTCHAs become inevitable. CaptchaAI handles them when they appear.

Why do I get CAPTCHAs with residential proxies?

Residential IPs aren't immune. High request rates, missing cookies, or bot-like headers can still trigger CAPTCHAs. Proxies reduce frequency but don't eliminate detection.

How does reCAPTCHA know I'm a bot if I'm in a real browser?

reCAPTCHA checks dozens of signals including cookie history, mouse movement patterns, and Google account activity. Automated browsers lack the organic interaction patterns of real users.

Discussions (0)

No comments yet.

Related Posts

Troubleshooting Turnstile Token Invalid After Solving: Diagnosis and Fixes
Fix Cloudflare Turnstile tokens that come back invalid after solving with Captcha AI.

Fix Cloudflare Turnstile tokens that come back invalid after solving with Captcha AI. Covers token expiry, sit...

Python Cloudflare Turnstile Web Scraping
Apr 08, 2026
Use Cases Job Board Scraping with CAPTCHA Handling Using CaptchaAI
Scrape job listings from Indeed, Linked In, Glassdoor, and other job boards that use CAPTCHAs with Captcha AI integration.

Scrape job listings from Indeed, Linked In, Glassdoor, and other job boards that use CAPTCHAs with Captcha AI...

Python reCAPTCHA v2 Cloudflare Turnstile
Feb 28, 2026
Explainers How Proxy Quality Affects CAPTCHA Solve Success Rate
Understand how proxy quality, IP reputation, and configuration affect CAPTCHA frequency and solve success rates with Captcha AI.

Understand how proxy quality, IP reputation, and configuration affect CAPTCHA frequency and solve success rate...

Python reCAPTCHA v2 Cloudflare Turnstile
Feb 06, 2026
Tutorials Handling Multiple CAPTCHAs on a Single Page
how to detect and solve multiple CAPTCHAs on a single web page using Captcha AI.

Learn how to detect and solve multiple CAPTCHAs on a single web page using Captcha AI. Covers multi-iframe ext...

Python reCAPTCHA v2 Cloudflare Turnstile
Apr 09, 2026
Integrations Selenium Wire + CaptchaAI: Request Interception for CAPTCHA Solving
Complete guide to using Selenium Wire for request interception, proxy routing, and automated CAPTCHA solving with Captcha AI in Python.

Complete guide to using Selenium Wire for request interception, proxy routing, and automated CAPTCHA solving w...

Python reCAPTCHA v2 Cloudflare Turnstile
Mar 13, 2026
Troubleshooting CaptchaAI Error Codes: Complete Reference and Fixes
Every Captcha AI API error code explained with causes and fixes.

Every Captcha AI API error code explained with causes and fixes. Covers in.php submit errors, res.php polling...

Cloudflare Turnstile Web Scraping
Feb 20, 2026
Use Cases Shipping and Logistics Rate Scraping with CAPTCHA Solving
Scrape shipping rates, tracking data, and logistics information from carrier websites protected by CAPTCHAs using Captcha AI.

Scrape shipping rates, tracking data, and logistics information from carrier websites protected by CAPTCHAs us...

Python reCAPTCHA v2 Cloudflare Turnstile
Jan 25, 2026
Tutorials Cloudflare Turnstile Sitekey Extraction and Solving
Find and extract Cloudflare Turnstile sitekeys from any page and solve them with Captcha AI — DOM queries, script analysis, and network interception.

Find and extract Cloudflare Turnstile sitekeys from any page and solve them with Captcha AI — DOM queries, scr...

Python Cloudflare Turnstile Web Scraping
Jan 09, 2026
Use Cases Multi-Step Workflow Automation with CaptchaAI
Manage workflows across multiple accounts on CAPTCHA-protected platforms — , action, and data collection at scale.

Manage workflows across multiple accounts on CAPTCHA-protected platforms — , action, and data collection at sc...

Automation Python reCAPTCHA v2
Apr 06, 2026
Use Cases Retail Site Data Collection with CAPTCHA Handling
Amazon uses image CAPTCHAs to block automated access.

Amazon uses image CAPTCHAs to block automated access. When you hit their anti-bot threshold, you'll see a page...

Web Scraping Image OCR
Apr 07, 2026
Use Cases Event Ticket Monitoring with CAPTCHA Handling
Build an event ticket availability monitor that handles CAPTCHAs using Captcha AI.

Build an event ticket availability monitor that handles CAPTCHAs using Captcha AI. Python workflow for checkin...

Automation Python reCAPTCHA v2
Jan 17, 2026
Use Cases CAPTCHA Solving in Ticket Purchase Automation
How to handle CAPTCHAs on ticketing platforms Ticketmaster, AXS, and event sites using Captcha AI for automated purchasing workflows.

How to handle CAPTCHAs on ticketing platforms Ticketmaster, AXS, and event sites using Captcha AI for automate...

Automation Python reCAPTCHA v2
Feb 25, 2026