Use Cases

Scraping Reliability Guide: Managing Blocks and CAPTCHA Challenges

Getting blocked during scraping wastes time and resources. This guide covers techniques to minimize detection and CAPTCHA triggers — and how CaptchaAI handles the CAPTCHAs that still appear.

How Sites Detect Scrapers

Layer Detection Method Difficulty
IP Rate limiting, reputation, geolocation Easy to circumvent
Headers User-Agent, Accept-Language, Referer Easy
Cookies Session tracking, fingerprinting cookies Medium
JavaScript Browser fingerprinting, behavior analysis Medium
CAPTCHA reCAPTCHA, Turnstile, hCaptcha Solvable via CaptchaAI
Behavioral Mouse movement, scroll patterns, timing Hard

Technique 1: Realistic HTTP Headers

import requests
import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15",
]

def get_headers():
    return {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
    }

Technique 2: Request Timing and Rate Limiting

import time
import random

def polite_delay():
    """Random delay between 2-7 seconds."""
    time.sleep(random.uniform(2, 7))

def scrape_pages(urls):
    session = requests.Session()
    results = []

    for url in urls:
        session.headers = get_headers()
        resp = session.get(url)
        results.append(resp.text)
        polite_delay()

    return results

Technique 3: Proxy Rotation

PROXIES = [
    "http://user:pass@proxy1:8080",
    "http://user:pass@proxy2:8080",
    "http://user:pass@proxy3:8080",
]

def get_proxy():
    proxy = random.choice(PROXIES)
    return {"http": proxy, "https": proxy}

session = requests.Session()
session.proxies = get_proxy()
def create_warm_session(base_url):
    """Create a session with realistic cookie history."""
    session = requests.Session()
    session.headers = get_headers()

    # Visit the homepage first to get cookies
    session.get(base_url)
    time.sleep(random.uniform(1, 3))

    # Visit a few pages to build cookie history
    session.get(f"{base_url}/about")
    time.sleep(random.uniform(1, 3))

    return session

Technique 5: Referrer Chain

def scrape_with_referrer(session, urls):
    """Add realistic Referer headers."""
    prev_url = None

    for url in urls:
        headers = get_headers()
        if prev_url:
            headers["Referer"] = prev_url

        resp = session.get(url, headers=headers)
        prev_url = url
        polite_delay()

When CAPTCHAs Still Appear: CaptchaAI Integration

Even with perfect stealth-configuredion, CAPTCHAs will eventually appear at scale. Add CaptchaAI as a fallback:

import requests
import time

API_KEY = "YOUR_API_KEY"

def solve_if_captcha(session, resp, url):
    """Check for CAPTCHA and solve if present."""
    from bs4 import BeautifulSoup

    captcha_indicators = ["g-recaptcha", "cf-turnstile", "captcha"]
    if not any(ind in resp.text.lower() for ind in captcha_indicators):
        return resp  # No CAPTCHA, return original response

    soup = BeautifulSoup(resp.text, "html.parser")

    # reCAPTCHA
    rc = soup.find("div", class_="g-recaptcha")
    if rc:
        site_key = rc["data-sitekey"]
        submit = requests.get("https://ocr.captchaai.com/in.php", params={
            "key": API_KEY, "method": "userrecaptcha",
            "googlekey": site_key, "pageurl": url
        })
        task_id = submit.text.split("|")[1]

        for _ in range(60):
            time.sleep(5)
            result = requests.get("https://ocr.captchaai.com/res.php", params={
                "key": API_KEY, "action": "get", "id": task_id
            })
            if result.text == "CAPCHA_NOT_READY": continue
            if result.text.startswith("OK|"):
                token = result.text.split("|")[1]
                return session.post(url, data={"g-recaptcha-response": token})

    return resp

# Usage in scraper
session = create_warm_session("https://example.com")
resp = session.get("https://example.com/data")
resp = solve_if_captcha(session, resp, "https://example.com/data")

Complete Stealth-Configuredion Scraper

import requests
import time
import random
from bs4 import BeautifulSoup

API_KEY = "YOUR_API_KEY"

class StealthScraper:
    def __init__(self, proxies=None):
        self.session = requests.Session()
        self.proxies = proxies or []

    def scrape(self, url):
        self.session.headers = get_headers()
        if self.proxies:
            self.session.proxies = get_proxy()

        resp = self.session.get(url)
        resp = solve_if_captcha(self.session, resp, url)

        polite_delay()
        return resp.text

    def scrape_batch(self, urls):
        results = []
        for url in urls:
            try:
                html = self.scrape(url)
                results.append({"url": url, "html": html, "success": True})
            except Exception as e:
                results.append({"url": url, "error": str(e), "success": False})
        return results

Detection Checklist

Check Status
Rotating User-Agent strings
Realistic Accept/Language headers
Random delays between requests (2-7s)
Proxy rotation (residential preferred)
Session cookie management
Referrer headers
CaptchaAI integration for CAPTCHA fallback
Error handling and retries

FAQ

What's the most important stealth-configuredion technique?

Proxy rotation has the highest impact. Most blocking decisions are IP-based. Combining residential proxies with CaptchaAI covers both IP-level and CAPTCHA-level protection.

Can I scrape at high speed without getting blocked?

Not from a single IP. Distribute requests across many proxies and accept that CAPTCHAs will appear. CaptchaAI solves them in seconds, so the overhead is minimal.

Does CaptchaAI work with all anti-bot systems?

CaptchaAI solves the CAPTCHA component of anti-bot systems (reCAPTCHA, Turnstile, hCaptcha, Cloudflare Challenge). Other detection layers (JavaScript fingerprinting, behavioral analysis) require browser-level solutions.

Discussions (0)

No comments yet.

Related Posts

Comparisons ScrapingBee vs Building with CaptchaAI: When to Use Which
Compare Scraping Bee's -in-one scraping API with building your own solution using Captcha AI.

Compare Scraping Bee's all-in-one scraping API with building your own solution using Captcha AI. Cost, flexibi...

Python All CAPTCHA Types Web Scraping
Mar 16, 2026
Reference CAPTCHA Types Comparison Matrix 2025
Complete side-by-side comparison of every major CAPTCHA type in 2025 — re CAPTCHA, Turnstile, Gee Test, BLS, h Captcha, and image CAPTCHAs.

Complete side-by-side comparison of every major CAPTCHA type in 2025 — re CAPTCHA, Turnstile, Gee Test, BLS, h...

All CAPTCHA Types Web Scraping
Mar 31, 2026
Explainers Rate Limiting CAPTCHA Solving Workflows
Sending too many requests too fast triggers blocks, bans, and wasted CAPTCHA solves.

Sending too many requests too fast triggers blocks, bans, and wasted CAPTCHA solves. Smart rate limiting keeps...

Automation Python All CAPTCHA Types
Apr 04, 2026
Use Cases Proxy Rotation for CAPTCHA Scraping
How to combine proxy rotation with Captcha AI to reduce CAPTCHA frequency and maintain scraping reliability.

How to combine proxy rotation with Captcha AI to reduce CAPTCHA frequency and maintain scraping reliability.

All CAPTCHA Types Web Scraping Proxies
Feb 28, 2026
Tutorials Dynamic CAPTCHA Loading: Detecting Lazy-Loaded CAPTCHAs
Detect and solve CAPTCHAs that load dynamically after user interaction — Mutation Observer, scroll triggers, and event-based rendering.

Detect and solve CAPTCHAs that load dynamically after user interaction — Mutation Observer, scroll triggers, a...

Python All CAPTCHA Types Web Scraping
Apr 03, 2026
Reference Complete Guide: CAPTCHA Solving from Basics to Production
End-to-end guide covering CAPTCHA fundamentals, solving approaches, API integration, error handling, scaling, and production deployment with Captcha AI.

End-to-end guide covering CAPTCHA fundamentals, solving approaches, API integration, error handling, scaling,...

Python All CAPTCHA Types Web Scraping
Jan 13, 2026
Explainers IP Reputation and CAPTCHA Solving: Best Practices
Manage IP reputation for CAPTCHA solving workflows.

Manage IP reputation for CAPTCHA solving workflows. Understand IP scoring, proxy rotation, and how IP quality...

Python All CAPTCHA Types Web Scraping
Mar 23, 2026
API Tutorials Building a Custom Scraping Framework with CaptchaAI
Build a modular scraping framework with built-in Captcha AI CAPTCHA solving.

Build a modular scraping framework with built-in Captcha AI CAPTCHA solving. Queue management, middleware pipe...

Python All CAPTCHA Types Web Scraping
Feb 27, 2026
Use Cases Real Estate Data Scraping with CAPTCHA Handling
Automate real estate listing data collection from CAPTCHA-protected property sites using Captcha AI.

Automate real estate listing data collection from CAPTCHA-protected property sites using Captcha AI.

All CAPTCHA Types Web Scraping
Mar 28, 2026
Use Cases Headless Browser CAPTCHA Issues and Solutions
Common CAPTCHA problems in headless browsers and how to solve them using Captcha AI with Selenium, Puppeteer, and Playwright.

Common CAPTCHA problems in headless browsers and how to solve them using Captcha AI with Selenium, Puppeteer,...

All CAPTCHA Types Web Scraping
Mar 27, 2026
Use Cases Retail Site Data Collection with CAPTCHA Handling
Amazon uses image CAPTCHAs to block automated access.

Amazon uses image CAPTCHAs to block automated access. When you hit their anti-bot threshold, you'll see a page...

Web Scraping Image OCR
Apr 07, 2026
Use Cases Event Ticket Monitoring with CAPTCHA Handling
Build an event ticket availability monitor that handles CAPTCHAs using Captcha AI.

Build an event ticket availability monitor that handles CAPTCHAs using Captcha AI. Python workflow for checkin...

Automation Python reCAPTCHA v2
Jan 17, 2026
Use Cases Automated Form Submission with CAPTCHA Handling
Complete guide to automating web form submissions that include CAPTCHA challenges — re CAPTCHA, Turnstile, and image CAPTCHAs with Captcha AI.

Complete guide to automating web form submissions that include CAPTCHA challenges — re CAPTCHA, Turnstile, and...

Python reCAPTCHA v2 Cloudflare Turnstile
Mar 21, 2026