Getting blocked during scraping wastes time and resources. This guide covers techniques to minimize detection and CAPTCHA triggers — and how CaptchaAI handles the CAPTCHAs that still appear.
How Sites Detect Scrapers
| Layer | Detection Method | Difficulty |
|---|---|---|
| IP | Rate limiting, reputation, geolocation | Easy to circumvent |
| Headers | User-Agent, Accept-Language, Referer | Easy |
| Cookies | Session tracking, fingerprinting cookies | Medium |
| JavaScript | Browser fingerprinting, behavior analysis | Medium |
| CAPTCHA | reCAPTCHA, Turnstile, hCaptcha | Solvable via CaptchaAI |
| Behavioral | Mouse movement, scroll patterns, timing | Hard |
Technique 1: Realistic HTTP Headers
import requests
import random
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15",
]
def get_headers():
return {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
Technique 2: Request Timing and Rate Limiting
import time
import random
def polite_delay():
"""Random delay between 2-7 seconds."""
time.sleep(random.uniform(2, 7))
def scrape_pages(urls):
session = requests.Session()
results = []
for url in urls:
session.headers = get_headers()
resp = session.get(url)
results.append(resp.text)
polite_delay()
return results
Technique 3: Proxy Rotation
PROXIES = [
"http://user:pass@proxy1:8080",
"http://user:pass@proxy2:8080",
"http://user:pass@proxy3:8080",
]
def get_proxy():
proxy = random.choice(PROXIES)
return {"http": proxy, "https": proxy}
session = requests.Session()
session.proxies = get_proxy()
Technique 4: Session and Cookie Management
def create_warm_session(base_url):
"""Create a session with realistic cookie history."""
session = requests.Session()
session.headers = get_headers()
# Visit the homepage first to get cookies
session.get(base_url)
time.sleep(random.uniform(1, 3))
# Visit a few pages to build cookie history
session.get(f"{base_url}/about")
time.sleep(random.uniform(1, 3))
return session
Technique 5: Referrer Chain
def scrape_with_referrer(session, urls):
"""Add realistic Referer headers."""
prev_url = None
for url in urls:
headers = get_headers()
if prev_url:
headers["Referer"] = prev_url
resp = session.get(url, headers=headers)
prev_url = url
polite_delay()
When CAPTCHAs Still Appear: CaptchaAI Integration
Even with perfect stealth-configuredion, CAPTCHAs will eventually appear at scale. Add CaptchaAI as a fallback:
import requests
import time
API_KEY = "YOUR_API_KEY"
def solve_if_captcha(session, resp, url):
"""Check for CAPTCHA and solve if present."""
from bs4 import BeautifulSoup
captcha_indicators = ["g-recaptcha", "cf-turnstile", "captcha"]
if not any(ind in resp.text.lower() for ind in captcha_indicators):
return resp # No CAPTCHA, return original response
soup = BeautifulSoup(resp.text, "html.parser")
# reCAPTCHA
rc = soup.find("div", class_="g-recaptcha")
if rc:
site_key = rc["data-sitekey"]
submit = requests.get("https://ocr.captchaai.com/in.php", params={
"key": API_KEY, "method": "userrecaptcha",
"googlekey": site_key, "pageurl": url
})
task_id = submit.text.split("|")[1]
for _ in range(60):
time.sleep(5)
result = requests.get("https://ocr.captchaai.com/res.php", params={
"key": API_KEY, "action": "get", "id": task_id
})
if result.text == "CAPCHA_NOT_READY": continue
if result.text.startswith("OK|"):
token = result.text.split("|")[1]
return session.post(url, data={"g-recaptcha-response": token})
return resp
# Usage in scraper
session = create_warm_session("https://example.com")
resp = session.get("https://example.com/data")
resp = solve_if_captcha(session, resp, "https://example.com/data")
Complete Stealth-Configuredion Scraper
import requests
import time
import random
from bs4 import BeautifulSoup
API_KEY = "YOUR_API_KEY"
class StealthScraper:
def __init__(self, proxies=None):
self.session = requests.Session()
self.proxies = proxies or []
def scrape(self, url):
self.session.headers = get_headers()
if self.proxies:
self.session.proxies = get_proxy()
resp = self.session.get(url)
resp = solve_if_captcha(self.session, resp, url)
polite_delay()
return resp.text
def scrape_batch(self, urls):
results = []
for url in urls:
try:
html = self.scrape(url)
results.append({"url": url, "html": html, "success": True})
except Exception as e:
results.append({"url": url, "error": str(e), "success": False})
return results
Detection Checklist
| Check | Status |
|---|---|
| Rotating User-Agent strings | ☐ |
| Realistic Accept/Language headers | ☐ |
| Random delays between requests (2-7s) | ☐ |
| Proxy rotation (residential preferred) | ☐ |
| Session cookie management | ☐ |
| Referrer headers | ☐ |
| CaptchaAI integration for CAPTCHA fallback | ☐ |
| Error handling and retries | ☐ |
FAQ
What's the most important stealth-configuredion technique?
Proxy rotation has the highest impact. Most blocking decisions are IP-based. Combining residential proxies with CaptchaAI covers both IP-level and CAPTCHA-level protection.
Can I scrape at high speed without getting blocked?
Not from a single IP. Distribute requests across many proxies and accept that CAPTCHAs will appear. CaptchaAI solves them in seconds, so the overhead is minimal.
Does CaptchaAI work with all anti-bot systems?
CaptchaAI solves the CAPTCHA component of anti-bot systems (reCAPTCHA, Turnstile, hCaptcha, Cloudflare Challenge). Other detection layers (JavaScript fingerprinting, behavioral analysis) require browser-level solutions.
Discussions (0)
Join the conversation
Sign in to share your opinion.
Sign InNo comments yet.