Every CAPTCHA solved costs time and money. These techniques reduce how often CAPTCHAs appear during scraping — and CaptchaAI handles the ones that still get through.
Prevention Techniques
1. Use Residential Proxies
Datacenter IPs trigger CAPTCHAs 5-10x more often than residential IPs:
# Residential proxy rotation
proxies = {
"http": "http://user:pass@residential-proxy.example.com:8080",
"https": "http://user:pass@residential-proxy.example.com:8080"
}
resp = requests.get(url, proxies=proxies)
2. Implement Request Delays
import random
import time
# Random delay between 3-8 seconds
time.sleep(random.uniform(3, 8))
Sites track request timing. Consistent intervals (exactly every 1 second) are a strong bot signal. Random delays mimic human behavior.
3. Set Realistic Headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.google.com/",
"DNT": "1",
"Connection": "keep-alive"
}
4. Maintain Session Cookies
session = requests.Session()
# Visit homepage first to establish cookies
session.get("https://example.com")
time.sleep(2)
# Then access target pages
session.get("https://example.com/data")
Sites expect returning visitors to have cookie history. A fresh session hitting deep pages is suspicious.
5. Use Referrer Chains
# Navigate like a human: search → results → detail
session.get("https://example.com")
time.sleep(2)
session.get("https://example.com/search?q=product", headers={"Referer": "https://example.com"})
time.sleep(3)
session.get("https://example.com/product/123", headers={"Referer": "https://example.com/search?q=product"})
6. Lower Concurrency
| Concurrency | CAPTCHA Rate | Speed |
|---|---|---|
| 1 thread | Lowest | Slow |
| 3 threads | Low | Moderate |
| 10 threads | High | Fast |
| 50 threads | Very high | Fast but blocked |
Start with 1-3 concurrent scrapers per site.
7. Use APIs When Available
Many sites offer public APIs that don't require CAPTCHA solving:
| Site | API Available | Notes |
|---|---|---|
| Amazon | Product Advertising API | Requires approval |
| Custom Search API | 100 free/day | |
| Twitter/X | API v2 | Paid tiers |
| Reddit API | Free with app registration |
Check if your target has an API before building a scraper.
8. Scrape During Off-Peak Hours
Sites are less aggressive with bot detection during low-traffic periods (late night, weekends). Rate limits may be higher and monitoring less strict.
When Prevention Fails: CaptchaAI
No prevention technique eliminates CAPTCHAs entirely. At scale, you need both prevention and solving:
import requests
import time
API_KEY = "YOUR_API_KEY"
def scrape_with_fallback(url, session):
resp = session.get(url)
# If CAPTCHA appears, solve it
if "g-recaptcha" in resp.text:
from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.text, "html.parser")
site_key = soup.find("div", class_="g-recaptcha")["data-sitekey"]
# Solve via CaptchaAI
submit = requests.get("https://ocr.captchaai.com/in.php", params={
"key": API_KEY, "method": "userrecaptcha",
"googlekey": site_key, "pageurl": url
})
task_id = submit.text.split("|")[1]
for _ in range(60):
time.sleep(5)
result = requests.get("https://ocr.captchaai.com/res.php", params={
"key": API_KEY, "action": "get", "id": task_id
})
if result.text == "CAPCHA_NOT_READY": continue
if result.text.startswith("OK|"):
token = result.text.split("|")[1]
resp = session.post(url, data={"g-recaptcha-response": token})
break
return resp.text
Cost Impact of Prevention
Good prevention techniques reduce CaptchaAI usage significantly:
| Approach | CAPTCHAs per 1K pages | Cost |
|---|---|---|
| No prevention | ~200-500 | $0.20-0.50 |
| Basic headers + delays | ~50-100 | $0.05-0.10 |
| Residential proxies + headers | ~10-30 | $0.01-0.03 |
| Full stealth setup | ~5-15 | $0.005-0.015 |
Investing in prevention pays for itself through lower CAPTCHA solving costs.
FAQ
What's the single most effective technique?
Residential proxy rotation. It addresses the most common trigger (IP reputation) and works across all sites.
Do I still need CaptchaAI if I use all these techniques?
Yes, for production reliability. Prevention reduces CAPTCHAs but doesn't eliminate them. CaptchaAI ensures your scraper never gets stuck on an unsolved CAPTCHA.
How do I know which technique helps most for my target site?
Monitor your CAPTCHA rate. Add techniques one at a time and measure the reduction. Start with proxies and headers as they have the highest impact.
Discussions (0)
Join the conversation
Sign in to share your opinion.
Sign InNo comments yet.