Retail Site Data Collection with CAPTCHA Handling

Amazon uses image CAPTCHAs to block automated access. When you hit their anti-bot threshold, you'll see a page asking you to type characters from a distorted image. CaptchaAI's OCR solving handles these automatically.

How Amazon's CAPTCHA Works

Amazon triggers CAPTCHAs based on:

Signal	Description
Request volume	Too many requests from one IP in a short window
Missing cookies	No Amazon session cookies
Suspicious headers	Bot-like User-Agent or missing headers
IP reputation	Known datacenter or proxy IP ranges

When triggered, Amazon redirects to a page with a distorted text image and an input field. You must solve the image and submit the text to continue.

Requirements

Requirement	Details
CaptchaAI API key	From captchaai.com
Python 3.7+	With requests and beautifulsoup4
Residential proxies	Recommended for sustained scraping

Solving Amazon's Image CAPTCHA

Step 1: Detect the CAPTCHA Page

import requests
from bs4 import BeautifulSoup

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
})

def is_captcha_page(html):
    return "Type the characters you see in this image" in html or \
           "captcha" in html.lower()

url = "https://www.amazon.com/dp/B0EXAMPLE"
resp = session.get(url)

if is_captcha_page(resp.text):
    print("CAPTCHA detected!")
else:
    print("Page loaded successfully")

Step 2: Extract and Solve the Image

import base64

API_KEY = "YOUR_API_KEY"

def solve_amazon_captcha(session, captcha_page_html, captcha_page_url):
    soup = BeautifulSoup(captcha_page_html, "html.parser")

    # Find the CAPTCHA image
    img_tag = soup.find("img", src=lambda s: s and "captcha" in s.lower())
    if not img_tag:
        raise Exception("CAPTCHA image not found")

    img_url = img_tag["src"]

    # Download the image
    img_resp = session.get(img_url)
    img_base64 = base64.b64encode(img_resp.content).decode()

    # Submit to CaptchaAI
    submit_resp = requests.get("https://ocr.captchaai.com/in.php", params={
        "key": API_KEY,
        "method": "base64",
        "body": img_base64
    })
    task_id = submit_resp.text.split("|")[1]

    # Poll for result
    import time
    for _ in range(30):
        time.sleep(5)
        result = requests.get("https://ocr.captchaai.com/res.php", params={
            "key": API_KEY, "action": "get", "id": task_id
        })
        if result.text == "CAPCHA_NOT_READY":
            continue
        if result.text.startswith("OK|"):
            return result.text.split("|")[1]
        raise Exception(f"Solve error: {result.text}")

    raise TimeoutError("Solve timed out")

Step 3: Submit the Solution

def submit_captcha_solution(session, captcha_page_html, solution, captcha_page_url):
    soup = BeautifulSoup(captcha_page_html, "html.parser")
    form = soup.find("form")

    # Build form data
    form_data = {}
    for inp in form.find_all("input"):
        name = inp.get("name")
        if name:
            form_data[name] = inp.get("value", "")

    # Set the CAPTCHA answer
    form_data["field-keywords"] = solution

    # Submit
    action = form.get("action", captcha_page_url)
    if action.startswith("/"):
        from urllib.parse import urljoin
        action = urljoin(captcha_page_url, action)

    resp = session.post(action, data=form_data)
    return resp

Full Working Example

import requests
import base64
import time
from bs4 import BeautifulSoup

API_KEY = "YOUR_API_KEY"

def scrape_amazon_product(url):
    session = requests.Session()
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept-Language": "en-US,en;q=0.9"
    })

    resp = session.get(url)

    # Handle CAPTCHA if present
    if "captcha" in resp.text.lower():
        soup = BeautifulSoup(resp.text, "html.parser")
        img = soup.find("img", src=lambda s: s and "captcha" in s.lower())

        if img:
            # Download and solve
            img_data = session.get(img["src"]).content
            img_b64 = base64.b64encode(img_data).decode()

            submit = requests.get("https://ocr.captchaai.com/in.php", params={
                "key": API_KEY, "method": "base64", "body": img_b64
            })
            task_id = submit.text.split("|")[1]

            for _ in range(30):
                time.sleep(5)
                result = requests.get("https://ocr.captchaai.com/res.php", params={
                    "key": API_KEY, "action": "get", "id": task_id
                })
                if result.text == "CAPCHA_NOT_READY":
                    continue
                if result.text.startswith("OK|"):
                    solution = result.text.split("|")[1]
                    break

            # Submit solution
            form = soup.find("form")
            form_data = {inp.get("name"): inp.get("value", "")
                        for inp in form.find_all("input") if inp.get("name")}
            form_data["field-keywords"] = solution

            action = form.get("action", url)
            resp = session.post(action, data=form_data)

    # Parse product data
    soup = BeautifulSoup(resp.text, "html.parser")
    title = soup.find("span", {"id": "productTitle"})
    price = soup.find("span", class_="a-price-whole")

    return {
        "title": title.text.strip() if title else None,
        "price": price.text.strip() if price else None
    }

product = scrape_amazon_product("https://www.amazon.com/dp/B0EXAMPLE")
print(product)

Best Practices for Amazon Scraping

Use residential proxies — Amazon blocks datacenter IPs aggressively
Rotate User-Agents — Use a pool of realistic browser strings
Maintain sessions — Keep cookies across requests
Add delays — 3-10 seconds between requests
Set Accept-Language — Always include locale headers
Don't scrape logged-in pages — Product pages are accessible without login

Troubleshooting

Issue	Fix
CAPTCHA on every request	Use residential proxies; slow down request rate
CAPTCHA solution rejected	Verify image was downloaded correctly; retry
Redirect loops	Check cookie handling; use `allow_redirects=True`
Empty product data	Amazon may serve different layouts; check selectors

FAQ

Does Amazon use reCAPTCHA?

Amazon primarily uses its own image-based CAPTCHA (distorted text). CaptchaAI solves these using the method=base64 endpoint for image/OCR solving.

How many requests before Amazon shows a CAPTCHA?

It varies. With good proxies and realistic headers, you may scrape hundreds of pages. Without proxies, CAPTCHAs can appear after 10-20 requests.

Is scraping Amazon legal?

Scraping publicly available product data is generally legal, but check Amazon's terms of service and applicable laws in your jurisdiction.

Full Working Code

Complete runnable examples for this article in Python, Node.js, PHP, Go, Java, C#, Ruby, Rust, Kotlin & Bash.

View on GitHub →

Retail Site Data Collection with CAPTCHA Handling

How Amazon's CAPTCHA Works

Requirements

Solving Amazon's Image CAPTCHA

Step 1: Detect the CAPTCHA Page

Step 2: Extract and Solve the Image

Step 3: Submit the Solution

Full Working Example

Best Practices for Amazon Scraping

Troubleshooting

FAQ

Does Amazon use reCAPTCHA?

How many requests before Amazon shows a CAPTCHA?

Is scraping Amazon legal?

Discussions (0)

Image CAPTCHA Base64 Encoding Best Practices

Grid Image CAPTCHA: Coordinate Mapping and Cell Selection

Case-Sensitive CAPTCHA API Parameter Guide

Multi-Character Image CAPTCHA Solving Strategies

Math CAPTCHA Solving with CaptchaAI calc Parameter

Improving OCR CAPTCHA Accuracy with CaptchaAI Settings

How Amazon's CAPTCHA Works

Requirements

Solving Amazon's Image CAPTCHA

Step 1: Detect the CAPTCHA Page

Step 2: Extract and Solve the Image

Step 3: Submit the Solution

Full Working Example

Best Practices for Amazon Scraping

Troubleshooting

FAQ

Does Amazon use reCAPTCHA?

How many requests before Amazon shows a CAPTCHA?

Is scraping Amazon legal?

Related Guides

Discussions (0)

Join the conversation

Related Posts

Image CAPTCHA Base64 Encoding Best Practices

Grid Image CAPTCHA: Coordinate Mapping and Cell Selection

Case-Sensitive CAPTCHA API Parameter Guide

Multi-Character Image CAPTCHA Solving Strategies

Math CAPTCHA Solving with CaptchaAI calc Parameter

Improving OCR CAPTCHA Accuracy with CaptchaAI Settings