Tutorials

Python Beautiful Soup + CaptchaAI: Handling CAPTCHA-Protected Pages

Beautiful Soup parses HTML. CaptchaAI solves CAPTCHAs. Together with requests, they form the fastest scraping stack for CAPTCHA-protected pages — no browser required.

This approach works when the site serves HTML directly (server-side rendered). For JavaScript-heavy SPAs, use Selenium or Playwright instead.


Prerequisites

pip install beautifulsoup4 requests lxml

The workflow

  1. Fetch the page HTML with requests
  2. Parse with Beautiful Soup to extract CAPTCHA parameters
  3. Send parameters to CaptchaAI to solve
  4. Submit the form with the CAPTCHA token via requests
  5. Parse the result page with Beautiful Soup

Extracting reCAPTCHA sitekeys with Beautiful Soup

import requests
from bs4 import BeautifulSoup

def extract_recaptcha_sitekey(url):
    """Extract reCAPTCHA v2 sitekey from page HTML."""
    resp = requests.get(url, timeout=30)
    soup = BeautifulSoup(resp.text, "lxml")

    # Method 1: data-sitekey attribute on div
    recaptcha_div = soup.find("div", class_="g-recaptcha")
    if recaptcha_div and recaptcha_div.get("data-sitekey"):
        return recaptcha_div["data-sitekey"]

    # Method 2: data-sitekey on any element
    element = soup.find(attrs={"data-sitekey": True})
    if element:
        return element["data-sitekey"]

    # Method 3: from script src
    import re
    for script in soup.find_all("script", src=True):
        match = re.search(r"render=([A-Za-z0-9_-]{40})", script["src"])
        if match:
            return match.group(1)

    return None


sitekey = extract_recaptcha_sitekey("https://example.com/login")
print(f"Sitekey: {sitekey}")

Extracting Turnstile sitekeys

def extract_turnstile_sitekey(url):
    """Extract Cloudflare Turnstile sitekey from page HTML."""
    resp = requests.get(url, timeout=30)
    soup = BeautifulSoup(resp.text, "lxml")

    # Method 1: Turnstile div
    turnstile_div = soup.find("div", class_="cf-turnstile")
    if turnstile_div and turnstile_div.get("data-sitekey"):
        return turnstile_div["data-sitekey"]

    # Method 2: Any element with Turnstile sitekey pattern
    element = soup.find(attrs={"data-sitekey": True})
    if element:
        sitekey = element["data-sitekey"]
        if sitekey.startswith("0x"):
            return sitekey

    # Method 3: In inline script
    import re
    for script in soup.find_all("script"):
        if script.string:
            match = re.search(r"sitekey\s*:\s*['\"]([0-9x][A-Za-z0-9_-]+)['\"]", script.string)
            if match:
                return match.group(1)

    return None

Extracting form fields

Always extract hidden form fields — they often contain CSRF tokens and other parameters the server expects:

def extract_form_data(soup, form_selector="form"):
    """Extract all form field names and values."""
    form = soup.select_one(form_selector)
    if not form:
        return {}

    data = {}
    # Hidden inputs (CSRF tokens, etc.)
    for inp in form.find_all("input", type="hidden"):
        name = inp.get("name")
        value = inp.get("value", "")
        if name:
            data[name] = value

    # Text inputs with default values
    for inp in form.find_all("input", type=["text", "email", "password"]):
        name = inp.get("name")
        value = inp.get("value", "")
        if name:
            data[name] = value

    return data

Complete reCAPTCHA scraping flow

import time
import requests
from bs4 import BeautifulSoup

API_KEY = "YOUR_API_KEY"


def solve_captcha(method, **params):
    """Solve CAPTCHA via CaptchaAI."""
    submit = requests.post("https://ocr.captchaai.com/in.php", data={
        "key": API_KEY, "method": method, "json": 1, **params,
    }, timeout=30).json()

    if submit.get("status") != 1:
        raise Exception(f"Submit error: {submit.get('request')}")

    task_id = submit["request"]
    for _ in range(30):
        time.sleep(5)
        result = requests.get("https://ocr.captchaai.com/res.php", params={
            "key": API_KEY, "action": "get", "id": task_id, "json": 1,
        }, timeout=30).json()
        if result.get("status") == 1:
            return result["request"]
    raise TimeoutError("Solve timed out")


def scrape_protected_page(url, credentials=None):
    """Scrape a reCAPTCHA-protected page — no browser needed."""
    session = requests.Session()
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                      "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    })

    # Step 1: Fetch the login page
    resp = session.get(url, timeout=30)
    soup = BeautifulSoup(resp.text, "lxml")

    # Step 2: Extract sitekey
    sitekey = None
    recaptcha_div = soup.find(attrs={"data-sitekey": True})
    if recaptcha_div:
        sitekey = recaptcha_div["data-sitekey"]
    if not sitekey:
        raise ValueError("No CAPTCHA sitekey found")
    print(f"Sitekey: {sitekey}")

    # Step 3: Extract form fields (CSRF tokens, etc.)
    form_data = extract_form_data(soup)
    print(f"Form fields: {list(form_data.keys())}")

    # Step 4: Add credentials
    if credentials:
        form_data.update(credentials)

    # Step 5: Solve CAPTCHA
    token = solve_captcha("userrecaptcha", googlekey=sitekey, pageurl=url)
    form_data["g-recaptcha-response"] = token

    # Step 6: Submit the form
    form = soup.find("form")
    action_url = form.get("action", url) if form else url
    if not action_url.startswith("http"):
        from urllib.parse import urljoin
        action_url = urljoin(url, action_url)

    method = (form.get("method", "POST") if form else "POST").upper()

    if method == "POST":
        result = session.post(action_url, data=form_data, timeout=30)
    else:
        result = session.get(action_url, params=form_data, timeout=30)

    # Step 7: Parse the result
    result_soup = BeautifulSoup(result.text, "lxml")
    return result_soup, session


# Usage
result_soup, session = scrape_protected_page(
    "https://example.com/login",
    credentials={"username": "user@example.com", "password": "pass123"},
)

# Now use the authenticated session to scrape protected content
dashboard = session.get("https://example.com/dashboard", timeout=30)
dashboard_soup = BeautifulSoup(dashboard.text, "lxml")
print(dashboard_soup.title.string)

Scraping search results behind CAPTCHA

def scrape_search_results(search_url, query):
    """Scrape search results from a CAPTCHA-protected search engine."""
    session = requests.Session()
    session.headers["User-Agent"] = (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
    )

    # Fetch search page
    resp = session.get(search_url, params={"q": query}, timeout=30)

    # Check if CAPTCHA is present
    soup = BeautifulSoup(resp.text, "lxml")
    sitekey_el = soup.find(attrs={"data-sitekey": True})

    if sitekey_el:
        # Solve CAPTCHA
        sitekey = sitekey_el["data-sitekey"]
        token = solve_captcha("userrecaptcha", googlekey=sitekey, pageurl=resp.url)

        # Resubmit with token
        form_data = extract_form_data(soup)
        form_data["g-recaptcha-response"] = token
        form_data["q"] = query
        resp = session.post(resp.url, data=form_data, timeout=30)
        soup = BeautifulSoup(resp.text, "lxml")

    # Extract results
    results = []
    for item in soup.select(".result, .search-result, .g"):
        title_el = item.select_one("h3, .title")
        link_el = item.select_one("a")
        snippet_el = item.select_one(".snippet, .description, .st")

        if title_el and link_el:
            results.append({
                "title": title_el.get_text(strip=True),
                "url": link_el.get("href", ""),
                "snippet": snippet_el.get_text(strip=True) if snippet_el else "",
            })

    return results

Image CAPTCHA extraction with Beautiful Soup

import base64
from urllib.parse import urljoin

def solve_image_captcha_bs4(url, captcha_img_selector="img.captcha"):
    """Extract, solve, and submit an image CAPTCHA."""
    session = requests.Session()
    resp = session.get(url, timeout=30)
    soup = BeautifulSoup(resp.text, "lxml")

    # Find CAPTCHA image
    img = soup.select_one(captcha_img_selector)
    if not img:
        raise ValueError("CAPTCHA image not found")

    # Download the image
    img_url = img.get("src", "")
    if img_url.startswith("data:image"):
        # Base64 inline image
        img_base64 = img_url.split(",", 1)[1]
    else:
        # URL — download it
        img_url = urljoin(url, img_url)
        img_resp = session.get(img_url, timeout=30)
        img_base64 = base64.b64encode(img_resp.content).decode()

    # Solve
    answer = solve_captcha("base64", body=img_base64)
    print(f"CAPTCHA answer: {answer}")

    # Submit form
    form_data = extract_form_data(soup)
    # Find the captcha input field name
    captcha_input = soup.select_one("input[name*='captcha'], input[name*='code']")
    if captcha_input:
        form_data[captcha_input["name"]] = answer

    form = soup.find("form")
    action = urljoin(url, form.get("action", "")) if form else url
    result = session.post(action, data=form_data, timeout=30)

    return BeautifulSoup(result.text, "lxml"), session

Production scraper class

import time
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin


class ProtectedScraper:
    """Scrape CAPTCHA-protected pages without a browser."""

    def __init__(self, api_key):
        self.api_key = api_key
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                          "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
            "Accept-Language": "en-US,en;q=0.9",
        })

    def get(self, url):
        """Fetch and parse a page, solving CAPTCHAs automatically."""
        resp = self.session.get(url, timeout=30)
        soup = BeautifulSoup(resp.text, "lxml")

        # Check for CAPTCHA
        sitekey_el = soup.find(attrs={"data-sitekey": True})
        if sitekey_el:
            soup = self._handle_captcha(soup, resp.url, sitekey_el)

        return soup

    def login(self, url, credentials):
        """Log in through a CAPTCHA-protected form."""
        resp = self.session.get(url, timeout=30)
        soup = BeautifulSoup(resp.text, "lxml")

        form_data = self._extract_form(soup)
        form_data.update(credentials)

        sitekey_el = soup.find(attrs={"data-sitekey": True})
        if sitekey_el:
            token = self._solve(sitekey_el["data-sitekey"], url)
            form_data["g-recaptcha-response"] = token

        form = soup.find("form")
        action = urljoin(url, form.get("action", "")) if form else url

        result = self.session.post(action, data=form_data, timeout=30)
        return BeautifulSoup(result.text, "lxml")

    def _handle_captcha(self, soup, url, sitekey_el):
        token = self._solve(sitekey_el["data-sitekey"], url)
        form_data = self._extract_form(soup)
        form_data["g-recaptcha-response"] = token

        form = soup.find("form")
        action = urljoin(url, form.get("action", "")) if form else url
        resp = self.session.post(action, data=form_data, timeout=30)
        return BeautifulSoup(resp.text, "lxml")

    def _extract_form(self, soup):
        data = {}
        for inp in soup.select("form input[type='hidden']"):
            if inp.get("name"):
                data[inp["name"]] = inp.get("value", "")
        return data

    def _solve(self, sitekey, url):
        submit = requests.post("https://ocr.captchaai.com/in.php", data={
            "key": self.api_key, "method": "userrecaptcha",
            "googlekey": sitekey, "pageurl": url, "json": 1,
        }, timeout=30).json()

        if submit.get("status") != 1:
            raise Exception(f"Error: {submit.get('request')}")

        task_id = submit["request"]
        for _ in range(30):
            time.sleep(5)
            result = requests.get("https://ocr.captchaai.com/res.php", params={
                "key": self.api_key, "action": "get", "id": task_id, "json": 1,
            }, timeout=30).json()
            if result.get("status") == 1:
                return result["request"]
        raise TimeoutError("Solve timed out")


# Usage
scraper = ProtectedScraper("YOUR_API_KEY")

# Login and scrape
scraper.login("https://example.com/login", {
    "email": "user@example.com",
    "password": "pass123",
})

# Now scrape authenticated pages
soup = scraper.get("https://example.com/dashboard")
for row in soup.select("table tr"):
    cells = [td.get_text(strip=True) for td in row.select("td")]
    print(cells)

When to use Beautiful Soup vs browser automation

Scenario Use BS4 + requests Use Selenium/Playwright
Server-rendered HTML Yes Overkill
JavaScript-rendered content No Yes
Complex multi-step form Maybe Preferred
High-volume scraping Yes (faster) Slower
Sites with JS fingerprinting No Yes
Simple login + scrape Yes Not needed

Troubleshooting

Symptom Cause Fix
Sitekey extraction returns None CAPTCHA loaded via JavaScript Switch to Selenium/Playwright
Form submission returns login page Missing CSRF token Extract all hidden inputs with extract_form_data()
403 after form POST Bot detection on headers Add realistic User-Agent and Referer headers
Token rejected Wrong pageurl parameter Use the exact URL shown in the browser
Cookies lost between requests Not using requests.Session() Always use a session object

Frequently asked questions

Can Beautiful Soup solve CAPTCHAs?

No — Beautiful Soup is an HTML parser. It extracts CAPTCHA parameters (sitekeys, image URLs). CaptchaAI does the actual solving. requests handles the HTTP communication.

When should I use a browser instead?

When the page requires JavaScript to render content, when the CAPTCHA is loaded dynamically, or when the site uses JavaScript-based fingerprinting.

Is this faster than Selenium?

Yes. requests + Beautiful Soup skips browser startup, JavaScript execution, and rendering, making it 5-10x faster per page.


Summary

Python Beautiful Soup + CaptchaAI provides the fastest scraping stack for CAPTCHA-protected pages that serve HTML directly. Parse sitekeys with BS4, solve with the API, and submit via requests.Session().

Discussions (0)

No comments yet.

Related Posts

Reference CAPTCHA Token Injection Methods Reference
Complete reference for injecting solved CAPTCHA tokens into web pages.

Complete reference for injecting solved CAPTCHA tokens into web pages. Covers re CAPTCHA, Turnstile, and Cloud...

Python Automation Cloudflare Turnstile
Apr 08, 2026
Tutorials Pytest Fixtures for CaptchaAI API Testing
Build reusable pytest fixtures to test CAPTCHA-solving workflows with Captcha AI.

Build reusable pytest fixtures to test CAPTCHA-solving workflows with Captcha AI. Covers mocking, live integra...

Python Automation Cloudflare Turnstile
Apr 08, 2026
Troubleshooting ERROR_PAGEURL: URL Mismatch Troubleshooting Guide
Fix ERROR_PAGEURL when using Captcha AI.

Fix ERROR_PAGEURL when using Captcha AI. Diagnose URL mismatch issues, handle redirects, SPAs, and dynamic URL...

Python Automation Cloudflare Turnstile
Mar 23, 2026
Tutorials CAPTCHA Solving Fallback Chains
Implement fallback chains for CAPTCHA solving with Captcha AI.

Implement fallback chains for CAPTCHA solving with Captcha AI. Cascade through solver methods, proxy pools, an...

Python Automation Cloudflare Turnstile
Apr 06, 2026
Troubleshooting Handling reCAPTCHA v2 and Cloudflare Turnstile on the Same Site
Solve both re CAPTCHA v 2 and Cloudflare Turnstile on sites that use multiple CAPTCHA providers — detect which type appears, solve each correctly, and handle pr...

Solve both re CAPTCHA v 2 and Cloudflare Turnstile on sites that use multiple CAPTCHA providers — detect which...

Python Automation Cloudflare Turnstile
Mar 23, 2026
Use Cases Multi-Step Workflow Automation with CaptchaAI
Manage workflows across multiple accounts on CAPTCHA-protected platforms — , action, and data collection at scale.

Manage workflows across multiple accounts on CAPTCHA-protected platforms — , action, and data collection at sc...

Python Automation Cloudflare Turnstile
Apr 06, 2026
Integrations Solving CAPTCHAs in React Native WebViews with CaptchaAI
how to detect and solve re CAPTCHA v 2 and Cloudflare Turnstile CAPTCHAs inside React Native Web Views using the Captcha AI API with working Java Script bridge...

Learn how to detect and solve re CAPTCHA v 2 and Cloudflare Turnstile CAPTCHAs inside React Native Web Views u...

Python Automation Cloudflare Turnstile
Mar 30, 2026
Comparisons WebDriver vs Chrome DevTools Protocol for CAPTCHA Automation
Compare Web Driver and Chrome Dev Tools Protocol (CDP) for CAPTCHA automation — detection, performance, capabilities, and when to use each with Captcha AI.

Compare Web Driver and Chrome Dev Tools Protocol (CDP) for CAPTCHA automation — detection, performance, capabi...

Python Automation Cloudflare Turnstile
Mar 27, 2026
Tutorials Build a Content Change Monitoring Bot with CaptchaAI
Build a bot that monitors web page content for changes, handles CAPTCHAs, and sends alerts when content updates are detected.

Build a bot that monitors web page content for changes, handles CAPTCHAs, and sends alerts when content update...

Python Automation Cloudflare Turnstile
Mar 22, 2026
Use Cases Multi-Step Checkout Automation with CAPTCHA Solving
Automate multi-step e-commerce checkout flows that include CAPTCHA challenges at cart, payment, or confirmation stages using Captcha AI.

Automate multi-step e-commerce checkout flows that include CAPTCHA challenges at cart, payment, or confirmatio...

Python Automation Cloudflare Turnstile
Mar 21, 2026
Tutorials Handling Multiple CAPTCHAs on a Single Page
how to detect and solve multiple CAPTCHAs on a single web page using Captcha AI.

Learn how to detect and solve multiple CAPTCHAs on a single web page using Captcha AI. Covers multi-iframe ext...

Python Cloudflare Turnstile reCAPTCHA v2
Apr 09, 2026
Tutorials Streaming Batch Results: Processing CAPTCHA Solutions as They Arrive
Process CAPTCHA solutions the moment they arrive instead of waiting for tasks to complete — use async generators, event emitters, and callback patterns for stre...

Process CAPTCHA solutions the moment they arrive instead of waiting for all tasks to complete — use async gene...

Python Automation All CAPTCHA Types
Apr 07, 2026
Tutorials Bulkhead Pattern: Isolating CAPTCHA Solving Failures
Apply the bulkhead pattern to isolate CAPTCHA solving failures — partition resources into independent pools so a slow or failing solver type doesn't starve othe...

Apply the bulkhead pattern to isolate CAPTCHA solving failures — partition resources into independent pools so...

Python Automation All CAPTCHA Types
Apr 07, 2026