Use Cases

Market Research Data Collection with CAPTCHA Handling

Market research requires data from competitor websites, review platforms, job boards, and industry directories — most of which are CAPTCHA-protected. CaptchaAI automates the solving so your data pipeline stays unbroken.

Common Data Sources and Their CAPTCHAs

Source Type Examples CAPTCHA Type
Review sites G2, Trustpilot, Yelp reCAPTCHA v2/v3
Job boards LinkedIn, Indeed Cloudflare Challenge
Business directories YellowPages, Crunchbase Turnstile, reCAPTCHA
Social media Twitter/X, Reddit Various
Patent databases USPTO, Google Patents reCAPTCHA v2
Government data SEC filings, census Image CAPTCHA

Data Collection Pipeline

import requests
import time
import re
import csv
import os

API_KEY = os.environ["CAPTCHAAI_API_KEY"]


def solve_captcha(params):
    params["key"] = API_KEY
    resp = requests.get("https://ocr.captchaai.com/in.php", params=params)
    if not resp.text.startswith("OK|"):
        raise Exception(f"Submit: {resp.text}")

    task_id = resp.text.split("|")[1]
    for _ in range(60):
        time.sleep(5)
        result = requests.get("https://ocr.captchaai.com/res.php", params={
            "key": API_KEY, "action": "get", "id": task_id,
        })
        if result.text == "CAPCHA_NOT_READY":
            continue
        if result.text.startswith("OK|"):
            return result.text.split("|", 1)[1]
        raise Exception(f"Solve: {result.text}")
    raise TimeoutError()


class MarketResearchCollector:
    def __init__(self):
        self.session = requests.Session()
        self.session.headers["User-Agent"] = (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 Chrome/120.0.0.0"
        )

    def fetch(self, url):
        """Fetch a page, solving CAPTCHAs as needed."""
        resp = self.session.get(url)

        # Detect reCAPTCHA
        match = re.search(
            r'data-sitekey=["\']([A-Za-z0-9_-]+)["\']', resp.text
        )
        if match:
            token = solve_captcha({
                "method": "userrecaptcha",
                "googlekey": match.group(1),
                "pageurl": url,
            })
            resp = self.session.post(url, data={
                "g-recaptcha-response": token,
            })

        # Detect Turnstile
        match = re.search(
            r'data-sitekey=["\']([0-9x][A-Za-z0-9_-]+)["\']', resp.text
        )
        if match and "cf-turnstile" in resp.text:
            token = solve_captcha({
                "method": "turnstile",
                "sitekey": match.group(1),
                "pageurl": url,
            })
            resp = self.session.post(url, data={
                "cf-turnstile-response": token,
            })

        return resp.text

    def collect_reviews(self, urls):
        """Collect review data from multiple pages."""
        reviews = []
        for url in urls:
            try:
                html = self.fetch(url)
                page_reviews = self._parse_reviews(html)
                reviews.extend(page_reviews)
                print(f"  Collected {len(page_reviews)} reviews from {url}")
                time.sleep(2)  # Polite delay
            except Exception as e:
                print(f"  Error on {url}: {e}")
        return reviews

    def collect_company_profiles(self, urls):
        """Collect company profile data."""
        profiles = []
        for url in urls:
            try:
                html = self.fetch(url)
                profile = self._parse_profile(html)
                if profile:
                    profiles.append(profile)
                    print(f"  Collected: {profile.get('name', 'Unknown')}")
                time.sleep(2)
            except Exception as e:
                print(f"  Error on {url}: {e}")
        return profiles

    def _parse_reviews(self, html):
        """Extract review data from HTML."""
        reviews = []
        # Generic review extraction patterns
        for match in re.finditer(
            r'class="review-text"[^>]*>(.*?)</div>', html, re.DOTALL
        ):
            reviews.append({
                "text": match.group(1).strip()[:500],
            })
        return reviews

    def _parse_profile(self, html):
        """Extract company profile from HTML."""
        name = re.search(r'<h1[^>]*>(.*?)</h1>', html)
        desc = re.search(
            r'class="description"[^>]*>(.*?)</div>', html, re.DOTALL
        )
        return {
            "name": name.group(1).strip() if name else None,
            "description": desc.group(1).strip()[:300] if desc else None,
        }

    def export_csv(self, data, filename):
        """Export collected data to CSV."""
        if not data:
            return
        keys = data[0].keys()
        with open(filename, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=keys)
            writer.writeheader()
            writer.writerows(data)
        print(f"Exported {len(data)} records to {filename}")

Usage

collector = MarketResearchCollector()

# Collect competitor reviews
review_urls = [
    "https://example-reviews.com/product/competitor-a",
    "https://example-reviews.com/product/competitor-b",
    "https://example-reviews.com/product/competitor-c",
]
reviews = collector.collect_reviews(review_urls)
collector.export_csv(reviews, "competitor_reviews.csv")

# Collect company profiles
profile_urls = [
    "https://example-directory.com/company/alpha-corp",
    "https://example-directory.com/company/beta-inc",
]
profiles = collector.collect_company_profiles(profile_urls)
collector.export_csv(profiles, "company_profiles.csv")

Use Cases

Competitive Pricing Intelligence

Monitor competitor pricing across e-commerce platforms. Track price changes, promotions, and stock levels.

Brand Sentiment Analysis

Collect reviews and ratings from review platforms. Aggregate sentiment data across multiple sources.

Job Market Analysis

Scrape job postings to understand hiring trends, salary ranges, and skill demand in your industry.

Patent Landscape Research

Collect patent filings from public databases to track innovation trends and competitor R&D activity.

Scaling Tips

Factor Recommendation
Request spacing 2-5 seconds between pages
Concurrent collectors 5-10 for moderate scale
Proxy rotation Required for 100+ pages/hour
Data deduplication Hash-based dedup before storage
Scheduling Run daily or weekly for trend data

FAQ

Public data scraping is generally permitted. Always check the site's terms of service and comply with local regulations. Don't scrape personal data without consent.

How much does this cost with CaptchaAI?

Typical market research scraping triggers CAPTCHAs on ~30% of pages. For 1,000 pages/day, expect ~300 solves at $0.5-3 total.

How do I handle sites that block scrapers entirely?

Combine CaptchaAI with proxy rotation and realistic request patterns. See our Scraping Without Getting Blocked guide.

Discussions (0)

No comments yet.

Related Posts

DevOps & Scaling Ansible Playbooks for CaptchaAI Worker Deployment
Deploy and manage Captcha AI workers with Ansible — playbooks for provisioning, configuration, rolling updates, and health checks across your server fleet.

Deploy and manage Captcha AI workers with Ansible — playbooks for provisioning, configuration, rolling updates...

Automation Python All CAPTCHA Types
Apr 07, 2026
DevOps & Scaling Blue-Green Deployment for CAPTCHA Solving Infrastructure
Implement blue-green deployments for CAPTCHA solving infrastructure — zero-downtime upgrades, traffic switching, and rollback strategies with Captcha AI.

Implement blue-green deployments for CAPTCHA solving infrastructure — zero-downtime upgrades, traffic switchin...

Automation Python All CAPTCHA Types
Apr 07, 2026
Reference API Endpoint Mapping: CaptchaAI vs Competitors
Side-by-side API endpoint comparison between Captcha AI, 2 Captcha, Anti-Captcha, and Cap Monster — endpoints, parameters, and response formats.

Side-by-side API endpoint comparison between Captcha AI, 2 Captcha, Anti-Captcha, and Cap Monster — endpoints,...

All CAPTCHA Types Migration
Feb 05, 2026
Troubleshooting CaptchaAI API Error Handling: Complete Decision Tree
Complete decision tree for every Captcha AI API error.

Complete decision tree for every Captcha AI API error. Learn which errors are retryable, which need parameter...

Automation Python All CAPTCHA Types
Mar 17, 2026
Tutorials Using Fiddler to Inspect CaptchaAI API Traffic
How to use Fiddler Everywhere and Fiddler Classic to capture, inspect, and debug Captcha AI API requests and responses — filters, breakpoints, and replay for tr...

How to use Fiddler Everywhere and Fiddler Classic to capture, inspect, and debug Captcha AI API requests and r...

Automation Python All CAPTCHA Types
Mar 05, 2026
Tutorials CAPTCHA Handling in Mobile Apps with Appium
Handle CAPTCHAs in mobile app automation using Appium and Captcha AI — extract Web sitekeys, solve, and inject tokens on Android and i OS.

Handle CAPTCHAs in mobile app automation using Appium and Captcha AI — extract Web View sitekeys, solve, and i...

Automation Python All CAPTCHA Types
Feb 13, 2026
Tutorials Streaming Batch Results: Processing CAPTCHA Solutions as They Arrive
Process CAPTCHA solutions the moment they arrive instead of waiting for tasks to complete — use async generators, event emitters, and callback patterns for stre...

Process CAPTCHA solutions the moment they arrive instead of waiting for all tasks to complete — use async gene...

Automation Python All CAPTCHA Types
Apr 07, 2026
Reference CaptchaAI CLI Tool: Command-Line CAPTCHA Solving and Testing
A reference for building and using a Captcha AI command-line tool — solve CAPTCHAs, check balance, test parameters, and integrate with shell scripts and CI/CD p...

A reference for building and using a Captcha AI command-line tool — solve CAPTCHAs, check balance, test parame...

Automation Python All CAPTCHA Types
Feb 26, 2026
DevOps & Scaling Auto-Scaling CAPTCHA Solving Workers
Build auto-scaling CAPTCHA solving workers that adjust capacity based on queue depth, balance, and solve rates.

Build auto-scaling CAPTCHA solving workers that adjust capacity based on queue depth, balance, and solve rates...

Automation Python All CAPTCHA Types
Mar 23, 2026
Use Cases Retail Site Data Collection with CAPTCHA Handling
Amazon uses image CAPTCHAs to block automated access.

Amazon uses image CAPTCHAs to block automated access. When you hit their anti-bot threshold, you'll see a page...

Web Scraping Image OCR
Apr 07, 2026
Use Cases Event Ticket Monitoring with CAPTCHA Handling
Build an event ticket availability monitor that handles CAPTCHAs using Captcha AI.

Build an event ticket availability monitor that handles CAPTCHAs using Captcha AI. Python workflow for checkin...

Automation Python reCAPTCHA v2
Jan 17, 2026
Use Cases Automated Form Submission with CAPTCHA Handling
Complete guide to automating web form submissions that include CAPTCHA challenges — re CAPTCHA, Turnstile, and image CAPTCHAs with Captcha AI.

Complete guide to automating web form submissions that include CAPTCHA challenges — re CAPTCHA, Turnstile, and...

Python reCAPTCHA v2 Cloudflare Turnstile
Mar 21, 2026