Octoparse is a visual web scraping tool that lets non-coders extract data. When CAPTCHAs block extraction, CaptchaAI provides the solution.
When Octoparse Encounters CAPTCHAs
| Scenario | What Happens |
|---|---|
| reCAPTCHA on target page | Extraction stops, manual solve needed |
| Cloudflare challenge | Page loads but no data extracted |
| Rate-limiting CAPTCHA | After N pages, CAPTCHA appears |
| Login-protected data | Login form has CAPTCHA |
Approach: Pre-Solve + Cookie Injection
Since Octoparse is a visual tool, the integration uses a Python helper to solve CAPTCHAs and export session cookies for Octoparse:
import requests
import time
import json
class OctoparseCaptchaHelper:
"""Solve CAPTCHAs and export cookies for Octoparse."""
def __init__(self, api_key):
self.api_key = api_key
self.session = requests.Session()
def solve_and_get_cookies(self, login_url, sitekey, credentials):
"""
Solve login CAPTCHA and return session cookies.
Steps:
1. Visit login page to get initial cookies
2. Solve CAPTCHA via CaptchaAI
3. Submit login form with token
4. Export authenticated cookies
"""
# Step 1: Get initial cookies
self.session.get(login_url, timeout=15)
# Step 2: Solve CAPTCHA
token = self._solve_recaptcha(sitekey, login_url)
# Step 3: Submit login
login_data = {
**credentials,
"g-recaptcha-response": token,
}
resp = self.session.post(login_url, data=login_data, timeout=30)
if resp.status_code != 200:
raise RuntimeError(f"Login failed: {resp.status_code}")
# Step 4: Export cookies
cookies = []
for cookie in self.session.cookies:
cookies.append({
"name": cookie.name,
"value": cookie.value,
"domain": cookie.domain,
"path": cookie.path,
})
return cookies
def export_cookies_for_octoparse(self, cookies, output_file="cookies.json"):
"""Save cookies in format importable by Octoparse."""
with open(output_file, "w") as f:
json.dump(cookies, f, indent=2)
print(f"Cookies saved to {output_file}")
print(f"Import these in Octoparse: Task → Advanced Settings → Cookies")
def _solve_recaptcha(self, sitekey, pageurl):
"""Solve reCAPTCHA via CaptchaAI."""
resp = requests.post("https://ocr.captchaai.com/in.php", data={
"key": self.api_key,
"method": "userrecaptcha",
"googlekey": sitekey,
"pageurl": pageurl,
"json": 1,
}, timeout=30)
result = resp.json()
if result.get("status") != 1:
raise RuntimeError(f"Submit error: {result.get('request')}")
task_id = result["request"]
time.sleep(15)
for _ in range(24):
resp = requests.get("https://ocr.captchaai.com/res.php", params={
"key": self.api_key, "action": "get",
"id": task_id, "json": 1,
}, timeout=15)
data = resp.json()
if data.get("status") == 1:
return data["request"]
if data["request"] != "CAPCHA_NOT_READY":
raise RuntimeError(data["request"])
time.sleep(5)
raise TimeoutError("Solve timeout")
# Usage
helper = OctoparseCaptchaHelper("YOUR_API_KEY")
cookies = helper.solve_and_get_cookies(
login_url="https://example.com/login",
sitekey="6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-",
credentials={"username": "user", "password": "pass"},
)
helper.export_cookies_for_octoparse(cookies)
Approach: API-Based Extraction with CAPTCHA Solving
For more control, use CaptchaAI directly in a Python script alongside Octoparse:
def extract_with_captcha(api_key, urls, sitekey):
"""Extract data from CAPTCHA-protected pages."""
results = []
for url in urls:
print(f"Processing: {url}")
# Solve CAPTCHA for this page
helper = OctoparseCaptchaHelper(api_key)
token = helper._solve_recaptcha(sitekey, url)
# Access page with token
resp = requests.post(url, data={
"g-recaptcha-response": token,
}, timeout=30)
# Parse response
if resp.status_code == 200:
results.append({
"url": url,
"content_length": len(resp.text),
"status": "success",
})
else:
results.append({
"url": url,
"status": f"failed ({resp.status_code})",
})
time.sleep(3) # Rate limit
return results
Octoparse Configuration Tips
| Setting | Recommendation |
|---|---|
| Page load wait | Set to 10+ seconds for CAPTCHA pages |
| Retry on error | Enable with 3 retries |
| Cookie import | Use exported cookies from helper |
| Cloud extraction | Use Octoparse cloud with pre-solved cookies |
| Local extraction | Use local mode for initial CAPTCHA bypass |
FAQ
Can Octoparse solve CAPTCHAs automatically?
Octoparse has limited built-in CAPTCHA handling. For reliable solving, use CaptchaAI to pre-solve and export session cookies, or switch to a code-based approach for CAPTCHA-heavy sites.
When should I use Octoparse vs. a coded solution?
Use Octoparse for simple, low-CAPTCHA sites. For sites with frequent CAPTCHAs, a Python script with CaptchaAI gives you more control and reliability.
Can I schedule the cookie refresh?
Yes. Run the Python helper on a schedule (e.g., via cron or Task Scheduler) to refresh cookies before each Octoparse extraction run.
Related Guides
Handle CAPTCHAs in visual scraping — try CaptchaAI.
Discussions (0)
Join the conversation
Sign in to share your opinion.
Sign InNo comments yet.