Static pages are easy. React apps, login-gated content, and aggressive rate limiters are where most crawlers break. Here is how Crawl4AI handles them.
The Problem with JS-Heavy Sites
Crawl4AI uses Playwright under the hood, which means it renders JavaScript -- unlike simple HTTP scrapers. This handles most modern websites. Where things break: single-page apps that load content asynchronously, pages behind login walls, and sites with aggressive bot detection or rate limiting.
Waiting for Content to Load
JavaScript-rendered content does not exist in the initial HTML response -- it appears after JS executes. Crawl4AI needs to wait for the content before extracting it.
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
config = CrawlerRunConfig(
# Wait until a specific CSS selector appears in the DOM
wait_for="css:.product-listing",
# Or wait for a JavaScript condition to be true
# wait_for="js:() => document.querySelectorAll('.product').length > 0",
# Or simply wait a fixed time (less reliable, use as last resort)
# wait_for="js:() => new Promise(r => setTimeout(r, 3000))",
# Page load timeout
page_timeout=30000, # 30 seconds
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://app.example.com/products",
config=config,
)Crawling Behind a Login
For content that requires authentication, there are two approaches: inject session cookies, or use Crawl4AI's hooks to run the login flow before crawling.
Option 1: Inject pre-authenticated cookies
# First, get your session cookie using your browser's DevTools
# Network tab -> find a request -> copy cookie header value
config = CrawlerRunConfig(
session_id="authenticated-session", # name to reuse this session
)
async with AsyncWebCrawler() as crawler:
# Set cookies before crawling
await crawler.crawler_strategy.set_cookies([
{
"name": "session_token",
"value": "your-session-cookie-value",
"domain": ".example.com",
"path": "/",
}
])
result = await crawler.arun(
url="https://app.example.com/protected-content",
config=config,
)Option 2: Use hooks to run the login flow
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
async def login_hook(page, **kwargs):
# Navigate to login page and fill in credentials
await page.goto("https://app.example.com/login")
await page.fill("#email", "your-email@example.com")
await page.fill("#password", "your-password")
await page.click("#login-button")
# Wait for redirect to confirm login succeeded
await page.wait_for_url("**/dashboard")
async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
# Run the login hook before the first crawl
crawler.crawler_strategy.set_hook("on_browser_created", login_hook)
result = await crawler.arun(
url="https://app.example.com/protected-reports",
config=CrawlerRunConfig(wait_for="css:.report-table"),
)Store credentials in environment variables, not in code. Never commit credentials to version control. For production use, prefer cookie injection (which can use short-lived tokens) over embedding passwords in code.Handling Rate Limits
Most documentation sites and content platforms have rate limits. Crawl4AI does not automatically handle 429 responses -- you need to build backoff logic around your crawling loops.
import asyncio
import random
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def crawl_with_backoff(crawler, url, config, max_retries=3):
for attempt in range(max_retries):
result = await crawler.arun(url=url, config=config)
if result.success:
return result
if result.status_code == 429:
# Exponential backoff with jitter
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited on {url}. Waiting {wait_time:.1f}s...")
await asyncio.sleep(wait_time)
else:
print(f"Failed {url}: {result.error_message}")
return None
print(f"Giving up on {url} after {max_retries} attempts")
return None
async def batch_crawl_polite(urls, delay_between=2.0, max_concurrent=3):
config = CrawlerRunConfig(css_selector="article")
async with AsyncWebCrawler() as crawler:
results = []
# Process in small batches to stay under rate limits
for i in range(0, len(urls), max_concurrent):
batch = urls[i:i + max_concurrent]
batch_results = await asyncio.gather(*[
crawl_with_backoff(crawler, url, config)
for url in batch
])
results.extend(batch_results)
# Polite delay between batches
if i + max_concurrent < len(urls):
await asyncio.sleep(delay_between)
return [r for r in results if r is not None]Caching Crawl Results
Re-crawling the same pages repeatedly wastes time and triggers rate limits. Crawl4AI has a built-in caching system.
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.cache_context import CacheMode
config = CrawlerRunConfig(
# Use cache if available, crawl if not
cache_mode=CacheMode.ENABLED,
# Force re-crawl even if cached
# cache_mode=CacheMode.BYPASS,
# Only use cache, never crawl (for offline testing)
# cache_mode=CacheMode.READ_ONLY,
)
async with AsyncWebCrawler() as crawler:
# First call: crawls and caches
result1 = await crawler.arun(url="https://docs.example.com/api", config=config)
# Second call: returns cached result instantly
result2 = await crawler.arun(url="https://docs.example.com/api", config=config)Quick Reference
- Use wait_for='css:.selector' to wait for JS-rendered content
- Inject cookies for authenticated crawling -- prefer over embedding credentials
- Use on_browser_created hook to run login flows before crawling
- Add exponential backoff with jitter for 429 rate limit responses
- Keep max_concurrent low (3-5) and add delays between batches
- Enable CacheMode.ENABLED to avoid re-crawling unchanged pages