Static pages are easy. React apps, login-gated content, and aggressive rate limiters are where most crawlers break. Here is how Crawl4AI handles them.

The Problem with JS-Heavy Sites

Crawl4AI uses Playwright under the hood, which means it renders JavaScript -- unlike simple HTTP scrapers. This handles most modern websites. Where things break: single-page apps that load content asynchronously, pages behind login walls, and sites with aggressive bot detection or rate limiting.

Waiting for Content to Load

JavaScript-rendered content does not exist in the initial HTML response -- it appears after JS executes. Crawl4AI needs to wait for the content before extracting it.

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
 
config = CrawlerRunConfig(
    # Wait until a specific CSS selector appears in the DOM
    wait_for="css:.product-listing",
 
    # Or wait for a JavaScript condition to be true
    # wait_for="js:() => document.querySelectorAll('.product').length > 0",
 
    # Or simply wait a fixed time (less reliable, use as last resort)
    # wait_for="js:() => new Promise(r => setTimeout(r, 3000))",
 
    # Page load timeout
    page_timeout=30000,   # 30 seconds
)
 
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://app.example.com/products",
        config=config,
    )

Crawling Behind a Login

For content that requires authentication, there are two approaches: inject session cookies, or use Crawl4AI's hooks to run the login flow before crawling.

Option 1: Inject pre-authenticated cookies

# First, get your session cookie using your browser's DevTools
# Network tab -> find a request -> copy cookie header value
 
config = CrawlerRunConfig(
    session_id="authenticated-session",  # name to reuse this session
)
 
async with AsyncWebCrawler() as crawler:
    # Set cookies before crawling
    await crawler.crawler_strategy.set_cookies([
        {
            "name": "session_token",
            "value": "your-session-cookie-value",
            "domain": ".example.com",
            "path": "/",
        }
    ])
 
    result = await crawler.arun(
        url="https://app.example.com/protected-content",
        config=config,
    )

Option 2: Use hooks to run the login flow

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
 
async def login_hook(page, **kwargs):
    # Navigate to login page and fill in credentials
    await page.goto("https://app.example.com/login")
    await page.fill("#email", "your-email@example.com")
    await page.fill("#password", "your-password")
    await page.click("#login-button")
    # Wait for redirect to confirm login succeeded
    await page.wait_for_url("**/dashboard")
 
async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
    # Run the login hook before the first crawl
    crawler.crawler_strategy.set_hook("on_browser_created", login_hook)
 
    result = await crawler.arun(
        url="https://app.example.com/protected-reports",
        config=CrawlerRunConfig(wait_for="css:.report-table"),
    )
Store credentials in environment variables, not in code. Never commit credentials to version control. For production use, prefer cookie injection (which can use short-lived tokens) over embedding passwords in code.

Handling Rate Limits

Most documentation sites and content platforms have rate limits. Crawl4AI does not automatically handle 429 responses -- you need to build backoff logic around your crawling loops.

import asyncio
import random
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
 
async def crawl_with_backoff(crawler, url, config, max_retries=3):
    for attempt in range(max_retries):
        result = await crawler.arun(url=url, config=config)
 
        if result.success:
            return result
 
        if result.status_code == 429:
            # Exponential backoff with jitter
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limited on {url}. Waiting {wait_time:.1f}s...")
            await asyncio.sleep(wait_time)
        else:
            print(f"Failed {url}: {result.error_message}")
            return None
 
    print(f"Giving up on {url} after {max_retries} attempts")
    return None
 
async def batch_crawl_polite(urls, delay_between=2.0, max_concurrent=3):
    config = CrawlerRunConfig(css_selector="article")
 
    async with AsyncWebCrawler() as crawler:
        results = []
        # Process in small batches to stay under rate limits
        for i in range(0, len(urls), max_concurrent):
            batch = urls[i:i + max_concurrent]
            batch_results = await asyncio.gather(*[
                crawl_with_backoff(crawler, url, config)
                for url in batch
            ])
            results.extend(batch_results)
            # Polite delay between batches
            if i + max_concurrent < len(urls):
                await asyncio.sleep(delay_between)
 
    return [r for r in results if r is not None]

Caching Crawl Results

Re-crawling the same pages repeatedly wastes time and triggers rate limits. Crawl4AI has a built-in caching system.

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.cache_context import CacheMode
 
config = CrawlerRunConfig(
    # Use cache if available, crawl if not
    cache_mode=CacheMode.ENABLED,
 
    # Force re-crawl even if cached
    # cache_mode=CacheMode.BYPASS,
 
    # Only use cache, never crawl (for offline testing)
    # cache_mode=CacheMode.READ_ONLY,
)
 
async with AsyncWebCrawler() as crawler:
    # First call: crawls and caches
    result1 = await crawler.arun(url="https://docs.example.com/api", config=config)
 
    # Second call: returns cached result instantly
    result2 = await crawler.arun(url="https://docs.example.com/api", config=config)

Quick Reference

  • Use wait_for='css:.selector' to wait for JS-rendered content
  • Inject cookies for authenticated crawling -- prefer over embedding credentials
  • Use on_browser_created hook to run login flows before crawling
  • Add exponential backoff with jitter for 429 rate limit responses
  • Keep max_concurrent low (3-5) and add delays between batches
  • Enable CacheMode.ENABLED to avoid re-crawling unchanged pages