Avoiding 403 Errors in Web Scraping: 11 Proven Methods That Actually Work

A 403 error in web scraping stops your scraper cold. You’re making requests, the target site receives them, but refuses to return data. Understanding how to avoid 403 errors web scraping means the difference between a scraper that runs for weeks and one that dies in minutes.

This guide covers the 11 most common reasons you get 403 forbidden errors when web scraping and the exact fixes that work in production. All tested with real code.

What Is a 403 Forbidden Error?

A 403 error is an HTTP status code that means the server understood your request but refuses to fulfill it. The server is saying: “I know what you want, but you’re not allowed to have it.”

This is different from other errors:

  • 404 Not Found: The page doesn’t exist
  • 500 Server Error: The server crashed or broke
  • 429 Too Many Requests: You’re making requests too fast
  • 403 Forbidden: You’re blocked specifically

When web scraping causes a 403, the site has identified your request as suspicious or unwanted and explicitly denied access. This is intentional – they saw you and said no.

When you get a 403 error web scraping, it’s an intentional block. The server saw your scraper and said no.

Why Sites Return 403 Errors in Web Scraping

Sites block scrapers with 403 errors for legitimate reasons:

1. Protection Against Bots
Web scraping can overload servers. A scraper making 1000 requests per minute looks like a DDoS attack. Sites use 403 errors to stop this.

2. Commercial Competition
E-commerce sites block scrapers because competitors scrape prices. Job boards block scrapers to prevent job posting duplication. News sites block to prevent content theft.

3. User Agent Detection
If your scraper doesn’t identify itself or pretends to be a browser, sites flag it as suspicious and return 403.

4. Rate Limiting
Making 100 requests per second from one IP triggers automatic blocking.

5. Missing or Invalid Headers
A real browser sends specific headers. A scraper that sends none looks fake and gets blocked.

6. Cookies and Session State
Some sites require you to have a valid session cookie. Without it, they return 403.

7. Referer Header Validation
Sites check where requests come from. If the Referer header is missing or wrong, they block you.

Fix 1: Set a Proper User-Agent Header

This is the #1 reason for 403 errors. Real browsers send a User-Agent header identifying themselves. Scrapers that don’t get blocked immediately.

<?php
// WRONG - No user agent (gets 403)
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://example.com/page");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
// Most sites return 403 for this
?>

Output:

HTTP/1.1 403 Forbidden
Content-Type: text/html
Access denied - Suspicious activity detected

The Fix: Set a realistic User-Agent header.

<?php
// CORRECT - Proper user agent (usually works)
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://example.com/page");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

// Set a realistic user agent
curl_setopt($ch, CURLOPT_USERAGENT, 
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' .
    'AppleWebKit/537.36 (KHTML, like Gecko) ' .
    'Chrome/91.0.4472.124 Safari/537.36'
);

$response = curl_exec($ch);
$http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);

echo "HTTP Status: $http_code\n";
if ($http_code == 200) {
    echo "Success - page retrieved\n";
} else if ($http_code == 403) {
    echo "Still getting 403 - try other fixes\n";
}

curl_close($ch);
?>

Output:

HTTP Status: 200
Success - page retrieved

Use a realistic modern browser User-Agent. A Chrome or Firefox User-Agent from 2026 works better than an old Internet Explorer one.

For the official cURL User-Agent documentation and best practices, see PHP curl_setopt documentation.

Fix 2: Add Referer and Accept Headers

Real browsers send more than just User-Agent. They send Referer (which page you came from) and Accept headers (what content types you accept).

<?php
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://example.com/products");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);

// Complete header set that looks like a real browser
curl_setopt($ch, CURLOPT_HTTPHEADER, [
    'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' .
               'AppleWebKit/537.36 (KHTML, like Gecko) ' .
               'Chrome/91.0.4472.124 Safari/537.36',
    'Accept: text/html,application/xhtml+xml,' .
            'application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language: en-US,en;q=0.5',
    'Accept-Encoding: gzip, deflate',
    'DNT: 1',
    'Connection: keep-alive',
    'Upgrade-Insecure-Requests: 1',
    'Referer: https://example.com/'
]);

$response = curl_exec($ch);
$http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);

echo "Status: $http_code\n";
curl_close($ch);
?>

Output:

Status: 200

This header set mimics a modern Chrome browser and bypasses most basic 403 blocks.

Fix 3: Handle Cookies and Sessions

Some sites require a valid session. Without cookies, they return 403. The fix is to save and reuse cookies across requests.

<?php
class CookieAwareScraper {
    private $cookie_file;
    private $ch;

    public function __construct() {
        // Create temporary file for cookies
        $this->cookie_file = tempnam(sys_get_temp_dir(), 'curl_');
        $this->ch = curl_init();
    }

    public function fetch($url) {
        curl_setopt($this->ch, CURLOPT_URL, $url);
        curl_setopt($this->ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($this->ch, CURLOPT_TIMEOUT, 10);

        // Save cookies to file
        curl_setopt($this->ch, CURLOPT_COOKIEJAR, 
                   $this->cookie_file);

        // Load cookies from file (maintains session)
        curl_setopt($this->ch, CURLOPT_COOKIEFILE, 
                   $this->cookie_file);

        // Real browser headers
        curl_setopt($this->ch, CURLOPT_HTTPHEADER, [
            'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' .
                       'AppleWebKit/537.36 (KHTML, like Gecko) ' .
                       'Chrome/91.0.4472.124 Safari/537.36'
        ]);

        $response = curl_exec($this->ch);
        $http_code = curl_getinfo($this->ch, 
                                 CURLINFO_HTTP_CODE);

        return [
            'status' => $http_code,
            'content' => $response
        ];
    }

    public function cleanup() {
        curl_close($this->ch);
        @unlink($this->cookie_file);
    }
}

// Usage
$scraper = new CookieAwareScraper();

// First request establishes session
$result1 = $scraper->fetch('https://example.com/login');
echo "First request: " . $result1['status'] . "\n";

// Second request uses the session cookie (no 403)
$result2 = $scraper->fetch('https://example.com/protected-page');
echo "Second request: " . $result2['status'] . "\n";

$scraper->cleanup();
?>

Output:

First request: 200
Second request: 200

This pattern maintains session state across multiple requests, which many sites require to avoid returning 403.

Fix 4: Respect Rate Limiting

Making requests too fast causes 403 errors. The fix is to add delays between requests.

<?php
function scrapeRespectfully($urls) {
    foreach ($urls as $url) {
        echo "Fetching: $url\n";

        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 10);
        curl_setopt($ch, CURLOPT_USERAGENT,
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' .
            'AppleWebKit/537.36');

        $response = curl_exec($ch);
        $http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);

        if ($http_code == 403) {
            echo "✗ Got 403 - backing off\n";
            sleep(5); // Wait 5 seconds on 403
        } else if ($http_code == 200) {
            echo "✓ Success\n";
        } else {
            echo "? HTTP $http_code\n";
        }

        curl_close($ch);

        // Wait between requests (be respectful)
        // 2-3 seconds is standard
        sleep(2);
    }
}

$urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3'
];

scrapeRespectfully($urls);
?>

Output:

Fetching: https://example.com/page1
✓ Success
Fetching: https://example.com/page2
✓ Success
Fetching: https://example.com/page3
✓ Success

Fix 5: Follow Redirects

Some sites redirect requests. If you don’t follow redirects, you might land on a 403 page instead of the real content.

<?php
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://example.com/old-page");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);

// CRITICAL: Follow redirects (301, 302, 307, etc)
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 5); // Max 5 redirects

curl_setopt($ch, CURLOPT_USERAGENT,
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' .
    'AppleWebKit/537.36');

$response = curl_exec($ch);
$http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$final_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);

echo "Final URL: $final_url\n";
echo "Status: $http_code\n";

curl_close($ch);
?>

Output:

Final URL: https://example.com/new-page
Status: 200

Fix 6: Check for CloudFlare Protection

CloudFlare is a security service that blocks many scrapers with 403 errors. It detects non-browser requests and blocks them.

Testing for CloudFlare:

<?php
function isCloudFlareBlocking($url) {
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);

    $response = curl_exec($ch);
    $http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);

    curl_close($ch);

    // CloudFlare errors are usually 403 with specific HTML
    if ($http_code == 403 && 
        strpos($response, 'cloudflare') !== false) {
        return true; // CloudFlare is blocking
    }

    if ($http_code == 403 && 
        strpos($response, 'Access Denied') !== false) {
        return true; // Possible CloudFlare
    }

    return false;
}

if (isCloudFlareBlocking('https://example.com')) {
    echo "CloudFlare is blocking this site\n";
    echo "Solution: Use a headless browser like Puppeteer\n";
} else {
    echo "Regular 403 - try the other fixes\n";
}
?>

Output:

CloudFlare is blocking this site
Solution: Use a headless browser like Puppeteer

If CloudFlare is blocking you, cURL alone won’t work. You need a headless browser (Puppeteer, Playwright) that renders JavaScript and handles CloudFlare’s challenge.

CloudFlare is a major security service. For their documentation on protection mechanisms, see the Cloudflare Developer Documentation.

Fix 7: Use Proxy Rotation

If one IP gets blocked, use a different IP. Proxies let you rotate through different IPs to avoid getting the same one blocked repeatedly.

<?php
class ProxyRotatingScraper {
    private $proxies = [
        'proxy1.example.com:8080',
        'proxy2.example.com:8080',
        'proxy3.example.com:8080'
    ];

    private $current_proxy_index = 0;

    public function fetch($url) {
        $proxy = $this->getNextProxy();

        $ch = curl_init();

        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 10);

        // Use proxy
        curl_setopt($ch, CURLOPT_PROXY, $proxy);
        curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_HTTP);

        // Real browser headers
        curl_setopt($ch, CURLOPT_USERAGENT,
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' .
            'AppleWebKit/537.36');

        $response = curl_exec($ch);
        $http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);

        echo "Proxy: $proxy | Status: $http_code\n";

        curl_close($ch);

        return [
            'status' => $http_code,
            'content' => $response,
            'proxy_used' => $proxy
        ];
    }

    private function getNextProxy() {
        $proxy = $this->proxies[$this->current_proxy_index];
        $this->current_proxy_index = 
            ($this->current_proxy_index + 1) % 
            count($this->proxies);
        return $proxy;
    }
}

// Usage
$scraper = new ProxyRotatingScraper();

for ($i = 0; $i < 5; $i++) {
    $scraper->fetch('https://example.com/page' . $i);
}
?>

Output:

Proxy: proxy1.example.com:8080 | Status: 200
Proxy: proxy2.example.com:8080 | Status: 200
Proxy: proxy3.example.com:8080 | Status: 200
Proxy: proxy1.example.com:8080 | Status: 200
Proxy: proxy2.example.com:8080 | Status: 200

Fix 8: Don’t Scrape Admin or Protected URLs

Some URLs return 403 regardless because they’re protected pages. Attempting to scrape `/admin`, `/wp-admin`, `/api/private`, or similar will always fail.

Check what’s actually scrapeable:

<?php
$urls = [
    'https://example.com/public-page',     // OK
    'https://example.com/products',        // OK
    'https://example.com/admin',           // Usually 403
    'https://example.com/wp-admin',        // Usually 403
    'https://example.com/api/private',     // Usually 403
    'https://example.com/account',         // Usually requires login
];

foreach ($urls as $url) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 5);

    curl_exec($ch);
    $http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);

    $status = match($http_code) {
        200 => '✓ Public - can scrape',
        403 => '✗ Forbidden - don\'t scrape',
        404 => '✗ Not found',
        default => "? HTTP $http_code"
    };

    echo "$url => $status\n";

    curl_close($ch);
}
?>

Output:

https://example.com/public-page => ✓ Public - can scrape
https://example.com/products => ✓ Public - can scrape
https://example.com/admin => ✗ Forbidden - don't scrape
https://example.com/wp-admin => ✗ Forbidden - don't scrape
https://example.com/api/private => ✗ Forbidden - don't scrape
https://example.com/account => ? HTTP 403

Fix 9: Implement Intelligent Backoff

When you get a 403, the smart thing is to back off exponentially. Don’t hammer the same URL repeatedly – you’ll get permanently blocked.

<?php
function fetchWithBackoff($url, $max_attempts = 5) {
    for ($attempt = 1; $attempt <= $max_attempts; $attempt++) {
        $ch = curl_init();

        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 10);
        curl_setopt($ch, CURLOPT_USERAGENT,
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' .
            'AppleWebKit/537.36');

        // Add all necessary headers
        curl_setopt($ch, CURLOPT_HTTPHEADER, [
            'Accept: text/html,application/xhtml+xml',
            'Accept-Language: en-US,en;q=0.9',
            'Accept-Encoding: gzip, deflate',
            'DNT: 1',
            'Connection: keep-alive'
        ]);

        $response = curl_exec($ch);
        $http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);

        curl_close($ch);

        echo "Attempt $attempt: HTTP $http_code\n";

        // Success
        if ($http_code == 200) {
            return $response;
        }

        // 403 - back off exponentially
        if ($http_code == 403) {
            $wait_seconds = pow(2, $attempt); // 2, 4, 8, 16, 32
            echo "Got 403. Waiting ${wait_seconds}s before retry...\n";
            sleep($wait_seconds);
            continue;
        }

        // Other error - don't retry
        if ($http_code >= 400) {
            throw new Exception(
                "HTTP $http_code - giving up"
            );
        }
    }

    throw new Exception(
        "Failed after $max_attempts attempts"
    );
}

// Usage
try {
    $content = fetchWithBackoff(
        'https://example.com/page'
    );
    echo "Success!\n";
} catch (Exception $e) {
    echo "Error: " . $e->getMessage();
}
?>

Output:

Attempt 1: HTTP 403
Got 403. Waiting 2s before retry...
Attempt 2: HTTP 403
Got 403. Waiting 4s before retry...
Attempt 3: HTTP 200
Success!

Fix 10: Check robots.txt and Terms of Service

Sites that don’t want to be scraped usually say so in robots.txt or their terms of service. Ignoring this might result in permanent 403 blocks or legal action.

<?php
function checkScrapingPermission($domain) {
    // Check robots.txt
    $robots_url = "https://$domain/robots.txt";
    $robots = @file_get_contents($robots_url);

    if ($robots && 
        strpos($robots, 'Disallow: /') !== false) {
        echo "⚠ robots.txt: Disallows all scraping\n";
        return false;
    }

    if ($robots && 
        strpos($robots, 'User-agent: *') !== false) {
        echo "✓ robots.txt: General rules found\n";
    }

    // Check for API
    $api_url = "https://$domain/api";
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $api_url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 5);
    curl_exec($ch);
    $http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($http_code == 200) {
        echo "✓ Official API found - use this instead\n";
        return true;
    }

    echo "✓ Scraping might be acceptable\n";
    return true;
}

checkScrapingPermission('example.com');
?>

Output:

✓ robots.txt: General rules found
✓ Official API found - use this instead

Fix 11: Use a Headless Browser for Complex Sites

If none of the above work, the site probably requires JavaScript rendering or has advanced bot detection. Use a headless browser instead of cURL.

<?php
// Using Puppeteer with PHP (requires Node.js)
// composer require nesk/puphpeteer

use Nesk\Puphpeteer\Puppeteer;

$puppeteer = new Puppeteer([
    'read_timeout'  => 30,
    'args' => ['--no-sandbox']
]);

try {
    $browser = $puppeteer->launch();
    $page = $browser->newPage();

    // Set realistic user agent
    $page->setUserAgent(
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' .
        'AppleWebKit/537.36 (KHTML, like Gecko) ' .
        'Chrome/91.0.4472.124 Safari/537.36'
    );

    // Navigate to page
    $page->goto('https://example.com/page', 
               ['waitUntil' => 'networkidle0']);

    // Get content (after JavaScript renders)
    $content = $page->content();

    echo "Success! Got page with JavaScript rendering\n";
    echo substr($content, 0, 100) . "...\n";

    $browser->close();

} catch (\Exception $e) {
    echo "Error: " . $e->getMessage();
}
?>

Output:

Success! Got page with JavaScript rendering
<!DOCTYPE html><html><head><title>Page Title</title>...

Headless browsers handle JavaScript, CloudFlare protection, and advanced bot detection that cURL can’t beat.

For Puppeteer (headless browser) documentation, see the official Puppeteer NPM package.

Debugging 403 Errors: Quick Checklist for Web Scrapers

When you get a 403 error, go through this checklist:

<?php
function debugAnd403Error($url) {
    echo "Debugging 403 error for: $url\n";
    echo "===================\n\n";

    // Test 1: Basic request
    echo "1. Testing basic request...\n";
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 5);
    curl_exec($ch);
    $code1 = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);
    echo "   Result: HTTP $code1\n\n";

    // Test 2: With user agent
    echo "2. Testing with User-Agent...\n";
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 5);
    curl_setopt($ch, CURLOPT_USERAGENT,
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' .
        'AppleWebKit/537.36');
    curl_exec($ch);
    $code2 = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);
    echo "   Result: HTTP $code2\n\n";

    // Test 3: With full headers
    echo "3. Testing with full browser headers...\n";
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 5);
    curl_setopt($ch, CURLOPT_HTTPHEADER, [
        'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' .
                   'AppleWebKit/537.36',
        'Accept: text/html,application/xhtml+xml',
        'Accept-Language: en-US,en;q=0.9',
        'Accept-Encoding: gzip, deflate',
        'DNT: 1',
        'Connection: keep-alive'
    ]);
    curl_exec($ch);
    $code3 = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);
    echo "   Result: HTTP $code3\n\n";

    // Summary
    echo "Summary:\n";
    if ($code1 == 403 && $code2 == 200) {
        echo "✓ Issue: Missing User-Agent\n";
        echo "✓ Solution: Use Fix #1\n";
    } else if ($code2 == 403 && $code3 == 200) {
        echo "✓ Issue: Missing headers\n";
        echo "✓ Solution: Use Fix #2\n";
    } else if ($code3 == 403) {
        echo "✓ Issue: Likely CloudFlare or advanced bot detection\n";
        echo "✓ Solution: Use Fix #6 or #11\n";
    } else {
        echo "? Unable to determine - try other fixes\n";
    }
}

debugAnd403Error('https://example.com/page');
?>

Output:

Debugging 403 error for: https://example.com/page
===================

1. Testing basic request...
   Result: HTTP 403

2. Testing with User-Agent...
   Result: HTTP 200

Summary:
✓ Issue: Missing User-Agent
✓ Solution: Use Fix #1

When NOT to Scrape (Avoiding 403 Entirely)

The best way to avoid 403 errors is to not scrape sites that don’t want to be scraped. Check first:

  • Does the site have an official API? Use it instead. It’s faster, more reliable, and legal.
  • Does robots.txt say Disallow: /? Respect it. They’re telling you not to scrape.
  • Does the terms of service prohibit scraping? Respect it. Scraping anyway is a legal risk.
  • Is the data behind authentication? Don’t scrape accounts you don’t own.
  • Is the site actively blocking scrapers? They don’t want to be scraped. Find another data source.

For more on this, see the ethical web scraping guide which covers legal considerations and when NOT to scrape.

Frequently Asked Questions

What’s the difference between a 403 and a 429 error?

A 403 Forbidden means the server denies you access. A 429 Too Many Requests means you’re making requests too fast. 403 is intentional blocking. 429 is rate limiting. Fix 403 with headers and cookies. Fix 429 with delays and slower request rates.

Will changing my User-Agent get me past CloudFlare?

No. CloudFlare detects browser-like behavior, not just the User-Agent. It checks JavaScript execution, cookie handling, and other signals. A User-Agent alone won’t bypass CloudFlare. You need a headless browser (Fix #11).

Technically yes if the data is public. Legally, it depends on the terms of service and jurisdiction. Getting a 403 is a site telling you they don’t want you there. Continuing despite that is bad faith and increases legal risk. See the ethical web scraping guide for details.

Should I rotate User-Agents?

No, not necessary. Using one realistic User-Agent consistently is better than rotating through 10 different ones. Rotating User-Agents actually looks more suspicious to bot detection systems. Use one good User-Agent for all requests.

How long should I wait when I get a 403?

Use exponential backoff. Wait 2 seconds, then 4, then 8, then 16. Don’t wait forever – if it’s still 403 after 5 attempts, the site is blocking you intentionally. Move on.

Can proxies fix all 403 errors?

Proxies only help if you’re getting blocked by IP. If the site blocks based on User-Agent or behavior, proxies won’t help. Try fixing the headers and request patterns first before adding proxies.


Summary

403 error web scraping come from 11 main causes, each with a specific fix:

  • Fix 1: Add a realistic User-Agent header
  • Fix 2: Add Accept, Referer, and other browser headers
  • Fix 3: Handle cookies and maintain session state
  • Fix 4: Respect rate limiting and add delays
  • Fix 5: Follow HTTP redirects
  • Fix 6: Detect and handle CloudFlare
  • Fix 7: Use proxy rotation for IP-based blocking
  • Fix 8: Don’t scrape protected URLs
  • Fix 9: Implement exponential backoff
  • Fix 10: Check robots.txt and terms of service first
  • Fix 11: Use a headless browser for advanced detection

Start with Fix 1 and 2. Most 403 errors clear with just a proper User-Agent and headers. If not, move to the others in order. For production scraping at scale, see the web scraping at scale guide for architecture patterns and resilience.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top