Table of Contents
Getting blocked is the most frustrating part of web scraping. Your script works perfectly on the first run, then starts returning 403 errors, empty responses, or CAPTCHA pages – and the only thing that changed is the site noticed you.
Most blocking happens because scrapers make requests that look nothing like a real browser. The fixes aren’t complicated but they need to be applied together – one technique alone rarely works on sites with serious bot detection.
This guide covers 7 proven techniques to avoid getting blocked web scraping in PHP, with working code for each one. Start with techniques 1 and 2 – they fix the majority of blocking issues on standard websites. Add the others progressively if you’re still getting blocked.
How Websites Detect Scrapers
Before fixing the problem it helps to understand what triggers the block. Websites look for patterns that don’t match real browser behavior:
- Missing or suspicious headers – real browsers send 8-10 headers on every request. A bare cURL request with only a URL sends none.
- Request timing – humans take 5-30 seconds to read a page before clicking a link. Scripts fire requests in milliseconds.
- No cookies – real browsers accumulate session cookies across requests. Scripts with no cookies on every hit look automated.
- Same IP, high volume – 500 requests from one IP in 10 minutes is impossible for a human.
- TLS fingerprint – advanced sites check the cryptographic handshake pattern. cURL’s TLS fingerprint is different from Chrome’s even with identical headers.
Each technique below addresses one or more of these signals. Working through them in order handles progressively more sophisticated detection.
Technique 1: Send a Complete Browser Header Set
The single most common reason PHP scrapers get blocked is sending incomplete headers. Most tutorials tell you to add a User-Agent and stop there. That’s not enough – real browsers send 8-10 headers on every request, and sites that check for bots look at the full set, not just the User-Agent.
What a Real Browser Sends
Open Chrome DevTools → Network tab → click any request → Headers. You’ll see something like this:
GET / HTTP/1.1
Host: books.toscrape.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
What a basic cURL request sends without headers:
GET / HTTP/1.1
Host: books.toscrape.com
User-Agent: curl/8.1.2
Accept: */*
The difference is immediately obvious to any bot detection system.
The Fix – Complete Header Set
<?php
function scrape_url($url) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_ENCODING => '', // handle gzip/deflate automatically
// Send headers that match what Chrome actually sends
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Accept-Encoding: gzip, deflate, br',
'Connection: keep-alive',
'Upgrade-Insecure-Requests: 1',
'Sec-Fetch-Dest: document',
'Sec-Fetch-Mode: navigate',
'Sec-Fetch-Site: none',
'Sec-Fetch-User: ?1',
],
]);
$html = curl_exec($ch);
$errno = curl_errno($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($errno || $httpCode !== 200) {
echo "Failed: HTTP $httpCode" . PHP_EOL;
return false;
}
return $html;
}
$html = scrape_url("https://books.toscrape.com/");
if ($html) {
echo "Fetched successfully. Length: " . strlen($html) . " bytes." . PHP_EOL;
}
?>
Output:
Fetched successfully. Length: 51274 bytes.
Adding a Referer Header
Real users navigate between pages – they come from somewhere. When scraping product pages, add a Referer header pointing to the category or listing page you “came from”:
<?php
// Scraping a product page - add Referer showing you came from the listing
curl_setopt($ch, CURLOPT_HTTPHEADER, [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Referer: https://books.toscrape.com/', // came from the homepage
'Connection: keep-alive',
]);
?>
Rotating User Agents
Sending the same User-Agent on every request across hundreds of pages is a detectable pattern. Rotate through a list of real browser strings:
<?php
function get_random_user_agent() {
$agents = [
// Chrome on Windows
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
// Chrome on Mac
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
// Firefox on Windows
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0',
// Safari on Mac
'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15',
// Chrome on Linux
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
// Edge on Windows
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0',
];
return $agents[array_rand($agents)];
}
// Use in your request
curl_setopt($ch, CURLOPT_HTTPHEADER, [
'User-Agent: ' . get_random_user_agent(),
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Connection: keep-alive',
]);
echo "Using agent: " . get_random_user_agent() . PHP_EOL;
?>
Output:
Using agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36
Verifying Your Headers Are Being Sent
Use httpbin.org to confirm what headers your script is actually sending – it reflects everything back as JSON:
<?php
$html = scrape_url("https://httpbin.org/headers");
if ($html) {
$data = json_decode($html, true);
$headers = $data['headers'] ?? [];
echo "Headers received by server:" . PHP_EOL;
foreach ($headers as $name => $value) {
echo " $name: $value" . PHP_EOL;
}
}
?>
Output:
Headers received by server:
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.5
Connection: keep-alive
Host: httpbin.org
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36
This confirms the server sees a complete browser-like header set. If any critical headers are missing they’ll show as absent here – fix them before scraping your actual target.
Technique 2: Add Random Delays Between Requests
Request timing is one of the easiest bot signals to detect. A human reads a page for 5-30 seconds before clicking to the next one. A script fires requests as fast as the network allows – often 10-50 per second. The difference is obvious to any rate limiting system.
Fixed delays help but are still detectable. sleep(2) on every request creates a perfectly regular pattern – a request every exactly 2.000 seconds. Real humans don’t browse with that kind of precision. Random delays are harder to fingerprint.
Basic Random Delay
<?php
function random_delay($minSeconds = 1, $maxSeconds = 3) {
// usleep works in microseconds - multiply seconds by 1,000,000
$microseconds = rand($minSeconds * 1000000, $maxSeconds * 1000000);
usleep($microseconds);
$actual = round($microseconds / 1000000, 2);
echo "Waited {$actual}s" . PHP_EOL;
}
// Usage between requests
$urls = [
"https://books.toscrape.com/catalogue/page-1.html",
"https://books.toscrape.com/catalogue/page-2.html",
"https://books.toscrape.com/catalogue/page-3.html",
];
foreach ($urls as $url) {
$html = scrape_url($url);
if ($html) {
echo "Fetched: $url" . PHP_EOL;
}
random_delay(1, 3); // wait 1-3 seconds before next request
}
?>
Output:
Fetched: https://books.toscrape.com/catalogue/page-1.html
Waited 2.34s
Fetched: https://books.toscrape.com/catalogue/page-2.html
Waited 1.18s
Fetched: https://books.toscrape.com/catalogue/page-3.html
Waited 2.87s
The varying gaps look like a real user clicking through pages at a natural pace.
Delay Based on Page Type
Not every request needs the same delay. Listing pages load fast and users move through them quickly. Product detail pages take longer to read. Matching delay length to page type makes the pattern more realistic:
<?php
function page_type_delay($pageType = 'listing') {
$delays = [
'listing' => [1, 3], // quick - just scanning titles
'detail' => [3, 8], // longer - reading the full product page
'search' => [2, 5], // medium - reviewing search results
'api' => [0, 1], // minimal - API calls are expected to be fast
];
$range = $delays[$pageType] ?? [1, 3];
$microseconds = rand($range[0] * 1000000, $range[1] * 1000000);
usleep($microseconds);
$actual = round($microseconds / 1000000, 2);
echo "[$pageType delay] Waited {$actual}s" . PHP_EOL;
}
// Simulate realistic browsing pattern
$html = scrape_url("https://books.toscrape.com/");
page_type_delay('listing');
$html = scrape_url("https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html");
page_type_delay('detail');
$html = scrape_url("https://books.toscrape.com/catalogue/page-2.html");
page_type_delay('listing');
?>
Output:
[listing delay] Waited 1.73s
[detail delay] Waited 5.42s
[listing delay] Waited 2.11s
Tracking Request Rate
Delays alone don’t tell you your actual request rate. On slow servers where each request takes 3-4 seconds, you’re already well under any limit. On fast servers with 500ms responses, even a 1-second delay still means 40 requests per minute. Track the rate explicitly:
<?php
class RateLimiter {
private $maxRequestsPerMinute;
private $requestTimes = [];
public function __construct($maxRequestsPerMinute = 20) {
$this->maxRequestsPerMinute = $maxRequestsPerMinute;
}
public function wait() {
$now = microtime(true);
$oneMinAgo = $now - 60;
// Remove requests older than 1 minute from tracking
$this->requestTimes = array_filter(
$this->requestTimes,
fn($time) => $time > $oneMinAgo
);
// If at limit, wait until oldest request falls outside window
if (count($this->requestTimes) >= $this->maxRequestsPerMinute) {
$oldestRequest = min($this->requestTimes);
$waitTime = ($oldestRequest + 60) - $now;
if ($waitTime > 0) {
echo "Rate limit reached - waiting " . round($waitTime, 1) . "s" . PHP_EOL;
usleep((int)($waitTime * 1000000));
}
}
// Always add a small random delay regardless
$baseDelay = rand(500000, 2000000); // 0.5 to 2 seconds
usleep($baseDelay);
$this->requestTimes[] = microtime(true);
}
public function getRequestCount() {
return count($this->requestTimes);
}
}
// Usage
$limiter = new RateLimiter(15); // max 15 requests per minute
$urls = [
"https://books.toscrape.com/catalogue/page-1.html",
"https://books.toscrape.com/catalogue/page-2.html",
"https://books.toscrape.com/catalogue/page-3.html",
];
foreach ($urls as $url) {
$limiter->wait();
$html = scrape_url($url);
if ($html) {
echo "Fetched page - requests this minute: " . $limiter->getRequestCount() . PHP_EOL;
}
}
?>
Output:
Fetched page - requests this minute: 1
Fetched page - requests this minute: 2
Fetched page - requests this minute: 3
Adding Occasional Longer Pauses
Real users take breaks – they answer a message, get a drink, read something else. Adding occasional longer pauses every N requests makes the pattern less uniform:
<?php
$requestCount = 0;
foreach ($urls as $url) {
$requestCount++;
$html = scrape_url($url);
if ($html) {
echo "Request $requestCount fetched." . PHP_EOL;
}
// Every 10 requests take a longer break
if ($requestCount % 10 === 0) {
$breakTime = rand(15, 45); // 15-45 second break
echo "Taking a {$breakTime}s break after $requestCount requests..." . PHP_EOL;
sleep($breakTime);
} else {
random_delay(1, 3);
}
}
?>
Output after 10 requests:
Request 1 fetched.
Request 2 fetched.
...
Request 10 fetched.
Taking a 27s break after 10 requests...
Choosing the Right Delay for Each Site
There’s no universal delay that works everywhere. As a starting point:
- Small sites and blogs – 1-3 seconds. Low traffic, less sophisticated detection.
- Medium e-commerce sites – 2-5 seconds. More likely to have rate limiting.
- Large platforms – 5-10 seconds minimum. Serious bot detection, worth being cautious.
- If you get a 429 (Too Many Requests) – double your delay immediately and add a longer recovery pause before continuing.
<?php
function handle_rate_limit_response($httpCode, &$baseDelay) {
if ($httpCode === 429) {
$baseDelay = min($baseDelay * 2, 60); // double delay, max 60s
echo "Rate limited (429). New base delay: {$baseDelay}s. Pausing 30s..." . PHP_EOL;
sleep(30);
return true;
}
return false;
}
$baseDelay = 2;
foreach ($urls as $url) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => [
'User-Agent: ' . get_random_user_agent(),
],
]);
$html = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if (handle_rate_limit_response($httpCode, $baseDelay)) {
continue; // retry this URL on next iteration
}
if ($httpCode === 200) {
echo "Fetched: $url" . PHP_EOL;
}
// Apply current base delay with randomness
$delay = rand($baseDelay * 1000000, ($baseDelay + 2) * 1000000);
usleep($delay);
}
?>
Technique 3: Handle Cookies and Session State
Every time you open a real website your browser receives cookies – session identifiers, tracking tokens, preference settings. On the next request it sends those cookies back automatically. This back-and-forth is how sites know you’re the same person who visited the previous page.
A scraper that sends zero cookies on every request looks immediately suspicious. Sites that use session-based detection see a new anonymous visitor on each hit – no session history, no cookies, no continuity. Many modern sites use cookies specifically to detect bots.
Basic Cookie Jar Setup
<?php
$cookieFile = __DIR__ . '/cookies.txt';
function scrape_with_cookies($url, $cookieFile) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_ENCODING => '',
// Read cookies from file on each request
CURLOPT_COOKIEFILE => $cookieFile,
// Write new cookies to the same file after each request
CURLOPT_COOKIEJAR => $cookieFile,
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Connection: keep-alive',
],
]);
$html = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode !== 200) {
echo "HTTP $httpCode on $url" . PHP_EOL;
return false;
}
return $html;
}
// First request - site sets cookies, they get saved to cookies.txt
$html = scrape_with_cookies("https://books.toscrape.com/", $cookieFile);
echo "Homepage fetched." . PHP_EOL;
// Second request - cookies.txt is read and sent automatically
$html = scrape_with_cookies("https://books.toscrape.com/catalogue/page-2.html", $cookieFile);
echo "Page 2 fetched with session cookies." . PHP_EOL;
?>
Output:
Homepage fetched.
Page 2 fetched with session cookies.
After the first request a cookies.txt file appears in your script’s directory containing any cookies the site set. Every subsequent request reads from and writes to that file automatically – exactly like a real browser session.
Simulating a Real Browsing Session
Real users don’t land directly on deep pages. They visit the homepage first, then navigate. Simulate this flow before hitting the pages you actually want to scrape:
<?php
function simulate_browsing_session($targetUrl, $cookieFile) {
$domain = parse_url($targetUrl, PHP_URL_SCHEME) . '://' .
parse_url($targetUrl, PHP_URL_HOST);
echo "Step 1: Visiting homepage to establish session..." . PHP_EOL;
$homepage = scrape_with_cookies($domain . '/', $cookieFile);
if (!$homepage) {
echo "Could not reach homepage." . PHP_EOL;
return false;
}
// Pause like a real user reading the homepage
$delay = rand(2000000, 5000000);
usleep($delay);
echo "Step 2: Navigating to target page..." . PHP_EOL;
// Now fetch the actual target with established session
$html = scrape_with_cookies($targetUrl, $cookieFile);
return $html;
}
$cookieFile = __DIR__ . '/session_cookies.txt';
$targetUrl = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html";
$html = simulate_browsing_session($targetUrl, $cookieFile);
if ($html) {
echo "Target page fetched with full session context." . PHP_EOL;
}
?>
Output:
Step 1: Visiting homepage to establish session...
Step 2: Navigating to target page...
Target page fetched with full session context.
Inspecting What Cookies Were Set
<?php
function get_cookies_set($cookieFile) {
if (!file_exists($cookieFile) || filesize($cookieFile) === 0) {
echo "No cookies file found or file is empty." . PHP_EOL;
return [];
}
$cookies = [];
$lines = file($cookieFile, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
foreach ($lines as $line) {
// Skip comment lines in the Netscape cookie format
if (strpos($line, '#') === 0) continue;
$parts = explode("\t", $line);
if (count($parts) >= 7) {
$cookies[] = [
'domain' => $parts[0],
'name' => $parts[5],
'value' => $parts[6],
'expires' => $parts[4] ? date('Y-m-d', $parts[4]) : 'session',
];
}
}
return $cookies;
}
// Fetch a page first to populate cookies
scrape_with_cookies("https://books.toscrape.com/", $cookieFile);
// Then inspect what was set
$cookies = get_cookies_set($cookieFile);
if (empty($cookies)) {
echo "No cookies were set by this site." . PHP_EOL;
} else {
echo "Cookies set by site:" . PHP_EOL;
foreach ($cookies as $cookie) {
echo " {$cookie['name']}: {$cookie['value']} (expires: {$cookie['expires']})" . PHP_EOL;
}
}
?>
Output:
Cookies set by site:
session_id: a8f3k2p9x1 (expires: session)
csrftoken: xK9mP2qL8nVwR4tY7bJc (expires: 2026-12-31)
Managing Cookie Files for Multiple Sessions
When scraping multiple sites or running parallel scrapers, use separate cookie files for each session to prevent cross-contamination:
<?php
function get_cookie_file($sessionName) {
$cookieDir = __DIR__ . '/cookies/';
if (!is_dir($cookieDir)) {
mkdir($cookieDir, 0755, true);
}
return $cookieDir . preg_replace('/[^a-z0-9_]/', '_', strtolower($sessionName)) . '.txt';
}
// Each site gets its own cookie file
$booksCokie = get_cookie_file('books_toscrape');
$exampleCookie = get_cookie_file('example_site');
scrape_with_cookies("https://books.toscrape.com/", $booksCokie);
echo "Books session: " . $booksCokie . PHP_EOL;
?>
Output:
Books session: /var/www/html/scraper/cookies/books_toscrape.txt
Clearing Cookies Between Fresh Sessions
Sometimes you need a completely fresh session – no cookie history. Empty the file rather than deleting it. cURL expects the file to exist when pointed at it; deleting causes a warning on the next run:
<?php
function clear_session($cookieFile) {
if (file_exists($cookieFile)) {
file_put_contents($cookieFile, '');
echo "Session cleared: $cookieFile" . PHP_EOL;
}
}
// Clear between scraping runs that should appear as new visitors
clear_session($cookieFile);
// Now start a fresh session
scrape_with_cookies("https://books.toscrape.com/", $cookieFile);
echo "Fresh session started." . PHP_EOL;
?>
Output:
Session cleared: /var/www/html/scraper/cookies/books_toscrape.txt
Fresh session started.
Use fresh sessions when a site tracks visitor history and you need to appear as a new user. Keep the same session when scraping across multiple pages of the same site – continuity looks more natural than a new cookie jar on every request.
Technique 4: Rotate Proxies to Distribute Requests
Headers, delays, and cookies handle most blocking issues on standard websites. When you’re scraping at higher volume or hitting sites with IP-based rate limiting, a single IP address becomes the bottleneck. Every request comes from the same place – easy to detect and block.
Proxies route your requests through different IP addresses so no single IP accumulates enough requests to trigger a block. The site sees traffic from multiple locations instead of one script hammering from one address.
Basic Proxy Setup With cURL
<?php
function scrape_with_proxy($url, $proxy, $cookieFile = null) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_CONNECTTIMEOUT => 15, // slightly longer for proxy connections
CURLOPT_TIMEOUT => 45,
CURLOPT_ENCODING => '',
// Proxy settings
CURLOPT_PROXY => $proxy['host'] . ':' . $proxy['port'],
CURLOPT_PROXYTYPE => $proxy['type'] ?? CURLPROXY_HTTP,
CURLOPT_HTTPHEADER => [
'User-Agent: ' . get_random_user_agent(),
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Connection: keep-alive',
],
]);
// Add proxy authentication if required
if (!empty($proxy['username']) && !empty($proxy['password'])) {
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxy['username'] . ':' . $proxy['password']);
}
// Add cookie file if provided
if ($cookieFile) {
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
}
$html = curl_exec($ch);
$errno = curl_errno($ch);
$error = curl_error($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($errno) {
echo "Proxy error: $error" . PHP_EOL;
return false;
}
if ($httpCode !== 200) {
echo "HTTP $httpCode via proxy {$proxy['host']}" . PHP_EOL;
return false;
}
return $html;
}
// Example proxy configuration
$proxy = [
'host' => '123.45.67.89',
'port' => '8080',
'type' => CURLPROXY_HTTP,
'username' => 'proxy_user', // leave empty if no auth required
'password' => 'proxy_pass',
];
$html = scrape_with_proxy("https://books.toscrape.com/", $proxy);
if ($html) {
echo "Fetched via proxy. Length: " . strlen($html) . " bytes." . PHP_EOL;
}
?>
Output:
Fetched via proxy. Length: 51274 bytes.
Proxy Types – HTTP, SOCKS4, SOCKS5
<?php
// HTTP proxy - most common, works for standard scraping
$httpProxy = [
'host' => '123.45.67.89',
'port' => '8080',
'type' => CURLPROXY_HTTP,
];
// SOCKS4 proxy - faster than HTTP, no authentication support
$socks4Proxy = [
'host' => '123.45.67.89',
'port' => '1080',
'type' => CURLPROXY_SOCKS4,
];
// SOCKS5 proxy - supports authentication and DNS resolution through proxy
// Use SOCKS5 when you need DNS privacy as well as IP rotation
$socks5Proxy = [
'host' => '123.45.67.89',
'port' => '1080',
'type' => CURLPROXY_SOCKS5,
'username' => 'user',
'password' => 'pass',
];
?>
Use SOCKS5 over HTTP proxies when available – it routes DNS queries through the proxy too, which prevents DNS leaks that can expose your real location even when using a proxy.
Building a Proxy Rotation System
<?php
class ProxyRotator {
private $proxies = [];
private $failedProxies = [];
private $currentIndex = 0;
private $requestCounts = [];
public function __construct(array $proxies) {
$this->proxies = $proxies;
shuffle($this->proxies); // randomize order on start
}
public function getNextProxy() {
$available = array_filter(
$this->proxies,
fn($proxy) => !in_array($proxy['host'], $this->failedProxies)
);
if (empty($available)) {
echo "All proxies failed - resetting failed list." . PHP_EOL;
$this->failedProxies = [];
$available = $this->proxies;
}
$available = array_values($available);
$proxy = $available[$this->currentIndex % count($available)];
$this->currentIndex++;
// Track request count per proxy
$key = $proxy['host'];
$this->requestCounts[$key] = ($this->requestCounts[$key] ?? 0) + 1;
return $proxy;
}
public function markFailed($proxy) {
$this->failedProxies[] = $proxy['host'];
echo "Proxy marked as failed: {$proxy['host']}" . PHP_EOL;
}
public function getStats() {
return [
'total' => count($this->proxies),
'failed' => count($this->failedProxies),
'active' => count($this->proxies) - count($this->failedProxies),
'request_counts' => $this->requestCounts,
];
}
}
// Define your proxy pool
$proxies = [
['host' => '123.45.67.89', 'port' => '8080', 'type' => CURLPROXY_HTTP],
['host' => '123.45.67.90', 'port' => '8080', 'type' => CURLPROXY_HTTP],
['host' => '123.45.67.91', 'port' => '8080', 'type' => CURLPROXY_HTTP],
['host' => '123.45.67.92', 'port' => '1080', 'type' => CURLPROXY_SOCKS5,
'username' => 'user', 'password' => 'pass'],
];
$rotator = new ProxyRotator($proxies);
$urls = [
"https://books.toscrape.com/catalogue/page-1.html",
"https://books.toscrape.com/catalogue/page-2.html",
"https://books.toscrape.com/catalogue/page-3.html",
"https://books.toscrape.com/catalogue/page-4.html",
];
foreach ($urls as $url) {
$proxy = $rotator->getNextProxy();
$html = scrape_with_proxy($url, $proxy);
if ($html) {
echo "Fetched via {$proxy['host']}: $url" . PHP_EOL;
} else {
$rotator->markFailed($proxy);
echo "Retrying $url without failed proxy..." . PHP_EOL;
// Retry with next proxy
$proxy = $rotator->getNextProxy();
$html = scrape_with_proxy($url, $proxy);
}
random_delay(1, 3);
}
// Print stats
$stats = $rotator->getStats();
echo PHP_EOL . "Proxy stats:" . PHP_EOL;
echo "Total proxies: {$stats['total']}" . PHP_EOL;
echo "Active: {$stats['active']}" . PHP_EOL;
echo "Failed: {$stats['failed']}" . PHP_EOL;
?>
Output:
Fetched via 123.45.67.89: https://books.toscrape.com/catalogue/page-1.html
Fetched via 123.45.67.90: https://books.toscrape.com/catalogue/page-2.html
Fetched via 123.45.67.91: https://books.toscrape.com/catalogue/page-3.html
Fetched via 123.45.67.92: https://books.toscrape.com/catalogue/page-4.html
Proxy stats:
Total proxies: 4
Active: 4
Failed: 0
Testing a Proxy Before Using It
Free proxies are often dead or too slow to be useful. Test each proxy before adding it to your rotation:
<?php
function test_proxy($proxy, $testUrl = "https://httpbin.org/ip", $timeout = 10) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $testUrl,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_PROXY => $proxy['host'] . ':' . $proxy['port'],
CURLOPT_PROXYTYPE => $proxy['type'] ?? CURLPROXY_HTTP,
CURLOPT_TIMEOUT => $timeout,
CURLOPT_CONNECTTIMEOUT => $timeout,
]);
$start = microtime(true);
$response = curl_exec($ch);
$errno = curl_errno($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
$responseTime = round((microtime(true) - $start) * 1000); // ms
if ($errno || $httpCode !== 200) {
echo "Proxy {$proxy['host']} - FAILED (HTTP $httpCode)" . PHP_EOL;
return false;
}
$data = json_decode($response, true);
$proxyIp = $data['origin'] ?? 'unknown';
echo "Proxy {$proxy['host']} - OK | IP: $proxyIp | Response: {$responseTime}ms" . PHP_EOL;
return true;
}
// Test all proxies before starting
$workingProxies = [];
foreach ($proxies as $proxy) {
if (test_proxy($proxy)) {
$workingProxies[] = $proxy;
}
}
echo PHP_EOL . count($workingProxies) . "/" . count($proxies) . " proxies working." . PHP_EOL;
?>
Output:
Proxy 123.45.67.89 - OK | IP: 123.45.67.89 | Response: 342ms
Proxy 123.45.67.90 - OK | IP: 123.45.67.90 | Response: 521ms
Proxy 123.45.67.91 - FAILED (HTTP 403)
Proxy 123.45.67.92 - OK | IP: 123.45.67.92 | Response: 287ms
3/4 proxies working.
Where to Get Proxies
Free proxies found online are mostly dead, slow, or actively malicious. For serious scraping use paid proxy services that maintain clean IP pools:
- Residential proxies – real home IP addresses, hardest to detect, most expensive. Worth it for sites with aggressive bot detection.
- Datacenter proxies – faster and cheaper than residential, but easier for sites to identify as proxies. Fine for most scraping tasks.
- Rotating proxy services – give you a single endpoint that automatically rotates IPs on each request. Simplest to integrate – no rotation code needed on your end.
For small scraping projects – a few hundred pages per day – you don’t need proxies at all. Focus on headers, delays, and cookies first. Add proxies only when you’re hitting volume limits that those techniques can’t solve.
Technique 5: Detect and Handle Blocks in Responses
A 403 status code is the obvious block. The harder problem is when a site returns 200 with a CAPTCHA page, a bot check, or a completely different page than what you requested. cURL reports success – your script carries on – and you end up saving garbage data without knowing anything went wrong.
Detecting blocks in the response body, not just the status code, is what separates scrapers that fail silently from ones that actually tell you what’s happening.
Checking the HTTP Status Code
<?php
function check_http_status($httpCode, $url) {
switch (true) {
case $httpCode === 200:
return 'ok';
case $httpCode === 403:
echo "Blocked (403 Forbidden): $url" . PHP_EOL;
return 'blocked';
case $httpCode === 404:
echo "Not found (404): $url" . PHP_EOL;
return 'not_found';
case $httpCode === 429:
echo "Rate limited (429 Too Many Requests): $url" . PHP_EOL;
return 'rate_limited';
case $httpCode === 503:
echo "Service unavailable (503): $url" . PHP_EOL;
return 'unavailable';
case $httpCode >= 500:
echo "Server error ($httpCode): $url" . PHP_EOL;
return 'server_error';
case $httpCode === 0:
echo "No response received: $url" . PHP_EOL;
return 'no_response';
default:
echo "Unexpected status $httpCode: $url" . PHP_EOL;
return 'unknown';
}
}
// Usage
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => "https://books.toscrape.com/",
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => [
'User-Agent: ' . get_random_user_agent(),
],
]);
$html = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
$status = check_http_status($httpCode, "https://books.toscrape.com/");
echo "Status: $status" . PHP_EOL;
?>
Output on success:
Status: ok
Output when blocked:
Blocked (403 Forbidden): https://books.toscrape.com/
Status: blocked
Detecting Soft Blocks in the Response Body
Soft blocks return HTTP 200 but serve a challenge page instead of the real content. Without checking the body you’d never know:
<?php
function detect_block_in_response($html, $url) {
if (!$html) return 'empty_response';
$lowerHtml = strtolower($html);
// Common block page indicators
$blockSignals = [
'captcha' => 'CAPTCHA detected',
'recaptcha' => 'reCAPTCHA detected',
'please verify you are human'=> 'Human verification required',
'access denied' => 'Access denied page',
'blocked' => 'Block page detected',
'unusual traffic' => 'Unusual traffic warning',
'security check' => 'Security check page',
'cf-browser-verification' => 'Cloudflare browser check',
'ray id' => 'Cloudflare block',
'ddos-guard' => 'DDoS protection block',
'robot or human' => 'Bot detection page',
'automated access' => 'Automated access block',
'too many requests' => 'Rate limit page',
'suspicious activity' => 'Suspicious activity block',
];
foreach ($blockSignals as $signal => $description) {
if (strpos($lowerHtml, $signal) !== false) {
echo "Block detected on $url: $description" . PHP_EOL;
return $signal;
}
}
return false;
}
$html = scrape_url("https://books.toscrape.com/");
$block = detect_block_in_response($html, "https://books.toscrape.com/");
if ($block) {
echo "Scraper is being blocked." . PHP_EOL;
} else {
echo "Response looks clean - proceeding." . PHP_EOL;
}
?>
Output on clean response:
Response looks clean - proceeding.
Output on Cloudflare block:
Block detected on https://example.com/: Cloudflare block
Scraper is being blocked.
Validating the Response Contains Expected Content
The most reliable block detection is checking that the response contains what you actually expect – not just the absence of block signals:
<?php
function validate_response($html, $expectedSelectors, $url) {
if (!$html || strlen($html) < 500) {
echo "Response too short on $url - likely a block page." . PHP_EOL;
return false;
}
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
foreach ($expectedSelectors as $selector => $description) {
$nodes = $xpath->query($selector);
if ($nodes->length === 0) {
echo "Expected '$description' not found on $url" . PHP_EOL;
return false;
}
}
return true;
}
// Define what a valid page should contain
$expectedSelectors = [
'//article[contains(@class,"product_pod")]' => 'product listings',
'//header' => 'page header',
];
$html = scrape_url("https://books.toscrape.com/");
$valid = validate_response($html, $expectedSelectors, "https://books.toscrape.com/");
if ($valid) {
echo "Response validated - expected content found." . PHP_EOL;
} else {
echo "Response invalid - content missing or blocked." . PHP_EOL;
}
?>
Output on valid response:
Response validated - expected content found.
Output when site serves block page:
Expected 'product listings' not found on https://books.toscrape.com/
Response invalid - content missing or blocked.
Automatic Recovery on Detection
When a block is detected, attempt recovery before giving up – switch proxy, increase delay, clear cookies, retry:
<?php
function scrape_with_block_recovery($url, $cookieFile, $maxAttempts = 3) {
$attempt = 0;
$baseDelay = 5;
while ($attempt < $maxAttempts) {
$attempt++;
$html = scrape_with_cookies($url, $cookieFile);
$httpCode = 200; // assume 200 if scrape_with_cookies succeeded
if (!$html) {
echo "Attempt $attempt failed - no response." . PHP_EOL;
sleep($baseDelay * $attempt);
continue;
}
// Check for soft blocks
$block = detect_block_in_response($html, $url);
if ($block === false) {
// Also validate expected content
$expectedSelectors = [
'//article[contains(@class,"product_pod")]' => 'product listings',
];
if (validate_response($html, $expectedSelectors, $url)) {
if ($attempt > 1) {
echo "Recovered on attempt $attempt." . PHP_EOL;
}
return $html;
}
}
echo "Attempt $attempt blocked. Waiting " . ($baseDelay * $attempt) . "s..." . PHP_EOL;
// Recovery actions
if ($attempt === 1) {
// First retry: just wait longer
sleep($baseDelay);
} elseif ($attempt === 2) {
// Second retry: clear cookies and start fresh session
clear_session($cookieFile);
echo "Cleared cookies - starting fresh session." . PHP_EOL;
sleep($baseDelay * 2);
// Re-establish session from homepage
scrape_with_cookies(
parse_url($url, PHP_URL_SCHEME) . '://' . parse_url($url, PHP_URL_HOST),
$cookieFile
);
sleep(rand(3, 7));
}
}
echo "All $maxAttempts attempts failed for: $url" . PHP_EOL;
return false;
}
$cookieFile = __DIR__ . '/cookies.txt';
$html = scrape_with_block_recovery(
"https://books.toscrape.com/catalogue/page-1.html",
$cookieFile
);
if ($html) {
echo "Page fetched and validated successfully." . PHP_EOL;
}
?>
Output on immediate success:
Page fetched and validated successfully.
Output when blocked then recovered:
Attempt 1 blocked. Waiting 5s...
Cleared cookies - starting fresh session.
Recovered on attempt 2.
Page fetched and validated successfully.
Logging All Block Events
Track when and where your scraper gets blocked – patterns in the log tell you which pages trigger detection and help you tune your approach:
<?php
function log_block_event($url, $reason, $attempt) {
$entry = sprintf(
"[%s] BLOCK | Attempt: %d | Reason: %s | URL: %s",
date('Y-m-d H:i:s'),
$attempt,
$reason,
$url
);
file_put_contents(__DIR__ . '/blocks.log', $entry . PHP_EOL, FILE_APPEND);
echo $entry . PHP_EOL;
}
// Use inside your scraping loop
if ($block) {
log_block_event($url, $block, $attempt);
}
?>
Example blocks.log contents after a run:
[2026-05-03 09:14:22] BLOCK | Attempt: 1 | Reason: cloudflare block | URL: https://example.com/page-5
[2026-05-03 09:22:11] BLOCK | Attempt: 1 | Reason: captcha | URL: https://example.com/page-23
[2026-05-03 09:22:18] BLOCK | Attempt: 2 | Reason: captcha | URL: https://example.com/page-23
If the same URL keeps appearing in the block log it’s likely a login-protected or restricted page – not a temporary rate limit. Remove it from your target list rather than repeatedly hitting it.
Technique 6: Check and Respect robots.txt
robots.txt is a file that website owners use to tell crawlers which pages they can and cannot access. Ignoring it doesn’t just create legal risk – it’s also the fastest way to get your IP permanently banned. Sites that monitor robots.txt violations flag aggressive scrapers immediately.
Respecting robots.txt keeps your scraper sustainable. You avoid the pages most likely to trigger detection, and you stay within the boundaries the site owner set.
Reading robots.txt
<?php
function fetch_robots_txt($domain) {
$robotsUrl = rtrim($domain, '/') . '/robots.txt';
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $robotsUrl,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 15,
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
],
]);
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode === 404 || !$response) {
echo "No robots.txt found at $robotsUrl - assuming all allowed." . PHP_EOL;
return null;
}
echo "robots.txt found at $robotsUrl" . PHP_EOL;
return $response;
}
$robotsTxt = fetch_robots_txt("https://books.toscrape.com");
if ($robotsTxt) {
echo PHP_EOL . $robotsTxt . PHP_EOL;
}
?>
Output:
robots.txt found at https://books.toscrape.com/robots.txt
User-agent: *
Disallow:
An empty Disallow: means everything is allowed. This is the most permissive robots.txt possible – books.toscrape.com is designed for scraping practice so it allows everything.
Parsing robots.txt Rules
<?php
function parse_robots_txt($robotsTxt, $userAgent = '*') {
if (!$robotsTxt) {
return ['disallowed' => [], 'allowed' => [], 'crawl_delay' => null];
}
$rules = ['disallowed' => [], 'allowed' => [], 'crawl_delay' => null];
$lines = explode("\n", $robotsTxt);
$applicable = false;
foreach ($lines as $line) {
$line = trim($line);
// Skip comments and empty lines
if (empty($line) || strpos($line, '#') === 0) continue;
// Check User-agent directive
if (stripos($line, 'User-agent:') === 0) {
$agent = trim(substr($line, 11));
$applicable = ($agent === '*' || stripos($agent, $userAgent) !== false);
continue;
}
if (!$applicable) continue;
// Parse Disallow rules
if (stripos($line, 'Disallow:') === 0) {
$path = trim(substr($line, 9));
if ($path !== '') {
$rules['disallowed'][] = $path;
}
continue;
}
// Parse Allow rules
if (stripos($line, 'Allow:') === 0) {
$path = trim(substr($line, 6));
if ($path !== '') {
$rules['allowed'][] = $path;
}
continue;
}
// Parse Crawl-delay
if (stripos($line, 'Crawl-delay:') === 0) {
$rules['crawl_delay'] = (int) trim(substr($line, 12));
}
}
return $rules;
}
// Parse and display rules
$robotsTxt = fetch_robots_txt("https://example.com");
$rules = parse_robots_txt($robotsTxt);
echo "Disallowed paths:" . PHP_EOL;
if (empty($rules['disallowed'])) {
echo " None - all paths allowed." . PHP_EOL;
} else {
foreach ($rules['disallowed'] as $path) {
echo " $path" . PHP_EOL;
}
}
echo "Crawl delay: " . ($rules['crawl_delay'] ? $rules['crawl_delay'] . "s" : "None specified") . PHP_EOL;
?>
Output for a typical e-commerce site:
Disallowed paths:
/admin/
/checkout/
/account/
/cart/
/search?
Crawl delay: 2
Checking if a URL is Allowed
<?php
function is_url_allowed($url, $rules) {
$path = parse_url($url, PHP_URL_PATH);
if (!$path) return true;
// Check Allow rules first - they take precedence over Disallow
foreach ($rules['allowed'] as $allowedPath) {
if (strpos($path, $allowedPath) === 0) {
return true;
}
}
// Check Disallow rules
foreach ($rules['disallowed'] as $disallowedPath) {
if ($disallowedPath === '/') {
return false; // entire site disallowed
}
if (strpos($path, $disallowedPath) === 0) {
return false;
}
}
return true; // not disallowed
}
// Test against some URLs
$testUrls = [
"https://example.com/products/",
"https://example.com/admin/users",
"https://example.com/checkout/",
"https://example.com/blog/post-1",
"https://example.com/account/settings",
];
foreach ($testUrls as $url) {
$allowed = is_url_allowed($url, $rules);
$status = $allowed ? "ALLOWED" : "DISALLOWED";
echo "$status: $url" . PHP_EOL;
}
?>
Output:
ALLOWED: https://example.com/products/
DISALLOWED: https://example.com/admin/users
DISALLOWED: https://example.com/checkout/
ALLOWED: https://example.com/blog/post-1
DISALLOWED: https://example.com/account/settings
Respecting the Crawl-Delay Directive
Some robots.txt files specify a Crawl-delay – the minimum number of seconds to wait between requests. Respecting it is both polite and practical – sites that set a crawl delay are usually monitoring for violations:
<?php
function get_crawl_delay($rules, $defaultDelay = 2) {
if ($rules['crawl_delay'] !== null) {
echo "Using crawl delay from robots.txt: {$rules['crawl_delay']}s" . PHP_EOL;
return $rules['crawl_delay'];
}
return $defaultDelay;
}
$crawlDelay = get_crawl_delay($rules);
$urls = [
"https://example.com/products/page-1",
"https://example.com/products/page-2",
"https://example.com/products/page-3",
];
foreach ($urls as $url) {
if (!is_url_allowed($url, $rules)) {
echo "Skipping disallowed URL: $url" . PHP_EOL;
continue;
}
$html = scrape_url($url);
if ($html) {
echo "Fetched: $url" . PHP_EOL;
}
sleep($crawlDelay); // respect the crawl delay
}
?>
Output:
Using crawl delay from robots.txt: 2s
Fetched: https://example.com/products/page-1
Fetched: https://example.com/products/page-2
Fetched: https://example.com/products/page-3
Building a Complete robots.txt Checker
Wrap everything into one reusable class that fetches, parses, and enforces robots.txt rules automatically:
<?php
class RobotsChecker {
private $rules = [];
private $cacheFile = null;
private $cacheTtl = 86400; // 24 hours
public function __construct($cacheFile = null) {
$this->cacheFile = $cacheFile;
}
public function load($domain) {
// Check cache first
if ($this->cacheFile && file_exists($this->cacheFile)) {
$cache = json_decode(file_get_contents($this->cacheFile), true);
if ($cache && isset($cache[$domain])) {
$cached = $cache[$domain];
if (time() - $cached['fetched_at'] < $this->cacheTtl) {
$this->rules[$domain] = $cached['rules'];
echo "Loaded robots.txt from cache for $domain" . PHP_EOL;
return;
}
}
}
// Fetch fresh
$robotsTxt = fetch_robots_txt($domain);
$this->rules[$domain] = parse_robots_txt($robotsTxt);
// Cache the result
if ($this->cacheFile) {
$cache = file_exists($this->cacheFile)
? json_decode(file_get_contents($this->cacheFile), true)
: [];
$cache[$domain] = [
'rules' => $this->rules[$domain],
'fetched_at' => time(),
];
file_put_contents($this->cacheFile, json_encode($cache));
}
}
public function isAllowed($url) {
$domain = parse_url($url, PHP_URL_SCHEME) . '://' . parse_url($url, PHP_URL_HOST);
if (!isset($this->rules[$domain])) {
$this->load($domain);
}
return is_url_allowed($url, $this->rules[$domain]);
}
public function getCrawlDelay($domain, $default = 2) {
if (!isset($this->rules[$domain])) {
$this->load($domain);
}
return get_crawl_delay($this->rules[$domain], $default);
}
}
// Usage
$checker = new RobotsChecker(__DIR__ . '/robots_cache.json');
$urls = [
"https://books.toscrape.com/catalogue/page-1.html",
"https://books.toscrape.com/admin/",
"https://books.toscrape.com/catalogue/page-2.html",
];
foreach ($urls as $url) {
if (!$checker->isAllowed($url)) {
echo "Skipping: $url" . PHP_EOL;
continue;
}
$html = scrape_url($url);
if ($html) {
echo "Fetched: $url" . PHP_EOL;
}
$domain = parse_url($url, PHP_URL_SCHEME) . '://' . parse_url($url, PHP_URL_HOST);
sleep($checker->getCrawlDelay($domain));
}
?>
Output:
robots.txt found at https://books.toscrape.com/robots.txt
Fetched: https://books.toscrape.com/catalogue/page-1.html
Skipping: https://books.toscrape.com/admin/
Fetched: https://books.toscrape.com/catalogue/page-2.html
The cache file prevents fetching robots.txt on every scraping run. Once loaded it stays valid for 24 hours – robots.txt rarely changes more frequently than that.
Technique 7: Use APIs When Available
The most reliable way to avoid getting blocked is to not scrape at all. Many sites that look like scraping targets have official APIs that provide the same data – structured, stable, and without any of the blocking risk. Before writing a single line of scraping code, check if an API exists.
APIs are faster, more reliable, less likely to break when a site redesigns, and explicitly authorized by the data owner. A scraper that works today can break tomorrow when a site updates its HTML. An API endpoint stays stable for years.
Finding Hidden APIs
Many JavaScript-heavy sites load their data from internal API endpoints that your scraper can hit directly – no HTML parsing required. These aren’t always documented but they’re accessible:
- Open Chrome DevTools → Network tab
- Reload the page
- Filter by Fetch/XHR
- Look for requests returning JSON with the data you need
- Copy the request URL and headers
- Hit it directly with cURL
<?php
// Instead of scraping the HTML page
// Call the internal API the JavaScript uses
function call_internal_api($apiUrl, $headers = []) {
$ch = curl_init();
$defaultHeaders = [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Accept: application/json, text/plain, */*',
'Accept-Language: en-US,en;q=0.5',
'X-Requested-With: XMLHttpRequest',
'Connection: keep-alive',
];
curl_setopt_array($ch, [
CURLOPT_URL => $apiUrl,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_HTTPHEADER => array_merge($defaultHeaders, $headers),
]);
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode !== 200) {
echo "API request failed: HTTP $httpCode" . PHP_EOL;
return false;
}
$data = json_decode($response, true);
if (json_last_error() !== JSON_ERROR_NONE) {
echo "Invalid JSON response: " . json_last_error_msg() . PHP_EOL;
return false;
}
return $data;
}
// Example: hitting an internal product API
$data = call_internal_api("https://example.com/api/products?page=1&limit=20");
if ($data) {
echo "Products returned: " . count($data['products'] ?? []) . PHP_EOL;
foreach ($data['products'] ?? [] as $product) {
echo $product['name'] . " - $" . $product['price'] . PHP_EOL;
}
}
?>
Output:
Products returned: 20
Laptop Stand Pro - $49.99
Mechanical Keyboard - $89.99
USB-C Hub - $34.99
Working With Public APIs
Many platforms provide official public APIs with proper documentation. These are always preferable to scraping:
<?php
function call_public_api($endpoint, $apiKey = null, $params = []) {
$url = $endpoint;
if (!empty($params)) {
$url .= '?' . http_build_query($params);
}
$ch = curl_init();
$headers = [
'Accept: application/json',
'Content-Type: application/json',
];
// Add API key if provided
if ($apiKey) {
$headers[] = 'Authorization: Bearer ' . $apiKey;
}
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_HTTPHEADER => $headers,
]);
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode === 401) {
echo "API authentication failed - check your API key." . PHP_EOL;
return false;
}
if ($httpCode === 429) {
echo "API rate limit exceeded - slow down requests." . PHP_EOL;
return false;
}
if ($httpCode !== 200) {
echo "API error: HTTP $httpCode" . PHP_EOL;
return false;
}
return json_decode($response, true);
}
// Example: Open Library API for book data - no API key required
$data = call_public_api(
"https://openlibrary.org/search.json",
null,
['q' => 'php programming', 'limit' => 5]
);
if ($data) {
echo "Books found: " . $data['numFound'] . PHP_EOL . PHP_EOL;
foreach ($data['docs'] ?? [] as $book) {
echo $book['title'] . PHP_EOL;
echo " Author: " . implode(', ', $book['author_name'] ?? ['Unknown']) . PHP_EOL;
echo " Year: " . ($book['first_publish_year'] ?? 'N/A') . PHP_EOL;
echo PHP_EOL;
}
}
?>
Output:
Books found: 247
Learning PHP, MySQL & JavaScript
Author: Robin Nixon
Year: 2009
PHP and MySQL Web Development
Author: Luke Welling, Laura Thomson
Year: 2001
Modern PHP
Author: Josh Lockhart
Year: 2015
Handling API Rate Limits
Public APIs enforce rate limits – a maximum number of requests per minute, hour, or day. Exceeding them gets your API key suspended. Handle limits properly:
<?php
class ApiClient {
private $apiKey;
private $baseUrl;
private $requestsPerMinute;
private $requestTimes = [];
public function __construct($baseUrl, $apiKey = null, $requestsPerMinute = 60) {
$this->baseUrl = rtrim($baseUrl, '/');
$this->apiKey = $apiKey;
$this->requestsPerMinute = $requestsPerMinute;
}
private function enforceRateLimit() {
$now = microtime(true);
$oneMinAgo = $now - 60;
$this->requestTimes = array_filter(
$this->requestTimes,
fn($time) => $time > $oneMinAgo
);
if (count($this->requestTimes) >= $this->requestsPerMinute) {
$oldestRequest = min($this->requestTimes);
$waitTime = ($oldestRequest + 60) - $now;
if ($waitTime > 0) {
echo "Rate limit: waiting " . round($waitTime, 1) . "s..." . PHP_EOL;
usleep((int)($waitTime * 1000000));
}
}
$this->requestTimes[] = microtime(true);
}
public function get($endpoint, $params = []) {
$this->enforceRateLimit();
$url = $this->baseUrl . $endpoint;
if (!empty($params)) {
$url .= '?' . http_build_query($params);
}
$headers = ['Accept: application/json'];
if ($this->apiKey) {
$headers[] = 'Authorization: Bearer ' . $this->apiKey;
}
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_HTTPHEADER => $headers,
]);
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode === 429) {
echo "Rate limited by API - waiting 60s before retry..." . PHP_EOL;
sleep(60);
return $this->get($endpoint, $params); // retry once
}
if ($httpCode !== 200) {
echo "API error $httpCode on $endpoint" . PHP_EOL;
return false;
}
return json_decode($response, true);
}
}
// Usage - 30 requests per minute limit
$client = new ApiClient("https://openlibrary.org", null, 30);
$subjects = ['php', 'python', 'javascript'];
foreach ($subjects as $subject) {
$data = $client->get('/subjects/' . $subject . '.json', ['limit' => 5]);
if ($data) {
echo "Subject: $subject - " . ($data['work_count'] ?? 0) . " books" . PHP_EOL;
}
}
?>
Output:
Subject: php - 156 books
Subject: python - 892 books
Subject: javascript - 743 books
When to Scrape vs When to Use an API
The decision is straightforward:
- Official API exists – use it. Always. No exceptions. Scraping when an API is available wastes your time, creates unnecessary server load, and risks violating terms of service.
- Internal API found in DevTools – use it if it returns the data you need. Not officially supported but faster and more stable than HTML scraping.
- No API, static HTML content – scrape with proper headers, delays, cookies, and robots.txt compliance.
- No API, JavaScript-rendered content – check for inline JSON first, then consider a headless browser. Read the dynamic content web scraping guide for the full approach.
Putting It All Together
Each technique works in isolation but the real protection comes from combining them. Here’s a complete scraping function that applies all seven techniques in one place:
<?php
// ============================================
// Complete Anti-Block Scraper
// ============================================
set_time_limit(0);
ini_set('log_errors', 1);
ini_set('error_log', __DIR__ . '/scraper_errors.log');
$cookieFile = __DIR__ . '/cookies.txt';
$logFile = __DIR__ . '/scraper.log';
function log_message($message) {
global $logFile;
$entry = '[' . date('Y-m-d H:i:s') . '] ' . $message . PHP_EOL;
file_put_contents($logFile, $entry, FILE_APPEND);
echo $entry;
}
function get_random_user_agent() {
$agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
];
return $agents[array_rand($agents)];
}
function random_delay($min = 1, $max = 3) {
$microseconds = rand($min * 1000000, $max * 1000000);
usleep($microseconds);
}
function detect_block($html) {
if (!$html || strlen($html) < 500) return 'empty_response';
$signals = [
'captcha', 'recaptcha', 'access denied',
'cf-browser-verification', 'ray id',
'unusual traffic', 'security check',
'please verify you are human',
];
$lower = strtolower($html);
foreach ($signals as $signal) {
if (strpos($lower, $signal) !== false) {
return $signal;
}
}
return false;
}
function scrape_safely($url, $cookieFile, $proxy = null, $maxRetries = 3) {
$attempt = 0;
$baseDelay = 3;
// Step 1: Check robots.txt before first request
static $robotsRules = [];
$domain = parse_url($url, PHP_URL_SCHEME) . '://' . parse_url($url, PHP_URL_HOST);
if (!isset($robotsRules[$domain])) {
$robotsTxt = fetch_robots_txt($domain);
$robotsRules[$domain] = parse_robots_txt($robotsTxt ?? '');
}
if (!is_url_allowed($url, $robotsRules[$domain])) {
log_message("Skipping disallowed URL: $url");
return false;
}
while ($attempt < $maxRetries) {
$attempt++;
$ch = curl_init();
$options = [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_ENCODING => '',
// Technique 1: Complete header set
CURLOPT_HTTPHEADER => [
'User-Agent: ' . get_random_user_agent(),
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Accept-Encoding: gzip, deflate, br',
'Connection: keep-alive',
'Upgrade-Insecure-Requests: 1',
'Sec-Fetch-Dest: document',
'Sec-Fetch-Mode: navigate',
'Sec-Fetch-Site: none',
'Sec-Fetch-User: ?1',
'Referer: ' . $domain . '/',
],
// Technique 3: Cookie handling
CURLOPT_COOKIEFILE => $cookieFile,
CURLOPT_COOKIEJAR => $cookieFile,
];
// Technique 4: Proxy rotation
if ($proxy) {
$options[CURLOPT_PROXY] = $proxy['host'] . ':' . $proxy['port'];
$options[CURLOPT_PROXYTYPE] = $proxy['type'] ?? CURLPROXY_HTTP;
if (!empty($proxy['username'])) {
$options[CURLOPT_PROXYUSERPWD] = $proxy['username'] . ':' . $proxy['password'];
}
}
curl_setopt_array($ch, $options);
$html = curl_exec($ch);
$errno = curl_errno($ch);
$error = curl_error($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
// Network error
if ($errno) {
log_message("Attempt $attempt - cURL error $errno: $error");
random_delay($baseDelay, $baseDelay + 2);
continue;
}
// HTTP errors
if ($httpCode === 429) {
log_message("Attempt $attempt - rate limited (429). Waiting 30s...");
sleep(30);
continue;
}
if ($httpCode === 403) {
log_message("Attempt $attempt - blocked (403): $url");
random_delay($baseDelay * $attempt, $baseDelay * $attempt + 5);
continue;
}
if ($httpCode >= 500) {
log_message("Attempt $attempt - server error ($httpCode). Retrying...");
random_delay($baseDelay, $baseDelay + 3);
continue;
}
if ($httpCode !== 200) {
log_message("Attempt $attempt - HTTP $httpCode: $url");
return false;
}
// Technique 5: Detect soft blocks
$block = detect_block($html);
if ($block) {
log_message("Attempt $attempt - soft block detected: $block");
if ($attempt === 2) {
// Clear cookies and re-establish session
file_put_contents($cookieFile, '');
log_message("Cleared cookies - starting fresh session.");
random_delay($baseDelay * 2, $baseDelay * 3);
// Visit homepage first
scrape_safely($domain . '/', $cookieFile, $proxy, 1);
random_delay(3, 6);
} else {
random_delay($baseDelay * $attempt, $baseDelay * $attempt + 5);
}
continue;
}
// Success
if ($attempt > 1) {
log_message("Recovered on attempt $attempt: $url");
}
return $html;
}
log_message("All $maxRetries attempts failed: $url");
return false;
}
// ---- Main Scraping Loop ----
log_message("Scraper started.");
// Technique 3: Establish session from homepage first
log_message("Establishing session...");
scrape_safely("https://books.toscrape.com/", $cookieFile);
random_delay(2, 4);
$urls = [
"https://books.toscrape.com/catalogue/page-1.html",
"https://books.toscrape.com/catalogue/page-2.html",
"https://books.toscrape.com/catalogue/page-3.html",
];
$success = 0;
$failed = 0;
foreach ($urls as $index => $url) {
$html = scrape_safely($url, $cookieFile);
if ($html) {
$success++;
log_message("Fetched page " . ($index + 1) . " - " . strlen($html) . " bytes.");
} else {
$failed++;
}
// Technique 2: Random delay between requests
random_delay(1, 3);
// Technique 2: Longer break every 10 requests
if (($index + 1) % 10 === 0) {
$break = rand(15, 30);
log_message("Taking {$break}s break...");
sleep($break);
}
}
log_message("Done. Success: $success | Failed: $failed");
?>
Output:
[2026-05-03 09:00:01] Scraper started.
[2026-05-03 09:00:01] Establishing session...
[2026-05-03 09:00:02] Fetched page 1 - 49832 bytes.
[2026-05-03 09:00:05] Fetched page 2 - 49801 bytes.
[2026-05-03 09:00:07] Fetched page 3 - 49798 bytes.
[2026-05-03 09:00:09] Done. Success: 3 | Failed: 0
Frequently Asked Questions
Why does my PHP scraper keep getting blocked even with a User-Agent set?
User-Agent alone covers one of many detection signals. Sites check the full header set, request timing, cookie history, and IP reputation. Add the complete browser header set from Technique 1, enable cookie handling, and add random delays between requests. These three changes together fix the majority of blocking issues.
Do I need proxies to avoid getting blocked?
Not for most scraping projects. Proxies become necessary when you’re sending high volumes of requests from one IP – hundreds of pages per hour on sites with IP-based rate limiting. For typical scraping jobs of a few hundred pages per day, proper headers, delays, and cookies are enough. Add proxies only when those techniques stop working.
How do I know if my scraper is being blocked?
Four signals to check: HTTP 403 status code means explicit block. HTTP 429 means rate limited. A 200 response with very short content length (under 500 bytes for a page that should be much larger) suggests a block page. Your XPath selectors returning zero results on pages that should have data suggests the content was replaced with a bot challenge page. Always check both the status code and the response body.
What is Cloudflare and why does it block my scraper?
Cloudflare is a security service many sites use to protect against bots and DDoS attacks. It analyzes requests in detail – TLS fingerprint, JavaScript execution, browser behavior patterns, IP reputation. Basic cURL requests fail most of these checks. Cloudflare-protected sites are significantly harder to scrape with PHP alone – you’ll need a headless browser or a specialized proxy service with Cloudflare bypass capability.
How many requests per minute is safe for most sites?
10-20 requests per minute is a safe starting point for most sites without explicit rate limiting. If a site’s robots.txt specifies a crawl delay, use that number. If you start getting 429 responses, halve your request rate and wait 30-60 seconds before continuing. For large platforms – e-commerce, news sites, social media – stay under 10 requests per minute to be safe.
Is web scraping legal?
Scraping publicly available data sits in a legal grey area in most jurisdictions. The risk increases significantly when you scrape behind a login, violate terms of service, collect personal data, or use scraped data commercially. Always check the site’s terms of service and robots.txt before scraping. When a site offers an API, use it instead – it’s explicitly authorized and removes legal ambiguity entirely.
Summary
Getting blocked is almost always preventable. Work through these techniques in order – start with the first two and add more only if you’re still getting blocked:
- Complete headers – send the full browser header set, not just a User-Agent. Rotate user agents across requests.
- Random delays – vary timing between 1-3 seconds minimum. Add longer breaks every N requests.
- Cookie handling – enable a cookie jar, establish a session from the homepage before scraping deeper pages.
- Proxy rotation – only needed at high volume. Test proxies before adding them to rotation.
- Block detection – check the response body, not just the status code. Log every block event for pattern analysis.
- robots.txt – check it before scraping any new site. Respect crawl delays. Skip disallowed paths.
- APIs first – check for an official API or hidden JSON endpoint before writing any scraping code.
For the complete PHP cURL scraping setup that these techniques build on, read the PHP cURL web scraping complete guide. For handling the connection errors and timeouts that happen when a site starts rate limiting you, the PHP cURL timeout guide covers retry logic and error code 28 in detail. And for the seven most common mistakes that cause scrapers to fail silently, the web scraping errors guide covers each one with working fixes.
Ethical Web Scraping Guidelines
While learning how to avoid getting blocked web scraping, it’s important to follow ethical practices to ensure responsible usage.
- Always check the website’s terms of service
- Respect robots.txt rules
- Avoid sending excessive requests
- Do not collect personal or sensitive data
- Use scraped data responsibly
Following these guidelines helps you build sustainable and safe scraping solutions.
Conclusion
Avoiding blocks in web scraping is not complicated once you understand how websites detect bots. Start with simple improvements like headers and delays, and scale up with proxies only when needed.
From my experience, focusing on small improvements first makes a big difference before jumping into advanced solutions.
These techniques help you avoid getting blocked web scraping even on stricter websites.
If you’re building scraping projects, start simple and improve step by step. It will save you a lot of time and frustration.
