Table of Contents
PHP web scraping errors are frustrating because they rarely come with a clear explanation. Your script runs, returns something, and you only realize something went wrong when you look at the data – or don’t, because nothing was saved at all.
Most scraping failures fall into seven categories. Each one has a specific symptom, a specific cause, and a specific fix. This guide covers all seven with working PHP code so you can identify exactly what went wrong and fix it without guessing.
All examples use books.toscrape.com – a site built specifically for scraping practice. Every code block here runs against a real target.
What You Need
- PHP 7.4 or higher
- cURL extension enabled – verify with
php -m | grep curl - PDO extension for database examples
- Basic familiarity with
curl_init()and DOMDocument
If you’re new to PHP cURL scraping, read the PHP cURL web scraping guide first – it covers the request setup and HTML parsing that the error fixes here build on.
Error 1: Wrong or Broken Selectors
The most common scraping error isn’t a crash – it’s silence. Your script runs without errors, but the data you’re trying to extract comes back empty. No exception, no warning, just an empty array or a NodeList with zero items.
This happens when your XPath query or CSS selector doesn’t match the actual HTML structure of the page.
What It Looks Like
<?php
$html = scrape_url("https://books.toscrape.com/");
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
// Wrong selector — class name is slightly off
$titles = $xpath->query('//article[@class="product"]//h3/a');
echo "Titles found: " . $titles->length . PHP_EOL;
?>
Output:
Titles found: 0
No error. No warning. Just zero results – because the actual class is product_pod, not product.
How to Debug It
Before assuming your selector is correct, verify what’s actually in the document:
<?php
$html = scrape_url("https://books.toscrape.com/");
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
// Step 1: Check the document loaded at all
$body = $xpath->query('//body');
echo "Body found: " . $body->length . PHP_EOL;
// Step 2: Count total elements to confirm HTML is there
$allArticles = $xpath->query('//article');
echo "Total articles: " . $allArticles->length . PHP_EOL;
// Step 3: Check what classes those articles actually have
foreach ($allArticles as $index => $article) {
echo "Article $index class: " . $article->getAttribute('class') . PHP_EOL;
if ($index >= 2) break; // just check first 3
}
?>
Output:
Body found: 1
Total articles: 20
Article 0 class: product_pod
Article 1 class: product_pod
Article 2 class: product_pod
Now you can see the actual class name and fix the selector.
The Fix
<?php
// Wrong — exact class match fails when there are multiple classes
$titles = $xpath->query('//article[@class="product"]//h3/a');
// Right — use contains() to match partial class names
$titles = $xpath->query('//article[contains(@class,"product_pod")]//h3/a');
echo "Titles found: " . $titles->length . PHP_EOL;
foreach ($titles as $title) {
echo $title->getAttribute('title') . PHP_EOL;
}
?>
Output:
Titles found: 20
A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
...
Dumping Raw HTML to Find the Right Selector
When you can’t tell from the output what the correct selector should be, save the raw response to a file and inspect it directly:
<?php
$html = scrape_url("https://books.toscrape.com/");
// Save raw HTML to file and open it in browser or text editor
file_put_contents(__DIR__ . '/debug_page.html', $html);
echo "Raw HTML saved — open debug_page.html to inspect the structure." . PHP_EOL;
?>
Open the file in your browser, right-click the element you want to target, and select Inspect. The actual class names, IDs, and attributes are right there. Never write a selector from memory – always verify against the actual HTML.
Testing XPath Queries Without Running Your Script
Chrome DevTools lets you test XPath directly in the browser console before writing any PHP. Open the console on the target page and run:
$x('//article[contains(@class,"product_pod")]//h3/a')
If it returns an array of elements, the query works. If it returns an empty array, fix the query before touching your PHP code. This saves the round-trip of editing, running, and reading output every time you adjust a selector.
When the Selector Was Working and Then Stopped
If your scraper was extracting data correctly and then suddenly returns empty results without any code changes, the site updated its HTML structure. Add a check that alerts you when expected data goes missing:
<?php
$titles = $xpath->query('//article[contains(@class,"product_pod")]//h3/a');
if ($titles->length === 0) {
$message = date('Y-m-d H:i:s') . " | Selector returned 0 results on "
. $url . " — site structure may have changed." . PHP_EOL;
file_put_contents(__DIR__ . '/scraper_alerts.log', $message, FILE_APPEND);
echo $message;
exit; // stop processing rather than saving empty data
}
echo "Found " . $titles->length . " titles. Proceeding." . PHP_EOL;
?>
Output when selector breaks:
2026-05-01 09:14:22 | Selector returned 0 results on https://books.toscrape.com/ — site structure may have changed.
Exiting immediately when results are empty prevents your database from being overwritten with blank records – which is worse than having no data at all.
Error 2: Getting Blocked – 403 Errors and Empty Responses
Your script sends a request, gets a response, but the content is either a 403 status, a CAPTCHA page, or a stripped-down HTML shell with none of the data you need. The request technically succeeded – cURL got a response – but the site identified your script as a bot and served something different.
What It Looks Like
<?php
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => "https://example.com/products",
CURLOPT_RETURNTRANSFER => true,
]);
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
echo "HTTP Status: " . $httpCode . PHP_EOL;
echo "Response length: " . strlen($response) . " bytes" . PHP_EOL;
?>
Output:
HTTP Status: 403
Response length: 2841 bytes
cURL returned a response – 2841 bytes of it – but it’s a “Access Denied” page, not the product data. No cURL error fires because from cURL’s perspective the request completed successfully.
Why It Happens
Sites detect bots through several signals. The most common ones that PHP cURL trips on:
- Missing or default User-Agent – cURL’s default user agent string identifies itself as curl/x.x.x. Most sites block it immediately.
- Missing Accept headers – real browsers send a full set of headers on every request. A bare cURL request with only a URL is immediately suspicious.
- No cookies – real browsers accumulate cookies across a session. Requests with zero cookies on every hit look automated.
- Request rate – hitting pages faster than any human could read them triggers rate limiting.
Fix 1: Add Proper Headers
<?php
function scrape_url($url) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_ENCODING => '',
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Accept-Encoding: gzip, deflate, br',
'Connection: keep-alive',
'Upgrade-Insecure-Requests: 1',
],
]);
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode === 403) {
echo "Blocked (403) on: $url" . PHP_EOL;
return false;
}
if ($httpCode === 200) {
return $response;
}
echo "Unexpected HTTP $httpCode on: $url" . PHP_EOL;
return false;
}
$html = scrape_url("https://books.toscrape.com/");
if ($html) {
echo "Fetched successfully. Length: " . strlen($html) . " bytes." . PHP_EOL;
}
?>
Output:
Fetched successfully. Length: 51274 bytes.
Fix 2: Detect What You Actually Got
A 200 status doesn’t guarantee you got the right page. Some sites return 200 with a CAPTCHA or bot-check page instead of the content you requested. Check the response body too:
<?php
function is_blocked_response($html) {
$blockSignals = [
'captcha',
'robot',
'unusual traffic',
'access denied',
'please verify',
'security check',
'cloudflare',
'ray id',
];
$lowerHtml = strtolower($html);
foreach ($blockSignals as $signal) {
if (strpos($lowerHtml, $signal) !== false) {
return $signal; // return what triggered it
}
}
return false;
}
$html = scrape_url("https://books.toscrape.com/");
if ($html) {
$blocked = is_blocked_response($html);
if ($blocked) {
echo "Got 200 but response looks like a block page. Trigger: '$blocked'" . PHP_EOL;
} else {
echo "Response looks clean. Proceeding." . PHP_EOL;
}
}
?>
Output on clean response:
Response looks clean. Proceeding.
Output when site returns a CAPTCHA with 200 status:
Got 200 but response looks like a block page. Trigger: 'captcha'
Fix 3: Rotate User Agents
Sending the same user agent string on every request is a detectable pattern. Rotate through real browser strings:
<?php
function get_random_user_agent() {
$agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
];
return $agents[array_rand($agents)];
}
// Use inside your request
curl_setopt($ch, CURLOPT_HTTPHEADER, [
'User-Agent: ' . get_random_user_agent(),
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
]);
?>
Fix 4: Add a Referer Header
Some sites check whether requests come from a known page. Adding a Referer header that looks like you navigated from the site’s own homepage helps bypass these checks:
<?php
curl_setopt($ch, CURLOPT_HTTPHEADER, [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Referer: https://books.toscrape.com/',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
]);
?>
Fix 5: Enable Cookie Handling
Real browsers carry cookies from page to page. Enabling a cookie jar makes your requests look more like a real browsing session:
<?php
$cookieFile = __DIR__ . '/cookies.txt';
curl_setopt_array($ch, [
CURLOPT_COOKIEFILE => $cookieFile,
CURLOPT_COOKIEJAR => $cookieFile,
]);
?>
What to Do When You’re Still Getting Blocked
If you’ve added proper headers, cookies, and delays and still getting blocked, the site is running more sophisticated detection – TLS fingerprinting, JavaScript challenges, or behavioral analysis. At that point cURL alone won’t cut it. Your options are:
- Check for an API – open browser DevTools → Network tab → filter by XHR/Fetch. Many sites that block HTML scraping have a public or semi-public API their JavaScript calls. Hitting the API directly is faster and more reliable than scraping HTML.
- Check for a mobile version – mobile sites often have simpler HTML and less aggressive bot detection.
- Use a headless browser – Puppeteer or Playwright execute JavaScript and mimic real browser behavior. More complex to set up but handles sites that cURL cannot.
Error 3: Empty Response From JavaScript-Rendered Pages
Your request returns a 200 status, the response has content, but the data you’re trying to scrape isn’t there. The page looks fine in a browser but your XPath query returns nothing. This isn’t a selector problem – the data simply doesn’t exist in what cURL fetched.
What It Looks Like
<?php
$html = scrape_url("https://example-js-site.com/products");
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$prices = $xpath->query('//span[@class="price"]');
echo "Prices found: " . $prices->length . PHP_EOL;
?>
Output:
Prices found: 0
Open the page in a browser and the prices are right there. But cURL fetched the page before JavaScript ran – so the price elements don’t exist yet in the raw HTML.
How to Confirm It’s a JavaScript Problem
Save the raw cURL response and open it in a browser. If the data is missing from the saved file, JavaScript is responsible:
<?php
$html = scrape_url("https://example-js-site.com/products");
file_put_contents(__DIR__ . '/debug_page.html', $html);
echo "Saved raw HTML - open debug_page.html in browser." . PHP_EOL;
// Also check if the page has obvious JS framework markers
$jsMarkers = [
'ng-app', // Angular
'data-reactroot', // React
'__NEXT_DATA__', // Next.js
'nuxt', // Nuxt.js
'vue', // Vue.js
'window.__data', // generic JS data injection
];
foreach ($jsMarkers as $marker) {
if (strpos($html, $marker) !== false) {
echo "JS framework detected: '$marker'" . PHP_EOL;
}
}
?>
Output on a JavaScript-rendered site:
Saved raw HTML - open debug_page.html in browser.
JS framework detected: '__NEXT_DATA__'
Fix 1: Find the Hidden API Call
JavaScript-rendered sites load their data from an API endpoint that the browser calls after the page loads. This API is often easier to scrape than the HTML and returns clean JSON.
To find it: open Chrome DevTools → Network tab → reload the page → filter by Fetch/XHR. Look for requests returning JSON with the data you need. Copy that URL and hit it directly with cURL:
<?php
// Instead of scraping the HTML page, call the API directly
$apiUrl = "https://example-js-site.com/api/products?page=1&limit=20";
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $apiUrl,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Accept: application/json',
'X-Requested-With: XMLHttpRequest',
'Referer: https://example-js-site.com/products',
],
]);
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode === 200) {
$data = json_decode($response, true);
$products = $data['products'] ?? [];
foreach ($products as $product) {
echo $product['name'] . " - " . $product['price'] . PHP_EOL;
}
} else {
echo "API request failed: HTTP $httpCode" . PHP_EOL;
}
?>
Output:
Laptop Stand Pro - $49.99
Mechanical Keyboard - $89.99
USB-C Hub - $34.99
Hitting the API directly is faster, more reliable, and less likely to break when the site redesigns its frontend. Always check for an API before reaching for a headless browser.
Fix 2: Extract Inline JSON From the Page
Many JavaScript frameworks embed the initial page data as JSON inside a <script> tag. This data gets loaded before JavaScript runs – meaning cURL can access it even though the rendered content can’t be seen.
<?php
$html = scrape_url("https://example-js-site.com/products");
// Look for JSON embedded in script tags
// Next.js sites commonly use __NEXT_DATA__
if (preg_match('/<script id="__NEXT_DATA__"[^>]*>(.*?)<\/script>/s', $html, $matches)) {
$jsonData = json_decode($matches[1], true);
// Navigate the JSON structure to find your data
$products = $jsonData['props']['pageProps']['products'] ?? [];
foreach ($products as $product) {
echo $product['name'] . " - $" . $product['price'] . PHP_EOL;
}
} else {
echo "No inline JSON found." . PHP_EOL;
}
?>
Output when data is found:
Laptop Stand Pro - $49.99
Mechanical Keyboard - $89.99
USB-C Hub - $34.99
Output when not found:
No inline JSON found.
This works on Next.js, Nuxt.js, and many React sites that do server-side rendering. Check the raw HTML source for window.__INITIAL_STATE__, window.__data, or any <script> tag containing a large JSON object – these are all inline data patterns worth checking before using a headless browser.
Fix 3: Use a Headless Browser as a Last Resort
If there’s no API and no inline JSON, you need a tool that actually executes JavaScript. Puppeteer is the most widely used option – it’s a Node.js library that controls a real Chrome browser programmatically.
Install it and run it alongside your PHP script:
npm install puppeteer
Create a Node.js script that fetches the rendered HTML and saves it to a file:
// scrape.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example-js-site.com/products', {
waitUntil: 'networkidle2', // wait until JS finishes loading
});
const html = await page.content();
require('fs').writeFileSync('rendered_page.html', html);
console.log('Saved rendered HTML.');
await browser.close();
})();
Then call it from PHP and parse the saved file:
<?php
// Run the Node.js script to get rendered HTML
shell_exec('node scrape.js');
// Now parse the fully rendered HTML
$html = file_get_contents(__DIR__ . '/rendered_page.html');
if (!$html) {
exit("Rendered HTML not found." . PHP_EOL);
}
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$prices = $xpath->query('//span[@class="price"]');
echo "Prices found: " . $prices->length . PHP_EOL;
foreach ($prices as $price) {
echo trim($price->textContent) . PHP_EOL;
}
?>
Output:
Prices found: 3
$49.99
$89.99
$34.99
Choosing the Right Approach
Work through these options in order – each one is simpler and faster than the next:
- Hidden API – check DevTools Network tab first. If the data comes from an API call, use that. Fastest option, least likely to break.
- Inline JSON – check the HTML source for embedded JSON in script tags. Works on most server-side rendered JavaScript frameworks.
- Headless browser – only if the above two fail. More complex, slower, higher resource usage, but handles any JavaScript-rendered page.
Error 4: Connection and Timeout Errors Inside a Scraping Loop
A scraper that works perfectly on a single URL often breaks when you run it across hundreds of pages. Connection errors, timeouts, and server hiccups that you’d never notice on one request become inevitable at scale. Without handling them, one failed request crashes the entire job.
What It Looks Like
<?php
$urls = [
"https://books.toscrape.com/catalogue/page-1.html",
"https://books.toscrape.com/catalogue/page-2.html",
"https://books.toscrape.com/catalogue/page-3.html",
];
foreach ($urls as $url) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
]);
$response = curl_exec($ch);
curl_close($ch);
// No error checking - if one fails, $response is false
// and the script either crashes or silently saves nothing
$dom = new DOMDocument();
$dom->loadHTML($response); // crashes here if $response is false
}
?>
Output when one request fails:
Warning: DOMDocument::loadHTML(): Empty string supplied as input in scraper.php on line 18
Everything after the failed URL gets skipped. If the failure happens on page 3 of 50, you lose pages 3 through 50 silently.
The Fix: Check Every Request Before Using the Response
<?php
function scrape_url($url) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_LOW_SPEED_LIMIT => 500,
CURLOPT_LOW_SPEED_TIME => 10,
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
],
]);
$response = curl_exec($ch);
$errno = curl_errno($ch);
$error = curl_error($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($errno) {
echo "cURL error $errno on $url: $error" . PHP_EOL;
return false;
}
if ($httpCode !== 200) {
echo "HTTP $httpCode on $url" . PHP_EOL;
return false;
}
return $response;
}
$urls = [
"https://books.toscrape.com/catalogue/page-1.html",
"https://books.toscrape.com/catalogue/page-2.html",
"https://books.toscrape.com/catalogue/page-3.html",
];
$failed = [];
$success = 0;
foreach ($urls as $url) {
$html = scrape_url($url);
if (!$html) {
$failed[] = $url;
continue; // skip to next URL instead of crashing
}
// Safe to parse - we know $html is valid content
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$books = $xpath->query('//article[contains(@class,"product_pod")]');
$success++;
echo "Page scraped: " . $books->length . " books found." . PHP_EOL;
sleep(1);
}
echo PHP_EOL . "Done. Success: $success | Failed: " . count($failed) . PHP_EOL;
if (!empty($failed)) {
echo "Failed URLs:" . PHP_EOL;
foreach ($failed as $url) {
echo " - $url" . PHP_EOL;
}
}
?>
Output when all succeed:
Page scraped: 20 books found.
Page scraped: 20 books found.
Page scraped: 20 books found.
Done. Success: 3 | Failed: 0
Output when one fails:
Page scraped: 20 books found.
cURL error 28 on https://books.toscrape.com/catalogue/page-2.html: Operation timed out after 30000 milliseconds
Page scraped: 20 books found.
Done. Success: 2 | Failed: 1
Failed URLs:
- https://books.toscrape.com/catalogue/page-2.html
Adding Retry Logic to the Loop
Collecting failed URLs is only useful if you do something with them. Run a second pass on failures with a longer timeout before giving up:
<?php
function scrape_with_retry($url, $maxRetries = 3, $timeout = 30) {
$attempt = 0;
while ($attempt < $maxRetries) {
$attempt++;
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => $timeout,
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
],
]);
$response = curl_exec($ch);
$errno = curl_errno($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($errno === 0 && $httpCode === 200) {
return $response;
}
if ($errno === 28 || $httpCode >= 500) {
echo "Attempt $attempt failed on $url. Retrying in " . ($attempt * 2) . "s..." . PHP_EOL;
sleep($attempt * 2);
continue;
}
// Permanent failure - no point retrying
return false;
}
return false;
}
// First pass - normal timeout
$urls = [
"https://books.toscrape.com/catalogue/page-1.html",
"https://books.toscrape.com/catalogue/page-2.html",
"https://books.toscrape.com/catalogue/page-3.html",
];
$failed = [];
$results = [];
foreach ($urls as $url) {
$html = scrape_with_retry($url, 3, 30);
if ($html) {
$results[$url] = $html;
} else {
$failed[] = $url;
}
sleep(1);
}
// Second pass - extended timeout on failed URLs
if (!empty($failed)) {
echo PHP_EOL . "Running second pass on " . count($failed) . " failed URLs..." . PHP_EOL;
foreach ($failed as $url) {
$html = scrape_with_retry($url, 2, 60);
if ($html) {
$results[$url] = $html;
echo "Recovered: $url" . PHP_EOL;
} else {
// Log permanently failed URLs
file_put_contents(
__DIR__ . '/failed_urls.log',
date('Y-m-d H:i:s') . " | " . $url . PHP_EOL,
FILE_APPEND
);
echo "Permanently failed - logged: $url" . PHP_EOL;
}
sleep(2);
}
}
echo PHP_EOL . "Total pages collected: " . count($results) . PHP_EOL;
?>
Output when second pass recovers a failed URL:
Attempt 1 failed on https://books.toscrape.com/catalogue/page-2.html. Retrying in 2s...
Attempt 2 failed on https://books.toscrape.com/catalogue/page-2.html. Retrying in 4s...
Running second pass on 1 failed URLs...
Recovered: https://books.toscrape.com/catalogue/page-2.html
Total pages collected: 3
Preventing PHP From Killing Long-Running Scrapers
PHP’s default execution time limit is 30 seconds. A scraper hitting 100 pages with 1-second delays needs at least 100 seconds to finish – PHP kills it before it gets there:
<?php
// Add at the top of any long-running scraper script
set_time_limit(0); // no PHP execution time limit
ini_set('memory_limit', '256M'); // enough memory for large jobs
// Also catch fatal errors that would otherwise kill the script silently
register_shutdown_function(function() {
$error = error_get_last();
if ($error && in_array($error['type'], [E_ERROR, E_PARSE, E_CORE_ERROR])) {
file_put_contents(
__DIR__ . '/scraper_errors.log',
date('Y-m-d H:i:s') . " FATAL: " . $error['message'] . PHP_EOL,
FILE_APPEND
);
}
});
?>
set_time_limit(0) removes the execution time cap for the current script only – it doesn’t change your PHP configuration globally. Add it to every scraper script that runs more than a handful of requests.
Error 5: Duplicate Data in the Database
You run your scraper twice – maybe to pick up new listings, maybe after fixing a bug – and end up with every record duplicated. Or you run it daily and after a week the same 100 products appear 700 times. The scraper works correctly, but nothing prevents it from inserting the same data repeatedly.
What It Looks Like
<?php
// Naive insert — no duplicate checking
function save_book($pdo, $title, $price) {
$stmt = $pdo->prepare("INSERT INTO books (title, price) VALUES (:title, :price)");
$stmt->execute([':title' => $title, ':price' => $price]);
}
// Run this twice and you get two identical rows every time
save_book($pdo, "A Light in the Attic", "£51.77");
save_book($pdo, "A Light in the Attic", "£51.77");
?>
Check the database after two runs:
SELECT * FROM books WHERE title = 'A Light in the Attic';
+----+----------------------+--------+
| id | title | price |
+----+----------------------+--------+
| 1 | A Light in the Attic | £51.77 |
| 2 | A Light in the Attic | £51.77 |
+----+----------------------+--------+
Two identical rows. Run the scraper ten times and you have ten copies of every book.
Fix 1: Add a UNIQUE Constraint to the Table
The most reliable fix is enforcing uniqueness at the database level. Add a UNIQUE constraint on the column – or combination of columns – that identifies a unique record:
-- Add to your CREATE TABLE statement
CREATE TABLE books (
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(255) NOT NULL,
price VARCHAR(20),
rating VARCHAR(20),
url VARCHAR(500),
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE KEY unique_title (title)
);
-- Or add to an existing table
ALTER TABLE books ADD UNIQUE KEY unique_title (title);
Now any attempt to insert a duplicate title throws an error instead of creating a duplicate row. But you don’t want your script to crash on that error – you want it to update the existing record instead.
Fix 2: INSERT … ON DUPLICATE KEY UPDATE
This MySQL statement inserts a new row if the unique key doesn’t exist, or updates the existing row if it does. One query handles both cases:
<?php
function save_book($pdo, $title, $price, $rating, $url) {
$sql = "INSERT INTO books (title, price, rating, url)
VALUES (:title, :price, :rating, :url)
ON DUPLICATE KEY UPDATE
price = VALUES(price),
rating = VALUES(rating),
scraped_at = CURRENT_TIMESTAMP";
try {
$stmt = $pdo->prepare($sql);
$stmt->execute([
':title' => $title,
':price' => $price,
':rating' => $rating,
':url' => $url,
]);
// rowCount() returns 1 for insert, 2 for update, 0 for no change
return $stmt->rowCount();
} catch (PDOException $e) {
echo "Database error: " . $e->getMessage() . PHP_EOL;
return false;
}
}
// Usage
$pdo = new PDO("mysql:host=localhost;dbname=scraper;charset=utf8mb4", $user, $pass, [
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
]);
$result = save_book($pdo, "A Light in the Attic", "£51.77", "One", "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/");
if ($result === 1) {
echo "New record inserted." . PHP_EOL;
} elseif ($result === 2) {
echo "Existing record updated." . PHP_EOL;
} elseif ($result === 0) {
echo "No change — data is identical." . PHP_EOL;
}
?>
Output on first run:
New record inserted.
Output on second run with same data:
No change — data is identical.
Output on second run with updated price:
Existing record updated.
Fix 3: Track Inserts and Updates Across a Full Scrape
On large jobs it’s useful to know how many records were new vs updated at the end of the run:
<?php
$inserted = 0;
$updated = 0;
$unchanged = 0;
$books = [
['title' => 'A Light in the Attic', 'price' => '£51.77', 'rating' => 'One', 'url' => 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/'],
['title' => 'Tipping the Velvet', 'price' => '£53.74', 'rating' => 'One', 'url' => 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/'],
['title' => 'Soumission', 'price' => '£50.10', 'rating' => 'One', 'url' => 'https://books.toscrape.com/catalogue/soumission_998/'],
];
foreach ($books as $book) {
$result = save_book($pdo, $book['title'], $book['price'], $book['rating'], $book['url']);
if ($result === 1) $inserted++;
elseif ($result === 2) $updated++;
elseif ($result === 0) $unchanged++;
}
echo PHP_EOL . "Scrape complete." . PHP_EOL;
echo "Inserted: $inserted" . PHP_EOL;
echo "Updated: $updated" . PHP_EOL;
echo "Unchanged: $unchanged" . PHP_EOL;
?>
Output on first run:
Scrape complete.
Inserted: 3
Updated: 0
Unchanged: 0
Output on second run when prices have changed:
Scrape complete.
Inserted: 0
Updated: 2
Unchanged: 1
Fix 4: Handling Composite Unique Keys
Sometimes a single column isn’t enough to identify a unique record. A product might appear on multiple category pages with the same title but a different URL. Use a composite unique key in that case:
-- Unique on combination of title AND url
ALTER TABLE books ADD UNIQUE KEY unique_title_url (title, url);
?>
<?php
// Now two books with the same title but different URLs are both stored
$sql = "INSERT INTO books (title, price, url)
VALUES (:title, :price, :url)
ON DUPLICATE KEY UPDATE
price = VALUES(price),
scraped_at = CURRENT_TIMESTAMP";
?>
Checking for Duplicates Before Inserting
If you can’t modify the table schema – for example, on an existing production database – check for the record manually before inserting:
<?php
function book_exists($pdo, $title) {
$stmt = $pdo->prepare("SELECT COUNT(*) FROM books WHERE title = :title");
$stmt->execute([':title' => $title]);
return (int) $stmt->fetchColumn() > 0;
}
function save_book_safe($pdo, $title, $price, $rating, $url) {
if (book_exists($pdo, $title)) {
// Update instead of insert
$stmt = $pdo->prepare("
UPDATE books
SET price = :price, rating = :rating, scraped_at = CURRENT_TIMESTAMP
WHERE title = :title
");
$stmt->execute([':price' => $price, ':rating' => $rating, ':title' => $title]);
echo "Updated: $title" . PHP_EOL;
} else {
// Insert new record
$stmt = $pdo->prepare("
INSERT INTO books (title, price, rating, url)
VALUES (:title, :price, :rating, :url)
");
$stmt->execute([':title' => $title, ':price' => $price, ':rating' => $rating, ':url' => $url]);
echo "Inserted: $title" . PHP_EOL;
}
}
save_book_safe($pdo, "A Light in the Attic", "£51.77", "One", "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/");
?>
Output on first run:
Inserted: A Light in the Attic
Output on second run:
Updated: A Light in the Attic
The manual check approach works but runs two queries per record instead of one. On large datasets the ON DUPLICATE KEY UPDATE approach is faster – use that whenever you control the schema.
Error 6: Silent Failures — When Your Scraper Stops Without Telling You
Silent failures are the most damaging scraping errors because you don’t know they happened. The script runs, finishes without a crash, and you assume the data is there. It isn’t. Somewhere in the middle the scraper hit an error, swallowed it, and kept going – saving nothing for the pages that failed.
What It Looks Like
<?php
// Error suppression — the @ operator hides all warnings and errors
$dom = new DOMDocument();
@$dom->loadHTML($response); // if $response is false, no warning shown
$xpath = new DOMXPath($dom);
$books = $xpath->query('//article[contains(@class,"product_pod")]');
foreach ($books as $book) {
// If $book processing throws an error, @ suppresses it
@save_book($pdo, $book['title'], $book['price']);
}
echo "Scrape complete." . PHP_EOL;
?>
Output:
Scrape complete.
Looks fine. But if $response was false on three pages, those pages were silently skipped. If the database connection dropped mid-job, every insert after that point failed without a trace. The script always says “Scrape complete” regardless of what actually happened.
Fix 1: Never Use the @ Error Suppression Operator
The @ operator before a function call suppresses all errors and warnings from that call. It’s sometimes used to silence DOMDocument warnings on messy HTML – but the correct fix is libxml_use_internal_errors(true), not suppression:
<?php
// Wrong — suppresses all errors including ones you need to see
@$dom->loadHTML($response);
// Right — captures libxml errors internally without hiding PHP errors
libxml_use_internal_errors(true);
$dom->loadHTML($response);
libxml_clear_errors();
?>
Fix 2: Set Up Error Logging at the Start of Every Scraper
<?php
// Add these lines at the top of every scraper script
set_time_limit(0);
ini_set('memory_limit', '256M');
ini_set('log_errors', 1);
ini_set('error_log', __DIR__ . '/scraper_errors.log');
// Catch fatal errors that bypass normal error handling
register_shutdown_function(function() {
$error = error_get_last();
if ($error && in_array($error['type'], [E_ERROR, E_PARSE, E_CORE_ERROR, E_COMPILE_ERROR])) {
$message = sprintf(
"[%s] FATAL ERROR: %s in %s on line %d",
date('Y-m-d H:i:s'),
$error['message'],
$error['file'],
$error['line']
);
file_put_contents(__DIR__ . '/scraper_errors.log', $message . PHP_EOL, FILE_APPEND);
echo $message . PHP_EOL;
}
});
echo "Scraper started: " . date('Y-m-d H:i:s') . PHP_EOL;
?>
Fix 3: Log Every Meaningful Event
Don’t just log errors – log what the scraper is doing so you can reconstruct exactly what happened when something goes wrong:
<?php
function log_event($message, $level = 'INFO') {
$entry = sprintf(
"[%s] [%s] %s",
date('Y-m-d H:i:s'),
$level,
$message
);
echo $entry . PHP_EOL;
file_put_contents(__DIR__ . '/scraper.log', $entry . PHP_EOL, FILE_APPEND);
}
// Use throughout your scraper
log_event("Starting scrape of books.toscrape.com");
log_event("Fetching page 1");
$html = scrape_with_retry("https://books.toscrape.com/", 3, 2);
if (!$html) {
log_event("Failed to fetch page 1 after 3 attempts", 'ERROR');
} else {
log_event("Page 1 fetched successfully — " . strlen($html) . " bytes");
}
?>
Output and log file contents:
[2026-05-01 10:22:11] [INFO] Starting scrape of books.toscrape.com
[2026-05-01 10:22:11] [INFO] Fetching page 1
[2026-05-01 10:22:12] [INFO] Page 1 fetched successfully — 51274 bytes
Fix 4: Validate Data Before Saving It
Saving empty or malformed data silently is a common form of silent failure. A page that returned a bot-check response instead of real content will have no books – but without validation your script saves nothing and moves on without flagging it:
<?php
function validate_book($title, $price, $url) {
$errors = [];
if (empty($title) || $title === 'N/A') {
$errors[] = "Missing title";
}
if (empty($price) || $price === 'N/A') {
$errors[] = "Missing price";
}
if (!filter_var($url, FILTER_VALIDATE_URL)) {
$errors[] = "Invalid URL: $url";
}
return $errors;
}
// Use before every insert
$title = "A Light in the Attic";
$price = "£51.77";
$url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/";
$errors = validate_book($title, $price, $url);
if (!empty($errors)) {
log_event("Validation failed for '$title': " . implode(', ', $errors), 'WARNING');
} else {
save_book($pdo, $title, $price, 'One', $url);
log_event("Saved: $title");
}
?>
Output on valid data:
[2026-05-01 10:22:14] [INFO] Saved: A Light in the Attic
Output when page returned bot-check response:
[2026-05-01 10:22:14] [WARNING] Validation failed for 'N/A': Missing title, Missing price
Fix 5: Add a Completion Summary
Every scraper should print a summary at the end – not just “complete” but actual numbers that confirm the job did what it was supposed to:
<?php
$startTime = microtime(true);
$stats = [
'pages_attempted' => 0,
'pages_success' => 0,
'pages_failed' => 0,
'records_inserted'=> 0,
'records_updated' => 0,
'records_skipped' => 0,
];
// ... scraping loop runs here, updating $stats throughout ...
$duration = round(microtime(true) - $startTime, 2);
$summary = "
========================================
Scrape Complete: " . date('Y-m-d H:i:s') . "
Duration: {$duration}s
Pages attempted: {$stats['pages_attempted']}
Pages succeeded: {$stats['pages_success']}
Pages failed: {$stats['pages_failed']}
Records inserted: {$stats['records_inserted']}
Records updated: {$stats['records_updated']}
Records skipped: {$stats['records_skipped']}
========================================";
echo $summary . PHP_EOL;
log_event($summary);
?>
Output:
========================================
Scrape Complete: 2026-05-01 10:45:33
Duration: 127.4s
Pages attempted: 50
Pages succeeded: 49
Pages failed: 1
Records inserted: 960
Records updated: 20
Records skipped: 0
========================================
One failed page out of 50 is immediately visible. Without this summary you’d have no idea unless you counted database rows manually. When the numbers don’t add up – pages succeeded times records per page should roughly equal records inserted plus updated – you know something went wrong and where to start looking.
Error 7: Memory Exhaustion on Large Scrapes
Your scraper works perfectly on 10 pages. On 500 pages it crashes halfway through with a fatal error. Nothing changed in your code – you just hit the memory ceiling that small test runs never exposed.
Fatal error: Allowed memory size of 134217728 bytes exhausted
(tried to allocate 20480 bytes) in scraper.php on line 47
This happens when you accumulate data in memory throughout the entire job instead of writing it out and releasing it as you go.
What It Looks Like
<?php
$allBooks = []; // grows with every page, never released
$url = "https://books.toscrape.com/";
$page = 1;
while ($url) {
$html = scrape_with_retry($url, 3, 2);
if (!$html) break;
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$books = $xpath->query('//article[contains(@class,"product_pod")]');
foreach ($books as $book) {
$titleNode = $xpath->query('.//h3/a', $book)->item(0);
$priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);
// Every page adds to the same array — never written out, never freed
$allBooks[] = [
'title' => $titleNode ? $titleNode->getAttribute('title') : 'N/A',
'price' => $priceNode ? trim($priceNode->textContent) : 'N/A',
];
}
$page++;
// ... pagination logic
}
// Only writes to database at the very end
// If script crashes at page 300, all 300 pages of data are lost
foreach ($allBooks as $book) {
save_book($pdo, $book['title'], $book['price']);
}
?>
Scraping 50 pages of 20 books each means 1000 arrays sitting in memory simultaneously. Scale that to thousands of pages with larger data and the memory limit hits fast.
Fix 1: Write to Database Immediately, Free Memory After
The fix is straightforward – save each page’s data as soon as it’s scraped, then release the variables before moving to the next page:
<?php
$url = "https://books.toscrape.com/";
$page = 1;
$inserted = 0;
$updated = 0;
while ($url) {
$html = scrape_with_retry($url, 3, 2);
if (!$html) {
log_event("Failed to fetch page $page — stopping.", 'ERROR');
break;
}
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$books = $xpath->query('//article[contains(@class,"product_pod")]');
foreach ($books as $book) {
$titleNode = $xpath->query('.//h3/a', $book)->item(0);
$priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);
$ratingNode = $xpath->query('.//*[contains(@class,"star-rating")]', $book)->item(0);
$linkNode = $xpath->query('.//h3/a', $book)->item(0);
$title = $titleNode ? $titleNode->getAttribute('title') : 'N/A';
$price = $priceNode ? trim($priceNode->textContent) : 'N/A';
$rating = $ratingNode ? str_replace('star-rating ', '', $ratingNode->getAttribute('class')) : 'N/A';
$link = $linkNode ? "https://books.toscrape.com/catalogue/" . ltrim($linkNode->getAttribute('href'), '../') : '';
// Save immediately instead of accumulating
$result = save_book($pdo, $title, $price, $rating, $link);
if ($result === 1) $inserted++;
if ($result === 2) $updated++;
}
log_event("Page $page done — inserted: $inserted | updated: $updated");
// Free memory before next iteration
unset($html, $books);
$dom = null;
// Find next page
libxml_use_internal_errors(true);
$freshDom = new DOMDocument();
$freshHtml = scrape_with_retry($url, 3, 2);
$freshDom->loadHTML($freshHtml);
libxml_clear_errors();
$freshXpath = new DOMXPath($freshDom);
$nextNode = $freshXpath->query('//li[contains(@class,"next")]/a')->item(0);
if ($nextNode) {
$url = "https://books.toscrape.com/catalogue/" . $nextNode->getAttribute('href');
$page++;
sleep(1);
} else {
$url = null;
}
unset($freshDom, $freshXpath, $freshHtml);
}
?>
Fix 2: Monitor Memory Usage During the Run
Track memory consumption per page so you can catch growth before it becomes a crash:
<?php
function log_memory($page) {
$used = memory_get_usage(true);
$peak = memory_get_peak_usage(true);
$usedMB = round($used / 1048576, 2);
$peakMB = round($peak / 1048576, 2);
echo "Page $page — Memory: {$usedMB}MB | Peak: {$peakMB}MB" . PHP_EOL;
// Warn if approaching limit
$limit = ini_get('memory_limit');
$limitMB = (int) $limit; // assumes limit is set in MB e.g. "256M"
if ($usedMB > $limitMB * 0.8) {
echo "WARNING: Using over 80% of memory limit ({$limitMB}MB)" . PHP_EOL;
}
}
// Call inside your scraping loop
log_memory($page);
?>
Output with proper memory management:
Page 1 — Memory: 8.25MB | Peak: 8.25MB
Page 10 — Memory: 8.31MB | Peak: 8.75MB
Page 25 — Memory: 8.29MB | Peak: 8.75MB
Page 50 — Memory: 8.33MB | Peak: 8.75MB
Output without memory management (accumulating array):
Page 1 — Memory: 8.25MB | Peak: 8.25MB
Page 10 — Memory: 24.50MB | Peak: 24.50MB
Page 25 — Memory: 58.75MB | Peak: 58.75MB
Page 50 — Memory: 112.4MB | Peak: 112.4MB
Memory climbing linearly with each page confirms you’re accumulating data. Flat memory usage confirms you’re releasing it correctly.
Fix 3: Force Garbage Collection on Long Jobs
PHP’s garbage collector doesn’t always free circular references immediately. On very long scraping jobs, force it manually every N pages:
<?php
// Inside your scraping loop
if ($page % 50 === 0) {
$collected = gc_collect_cycles();
log_event("Page $page — forced GC, collected $collected cycles");
}
?>
Output:
[2026-05-01 11:14:22] [INFO] Page 50 — forced GC, collected 143 cycles
[2026-05-01 11:22:47] [INFO] Page 100 — forced GC, collected 97 cycles
Fix 4: Write to CSV Instead of Database for Very Large Jobs
If you’re scraping millions of records, writing to a CSV file first and importing to the database in bulk is faster and uses less memory than per-row PDO inserts:
<?php
$csvFile = __DIR__ . '/books_' . date('Y-m-d') . '.csv';
$handle = fopen($csvFile, 'w');
// Write header row
fputcsv($handle, ['title', 'price', 'rating', 'url', 'scraped_at']);
// Inside scraping loop — write each row immediately
foreach ($books as $book) {
$titleNode = $xpath->query('.//h3/a', $book)->item(0);
$priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);
$title = $titleNode ? $titleNode->getAttribute('title') : 'N/A';
$price = $priceNode ? trim($priceNode->textContent) : 'N/A';
fputcsv($handle, [$title, $price, date('Y-m-d H:i:s')]);
}
// Close file handle when done
fclose($handle);
echo "Data written to: $csvFile" . PHP_EOL;
?>
Output:
Data written to: /var/www/html/books_2026-05-01.csv
fputcsv() writes one row at a time directly to disk. The file handle stays open throughout the job but the data itself is never held in memory – each row is written and released immediately. Import the finished CSV to MySQL with:
LOAD DATA INFILE '/var/www/html/books_2026-05-01.csv'
INTO TABLE books
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(title, price, scraped_at);
This is orders of magnitude faster than thousands of individual INSERT statements for large datasets.
Quick Reference: Web Scraping Errors and Fixes
| Symptom | Likely Cause | Fix |
|---|---|---|
| XPath returns 0 results, no error | Wrong selector or class name | Dump raw HTML, inspect actual attributes, use contains(@class) |
| HTTP 403 or empty response | Missing headers, no user agent | Add full headers, rotate user agents, enable cookie jar |
| 200 status but data missing | JavaScript-rendered content | Find hidden API, extract inline JSON, or use headless browser |
| cURL error 28 | Timeout — connection, transfer, or stall | Set CURLOPT_CONNECTTIMEOUT + CURLOPT_TIMEOUT, add retry logic |
| Duplicate rows in database | No uniqueness enforcement | Add UNIQUE constraint, use ON DUPLICATE KEY UPDATE |
| Script finishes but data is missing | Silent failures, error suppression | Remove @ operator, add logging and completion summary |
| Fatal memory exhausted error | Accumulating data in memory | Write to database immediately, unset() after each page |
Frequently Asked Questions
Why does my web scraping errors return empty results with no error?
Two causes cover most cases. Either your XPath or CSS selector doesn’t match the actual HTML – use contains(@class) instead of exact class matching and dump the raw HTML to verify the structure. Or the content is loaded by JavaScript after the initial page load – cURL only fetches the raw HTML before JavaScript runs, so the data simply isn’t there yet. Save the raw response to a file and open it in a browser to confirm which one you’re dealing with.
Why is my scraper getting blocked even with a user agent set?
User agent alone isn’t enough. Modern bot detection looks at the full request signature – missing Accept headers, no cookies, identical requests firing at perfectly regular intervals, and requests that never follow the natural browsing pattern of visiting a homepage before deeper pages. Add the full header set, enable a cookie jar, add random delays between requests, and add a Referer header pointing to the site’s own domain.
What is the difference between a 403 and an empty response?
A 403 means the server received your request and explicitly refused it – you’re identified as a bot and blocked. An empty response usually means cURL got something back but it wasn’t HTML content, or the connection dropped before any data transferred. Check curl_getinfo($ch, CURLINFO_HTTP_CODE) for the status code and strlen($response) for the response size – these two numbers together tell you which problem you have.
How do I scrape a website that uses JavaScript to load data?
Work through three options in order. First, check Chrome DevTools Network tab for API calls the JavaScript makes – hitting that API directly with cURL is the fastest solution. Second, check the raw HTML source for embedded JSON in script tags – frameworks like Next.js and Nuxt.js inject initial data this way. Third, if neither works, use a headless browser like Puppeteer that executes JavaScript before handing you the rendered HTML.
How do I stop my scraper from saving duplicate records?
Add a UNIQUE constraint to the column that identifies a unique record, then use INSERT ... ON DUPLICATE KEY UPDATE instead of a plain INSERT. This handles both new records and updates in a single query without checking for existence first. If you can’t modify the schema, check for the record manually before inserting – but the UNIQUE constraint approach is faster and more reliable.
Why does my scraper crash halfway through with no error message?
Most likely PHP’s max_execution_time killed it, or the memory limit was reached and the error went unlogged. Add set_time_limit(0) and ini_set('memory_limit', '256M') at the top of the script. Also add a register_shutdown_function() that logs fatal errors to a file – PHP’s shutdown function fires even when the script is killed by a fatal error, so you’ll always get a log entry explaining what happened.
How much memory should a PHP scraper use?
On a well-written scraper that writes data immediately and frees variables after each page, memory usage should stay roughly flat throughout the entire job – typically 8-15MB regardless of how many pages you scrape. If memory climbs linearly with each page, you’re accumulating data in an array instead of releasing it. Use memory_get_usage(true) inside your loop to monitor it in real time.
Summary
Most PHP web scraping errors come down to the same root causes – not checking what you actually got, not handling failures explicitly, and not releasing resources as you go. The fixes aren’t complicated but they need to be in place from the start, not retrofitted after something breaks in production.
The seven errors covered here in order of how often they appear:
- Wrong selectors – dump the raw HTML and verify before writing any XPath
- Getting blocked – add full headers, cookies, and random delays from day one
- JavaScript-rendered content – check for a hidden API before reaching for a headless browser
- Timeout and connection errors – always set both
CURLOPT_CONNECTTIMEOUTandCURLOPT_TIMEOUT, always add retry logic - Duplicate data – enforce uniqueness at the database level, not in PHP
- Silent failures – log everything, never use the
@operator, always print a completion summary - Memory exhaustion – write and release per page, never accumulate in a single array
For the complete PHP cURL scraping setup that puts all of these error-handling patterns together – including pagination, MySQL storage, rate limiting, and cookie handling – read the PHP cURL web scraping complete guide. If timeouts are your specific problem, the PHP cURL timeout guide covers error code 28, curl_getinfo() timing breakdown, and retry logic in detail.
Next Step
Continue learning by reading our PHP cURL scraping guide.
New to scraping? Start with our PHP web scraper guide.
Build projects like PHP price tracker.
Automate scripts using PHP cron job automation.
See this real PHP scraping example to understand better.
