Table of Contents
Web scraping sounds complicated until you build your first one. At its core a scraper does three things: fetch a web page, find the data you want inside the HTML, and do something useful with it — save it to a file, store it in a database, or display it on screen.
This guide builds a complete working PHP web scraper from scratch. By the end you’ll have a scraper that fetches real pages, extracts structured data, and saves it to a CSV file — all with code you can run immediately and adapt for your own projects.
No prior scraping experience needed. Basic PHP knowledge is enough to follow along.
What is a PHP Web Scraper?
A PHP web scraper is a script that automatically collects data from websites. Instead of manually copying information from web pages, the script does it programmatically — fetching the page, reading the HTML, and pulling out exactly the data you need.
Common uses for a PHP web scraper:
- Price monitoring — track product prices across multiple sites and get notified when they drop
- Content aggregation — collect news articles, job listings, or property listings from multiple sources
- Data research — gather publicly available data for analysis without manual copying
- Automation — feed scraped data into databases, reports, or other applications automatically
How a PHP Web Scraper Works
Every PHP web scraper follows the same three-step process:
- Fetch — send an HTTP request to the target URL and get the HTML response back
- Parse — load the HTML into a parser that lets you navigate and query the document structure
- Extract — use selectors to find the specific elements containing the data you want
PHP has everything needed to do all three built in — no external libraries required to get started.
What You Need
- PHP 7.4 or higher
- cURL extension — verify with
php -m | grep curl - A terminal or command line to run scripts
All examples in this guide use books.toscrape.com — a site built specifically for scraping practice. It’s safe to scrape, always available, and structured the same way as most real e-commerce sites.
Fetching a Web Page With cURL
The first step of every scraper is fetching the page. PHP has a built-in function called file_get_contents() that can fetch URLs, but it has no support for headers, cookies, timeouts, or redirects. Most real websites block requests that don’t send proper headers – so file_get_contents() fails immediately on anything beyond a simple test.
Use cURL instead. It handles everything a real browser does at the request level.
Your First cURL Request
<?php
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => "https://books.toscrape.com/",
CURLOPT_RETURNTRANSFER => true, // return response as string
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_CONNECTTIMEOUT => 10, // connection timeout in seconds
CURLOPT_TIMEOUT => 30, // total request timeout
CURLOPT_ENCODING => '', // handle compressed responses
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
],
]);
$html = curl_exec($ch);
$errno = curl_errno($ch);
$error = curl_error($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($errno) {
echo "Request failed: " . $error . PHP_EOL;
exit;
}
if ($httpCode !== 200) {
echo "HTTP error: " . $httpCode . PHP_EOL;
exit;
}
echo "Page fetched successfully." . PHP_EOL;
echo "Response size: " . strlen($html) . " bytes." . PHP_EOL;
?>
Output:
Page fetched successfully.
Response size: 51274 bytes.
What Each Option Does
CURLOPT_RETURNTRANSFER – without this, cURL prints the response directly to the screen instead of returning it as a string. Always set this to true.
CURLOPT_FOLLOWLOCATION – follows redirects automatically. Without this, if the site redirects HTTP to HTTPS, you get an empty response with no error.
CURLOPT_CONNECTTIMEOUT and CURLOPT_TIMEOUT – set time limits so your script doesn’t hang indefinitely on slow or unresponsive servers. Always set both.
User-Agent header – identifies your request as coming from a Chrome browser instead of a PHP script. Without this, most sites either block the request or serve a stripped version of the page.
curl_errno() and CURLINFO_HTTP_CODE – two separate checks. curl_errno() catches network-level failures where the request never completed. CURLINFO_HTTP_CODE catches HTTP errors like 403 or 404 where the request completed but the server rejected it. You need both – a 403 response doesn’t trigger curl_errno().
Making the Fetch Function Reusable
Wrap the cURL setup in a function so you can reuse it throughout your scraper without repeating the same options every time:
<?php
function fetch_page($url) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_ENCODING => '',
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
],
]);
$html = curl_exec($ch);
$errno = curl_errno($ch);
$error = curl_error($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($errno) {
echo "cURL error on $url: $error" . PHP_EOL;
return false;
}
if ($httpCode !== 200) {
echo "HTTP $httpCode on $url" . PHP_EOL;
return false;
}
return $html;
}
// Usage
$html = fetch_page("https://books.toscrape.com/");
if ($html) {
echo "Fetched " . strlen($html) . " bytes." . PHP_EOL;
}
?>
Output:
Fetched 51274 bytes.
Every scraper you build from here forward starts with this function. The URL changes, the function stays the same.
Checking What You Actually Got
Before writing any parsing code, verify the response contains what you expect. Save the raw HTML to a file and open it in your browser:
<?php
$html = fetch_page("https://books.toscrape.com/");
if ($html) {
file_put_contents('fetched_page.html', $html);
echo "Saved to fetched_page.html - open in browser to inspect." . PHP_EOL;
}
?>
Open fetched_page.html in Chrome or Firefox. If the page looks right – same content you’d see visiting the URL normally – you’re ready to parse it. If it looks different (a login page, a CAPTCHA, or missing content), the site is treating your request differently than a real browser. That’s a signal to adjust your headers before going further.
Parsing HTML and Extracting Data
Fetching the page gives you a raw HTML string – one long block of text. To extract specific data from it you need a parser that understands HTML structure and lets you query it the same way you’d query a database.
PHP’s built-in DOMDocument class does exactly this. Combine it with DOMXPath and you can target any element on the page with a precise query.
Loading HTML Into DOMDocument
<?php
$html = fetch_page("https://books.toscrape.com/");
if (!$html) {
exit("Failed to fetch page." . PHP_EOL);
}
// Suppress warnings from malformed HTML - common on real sites
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
// Clear the suppressed errors so they don't accumulate
libxml_clear_errors();
// Create an XPath object to query the document
$xpath = new DOMXPath($dom);
echo "HTML loaded successfully." . PHP_EOL;
?>
Output:
HTML loaded successfully.
libxml_use_internal_errors(true) is important. Real websites have messy HTML – unclosed tags, missing attributes, encoding quirks. Without this line, DOMDocument floods your output with warnings for every imperfect tag it finds. This captures those warnings internally so your output stays clean.
Understanding XPath
XPath is a query language for navigating HTML and XML documents. Think of it like CSS selectors but more powerful. The pattern is always the same:
// Start from anywhere in the document
//tagname
// Filter by attribute
//tagname[@attribute="value"]
// Filter by partial attribute match
//tagname[contains(@attribute, "value")]
// Go deeper - find a child element
//parent//child
// Get a specific attribute value
//tagname/@attribute
You don’t need to memorize XPath syntax. The two patterns you’ll use 90% of the time are //tag[@class="value"] and //tag[contains(@class,"value")]. Use contains() when an element has multiple classes.
Extracting Book Titles
<?php
// Query all book title links inside article elements
$titles = $xpath->query('//article[contains(@class,"product_pod")]//h3/a');
echo "Books found: " . $titles->length . PHP_EOL . PHP_EOL;
foreach ($titles as $title) {
// Title text is in the "title" attribute, not the link text
echo $title->getAttribute('title') . PHP_EOL;
}
?>
Output:
Books found: 20
A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
...
Extracting Multiple Fields From Each Book
Real scrapers extract several fields at once from each item. Use a context node – passing the current element as the second argument to $xpath->query() – to search within each book card individually:
<?php
// Get all book cards
$books = $xpath->query('//article[contains(@class,"product_pod")]');
echo "Extracting data from " . $books->length . " books..." . PHP_EOL . PHP_EOL;
$results = [];
foreach ($books as $book) {
// Search within this book card only using "." prefix
$titleNode = $xpath->query('.//h3/a', $book)->item(0);
$priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);
$ratingNode = $xpath->query('.//*[contains(@class,"star-rating")]', $book)->item(0);
$linkNode = $xpath->query('.//h3/a', $book)->item(0);
// Extract values safely - use null coalescing to handle missing nodes
$title = $titleNode ? $titleNode->getAttribute('title') : 'N/A';
$price = $priceNode ? trim($priceNode->textContent) : 'N/A';
$rating = $ratingNode ? str_replace('star-rating ', '', $ratingNode->getAttribute('class')) : 'N/A';
$link = $linkNode ? "https://books.toscrape.com/catalogue/" . ltrim($linkNode->getAttribute('href'), '../') : 'N/A';
$results[] = [
'title' => $title,
'price' => $price,
'rating' => $rating,
'url' => $link,
];
}
// Display the extracted data
foreach ($results as $book) {
echo $book['title'] . PHP_EOL;
echo " Price: " . $book['price'] . PHP_EOL;
echo " Rating: " . $book['rating'] . PHP_EOL;
echo " URL: " . $book['url'] . PHP_EOL;
echo PHP_EOL;
}
?>
Output:
Extracting data from 20 books...
A Light in the Attic
Price: £51.77
Rating: One
URL: https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
Tipping the Velvet
Price: £53.74
Rating: One
URL: https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
Soumission
Price: £50.10
Rating: One
URL: https://books.toscrape.com/catalogue/soumission_998/index.html
...
Finding the Right XPath for Any Page
You don’t need to guess XPath queries. Chrome DevTools shows you exactly what to use:
- Open the target page in Chrome
- Right-click the element you want to scrape
- Select Inspect
- In the Elements panel, right-click the highlighted HTML tag
- Select Copy → Copy XPath
Test the copied XPath in the Chrome console before using it in PHP:
$x('//article[contains(@class,"product_pod")]//h3/a')
If it returns an array of elements the query works. Fix it in the console first – it’s faster than editing and rerunning your PHP script every time.
Handling Missing Elements Safely
Not every page has every element you’re looking for. Always check that a node exists before trying to read from it:
<?php
$priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);
// Wrong - crashes if $priceNode is null
$price = $priceNode->textContent;
// Right - check first, provide fallback
$price = $priceNode ? trim($priceNode->textContent) : 'N/A';
?>
Using ->item(0) returns null when the query finds nothing – it doesn’t throw an error. The ternary check $node ? $node->value : 'N/A' is the pattern to use on every single node you extract. One missing element on one page shouldn’t crash the entire scraper.
Saving Scraped Data to a CSV File
Printing data to the terminal is useful for testing. For anything real you need to save it somewhere – a file you can open in Excel, import into a database, or pass to another script.
CSV is the simplest format to start with. Every spreadsheet application opens it, every database can import it, and PHP writes it in one line per row with no external libraries.
Writing a Basic CSV File
<?php
$html = fetch_page("https://books.toscrape.com/");
if (!$html) {
exit("Failed to fetch page." . PHP_EOL);
}
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$books = $xpath->query('//article[contains(@class,"product_pod")]');
// Open file for writing - 'w' creates it if it doesn't exist
$csvFile = __DIR__ . '/books.csv';
$handle = fopen($csvFile, 'w');
if (!$handle) {
exit("Could not create CSV file." . PHP_EOL);
}
// Write the header row first
fputcsv($handle, ['Title', 'Price', 'Rating', 'URL']);
$count = 0;
foreach ($books as $book) {
$titleNode = $xpath->query('.//h3/a', $book)->item(0);
$priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);
$ratingNode = $xpath->query('.//*[contains(@class,"star-rating")]', $book)->item(0);
$linkNode = $xpath->query('.//h3/a', $book)->item(0);
$title = $titleNode ? $titleNode->getAttribute('title') : 'N/A';
$price = $priceNode ? trim($priceNode->textContent) : 'N/A';
$rating = $ratingNode ? str_replace('star-rating ', '', $ratingNode->getAttribute('class')) : 'N/A';
$link = $linkNode ? "https://books.toscrape.com/catalogue/" . ltrim($linkNode->getAttribute('href'), '../') : 'N/A';
// Write one row per book - fputcsv handles commas and quotes automatically
fputcsv($handle, [$title, $price, $rating, $link]);
$count++;
}
fclose($handle);
echo "Saved $count books to books.csv" . PHP_EOL;
?>
Output:
Saved 20 books to books.csv
Open books.csv in Excel or Google Sheets and you’ll see 20 rows of clean structured data – title, price, rating, and URL – ready to work with.
Appending vs Overwriting
The 'w' mode in fopen() overwrites the file every time the script runs. For a scraper that runs once and saves everything that’s fine. For a scraper that runs in a loop across multiple pages, use 'a' (append) after writing the header:
<?php
$csvFile = __DIR__ . '/books.csv';
$isNewFile = !file_exists($csvFile);
// Open in append mode - adds to existing file instead of overwriting
$handle = fopen($csvFile, 'a');
if (!$handle) {
exit("Could not open CSV file." . PHP_EOL);
}
// Only write header row if the file is brand new
if ($isNewFile) {
fputcsv($handle, ['Title', 'Price', 'Rating', 'URL']);
}
// Write rows from current page
foreach ($books as $book) {
// ... extract fields ...
fputcsv($handle, [$title, $price, $rating, $link]);
}
fclose($handle);
?>
This way you can call the same code on page 1, page 2, and page 50 without the earlier pages getting overwritten. The header only appears once at the top regardless of how many pages you scrape.
Complete Single-Page Scraper With CSV Output
Putting everything together – fetching, parsing, and saving – in one clean script:
<?php
// ---- Configuration ----
$targetUrl = "https://books.toscrape.com/";
$csvFile = __DIR__ . '/books.csv';
// ---- Fetch ----
function fetch_page($url) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_ENCODING => '',
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
],
]);
$html = curl_exec($ch);
$errno = curl_errno($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($errno || $httpCode !== 200) {
echo "Failed to fetch $url (HTTP $httpCode)" . PHP_EOL;
return false;
}
return $html;
}
// ---- Parse ----
$html = fetch_page($targetUrl);
if (!$html) {
exit("Scraper stopped - could not fetch page." . PHP_EOL);
}
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$books = $xpath->query('//article[contains(@class,"product_pod")]');
if ($books->length === 0) {
exit("No books found - page structure may have changed." . PHP_EOL);
}
// ---- Save ----
$handle = fopen($csvFile, 'w');
fputcsv($handle, ['Title', 'Price', 'Rating', 'URL']);
$saved = 0;
foreach ($books as $book) {
$titleNode = $xpath->query('.//h3/a', $book)->item(0);
$priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);
$ratingNode = $xpath->query('.//*[contains(@class,"star-rating")]', $book)->item(0);
$linkNode = $xpath->query('.//h3/a', $book)->item(0);
$title = $titleNode ? $titleNode->getAttribute('title') : 'N/A';
$price = $priceNode ? trim($priceNode->textContent) : 'N/A';
$rating = $ratingNode ? str_replace('star-rating ', '', $ratingNode->getAttribute('class')) : 'N/A';
$link = $linkNode ? "https://books.toscrape.com/catalogue/" . ltrim($linkNode->getAttribute('href'), '../') : 'N/A';
fputcsv($handle, [$title, $price, $rating, $link]);
$saved++;
}
fclose($handle);
echo "Done. Saved $saved books to: $csvFile" . PHP_EOL;
?>
Output:
Done. Saved 20 books to: /var/www/html/scraper/books.csv
books.csv contents:
Title,Price,Rating,URL
A Light in the Attic,£51.77,One,https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
Tipping the Velvet,£53.74,One,https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
Soumission,£50.10,One,https://books.toscrape.com/catalogue/soumission_998/index.html
Sharp Objects,£47.82,Four,https://books.toscrape.com/catalogue/sharp-objects_997/index.html
...
Run this script from the terminal:
php scraper.php
The CSV file appears in the same directory as the script. Open it directly in Excel, Google Sheets, or import it into MySQL – the data is ready to use immediately.
Scraping Multiple Pages
One page of 20 books is a test run. A real scraper needs to follow the site through all 50 pages and collect everything. The only difference between scraping one page and fifty is a loop that finds the next page link and keeps going until there isn’t one.
Finding the Next Page Link
Before writing the loop, identify how the site handles pagination. On books.toscrape.com, each page has a “next” button in the bottom navigation. Inspect it in Chrome DevTools and you’ll see:
<li class="next">
<a href="catalogue/page-2.html">next</a>
</li>
The XPath to find this link:
//li[contains(@class,"next")]/a
When this element exists there’s another page to scrape. When it’s gone you’re on the last page.
The Multi-Page Scraper
<?php
set_time_limit(0);
function fetch_page($url) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_ENCODING => '',
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
],
]);
$html = curl_exec($ch);
$errno = curl_errno($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($errno || $httpCode !== 200) {
echo "Failed: $url (HTTP $httpCode)" . PHP_EOL;
return false;
}
return $html;
}
// ---- Setup CSV ----
$csvFile = __DIR__ . '/all_books.csv';
$handle = fopen($csvFile, 'w');
fputcsv($handle, ['Title', 'Price', 'Rating', 'URL']);
// ---- Scraping Loop ----
$currentUrl = "https://books.toscrape.com/";
$page = 1;
$totalSaved = 0;
while ($currentUrl) {
echo "Scraping page $page..." . PHP_EOL;
$html = fetch_page($currentUrl);
if (!$html) {
echo "Failed to fetch page $page. Stopping." . PHP_EOL;
break;
}
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$books = $xpath->query('//article[contains(@class,"product_pod")]');
if ($books->length === 0) {
echo "No books found on page $page. Stopping." . PHP_EOL;
break;
}
// Extract and save each book on this page
foreach ($books as $book) {
$titleNode = $xpath->query('.//h3/a', $book)->item(0);
$priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);
$ratingNode = $xpath->query('.//*[contains(@class,"star-rating")]', $book)->item(0);
$linkNode = $xpath->query('.//h3/a', $book)->item(0);
$title = $titleNode ? $titleNode->getAttribute('title') : 'N/A';
$price = $priceNode ? trim($priceNode->textContent) : 'N/A';
$rating = $ratingNode ? str_replace('star-rating ', '', $ratingNode->getAttribute('class')) : 'N/A';
$link = $linkNode ? "https://books.toscrape.com/catalogue/" . ltrim($linkNode->getAttribute('href'), '../') : 'N/A';
fputcsv($handle, [$title, $price, $rating, $link]);
$totalSaved++;
}
echo "Page $page done - {$books->length} books saved." . PHP_EOL;
// Find the next page link
$nextNode = $xpath->query('//li[contains(@class,"next")]/a')->item(0);
if ($nextNode) {
$nextHref = $nextNode->getAttribute('href');
$currentUrl = "https://books.toscrape.com/catalogue/" . $nextHref;
$page++;
sleep(1); // pause between requests
} else {
echo "Last page reached." . PHP_EOL;
$currentUrl = null;
}
// Free memory before next page
unset($html, $books);
$dom = null;
}
fclose($handle);
echo PHP_EOL . "Scrape complete." . PHP_EOL;
echo "Total books saved: $totalSaved" . PHP_EOL;
echo "File: $csvFile" . PHP_EOL;
?>
Output:
Scraping page 1...
Page 1 done - 20 books saved.
Scraping page 2...
Page 2 done - 20 books saved.
Scraping page 3...
Page 3 done - 20 books saved.
...
Scraping page 50...
Page 50 done - 20 books saved.
Last page reached.
Scrape complete.
Total books saved: 1000
File: /var/www/html/scraper/all_books.csv
Why sleep(1) Is in the Loop
The sleep(1) call pauses the script for one second between each page request. Without it the scraper hits 50 pages as fast as possible – potentially hundreds of requests per minute. Most sites detect this pattern and block your IP.
One second between requests is the minimum for most sites. If you start getting blocked or seeing empty responses mid-scrape, increase it to 2-3 seconds. The scrape takes longer but stays under the radar.
Why unset() Is in the Loop
unset($html, $books) and $dom = null at the end of each iteration releases the memory used by that page before loading the next one. Without this, every page’s HTML and parsed DOM accumulates in memory throughout the entire job. On 50 pages it’s manageable. On 500 pages your script hits the memory limit and crashes.
Always free variables you no longer need inside scraping loops. It costs nothing and prevents the most common cause of scraper crashes on large jobs.
Adding a Safety Cap
When testing a new scraper add a page limit so it doesn’t run all 50 pages every time you make a small change:
<?php
$maxPages = 3; // remove this line when ready for full run
while ($currentUrl) {
if ($page > $maxPages) {
echo "Test limit reached. Stopping at page $maxPages." . PHP_EOL;
break;
}
// ... rest of the loop
}
?>
Output with cap set to 3:
Scraping page 1...
Page 1 done - 20 books saved.
Scraping page 2...
Page 2 done - 20 books saved.
Scraping page 3...
Page 3 done - 20 books saved.
Test limit reached. Stopping at page 3.
Set $maxPages = 3 while developing, remove the check when you’re ready for the full scrape. This prevents accidentally hammering a site with 50 requests every time you test a code change.
Common Beginner Mistakes and How to Avoid Them
Most PHP web scraper problems come from the same handful of mistakes. Knowing what they are before you hit them saves hours of debugging.
Mistake 1: Using file_get_contents() Instead of cURL
file_get_contents() works on basic URLs but has no support for custom headers, cookies, redirects, or timeouts. Most real websites block requests without a proper User-Agent header – which file_get_contents() can’t send reliably.
<?php
// Wrong - gets blocked by most real sites
$html = file_get_contents("https://example.com/products");
// Right - full control over headers and request behavior
$html = fetch_page("https://example.com/products");
?>
Mistake 2: Using @ to Suppress Errors
The @ operator before a function call hides all errors and warnings from that call. Beginners use it to silence DOMDocument warnings – but it hides real errors too, making bugs impossible to find.
<?php
// Wrong - hides all errors including ones you need to see
@$dom->loadHTML($html);
// Right - captures only libxml parsing warnings internally
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
?>
Mistake 3: Not Checking if the Request Succeeded
If curl_exec() fails it returns false. Passing false to DOMDocument::loadHTML() throws a warning and loads an empty document – your scraper continues running and saves nothing without any obvious sign something went wrong.
<?php
// Wrong - no check, scraper continues on failed requests
$html = fetch_page($url);
$dom->loadHTML($html); // crashes silently if $html is false
// Right - check every request before using the response
$html = fetch_page($url);
if (!$html) {
echo "Failed to fetch $url - skipping." . PHP_EOL;
continue;
}
$dom->loadHTML($html);
?>
Mistake 4: Using Exact Class Matching Instead of contains()
Elements often have multiple CSS classes. An exact match on @class="product" fails if the element has class="product active featured". Use contains() to match partial class names:
<?php
// Wrong - fails when element has multiple classes
$books = $xpath->query('//article[@class="product_pod"]');
// Right - matches regardless of other classes present
$books = $xpath->query('//article[contains(@class,"product_pod")]');
?>
Mistake 5: No Delay Between Requests
Firing requests as fast as PHP can execute them is the fastest way to get your IP blocked. A site that serves 50 pages at one per second looks like a real user. A site that gets 50 requests in 3 seconds flags your IP immediately.
<?php
while ($currentUrl) {
$html = fetch_page($currentUrl);
// Process page...
// Wrong - no delay, maximum request speed
// Right - pause between requests
sleep(1); // minimum 1 second, increase to 2-3 if getting blocked
}
?>
Mistake 6: Storing Everything in Memory Before Saving
Collecting all scraped data in one giant array and saving at the end uses more memory with every page. On large scrapes the script hits PHP’s memory limit and crashes – losing everything collected so far.
<?php
// Wrong - accumulates all data in memory
$allBooks = [];
while ($currentUrl) {
// ... scrape page ...
$allBooks[] = $bookData; // grows with every page
}
// Only saves if script survives to the end
foreach ($allBooks as $book) {
fputcsv($handle, $book);
}
// Right - save immediately and free memory
while ($currentUrl) {
// ... scrape page ...
fputcsv($handle, [$title, $price, $rating, $link]); // save now
unset($html, $books); // free memory
$dom = null;
}
?>
Mistake 7: Hardcoding Paths Instead of Using __DIR__
Hardcoded file paths break when you move the project to a different server or directory. Use __DIR__ which always returns the absolute path of the current file’s directory:
<?php
// Wrong - breaks if project moves to a different location
$csvFile = '/var/www/html/myproject/books.csv';
require '/var/www/html/myproject/config.php';
// Right - works regardless of where the project lives
$csvFile = __DIR__ . '/books.csv';
require __DIR__ . '/config.php';
?>
Mistake 8: Scraping Without Checking robots.txt
Before scraping any site check its robots.txt file. It tells you which paths the site owner wants left alone. Ignoring it doesn’t just risk legal issues – it’s also the quickest way to get permanently blocked:
<?php
// Always check robots.txt before scraping a new site
$robotsUrl = "https://books.toscrape.com/robots.txt";
$robots = fetch_page($robotsUrl);
if ($robots) {
echo $robots . PHP_EOL;
} else {
echo "No robots.txt found - proceed with caution." . PHP_EOL;
}
// books.toscrape.com output:
// User-agent: *
// Disallow:
// (empty Disallow means everything is allowed)
?>
An empty Disallow: means the entire site is open to scraping. Disallow: / means the owner wants nothing scraped. Specific paths like Disallow: /admin/ mean those paths only should be avoided.
Frequently Asked Questions
What is the best way to build a PHP web scraper?
Use cURL for fetching pages and DOMDocument with DOMXPath for parsing HTML. This combination is built into PHP, requires no external libraries, and handles the full range of real-world scraping tasks – custom headers, cookies, redirects, and structured data extraction. For beginners this is the right starting point before reaching for external libraries like Guzzle or Symfony DomCrawler.
Why is my PHP web scraper returning empty results?
Two causes cover most cases. Either your XPath selector doesn’t match the actual HTML – inspect the page in Chrome DevTools and verify the exact class names and element structure before writing your query. Or the content is loaded by JavaScript after the initial page load – cURL fetches raw HTML only, not JavaScript-rendered content. Save the raw response to a file and open it in your browser to confirm the data is actually there before assuming your selector is wrong.
How do I stop my PHP web scraper from getting blocked?
Always send a realistic User-Agent header – the default cURL user agent is blocked by most sites. Add the full Accept and Accept-Language headers to match what a real browser sends. Enable cookie handling with a cookie jar file. Add a delay of 1-2 seconds between requests. These four changes fix the majority of blocking issues on standard websites without needing proxies or more complex solutions.
Can I scrape JavaScript-rendered websites with PHP?
Not directly with cURL. PHP cURL fetches the raw HTML before JavaScript runs – if the data loads after page render, cURL won’t see it. First check Chrome DevTools Network tab for API calls the JavaScript makes — hitting that API directly with cURL is often simpler than scraping HTML. If there’s no API, check the HTML source for inline JSON in script tags. As a last resort use a headless browser like Puppeteer to render the page and save the HTML, then parse it with PHP.
How do I save scraped data to a database instead of CSV?
Replace fputcsv() with PDO database calls. Create your table first with a UNIQUE constraint on the field that identifies a unique record, then use INSERT ... ON DUPLICATE KEY UPDATE so re-running the scraper updates existing records instead of creating duplicates. The PHP cURL web scraping complete guide covers the full database storage implementation with working code.
How do I run my PHP web scraper automatically on a schedule?
Use a cron job to run your scraper script automatically on whatever schedule you need – daily, hourly, or weekly. Add the cron entry pointing at your PHP binary and script path, redirect output to a log file, and the server handles execution without any manual intervention. The PHP cron job guide covers the full setup including cPanel configuration and debugging when jobs don’t run.
Is PHP good for web scraping?
For most scraping tasks, yes. PHP has cURL and DOMDocument built in, runs on virtually every hosting environment, and handles static HTML scraping well. Where it falls short is JavaScript-heavy sites and very large scale operations needing parallel requests. For those cases Python with Scrapy or BeautifulSoup is more commonly used — but for a developer already working in PHP, the built-in tools are more than capable for the majority of real scraping projects.
Next Steps
You now have a complete working PHP web scraper that fetches pages, parses HTML, extracts structured data, and saves it to a CSV file across multiple pages. That covers the foundation of every scraping project regardless of complexity.
Three directions to go from here depending on what you’re building:
- Store data in MySQL instead of CSV – the PHP cURL web scraping complete guide covers database storage, error handling, retry logic, rate limiting, and cookie-based session handling in full detail with working code throughout
- Handle timeout and connection errors – scrapers hitting real targets need retry logic and proper timeout configuration. The PHP cURL timeout guide covers error code 28,
curl_getinfo()timing breakdown, and building retry loops that recover from failures automatically - Run the scraper on a schedule – once your scraper works reliably, automate it with a cron job so it runs daily without manual intervention. The PHP cron job guide covers everything from cron syntax to debugging silent failures
The scraper built in this guide works against books.toscrape.com – a site designed for scraping practice. Once you’re comfortable with the code, try pointing it at a real target and adapt the XPath queries to match the actual HTML structure. The fetch, parse, and extract pattern stays the same – only the selectors change.
