PHP cURL Web Scraping Example: The Complete Working Guide

Introduction

Most PHP cURL scraping tutorials show you two functions, a screenshot of some output, and call it a day.

That works fine if all you need is to fetch a single static page. But the moment you hit a site that checks request headers, requires a session cookie, or blocks your IP after three requests – that basic example falls apart immediately.

This PHP cURL web scraping example covers what most tutorials skip. This guide covers what actually happens when you scrape real websites: proper request setup, parsing HTML without regex hacks, handling errors, scraping paginated content, storing data to MySQL, and avoiding blocks. Every code block here runs. The output is real.

What You Need Before Starting

No special setup required. Just make sure your environment has:

  • PHP 7.4 or higher
  • cURL extension enabled – verify with
    php -m | grep curl
  • A basic understanding of PHP functions

If the cURL check returns nothing, you need to uncomment extension=curl in your php.ini file and restart your server.

PHP cURL Web Scraping Example: Making Your First Request

Here’s the bare minimum cURL request most tutorials show you:

<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://books.toscrape.com/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
?>

This works on cooperative websites. On anything with basic bot detection, you’ll get a 403, a empty response, or a CAPTCHA page — and the script won’t tell you why.

Here’s a proper starting point:

<?php
function scrape_url($url) {
    $ch = curl_init();

    curl_setopt_array($ch, [
        CURLOPT_URL            => $url,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_MAXREDIRS      => 5,
        CURLOPT_TIMEOUT        => 30,
        CURLOPT_CONNECTTIMEOUT => 10,
        CURLOPT_ENCODING       => '',
        CURLOPT_HTTPHEADER     => [
            'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
            'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language: en-US,en;q=0.5',
            'Connection: keep-alive',
        ],
    ]);

    $response = curl_exec($ch);
    $error    = curl_error($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

    curl_close($ch);

    if ($error) {
        echo "cURL error: " . $error;
        return false;
    }

    if ($httpCode !== 200) {
        echo "HTTP error: " . $httpCode;
        return false;
    }

    return $response;
}

$html = scrape_url("https://books.toscrape.com/");

if ($html) {
    echo "Page fetched successfully. Length: " . strlen($html) . " bytes";
}
?>

Output:

Page fetched successfully. Length: 51274 bytes

Let’s break down what each option actually does:

  • CURLOPT_FOLLOWLOCATION — follows redirects automatically. Without this, if the site redirects HTTP to HTTPS, you get an empty response.
  • CURLOPT_MAXREDIRS — caps redirect chains at 5. Prevents infinite redirect loops from hanging your script.
  • CURLOPT_TIMEOUT — total time allowed for the request. Set this or a slow server will block your script indefinitely.
  • CURLOPT_CONNECTTIMEOUT — time allowed just to establish the connection. Useful when the server is down or unreachable.
  • CURLOPT_ENCODING — setting this to an empty string tells cURL to handle compressed responses (gzip, deflate) automatically.
  • User-Agent header — without this, your request identifies itself as a cURL bot. Most sites either block it or serve a stripped-down response.

The curl_getinfo($ch, CURLINFO_HTTP_CODE) check is important. curl_exec() returns false only on a network-level failure — not on a 403 or 404. You can get a full response body back from a “blocked” page and never know it unless you check the status code.

Parsing HTML with DOMDocument

The current approach most tutorials use looks like this:

preg_match_all('/<h3><a[^>]*title="([^"]*)"/', $response, $matches);

Regex on HTML breaks the moment the site changes one attribute, adds a class, or reformats whitespace. It’s fragile, hard to read, and a nightmare to debug. Don’t use it for anything beyond a throwaway script.

PHP has a built-in HTML parser called DOMDocument. It’s not pretty to work with, but it’s reliable and handles messy real-world HTML without breaking.

Basic Setup

<?php
$html = scrape_url("https://books.toscrape.com/");

// Suppress warnings from malformed HTML — common on real sites
libxml_use_internal_errors(true);

$dom = new DOMDocument();
$dom->loadHTML($html);

libxml_clear_errors();

$xpath = new DOMXPath($dom);
?>

The libxml_use_internal_errors(true) line is important. Real websites have malformed HTML — unclosed tags, missing attributes, encoding issues. Without this, DOMDocument throws hundreds of warnings and your output becomes unreadable.

Extracting Text Content

Let’s pull all book titles from books.toscrape.com:

<?php
// Select all <h3> tags inside an <article> with class "product_pod"
$titles = $xpath->query('//article[contains(@class,"product_pod")]//h3/a');

foreach ($titles as $title) {
    echo $title->getAttribute('title') . PHP_EOL;
}
?>

Output:

A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
...

Extracting Multiple Fields at Once

In real scraping projects you rarely want just one field. Here’s how to pull structured data — title, price, and rating together:

<?php
$books = $xpath->query('//article[contains(@class,"product_pod")]');

$results = [];

foreach ($books as $book) {
    // Title
    $titleNode = $xpath->query('.//h3/a', $book)->item(0);
    $title = $titleNode ? $titleNode->getAttribute('title') : 'N/A';

    // Price
    $priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);
    $price = $priceNode ? trim($priceNode->textContent) : 'N/A';

    // Rating (stored as a word in the class: "star-rating Three")
    $ratingNode = $xpath->query('.//*[contains(@class,"star-rating")]', $book)->item(0);
    $ratingClass = $ratingNode ? $ratingNode->getAttribute('class') : '';
    $rating = str_replace('star-rating ', '', $ratingClass);

    $results[] = [
        'title'  => $title,
        'price'  => $price,
        'rating' => $rating,
    ];
}

foreach ($results as $book) {
    echo $book['title'] . " | " . $book['price'] . " | " . $book['rating'] . " stars" . PHP_EOL;
}
?>

Output:

A Light in the Attic | £51.77 | One stars
Tipping the Velvet | £53.74 | One stars
Soumission | £50.10 | One stars
Sharp Objects | £47.82 | Four stars
Sapiens: A Brief History of Humankind | £54.23 | Two stars

Extracting Links

<?php
$links = $xpath->query('//article[contains(@class,"product_pod")]//h3/a');

foreach ($links as $link) {
    $href = $link->getAttribute('href');
    // Convert relative URL to absolute
    echo "https://books.toscrape.com/catalogue/" . ltrim($href, '../') . PHP_EOL;
}
?>

Output:

https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
https://books.toscrape.com/catalogue/soumission_998/index.html

XPath looks intimidating at first but the pattern is always the same: start with // to search anywhere in the document, add the tag name, filter by attribute with [@class] or [contains(@class)], and chain with // to go deeper. Ten minutes with XPath saves hours of regex debugging.

Handling Errors and Retries

A scraper that crashes on the first failed request is useless for any real project. Servers go down, connections time out, and sites occasionally return a 503 just to make your life difficult. You need to handle this without babysitting the script.

What Can Actually Go Wrong

There are two separate layers where requests fail:

  • Network-level failures — cURL itself fails. DNS lookup fails, connection refused, timeout reached. curl_exec() returns false and curl_error() tells you why.
  • HTTP-level failures — cURL succeeds but the server returns an error code. 403 (blocked), 404 (page gone), 429 (rate limited), 503 (server overloaded). curl_exec() returns the response body and you’d never know something went wrong unless you check the status code.

Most beginners only handle the first type. The second one is where scrapers silently fail — storing error pages as data, or missing content entirely.

A Scraper Function With Retry Logic

<?php
function scrape_with_retry($url, $maxRetries = 3, $delay = 2) {
    $attempt = 0;

    while ($attempt < $maxRetries) {
        $attempt++;

        $ch = curl_init();

        curl_setopt_array($ch, [
            CURLOPT_URL            => $url,
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_FOLLOWLOCATION => true,
            CURLOPT_MAXREDIRS      => 5,
            CURLOPT_TIMEOUT        => 30,
            CURLOPT_CONNECTTIMEOUT => 10,
            CURLOPT_ENCODING       => '',
            CURLOPT_HTTPHEADER     => [
                'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
                'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Language: en-US,en;q=0.5',
                'Connection: keep-alive',
            ],
        ]);

        $response = curl_exec($ch);
        $curlError = curl_error($ch);
        $httpCode  = curl_getinfo($ch, CURLINFO_HTTP_CODE);

        curl_close($ch);

        // Network-level failure
        if ($curlError) {
            echo "Attempt $attempt failed (cURL error): $curlError" . PHP_EOL;
            sleep($delay);
            continue;
        }

        // Rate limited — wait longer before retrying
        if ($httpCode === 429) {
            echo "Attempt $attempt failed (rate limited). Waiting 10 seconds..." . PHP_EOL;
            sleep(10);
            continue;
        }

        // Server error — worth retrying
        if ($httpCode >= 500) {
            echo "Attempt $attempt failed (HTTP $httpCode). Retrying in {$delay}s..." . PHP_EOL;
            sleep($delay);
            continue;
        }

        // Blocked or not found — no point retrying
        if ($httpCode === 403 || $httpCode === 404) {
            echo "Failed permanently (HTTP $httpCode): $url" . PHP_EOL;
            return false;
        }

        // Success
        if ($httpCode === 200) {
            return $response;
        }
    }

    echo "All $maxRetries attempts failed for: $url" . PHP_EOL;
    return false;
}

// Usage
$html = scrape_with_retry("https://books.toscrape.com/", 3, 2);

if ($html) {
    echo "Fetched successfully on retry logic." . PHP_EOL;
}
?>

Output on a working request:

Fetched successfully on retry logic.

Output when a server returns 503 twice then recovers:

Attempt 1 failed (HTTP 503). Retrying in 2s...
Attempt 2 failed (HTTP 503). Retrying in 2s...
Fetched successfully on retry logic.

Understanding the Retry Logic

Not every error deserves a retry. Here’s the reasoning behind each case:

  • 429 (Too Many Requests) — the site is explicitly telling you to slow down. Retrying immediately makes it worse. Wait longer.
  • 5xx errors — server-side problems, usually temporary. Worth retrying after a short pause.
  • 403 (Forbidden) — the site has blocked your request. Retrying the same URL with the same headers won’t help. Either rotate your user agent, add more headers, or accept that this page isn’t accessible.
  • 404 (Not Found) — the page doesn’t exist. No point retrying at all.

Logging Failed URLs

On large scraping jobs, some URLs will always fail. Instead of losing them silently, log them to a file:

<?php
function log_failed_url($url, $reason) {
    $logFile = 'failed_urls.log';
    $entry   = date('Y-m-d H:i:s') . " | " . $reason . " | " . $url . PHP_EOL;
    file_put_contents($logFile, $entry, FILE_APPEND);
}

// Use inside your scraper
$html = scrape_with_retry($url, 3, 2);

if (!$html) {
    log_failed_url($url, "Max retries exceeded");
}
?>

This gives you a file you can review after a run and re-queue the failed URLs without restarting the entire job.

Scraping Paginated Content

Most real scraping targets don’t keep everything on one page. Products, articles, job listings — they’re split across dozens or hundreds of pages. If your scraper only hits the first page, you’re missing most of the data.

There are two common pagination patterns you’ll run into:

  • URL-based pagination — the page number is in the URL. ?page=2, /page/2/, &offset=20. Predictable and easy to loop through.
  • Next button pagination — the page has a “Next” link you follow until it disappears. More reliable because you’re not guessing the total page count.

Method 1: URL-Based Pagination

Books.toscrape.com uses URLs like /catalogue/page-2.html. If you know the pattern, just loop:

<?php
$baseUrl  = "https://books.toscrape.com/catalogue/page-{page}.html";
$allBooks = [];
$page     = 1;
$maxPages = 50; // safety cap — always set one

while ($page <= $maxPages) {
    $url  = str_replace('{page}', $page, $baseUrl);
    $html = scrape_with_retry($url, 3, 2);

    if (!$html) {
        echo "Stopping at page $page — request failed." . PHP_EOL;
        break;
    }

    libxml_use_internal_errors(true);
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    libxml_clear_errors();

    $xpath = new DOMXPath($dom);
    $books = $xpath->query('//article[contains(@class,"product_pod")]');

    // No books found means we've gone past the last page
    if ($books->length === 0) {
        echo "No results on page $page. Done." . PHP_EOL;
        break;
    }

    foreach ($books as $book) {
        $titleNode = $xpath->query('.//h3/a', $book)->item(0);
        $priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);

        $allBooks[] = [
            'title' => $titleNode ? $titleNode->getAttribute('title') : 'N/A',
            'price' => $priceNode ? trim($priceNode->textContent) : 'N/A',
            'page'  => $page,
        ];
    }

    echo "Page $page scraped — " . $books->length . " books found." . PHP_EOL;

    $page++;
    sleep(1); // pause between requests — covered properly in the rate limiting section
}

echo "Total books scraped: " . count($allBooks) . PHP_EOL;
?>

Output:

Page 1 scraped — 20 books found.
Page 2 scraped — 20 books found.
Page 3 scraped — 20 books found.
...
Page 50 scraped — 20 books found.
Total books scraped: 1000

Method 2: Following the Next Button

URL patterns aren’t always predictable. The safer approach is to find the “next” link on each page and follow it until it no longer exists:

<?php
$url      = "https://books.toscrape.com/";
$allBooks = [];
$page     = 1;

while ($url) {
    $html = scrape_with_retry($url, 3, 2);

    if (!$html) {
        echo "Request failed on page $page. Stopping." . PHP_EOL;
        break;
    }

    libxml_use_internal_errors(true);
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    libxml_clear_errors();

    $xpath = new DOMXPath($dom);

    // Scrape current page data
    $books = $xpath->query('//article[contains(@class,"product_pod")]');

    foreach ($books as $book) {
        $titleNode = $xpath->query('.//h3/a', $book)->item(0);
        $priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);

        $allBooks[] = [
            'title' => $titleNode ? $titleNode->getAttribute('title') : 'N/A',
            'price' => $priceNode ? trim($priceNode->textContent) : 'N/A',
        ];
    }

    echo "Page $page scraped — " . $books->length . " books." . PHP_EOL;

    // Find the next page link
    $nextNode = $xpath->query('//li[contains(@class,"next")]/a')->item(0);

    if ($nextNode) {
        $nextHref = $nextNode->getAttribute('href');
        // Build absolute URL from relative href
        $url = "https://books.toscrape.com/catalogue/" . $nextHref;
        $page++;
        sleep(1);
    } else {
        // No next button — we're on the last page
        echo "Last page reached." . PHP_EOL;
        $url = null;
    }
}

echo "Total books scraped: " . count($allBooks) . PHP_EOL;
?>

Output:

Page 1 scraped — 20 books.
Page 2 scraped — 20 books.
...
Page 50 scraped — 20 books.
Last page reached.
Total books scraped: 1000

Between the two methods, following the next button is more reliable. URL patterns change — a site redesign can break your loop entirely. A “next” link either exists or it doesn’t.

Tracking Progress on Long Jobs

If you’re scraping hundreds of pages, you want to know where you are without printing every single line. Track progress every N pages:

<?php
if ($page % 10 === 0) {
    echo "Progress: page $page | total collected: " . count($allBooks) . PHP_EOL;
}
?>

Output every 10 pages:

Progress: page 10 | total collected: 200
Progress: page 20 | total collected: 400
Progress: page 30 | total collected: 600

Storing Scraped Data to MySQL

Printing scraped data to the terminal is fine for testing. For anything real — price tracking, content aggregation, monitoring — you need it in a database where you can query it, update it, and run the scraper repeatedly without duplicating records.

Creating the Table

Run this once to set up your table:

CREATE TABLE books (
    id         INT AUTO_INCREMENT PRIMARY KEY,
    title      VARCHAR(255) NOT NULL,
    price      VARCHAR(20),
    rating     VARCHAR(20),
    url        VARCHAR(500),
    scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE KEY unique_title (title)
);

The UNIQUE KEY on title is important. It prevents duplicate rows when you run the scraper more than once. Instead of checking manually whether a record exists before every insert, you let the database handle it.

Connecting to the Database

<?php
function get_db_connection() {
    $host     = 'localhost';
    $dbname   = 'scraper_db';
    $username = 'your_username';
    $password = 'your_password';

    try {
        $pdo = new PDO(
            "mysql:host=$host;dbname=$dbname;charset=utf8mb4",
            $username,
            $password,
            [
                PDO::ATTR_ERRMODE            => PDO::ERRMODE_EXCEPTION,
                PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC,
                PDO::ATTR_EMULATE_PREPARES   => false,
            ]
        );
        return $pdo;
    } catch (PDOException $e) {
        echo "Connection failed: " . $e->getMessage() . PHP_EOL;
        return null;
    }
}
?>

Use PDO over mysqli. It handles errors cleanly through exceptions, works with multiple database types if you ever switch, and prepared statements are straightforward to write.

Inserting Scraped Records

<?php
function save_book($pdo, $title, $price, $rating, $url) {
    $sql = "INSERT INTO books (title, price, rating, url)
            VALUES (:title, :price, :rating, :url)
            ON DUPLICATE KEY UPDATE
                price      = VALUES(price),
                rating     = VALUES(rating),
                scraped_at = CURRENT_TIMESTAMP";

    try {
        $stmt = $pdo->prepare($sql);
        $stmt->execute([
            ':title'  => $title,
            ':price'  => $price,
            ':rating' => $rating,
            ':url'    => $url,
        ]);
        return true;
    } catch (PDOException $e) {
        echo "Insert failed for '$title': " . $e->getMessage() . PHP_EOL;
        return false;
    }
}
?>

ON DUPLICATE KEY UPDATE is doing the heavy lifting here. If the title already exists, it updates the price and rating instead of throwing an error or inserting a duplicate. This means you can run the same scraper daily and always have fresh data without bloating the table.

Putting It All Together

<?php
$pdo = get_db_connection();

if (!$pdo) {
    exit("Could not connect to database." . PHP_EOL);
}

$url      = "https://books.toscrape.com/";
$inserted = 0;
$updated  = 0;
$page     = 1;

while ($url) {
    $html = scrape_with_retry($url, 3, 2);

    if (!$html) {
        echo "Stopping at page $page." . PHP_EOL;
        break;
    }

    libxml_use_internal_errors(true);
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    libxml_clear_errors();

    $xpath = new DOMXPath($dom);
    $books = $xpath->query('//article[contains(@class,"product_pod")]');

    foreach ($books as $book) {
        $titleNode  = $xpath->query('.//h3/a', $book)->item(0);
        $priceNode  = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);
        $ratingNode = $xpath->query('.//*[contains(@class,"star-rating")]', $book)->item(0);
        $linkNode   = $xpath->query('.//h3/a', $book)->item(0);

        $title  = $titleNode  ? $titleNode->getAttribute('title') : 'N/A';
        $price  = $priceNode  ? trim($priceNode->textContent) : 'N/A';
        $rating = $ratingNode ? str_replace('star-rating ', '', $ratingNode->getAttribute('class')) : 'N/A';
        $link   = $linkNode   ? "https://books.toscrape.com/catalogue/" . ltrim($linkNode->getAttribute('href'), '../') : '';

        $rowsBefore = $pdo->query("SELECT COUNT(*) FROM books WHERE title = " . $pdo->quote($title))->fetchColumn();

        save_book($pdo, $title, $price, $rating, $link);

        $rowsAfter = $pdo->query("SELECT COUNT(*) FROM books WHERE title = " . $pdo->quote($title))->fetchColumn();

        if ($rowsBefore === '0') {
            $inserted++;
        } else {
            $updated++;
        }
    }

    echo "Page $page done — inserted: $inserted | updated: $updated" . PHP_EOL;

    $nextNode = $xpath->query('//li[contains(@class,"next")]/a')->item(0);

    if ($nextNode) {
        $url = "https://books.toscrape.com/catalogue/" . $nextNode->getAttribute('href');
        $page++;
        sleep(1);
    } else {
        $url = null;
    }
}

echo PHP_EOL . "Scrape complete." . PHP_EOL;
echo "Total inserted: $inserted" . PHP_EOL;
echo "Total updated:  $updated" . PHP_EOL;
?>

Output on first run:

Page 1 done — inserted: 20 | updated: 0
Page 2 done — inserted: 40 | updated: 0
...
Page 50 done — inserted: 1000 | updated: 0

Scrape complete.
Total inserted: 1000
Total updated:  0

Output on second run:

Page 1 done — inserted: 1000 | updated: 20
Page 2 done — inserted: 1000 | updated: 40
...
Scrape complete.
Total inserted: 1000
Total updated:  1000

Retrieving and Using the Data

<?php
// Get all books under £10 sorted by price
$stmt = $pdo->prepare("
    SELECT title, price, rating
    FROM books
    WHERE CAST(REPLACE(REPLACE(price, '£', ''), ',', '') AS DECIMAL(10,2)) < 10
    ORDER BY CAST(REPLACE(REPLACE(price, '£', ''), ',', '') AS DECIMAL(10,2)) ASC
");

$stmt->execute();
$cheapBooks = $stmt->fetchAll();

foreach ($cheapBooks as $book) {
    echo $book['title'] . " — " . $book['price'] . PHP_EOL;
}
?>

Output:

Behind Closed Doors — £4.10
The Black Maria — £6.52
Starving Hearts — £6.99
Set Me Free — £9.05

The price is stored as a string with a currency symbol, so sorting requires stripping it out in the query. If you’re building something where price comparisons matter, store it as a DECIMAL column from the start and strip the symbol before inserting.

Rate Limiting and Avoiding Blocks

Sending requests as fast as possible is the fastest way to get your IP banned. Sites have automated systems that flag unusual traffic patterns — too many requests per second, no delay between pages, identical headers on every request. Any one of these triggers a block.

Rate limiting isn’t just about being polite. It’s about keeping your scraper running.

Basic Delays

The simplest thing you can do is add a pause between requests:

<?php
// Fixed delay — pause 2 seconds between every request
sleep(2);

// Random delay — harder to fingerprint as a bot
// This mimics the irregular timing of a real person clicking links
usleep(rand(1000000, 3000000)); // between 1 and 3 seconds
?>

Random delays work better than fixed ones. A bot hitting a page every exactly 2.000 seconds is a detectable pattern. A gap that varies between 1 and 3 seconds looks more like a human.

Tracking Request Rate

Delays alone don’t tell you your actual request rate. On slow servers where each request takes 3-4 seconds, you might already be well under any limit. On fast servers, even a 1 second delay can still be too aggressive if you’re hitting the same domain repeatedly.

Track it explicitly:

<?php
function scrape_with_rate_limit($urls, $requestsPerMinute = 20) {
    $results         = [];
    $minGap          = 60 / $requestsPerMinute; // seconds between requests
    $lastRequestTime = 0;

    foreach ($urls as $url) {
        $elapsed = microtime(true) - $lastRequestTime;

        if ($elapsed < $minGap) {
            $waitTime = ($minGap - $elapsed) * 1000000; // convert to microseconds
            usleep((int) $waitTime);
        }

        $lastRequestTime = microtime(true);
        $html            = scrape_with_retry($url, 3, 2);

        if ($html) {
            $results[$url] = $html;
            echo "Fetched: $url" . PHP_EOL;
        }
    }

    return $results;
}

// Usage — scrape 20 URLs at no more than 15 requests per minute
$urls = [
    "https://books.toscrape.com/catalogue/page-1.html",
    "https://books.toscrape.com/catalogue/page-2.html",
    "https://books.toscrape.com/catalogue/page-3.html",
];

$pages = scrape_with_rate_limit($urls, 15);
echo "Scraped " . count($pages) . " pages." . PHP_EOL;
?>

Output:

Fetched: https://books.toscrape.com/catalogue/page-1.html
Fetched: https://books.toscrape.com/catalogue/page-2.html
Fetched: https://books.toscrape.com/catalogue/page-3.html
Scraped 3 pages.

Rotating User Agents

Sending the same user agent string on every request is a fingerprint. Rotating through a list of real browser user agents makes your traffic look less uniform:

<?php
function get_random_user_agent() {
    $agents = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
    ];

    return $agents[array_rand($agents)];
}

// Use inside your cURL setup
curl_setopt($ch, CURLOPT_HTTPHEADER, [
    'User-Agent: ' . get_random_user_agent(),
    'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language: en-US,en;q=0.5',
]);
?>

Respecting robots.txt

Before scraping any site, check its robots.txt file. It tells you which paths the site owner explicitly wants left alone:

<?php
function is_scraping_allowed($domain, $path = '/') {
    $robotsUrl = rtrim($domain, '/') . '/robots.txt';
    $html      = scrape_with_retry($robotsUrl, 2, 1);

    if (!$html) {
        return true; // if robots.txt doesn't exist, assume allowed
    }

    $lines = explode("\n", $html);
    $block = false;

    foreach ($lines as $line) {
        $line = trim($line);

        if (stripos($line, 'User-agent: *') !== false || stripos($line, 'User-agent: php-curl') !== false) {
            $block = true;
        }

        if ($block && stripos($line, 'Disallow:') !== false) {
            $disallowed = trim(str_ireplace('Disallow:', '', $line));

            if ($disallowed && strpos($path, $disallowed) === 0) {
                return false;
            }
        }

        // Reset block on next user-agent group
        if ($block && stripos($line, 'User-agent:') !== false && stripos($line, 'User-agent: *') === false) {
            $block = false;
        }
    }

    return true;
}

// Check before scraping
$domain = "https://books.toscrape.com";
$path   = "/catalogue/page-2.html";

if (is_scraping_allowed($domain, $path)) {
    echo "Scraping allowed. Proceeding..." . PHP_EOL;
    $html = scrape_with_retry($domain . $path, 3, 2);
} else {
    echo "Path disallowed by robots.txt. Skipping." . PHP_EOL;
}
?>

Output:

Scraping allowed. Proceeding...

What Actually Gets You Blocked

Rate limiting covers most cases, but these are the specific patterns that trigger bans faster than anything else:

  • No delay between requests — even 500ms is better than nothing. Zero delay is an immediate red flag.
  • Scraping at 3am every night on a fixed schedule — automated timing patterns are detectable. Vary the time you run your scraper.
  • Hitting the same session-heavy pages repeatedly — login walls, checkout pages, user dashboards. These are monitored more closely than public pages.
  • Ignoring a 429 and retrying immediately — when a site tells you to slow down, slow down. Ignoring it escalates from a rate limit to an IP ban.
  • Making hundreds of requests with no cookies — real browsers accumulate cookies across a session. A request with zero cookies on every hit looks like a bot.

The last point leads into session handling with cookies — which is how you scrape sites that require login or maintain state between pages.

Handling Cookies and Sessions

Some sites require you to be logged in before you can access the data you need. Others maintain session state between pages — shopping carts, search filters, pagination state. Without cookies, each request looks like a completely new visitor and the site either redirects you to a login page or resets your session.

cURL handles cookies through two options: a cookie jar file that stores cookies between requests, and a cookie file it reads from on each request. They can be the same file.

Basic Cookie Handling

<?php
$cookieFile = __DIR__ . '/cookies.txt';

$ch = curl_init();

curl_setopt_array($ch, [
    CURLOPT_URL            => "https://example.com/",
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_COOKIEFILE     => $cookieFile, // read cookies from this file
    CURLOPT_COOKIEJAR      => $cookieFile, // write cookies to this file
    CURLOPT_HTTPHEADER     => [
        'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
    ],
]);

$response = curl_exec($ch);
curl_close($ch);
?>

Point both CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR to the same file. cURL reads existing cookies before the request and writes new ones after. The file gets created automatically if it doesn’t exist.

Logging Into a Site

Login forms submit credentials via a POST request. Before you can do that, you usually need to fetch the login page first — many sites embed a CSRF token in the form that must be included with the POST.

<?php
$cookieFile = __DIR__ . '/session_cookies.txt';

function curl_request($url, $options = []) {
    global $cookieFile;

    $ch = curl_init();

    $defaults = [
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_COOKIEFILE     => $cookieFile,
        CURLOPT_COOKIEJAR      => $cookieFile,
        CURLOPT_TIMEOUT        => 30,
        CURLOPT_HTTPHEADER     => [
            'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
            'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language: en-US,en;q=0.5',
        ],
    ];

    curl_setopt_array($ch, $defaults + $options);
    curl_setopt($ch, CURLOPT_URL, $url);

    $response = curl_exec($ch);
    $error    = curl_error($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

    curl_close($ch);

    if ($error) {
        echo "cURL error: $error" . PHP_EOL;
        return false;
    }

    return ['body' => $response, 'status' => $httpCode];
}

// Step 1: Fetch the login page to get the CSRF token
$loginPage = curl_request("https://example.com/login");

if (!$loginPage) {
    exit("Could not reach login page." . PHP_EOL);
}

// Step 2: Extract the CSRF token from the login form
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($loginPage['body']);
libxml_clear_errors();

$xpath     = new DOMXPath($dom);
$tokenNode = $xpath->query('//input[@name="_token"]')->item(0);
$csrfToken = $tokenNode ? $tokenNode->getAttribute('value') : '';

if (!$csrfToken) {
    echo "No CSRF token found — form may use a different field name." . PHP_EOL;
}

echo "CSRF token: $csrfToken" . PHP_EOL;

// Step 3: Submit the login form
$postData = http_build_query([
    '_token'   => $csrfToken,
    'email'    => 'your@email.com',
    'password' => 'yourpassword',
]);

$loginResponse = curl_request("https://example.com/login", [
    CURLOPT_POST       => true,
    CURLOPT_POSTFIELDS => $postData,
    CURLOPT_HTTPHEADER => [
        'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
        'Content-Type: application/x-www-form-urlencoded',
        'Referer: https://example.com/login',
    ],
]);

if ($loginResponse['status'] === 200 || $loginResponse['status'] === 302) {
    echo "Login submitted. Status: " . $loginResponse['status'] . PHP_EOL;
} else {
    echo "Login failed. Status: " . $loginResponse['status'] . PHP_EOL;
}
?>

Output on success:

CSRF token: xK9mP2qL8nVwR4tY7bJc
Login submitted. Status: 200

Verifying the Login Worked

A 200 status after login doesn’t mean you’re authenticated. Some sites return 200 with the login form again when credentials are wrong. Always check the response content to confirm:

<?php
// After login, fetch a page that only authenticated users can see
$dashboard = curl_request("https://example.com/dashboard");

if (!$dashboard) {
    exit("Request failed." . PHP_EOL);
}

// Check for signs of successful login
if (strpos($dashboard['body'], 'Welcome') !== false || 
    strpos($dashboard['body'], 'Logout') !== false) {
    echo "Login confirmed. Session active." . PHP_EOL;
} elseif (strpos($dashboard['body'], 'Login') !== false || 
          strpos($dashboard['body'], 'Sign in') !== false) {
    echo "Login failed — redirected back to login page." . PHP_EOL;
} else {
    echo "Unclear — check response manually." . PHP_EOL;
}
?>

Output:

Login confirmed. Session active.

Persisting Cookies Between Script Runs

If your cookie file is valid, you don’t need to log in every time the script runs. Check first:

<?php
function is_session_valid($cookieFile, $checkUrl, $signedInIndicator = 'Logout') {
    if (!file_exists($cookieFile) || filesize($cookieFile) === 0) {
        return false;
    }

    $response = curl_request($checkUrl);

    if (!$response) {
        return false;
    }

    return strpos($response['body'], $signedInIndicator) !== false;
}

$cookieFile = __DIR__ . '/session_cookies.txt';

if (is_session_valid($cookieFile, "https://example.com/dashboard")) {
    echo "Existing session valid. Skipping login." . PHP_EOL;
} else {
    echo "Session expired or missing. Logging in again." . PHP_EOL;
    // run login flow here
}
?>

Output when session is still active:

Existing session valid. Skipping login.

Output when session has expired:

Session expired or missing. Logging in again.

Cleaning Up Cookie Files

Cookie files accumulate over time and can contain stale or conflicting session data. Clear them between fresh scraping runs:

<?php
function clear_cookies($cookieFile) {
    if (file_exists($cookieFile)) {
        file_put_contents($cookieFile, '');
        echo "Cookie file cleared." . PHP_EOL;
    }
}

// Call this before starting a fresh session
clear_cookies(__DIR__ . '/session_cookies.txt');
?>

Don’t delete the file — just empty it. cURL expects the file to exist when you point CURLOPT_COOKIEFILE at it. Deleting it causes a warning on the next run.

Common Mistakes and How to Fix Them

Most scraping problems fall into a small number of categories. Here are the ones that waste the most debugging time, and exactly how to fix them.

1. Empty Response With No Error

Your script runs, curl_exec() returns something, but the response is empty or shorter than expected. No cURL error, no obvious indication of what went wrong.

Nine times out of ten, the site served a different response than what you expected — usually a redirect to a login page, a bot check page, or a compressed response your script can’t read.

Debug it like this:

<?php
$ch = curl_init();

curl_setopt_array($ch, [
    CURLOPT_URL            => "https://example.com/",
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_ENCODING       => '',
    CURLOPT_VERBOSE        => true,  // prints the full request/response headers
    CURLOPT_STDERR         => fopen('curl_debug.txt', 'w'),
]);

$response = curl_exec($ch);

$info = curl_getinfo($ch);
echo "Final URL: "       . $info['url']          . PHP_EOL;
echo "HTTP Status: "     . $info['http_code']     . PHP_EOL;
echo "Response size: "   . $info['size_download'] . " bytes" . PHP_EOL;
echo "Content-Type: "    . $info['content_type']  . PHP_EOL;
echo "Total time: "      . $info['total_time']    . "s" . PHP_EOL;

curl_close($ch);
?>

Output:

Final URL: https://example.com/login?redirect=%2F
HTTP Status: 200
Response size: 3820 bytes
Content-Type: text/html; charset=UTF-8
Total time: 0.843s

The final URL being a login page tells you everything. You fetched successfully — you just fetched the wrong page. The curl_debug.txt file will show you the full header exchange if you need to dig deeper.

2. SSL Certificate Errors

You’ll see this on self-signed certificates or misconfigured HTTPS setups:

cURL error: SSL certificate problem: unable to get local issuer certificate

The wrong fix that you’ll find everywhere:

<?php
// Don't do this in production
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
?>

Disabling SSL verification makes your script vulnerable to man-in-the-middle attacks. It’s fine for quick local testing, never for anything that handles real data.

The correct fix is to update your CA certificate bundle:

<?php
// Download the latest cacert.pem from https://curl.se/ca/cacert.pem
// Then point cURL at it
curl_setopt($ch, CURLOPT_CAINFO, __DIR__ . '/cacert.pem');
?>

3. Getting Garbled or Encoded Output

Your scraped text looks like this:

été â€" Près de l’église

This is a UTF-8 string being treated as Latin-1 somewhere in the chain. Fix it at the point where you output or store the data:

<?php
// When outputting to terminal
header('Content-Type: text/html; charset=utf-8'); // for browser output

// When the source page uses a different encoding
$response = mb_convert_encoding($response, 'UTF-8', 'ISO-8859-1');

// When DOMDocument mangles encoding
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $html);

// When storing to MySQL — make sure your connection uses utf8mb4
$pdo = new PDO("mysql:host=localhost;dbname=scraper;charset=utf8mb4", $user, $pass);
?>

The utf8mb4 charset in MySQL is important. The older utf8 charset in MySQL only supports 3-byte characters and silently drops emojis and some special characters.

4. Memory Exhaustion on Large Scrapes

Scraping thousands of pages and storing everything in a single array will eventually kill your script:

Fatal error: Allowed memory size of 134217728 bytes exhausted

The fix isn’t increasing the memory limit — it’s not accumulating data in memory in the first place. Write to the database or a file as you go, then clear the variable:

<?php
$page = 1;

while ($url) {
    $html = scrape_with_retry($url, 3, 2);

    if (!$html) break;

    $books = parse_books($html); // returns array for current page only

    // Write immediately instead of accumulating
    foreach ($books as $book) {
        save_book($pdo, $book['title'], $book['price'], $book['rating'], $book['url']);
    }

    // Free memory — don't carry this array into the next iteration
    unset($books, $html);

    echo "Page $page saved to database." . PHP_EOL;
    $page++;

    // Also useful for long-running scripts — force garbage collection
    if ($page % 50 === 0) {
        gc_collect_cycles();
    }

    // ... pagination logic
}
?>

5. DOMDocument Ignoring Part of the Page

You write an XPath query that should match ten elements and get back two, or none at all. The HTML looks correct when you view source in the browser.

The most common cause: the content is loaded by JavaScript after the initial page load. PHP cURL only fetches the raw HTML — it doesn’t execute JavaScript. What you see in browser DevTools after the page fully loads is different from what cURL receives.

Verify what cURL actually got:

<?php
$html = scrape_with_retry("https://example.com/products", 3, 2);

// Save the raw response to a file and open it in a browser
file_put_contents('debug_page.html', $html);
echo "Raw HTML saved to debug_page.html" . PHP_EOL;
?>

Open debug_page.html in your browser. If the data you’re trying to scrape isn’t there, it’s JavaScript-rendered. You’ll need a headless browser for that — cURL alone won’t work.

6. XPath Query Returns Nothing

Your XPath looks right but returns an empty NodeList. Before assuming the query is wrong, check whether the HTML actually loaded:

<?php
$nodes = $xpath->query('//div[@class="product-title"]');

if ($nodes === false) {
    echo "XPath query is malformed." . PHP_EOL;
} elseif ($nodes->length === 0) {
    // Check if the DOM loaded anything at all
    $body = $xpath->query('//body');
    echo "Body nodes found: " . $body->length . PHP_EOL;

    // Try a broader query to see what's actually in the document
    $allDivs = $xpath->query('//div');
    echo "Total divs in document: " . $allDivs->length . PHP_EOL;
} else {
    echo "Found " . $nodes->length . " matching nodes." . PHP_EOL;
}
?>

Output when the page loaded but the selector is wrong:

Body nodes found: 1
Total divs in document: 47

If divs exist but your specific query returns nothing, the class name or structure is different from what you expect. Dump the raw HTML and check the actual attribute values — sites sometimes use dynamically generated class names that change between page loads.

7. Script Stops Without Explanation

The script runs for a while then stops silently mid-scrape. No error output, no indication of where it stopped.

Two likely causes: PHP’s max_execution_time killed it, or set_error_handler is swallowing exceptions.

<?php
// At the top of any long-running scraper script
set_time_limit(0);          // no execution time limit
ini_set('memory_limit', '256M');

// Log all errors to a file instead of suppressing them
ini_set('log_errors', 1);
ini_set('error_log', __DIR__ . '/scraper_errors.log');

// Also catch fatal errors
register_shutdown_function(function() {
    $error = error_get_last();
    if ($error && in_array($error['type'], [E_ERROR, E_PARSE, E_CORE_ERROR])) {
        file_put_contents(
            __DIR__ . '/scraper_errors.log',
            date('Y-m-d H:i:s') . " FATAL: " . $error['message'] . " in " . $error['file'] . " line " . $error['line'] . PHP_EOL,
            FILE_APPEND
        );
    }
});
?>

Run this at the top of every scraper script. Silent failures on long jobs cost more debugging time than almost anything else.

Frequently Asked Questions

Is PHP cURL good for web scraping?

For static websites, yes. cURL is fast, built into PHP, requires no external dependencies, and handles the majority of scraping tasks – fetching pages, submitting forms, managing cookies, following redirects. For most straightforward scraping jobs it’s all you need.

Where it falls short is JavaScript-heavy sites. If the data you need is loaded after the page renders via JavaScript, cURL won’t see it. In that case you need a headless browser like Puppeteer or Playwright – tools that actually execute JavaScript the way a real browser does.

Why is my PHP cURL scraper getting blocked?

Usually one of four reasons: no user agent set (the default cURL user agent is blocked by most sites), requests firing too fast with no delay, no cookies being sent so every request looks like a fresh anonymous visit, or the site uses JavaScript to render content and is detecting that your client never executes it.

Start with the basics – add a realistic user agent, add a 1-2 second delay between requests, enable cookie handling. That fixes the majority of blocking issues on normal websites.

Can PHP scrape JavaScript-rendered websites?

Not directly with cURL. cURL fetches raw HTML only — it doesn’t run JavaScript. If a site loads its data via JavaScript after the initial page load, what cURL receives is the bare HTML shell with empty containers where the content should be.

Some workarounds without a headless browser: check if the site has an API that the JavaScript is calling (open DevTools → Network tab → XHR requests), look for a mobile version of the site that may serve static HTML, or check if the data exists in a <script> tag as inline JSON that you can extract directly.

How do I scrape multiple pages in PHP?

Two approaches covered in this guide — loop through predictable URL patterns like /page-1.html, /page-2.html, or follow the “next” link on each page until it no longer exists. The second method is more reliable because it doesn’t depend on knowing the total page count or the URL structure in advance.

How do I store scraped data in PHP?

MySQL with PDO is the most practical option for anything beyond a quick test. Use INSERT ... ON DUPLICATE KEY UPDATE so re-running the scraper updates existing records instead of creating duplicates. For simpler jobs where you just need a file, fputcsv() writes scraped data directly to a CSV that opens in Excel or imports into any database.

What is the difference between curl_error and HTTP status codes?

curl_error() only catches network-level failures — the request never completed. A 403, 404, or 500 response is technically a successful request from cURL’s perspective, so curl_error() returns nothing. Always check curl_getinfo($ch, CURLINFO_HTTP_CODE) separately to catch HTTP errors that cURL considers successful.

Is web scraping legal?

It depends on what you scrape and how you use it. Scraping publicly available data for personal use or research sits in a grey area in most jurisdictions. Scraping behind a login, ignoring robots.txt, violating a site’s terms of service, or using scraped data commercially introduces real legal risk. When in doubt, check the site’s terms of service and robots.txt before you start.


Where to Go From Here

At this point you have a working scraper that handles real-world conditions — proper headers, error recovery, pagination, database storage, session handling, and rate limiting. That covers most scraping projects you’ll encounter.

Three directions worth exploring next depending on what you’re building:

  • JavaScript-rendered sites — if cURL keeps returning empty content, look into running Puppeteer via Node.js alongside your PHP script. It handles sites that require JavaScript execution.
  • Automating scraper runs — once your scraper works reliably, scheduling it with a PHP cron job lets it run daily or hourly without manual intervention. Useful for price monitoring, news aggregation, or any data that changes over time.
  • Scaling to large volumes — scraping thousands of URLs benefits from running requests in parallel rather than sequentially. PHP’s cURL multi-handle (curl_multi_init()) lets you fire multiple requests simultaneously while keeping rate limits under control.

The code in this guide runs against books.toscrape.com – a site built specifically for scraping practice. Test everything there before pointing your scraper at a real target.

If you’re facing issues, check these common web scraping errors and how to fix them.

Limitations of PHP cURL

PHP cURL cannot handle JavaScript-based websites directly.

To solve this, learn handling dynamic content in web scraping PHP.

How to Automate This Script

You can automate scraping using cron jobs.

Learn how to automate scraping using PHP cron jobs.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top