Table of Contents
Scraping data to the terminal is useful for testing. For anything real – price tracking, job aggregation, content monitoring – you need it stored somewhere you can query, update, and run the scraper against repeatedly without losing previous results.
MySQL is the standard choice for PHP scraping projects. It handles millions of rows efficiently, runs on every hosting environment, and integrates cleanly with PHP via PDO. This guide covers the complete PHP MySQL scraping workflow – database setup, safe connections, inserting scraped data with prepared statements, handling duplicates, and querying stored results.
Every code block here uses PDO with prepared statements. If you’ve seen tutorials that build SQL strings by concatenating variables directly – like "INSERT INTO products VALUES ('$title', '$price')" – that approach is vulnerable to SQL injection and should never be used in production. Prepared statements are not optional.
What We’re Building
By the end of this guide you’ll have a complete PHP MySQL scraping script that:
- Fetches product data from a live website using cURL
- Parses the HTML with DOMDocument and XPath
- Stores results to MySQL using PDO prepared statements
- Handles duplicate records without crashing
- Retrieves and displays stored data
What You Need
- PHP 7.4 or higher with PDO and PDO_MySQL enabled
- MySQL 5.7 or higher
- cURL extension enabled
Verify PDO is available:
<?php
// Check PDO and MySQL driver are enabled
if (!extension_loaded('pdo')) {
echo "PDO not available." . PHP_EOL;
} elseif (!in_array('mysql', PDO::getAvailableDrivers())) {
echo "PDO MySQL driver not available." . PHP_EOL;
} else {
echo "PDO MySQL ready." . PHP_EOL;
}
?>
Output:
PDO MySQL ready.
If PDO MySQL isn’t available, uncomment extension=pdo_mysql in your php.ini file and restart your web server.
Creating the Database and Table
Run this SQL once to set up the database structure. The table design matters more than most tutorials suggest – the right columns and constraints prevent duplicate data and make querying efficient from day one.
Basic Table Setup
-- Create the database
CREATE DATABASE IF NOT EXISTS scraper_db
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
USE scraper_db;
-- Create the products table
CREATE TABLE IF NOT EXISTS products (
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(255) NOT NULL,
price DECIMAL(10,2) DEFAULT NULL,
currency VARCHAR(10) DEFAULT 'GBP',
rating VARCHAR(20) DEFAULT NULL,
url VARCHAR(500) DEFAULT NULL,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
UNIQUE KEY unique_url (url)
);
Three decisions in this schema worth explaining:
utf8mb4 character set – standard utf8 in MySQL only supports 3-byte characters and silently drops emojis and some special characters. utf8mb4 is the correct full Unicode implementation. Always use it.
DECIMAL(10,2) for price – storing prices as VARCHAR means you can’t sort or compare them numerically. DECIMAL(10,2) stores up to 99,999,999.99 with exact precision – no floating point rounding errors on currency values.
UNIQUE KEY on url – prevents duplicate rows when you re-run the scraper. Combined with ON DUPLICATE KEY UPDATE later, this lets you run the scraper daily and always have fresh data without accumulating duplicates.
Adding Indexes for Querying
If you’ll query by price range or filter by rating frequently, add indexes to those columns:
-- Add after creating the table
ALTER TABLE products ADD INDEX idx_price (price);
ALTER TABLE products ADD INDEX idx_scraped_at (scraped_at);
ALTER TABLE products ADD INDEX idx_rating (rating);
Indexes slow down inserts slightly but make SELECT queries on large datasets dramatically faster. Add them on columns you’ll filter or sort by regularly.
Verifying the Structure
-- Confirm the table was created correctly
DESCRIBE products;
Output:
+------------+---------------+------+-----+-------------------+
| Field | Type | Null | Key | Default |
+------------+---------------+------+-----+-------------------+
| id | int | NO | PRI | NULL |
| title | varchar(255) | NO | | NULL |
| price | decimal(10,2) | YES | | NULL |
| currency | varchar(10) | YES | | GBP |
| rating | varchar(20) | YES | | NULL |
| url | varchar(500) | YES | UNI | NULL |
| scraped_at | timestamp | YES | | CURRENT_TIMESTAMP |
+------------+---------------+------+-----+-------------------+
Creating the Table From PHP
If you prefer to create the table programmatically rather than through a MySQL client, do it in PHP at script startup:
<?php
function create_table_if_not_exists($pdo) {
$sql = "CREATE TABLE IF NOT EXISTS products (
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(255) NOT NULL,
price DECIMAL(10,2) DEFAULT NULL,
currency VARCHAR(10) DEFAULT 'GBP',
rating VARCHAR(20) DEFAULT NULL,
url VARCHAR(500) DEFAULT NULL,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
ON UPDATE CURRENT_TIMESTAMP,
UNIQUE KEY unique_url (url)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci";
try {
$pdo->exec($sql);
echo "Table ready." . PHP_EOL;
return true;
} catch (PDOException $e) {
echo "Failed to create table: " . $e->getMessage() . PHP_EOL;
return false;
}
}
?>
Connecting to MySQL With PDO
PDO (PHP Data Objects) is the correct way to connect to MySQL in modern PHP. It supports prepared statements natively, handles errors through exceptions rather than silent failures, and works with multiple database types if you ever need to switch.
The older mysqli extension works too but PDO is cleaner for scraping projects where you’re running many similar insert and select queries.
Basic PDO Connection
<?php
function get_db_connection() {
$host = 'localhost';
$dbname = 'scraper_db';
$username = 'your_username';
$password = 'your_password';
$charset = 'utf8mb4';
$dsn = "mysql:host=$host;dbname=$dbname;charset=$charset";
$options = [
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC,
PDO::ATTR_EMULATE_PREPARES => false,
];
try {
$pdo = new PDO($dsn, $username, $password, $options);
return $pdo;
} catch (PDOException $e) {
echo "Connection failed: " . $e->getMessage() . PHP_EOL;
return null;
}
}
$pdo = get_db_connection();
if ($pdo) {
echo "Connected to MySQL successfully." . PHP_EOL;
}
?>
Output:
Connected to MySQL successfully.
What Each PDO Option Does
ERRMODE_EXCEPTION – throws a PDOException on any database error instead of returning false silently. Without this, failed queries return false and your scraper continues saving nothing with no indication anything went wrong.
FETCH_ASSOC – returns rows as associative arrays ($row['title']) instead of both numeric and associative indexes ($row[0] and $row['title']). Cleaner to work with.
EMULATE_PREPARES = false – forces PDO to use real prepared statements rather than emulating them in PHP. Real prepared statements are more secure and let MySQL optimize repeated queries.
For the full list of PDO options and attributes see the official PHP PDO documentation.
Keeping Credentials Out of the Script
Hardcoding database credentials in your script is a security problem – they end up in version control, shared files, and server logs. Store them in a separate config file outside your web root:
<?php
// config.php - store this outside public_html
// e.g. /home/username/config.php not /home/username/public_html/config.php
return [
'db' => [
'host' => 'localhost',
'name' => 'scraper_db',
'username' => 'your_username',
'password' => 'your_password',
'charset' => 'utf8mb4',
],
];
?>
<?php
// In your scraper script - load config from outside web root
$config = require '/home/username/config.php';
function get_db_connection($config) {
$db = $config['db'];
$dsn = "mysql:host={$db['host']};dbname={$db['name']};charset={$db['charset']}";
$options = [
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC,
PDO::ATTR_EMULATE_PREPARES => false,
];
try {
return new PDO($dsn, $db['username'], $db['password'], $options);
} catch (PDOException $e) {
echo "Connection failed: " . $e->getMessage() . PHP_EOL;
return null;
}
}
$pdo = get_db_connection($config);
if ($pdo) {
echo "Connected using config file." . PHP_EOL;
}
?>
Output:
Connected using config file.
Testing the Connection
Verify the connection and character set are configured correctly before running any scraping jobs:
<?php
function test_db_connection($pdo) {
try {
// Check MySQL version
$version = $pdo->query("SELECT VERSION()")->fetchColumn();
echo "MySQL version: $version" . PHP_EOL;
// Check character set
$charset = $pdo->query("SELECT @@character_set_connection")->fetchColumn();
echo "Connection charset: $charset" . PHP_EOL;
// Check the products table exists
$tableCheck = $pdo->query(
"SELECT COUNT(*) FROM information_schema.tables
WHERE table_schema = DATABASE()
AND table_name = 'products'"
)->fetchColumn();
if ($tableCheck) {
echo "Products table: exists" . PHP_EOL;
} else {
echo "Products table: not found - run CREATE TABLE first" . PHP_EOL;
}
return true;
} catch (PDOException $e) {
echo "Test failed: " . $e->getMessage() . PHP_EOL;
return false;
}
}
$pdo = get_db_connection($config);
if ($pdo) {
test_db_connection($pdo);
}
?>
Output:
MySQL version: 8.0.37
Connection charset: utf8mb4
Products table: exists
Handling Connection Failures Gracefully
A scraping job that runs for an hour and then fails because the database connection dropped loses all collected data. Add a reconnect function for long-running jobs:
<?php
function get_db_connection_with_retry($config, $maxAttempts = 3) {
$attempt = 0;
while ($attempt < $maxAttempts) {
$attempt++;
$pdo = get_db_connection($config);
if ($pdo) {
if ($attempt > 1) {
echo "Connected on attempt $attempt." . PHP_EOL;
}
return $pdo;
}
echo "Connection attempt $attempt failed. Retrying in 5s..." . PHP_EOL;
sleep(5);
}
echo "All $maxAttempts connection attempts failed." . PHP_EOL;
return null;
}
$pdo = get_db_connection_with_retry($config);
if (!$pdo) {
exit("Cannot continue without database connection." . PHP_EOL);
}
echo "Ready to scrape and store." . PHP_EOL;
?>
Output on first attempt success:
Ready to scrape and store.
Output when first attempt fails:
Connection attempt 1 failed. Retrying in 5s...
Connected on attempt 2.
Ready to scrape and store.
Scraping and Inserting Data With Prepared Statements
This section puts the scraping and database storage together. The fetch function gets the HTML, the parse function extracts the data, and the insert function saves it to MySQL using prepared statements – not string concatenation.
Why Prepared Statements Are Non-Negotiable
This is how most tutorials show inserting scraped data:
<?php
// NEVER do this - SQL injection vulnerability
$title = "A Light in the Attic";
$price = "51.77";
$sql = "INSERT INTO products (title, price) VALUES ('$title', '$price')";
$conn->query($sql);
?>
If a scraped title contains a single quote – like “It’s Hard to Say Goodbye” – this query breaks. If it contains deliberately crafted SQL – possible when scraping user-generated content – it executes that SQL against your database. Prepared statements fix both problems permanently:
<?php
// Correct - prepared statement with bound parameters
$sql = "INSERT INTO products (title, price) VALUES (:title, :price)";
$stmt = $pdo->prepare($sql);
$stmt->execute([':title' => $title, ':price' => $price]);
?>
The parameters are never interpreted as SQL – they’re treated as data values regardless of their content. This is the only correct approach.
The Fetch Function
<?php
function fetch_page($url) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_ENCODING => '',
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
],
]);
$html = curl_exec($ch);
$errno = curl_errno($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($errno || $httpCode !== 200) {
echo "Fetch failed: HTTP $httpCode on $url" . PHP_EOL;
return false;
}
return $html;
}
?>
The Parse Function
<?php
function parse_books($html) {
if (!$html) return [];
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$books = $xpath->query('//article[contains(@class,"product_pod")]');
$results = [];
foreach ($books as $book) {
$titleNode = $xpath->query('.//h3/a', $book)->item(0);
$priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);
$ratingNode = $xpath->query('.//*[contains(@class,"star-rating")]', $book)->item(0);
$linkNode = $xpath->query('.//h3/a', $book)->item(0);
$title = $titleNode ? $titleNode->getAttribute('title') : null;
$price = $priceNode ? (float) preg_replace('/[^0-9.]/', '', $priceNode->textContent) : null;
$rating = $ratingNode ? str_replace('star-rating ', '', $ratingNode->getAttribute('class')) : null;
$url = $linkNode ? "https://books.toscrape.com/catalogue/" . ltrim($linkNode->getAttribute('href'), '../') : null;
// Skip records with no title
if (!$title) continue;
$results[] = [
'title' => $title,
'price' => $price,
'currency' => 'GBP',
'rating' => $rating,
'url' => $url,
];
}
return $results;
}
?>
The Insert Function With Prepared Statements
<?php
function insert_product($pdo, $product) {
$sql = "INSERT INTO products (title, price, currency, rating, url)
VALUES (:title, :price, :currency, :rating, :url)";
try {
$stmt = $pdo->prepare($sql);
$stmt->execute([
':title' => $product['title'],
':price' => $product['price'],
':currency' => $product['currency'],
':rating' => $product['rating'],
':url' => $product['url'],
]);
echo "Inserted: {$product['title']} - £{$product['price']}" . PHP_EOL;
return true;
} catch (PDOException $e) {
echo "Insert failed for '{$product['title']}': " . $e->getMessage() . PHP_EOL;
return false;
}
}
?>
Putting It Together – Scrape and Store in One Script
<?php
set_time_limit(0);
$pdo = get_db_connection($config);
if (!$pdo) {
exit("Database connection failed." . PHP_EOL);
}
// Ensure table exists
create_table_if_not_exists($pdo);
// Fetch and parse
echo "Fetching page..." . PHP_EOL;
$html = fetch_page("https://books.toscrape.com/");
if (!$html) {
exit("Could not fetch page." . PHP_EOL);
}
$books = parse_books($html);
echo "Found " . count($books) . " books. Inserting..." . PHP_EOL . PHP_EOL;
$inserted = 0;
$failed = 0;
foreach ($books as $book) {
$result = insert_product($pdo, $book);
$result ? $inserted++ : $failed++;
}
echo PHP_EOL . "Done. Inserted: $inserted | Failed: $failed" . PHP_EOL;
?>
Output:
Fetching page...
Found 20 books. Inserting...
Inserted: A Light in the Attic - £51.77
Inserted: Tipping the Velvet - £53.74
Inserted: Soumission - £50.10
Inserted: Sharp Objects - £47.82
...
Done. Inserted: 20 | Failed: 0
Using a Transaction for Bulk Inserts
Inserting records one at a time inside a loop means each insert is its own database transaction. Wrapping the whole batch in a single transaction is significantly faster on large inserts – and rolls back cleanly if anything fails halfway through:
<?php
function insert_products_batch($pdo, $products) {
$sql = "INSERT INTO products (title, price, currency, rating, url)
VALUES (:title, :price, :currency, :rating, :url)";
$stmt = $pdo->prepare($sql);
$inserted = 0;
$failed = 0;
try {
$pdo->beginTransaction();
foreach ($products as $product) {
$stmt->execute([
':title' => $product['title'],
':price' => $product['price'],
':currency' => $product['currency'],
':rating' => $product['rating'],
':url' => $product['url'],
]);
$inserted++;
}
$pdo->commit();
echo "Transaction committed. Inserted: $inserted" . PHP_EOL;
} catch (PDOException $e) {
$pdo->rollBack();
echo "Transaction rolled back: " . $e->getMessage() . PHP_EOL;
$failed = count($products);
$inserted = 0;
}
return ['inserted' => $inserted, 'failed' => $failed];
}
// Usage
$books = parse_books($html);
$result = insert_products_batch($pdo, $books);
echo "Inserted: {$result['inserted']} | Failed: {$result['failed']}" . PHP_EOL;
?>
Output:
Transaction committed. Inserted: 20
Inserted: 20 | Failed: 0
On 1000 rows the transaction approach is typically 5-10x faster than individual inserts. The tradeoff is that if any single insert fails the entire batch rolls back. For scraping jobs where partial success is acceptable, insert individually. For jobs where you need all-or-nothing consistency, use a transaction.
Handling Duplicates With ON DUPLICATE KEY UPDATE
Re-running the scraper on the same site will try to insert records that already exist. Without duplicate handling the script crashes on the UNIQUE constraint violation – or worse, if there’s no constraint, silently creates duplicate rows.
The correct approach is INSERT ... ON DUPLICATE KEY UPDATE. It inserts new records and updates existing ones in a single query – no separate SELECT check needed, no extra round trips to the database.
Basic ON DUPLICATE KEY UPDATE
<?php
function upsert_product($pdo, $product) {
$sql = "INSERT INTO products (title, price, currency, rating, url)
VALUES (:title, :price, :currency, :rating, :url)
ON DUPLICATE KEY UPDATE
price = VALUES(price),
rating = VALUES(rating),
scraped_at = CURRENT_TIMESTAMP";
try {
$stmt = $pdo->prepare($sql);
$stmt->execute([
':title' => $product['title'],
':price' => $product['price'],
':currency' => $product['currency'],
':rating' => $product['rating'],
':url' => $product['url'],
]);
// rowCount() returns 1 for insert, 2 for update, 0 for no change
$rows = $stmt->rowCount();
if ($rows === 1) {
echo "Inserted: {$product['title']}" . PHP_EOL;
} elseif ($rows === 2) {
echo "Updated: {$product['title']}" . PHP_EOL;
} else {
echo "No change: {$product['title']}" . PHP_EOL;
}
return $rows;
} catch (PDOException $e) {
echo "Upsert failed for '{$product['title']}': " . $e->getMessage() . PHP_EOL;
return false;
}
}
// Test it
$pdo = get_db_connection($config);
$product = [
'title' => 'A Light in the Attic',
'price' => 51.77,
'currency' => 'GBP',
'rating' => 'One',
'url' => 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
];
// First run - inserts
upsert_product($pdo, $product);
// Second run same data - no change
upsert_product($pdo, $product);
// Third run with updated price - updates
$product['price'] = 47.50;
upsert_product($pdo, $product);
?>
Output:
Inserted: A Light in the Attic
No change: A Light in the Attic
Updated: A Light in the Attic
Tracking Insert vs Update Counts Across a Full Run
<?php
$html = fetch_page("https://books.toscrape.com/");
$books = parse_books($html);
$stats = [
'inserted' => 0,
'updated' => 0,
'unchanged' => 0,
'failed' => 0,
];
foreach ($books as $book) {
$result = upsert_product($pdo, $book);
if ($result === false) {
$stats['failed']++;
} elseif ($result === 1) {
$stats['inserted']++;
} elseif ($result === 2) {
$stats['updated']++;
} else {
$stats['unchanged']++;
}
}
echo PHP_EOL . "Run complete:" . PHP_EOL;
echo " Inserted: {$stats['inserted']}" . PHP_EOL;
echo " Updated: {$stats['updated']}" . PHP_EOL;
echo " Unchanged: {$stats['unchanged']}" . PHP_EOL;
echo " Failed: {$stats['failed']}" . PHP_EOL;
?>
Output on first run:
Inserted: A Light in the Attic
Inserted: Tipping the Velvet
...
Run complete:
Inserted: 20
Updated: 0
Unchanged: 0
Failed: 0
Output on second run when prices have changed:
Updated: A Light in the Attic
No change: Tipping the Velvet
...
Run complete:
Inserted: 0
Updated: 3
Unchanged: 17
Failed: 0
When to Preserve Old Values vs Overwrite
Sometimes you want to keep the original value rather than overwrite on update. Use IF inside the UPDATE clause to make that decision per column:
<?php
$sql = "INSERT INTO products (title, price, currency, rating, url)
VALUES (:title, :price, :currency, :rating, :url)
ON DUPLICATE KEY UPDATE
-- Always update price to latest value
price = VALUES(price),
-- Only update rating if it was previously NULL
rating = IF(rating IS NULL, VALUES(rating), rating),
-- Never overwrite the original URL
-- url column not included in UPDATE clause
-- Always update the timestamp
scraped_at = CURRENT_TIMESTAMP";
$stmt = $pdo->prepare($sql);
$stmt->execute([
':title' => $product['title'],
':price' => $product['price'],
':currency' => $product['currency'],
':rating' => $product['rating'],
':url' => $product['url'],
]);
?>
Checking if a Record Exists Before Inserting
Occasionally you need to know whether a record is new or existing before deciding what to do – not just after the upsert. Check first with a SELECT:
<?php
function product_exists($pdo, $url) {
$stmt = $pdo->prepare(
"SELECT COUNT(*) FROM products WHERE url = :url"
);
$stmt->execute([':url' => $url]);
return (int) $stmt->fetchColumn() > 0;
}
function insert_only_new($pdo, $product) {
if (product_exists($pdo, $product['url'])) {
echo "Skipping existing: {$product['title']}" . PHP_EOL;
return false;
}
$sql = "INSERT INTO products (title, price, currency, rating, url)
VALUES (:title, :price, :currency, :rating, :url)";
$stmt = $pdo->prepare($sql);
try {
$stmt->execute([
':title' => $product['title'],
':price' => $product['price'],
':currency' => $product['currency'],
':rating' => $product['rating'],
':url' => $product['url'],
]);
echo "Inserted new: {$product['title']}" . PHP_EOL;
return true;
} catch (PDOException $e) {
echo "Insert failed: " . $e->getMessage() . PHP_EOL;
return false;
}
}
// Usage
foreach ($books as $book) {
insert_only_new($pdo, $book);
}
?>
Output when all records already exist:
Skipping existing: A Light in the Attic
Skipping existing: Tipping the Velvet
Skipping existing: Soumission
...
The SELECT-then-INSERT approach uses two queries per record instead of one. For most scraping jobs ON DUPLICATE KEY UPDATE is faster and simpler. Use the SELECT check only when the decision of what to do with new vs existing records is more complex than the upsert handles.
Retrieving and Querying Stored Data
Storing data is only half the job. The value comes from querying it – filtering by price, sorting by rating, finding records scraped within a date range, or comparing current prices to previous ones.
Fetching All Records
<?php
function get_all_products($pdo) {
$stmt = $pdo->query(
"SELECT id, title, price, currency, rating, scraped_at
FROM products
ORDER BY scraped_at DESC"
);
return $stmt->fetchAll();
}
$products = get_all_products($pdo);
echo "Total products stored: " . count($products) . PHP_EOL . PHP_EOL;
foreach ($products as $product) {
echo "{$product['title']}" . PHP_EOL;
echo " Price: {$product['currency']} {$product['price']}" . PHP_EOL;
echo " Rating: {$product['rating']}" . PHP_EOL;
echo " Scraped: {$product['scraped_at']}" . PHP_EOL;
echo PHP_EOL;
}
?>
Output:
Total products stored: 20
A Light in the Attic
Price: GBP 51.77
Rating: One
Scraped: 2026-05-03 09:00:01
Tipping the Velvet
Price: GBP 53.74
Rating: One
Scraped: 2026-05-03 09:00:02
...
Filtering by Price Range
<?php
function get_products_by_price_range($pdo, $minPrice, $maxPrice) {
$stmt = $pdo->prepare(
"SELECT title, price, currency, rating
FROM products
WHERE price BETWEEN :min AND :max
ORDER BY price ASC"
);
$stmt->execute([
':min' => $minPrice,
':max' => $maxPrice,
]);
return $stmt->fetchAll();
}
// Get books priced between £10 and £30
$affordable = get_products_by_price_range($pdo, 10, 30);
echo "Books between £10 and £30: " . count($affordable) . PHP_EOL . PHP_EOL;
foreach ($affordable as $book) {
echo "{$book['title']} - £{$book['price']}" . PHP_EOL;
}
?>
Output:
Books between £10 and £30: 3
Starving Hearts - £13.99
The Black Maria - £22.50
Set Me Free - £27.89
Filtering by Rating
<?php
function get_products_by_rating($pdo, $rating) {
$stmt = $pdo->prepare(
"SELECT title, price, url
FROM products
WHERE rating = :rating
ORDER BY price ASC"
);
$stmt->execute([':rating' => $rating]);
return $stmt->fetchAll();
}
// Get all five-star books
$topRated = get_products_by_rating($pdo, 'Five');
echo "Five-star books: " . count($topRated) . PHP_EOL . PHP_EOL;
foreach ($topRated as $book) {
echo "{$book['title']} - £{$book['price']}" . PHP_EOL;
}
?>
Output:
Five-star books: 4
Libertarianism for Beginners - £51.33
Mesaerion: The Best Science Fiction Stories 1800-1849 - £37.59
Olio - £23.88
Our Band Could Be Your Life - £57.25
Getting Summary Statistics
<?php
function get_product_stats($pdo) {
$stmt = $pdo->query(
"SELECT
COUNT(*) AS total_products,
MIN(price) AS cheapest,
MAX(price) AS most_expensive,
ROUND(AVG(price), 2) AS average_price,
SUM(price) AS total_value,
MIN(scraped_at) AS first_scraped,
MAX(scraped_at) AS last_scraped
FROM products"
);
return $stmt->fetch();
}
$stats = get_product_stats($pdo);
echo "Database Statistics:" . PHP_EOL;
echo str_repeat('-', 40) . PHP_EOL;
echo "Total products: {$stats['total_products']}" . PHP_EOL;
echo "Cheapest: £{$stats['cheapest']}" . PHP_EOL;
echo "Most expensive: £{$stats['most_expensive']}" . PHP_EOL;
echo "Average price: £{$stats['average_price']}" . PHP_EOL;
echo "Total value: £{$stats['total_value']}" . PHP_EOL;
echo "First scraped: {$stats['first_scraped']}" . PHP_EOL;
echo "Last scraped: {$stats['last_scraped']}" . PHP_EOL;
?>
Output:
Database Statistics:
----------------------------------------
Total products: 20
Cheapest: £10.00
Most expensive: £57.25
Average price: £41.73
Total value: £834.60
First scraped: 2026-05-01 09:00:01
Last scraped: 2026-05-03 09:00:01
Searching by Keyword
<?php
function search_products($pdo, $keyword) {
$stmt = $pdo->prepare(
"SELECT title, price, rating
FROM products
WHERE title LIKE :keyword
ORDER BY price ASC"
);
// Wrap keyword in wildcards for partial matching
$stmt->execute([':keyword' => '%' . $keyword . '%']);
return $stmt->fetchAll();
}
$results = search_products($pdo, 'the');
echo "Products containing 'the': " . count($results) . PHP_EOL . PHP_EOL;
foreach ($results as $product) {
echo "{$product['title']} - £{$product['price']}" . PHP_EOL;
}
?>
Output:
Products containing 'the': 5
Tipping the Velvet - £53.74
In the Woods - £19.63
The Black Maria - £22.50
The Requiem Red - £32.83
The Dirty Little Secrets of Getting Your Dream Job - £33.34
Getting Recently Scraped Products
<?php
function get_recently_scraped($pdo, $hours = 24) {
$stmt = $pdo->prepare(
"SELECT title, price, scraped_at
FROM products
WHERE scraped_at >= DATE_SUB(NOW(), INTERVAL :hours HOUR)
ORDER BY scraped_at DESC"
);
$stmt->execute([':hours' => $hours]);
return $stmt->fetchAll();
}
// Get everything scraped in the last 24 hours
$recent = get_recently_scraped($pdo, 24);
echo "Products scraped in last 24 hours: " . count($recent) . PHP_EOL;
foreach ($recent as $product) {
echo "{$product['title']} - scraped at {$product['scraped_at']}" . PHP_EOL;
}
?>
Output:
Products scraped in last 24 hours: 20
A Light in the Attic - scraped at 2026-05-03 09:00:01
Tipping the Velvet - scraped at 2026-05-03 09:00:02
...
Complete End-to-End PHP MySQL Scraping Script
This brings everything together – connection, table creation, fetching, parsing, upserting, and a summary report. Save this as scraper.php and run it with php scraper.php.
<?php
// ============================================
// PHP MySQL Scraping - Complete Script
// ============================================
set_time_limit(0);
error_reporting(E_ALL);
ini_set('log_errors', 1);
ini_set('error_log', __DIR__ . '/scraper_errors.log');
// ---- Configuration ----
$dbConfig = [
'host' => 'localhost',
'name' => 'scraper_db',
'username' => 'your_username',
'password' => 'your_password',
'charset' => 'utf8mb4',
];
$targetUrl = "https://books.toscrape.com/";
$logFile = __DIR__ . '/scraper.log';
$startTime = microtime(true);
// ---- Logging ----
function log_message($message) {
global $logFile;
$entry = '[' . date('Y-m-d H:i:s') . '] ' . $message . PHP_EOL;
file_put_contents($logFile, $entry, FILE_APPEND);
echo $entry;
}
// ---- Database ----
function get_db_connection($config) {
$dsn = "mysql:host={$config['host']};dbname={$config['name']};charset={$config['charset']}";
try {
return new PDO($dsn, $config['username'], $config['password'], [
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC,
PDO::ATTR_EMULATE_PREPARES => false,
]);
} catch (PDOException $e) {
log_message("DB connection failed: " . $e->getMessage());
return null;
}
}
function create_table($pdo) {
$pdo->exec("CREATE TABLE IF NOT EXISTS products (
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(255) NOT NULL,
price DECIMAL(10,2) DEFAULT NULL,
currency VARCHAR(10) DEFAULT 'GBP',
rating VARCHAR(20) DEFAULT NULL,
url VARCHAR(500) DEFAULT NULL,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
ON UPDATE CURRENT_TIMESTAMP,
UNIQUE KEY unique_url (url)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci");
}
// ---- Fetch ----
function fetch_page($url) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_ENCODING => '',
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
],
]);
$html = curl_exec($ch);
$errno = curl_errno($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($errno || $httpCode !== 200) {
log_message("Fetch failed: HTTP $httpCode on $url");
return false;
}
return $html;
}
// ---- Parse ----
function parse_books($html) {
if (!$html) return [];
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$books = $xpath->query('//article[contains(@class,"product_pod")]');
$results = [];
foreach ($books as $book) {
$titleNode = $xpath->query('.//h3/a', $book)->item(0);
$priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);
$ratingNode = $xpath->query('.//*[contains(@class,"star-rating")]', $book)->item(0);
$linkNode = $xpath->query('.//h3/a', $book)->item(0);
$title = $titleNode ? $titleNode->getAttribute('title') : null;
$price = $priceNode ? (float) preg_replace('/[^0-9.]/', '', $priceNode->textContent) : null;
$rating= $ratingNode ? str_replace('star-rating ', '', $ratingNode->getAttribute('class')) : null;
$url = $linkNode ? "https://books.toscrape.com/catalogue/" . ltrim($linkNode->getAttribute('href'), '../') : null;
if (!$title) continue;
$results[] = [
'title' => $title,
'price' => $price,
'currency' => 'GBP',
'rating' => $rating,
'url' => $url,
];
}
return $results;
}
// ---- Upsert ----
function upsert_product($pdo, $product) {
$sql = "INSERT INTO products (title, price, currency, rating, url)
VALUES (:title, :price, :currency, :rating, :url)
ON DUPLICATE KEY UPDATE
price = VALUES(price),
rating = VALUES(rating),
scraped_at = CURRENT_TIMESTAMP";
$stmt = $pdo->prepare($sql);
$stmt->execute([
':title' => $product['title'],
':price' => $product['price'],
':currency' => $product['currency'],
':rating' => $product['rating'],
':url' => $product['url'],
]);
return $stmt->rowCount();
}
// ============================================
// MAIN SCRIPT
// ============================================
log_message("Scraper started.");
// Connect
$pdo = get_db_connection($dbConfig);
if (!$pdo) {
exit("Cannot continue without database." . PHP_EOL);
}
// Setup table
create_table($pdo);
log_message("Database ready.");
// Scrape
$currentUrl = $targetUrl;
$page = 1;
$stats = [
'inserted' => 0,
'updated' => 0,
'unchanged' => 0,
'failed' => 0,
'pages' => 0,
];
while ($currentUrl) {
log_message("Fetching page $page: $currentUrl");
$html = fetch_page($currentUrl);
if (!$html) {
log_message("Failed to fetch page $page - stopping.");
break;
}
$books = parse_books($html);
if (empty($books)) {
log_message("No books found on page $page - stopping.");
break;
}
// Upsert each book
try {
$pdo->beginTransaction();
foreach ($books as $book) {
$result = upsert_product($pdo, $book);
if ($result === 1) $stats['inserted']++;
elseif ($result === 2) $stats['updated']++;
else $stats['unchanged']++;
}
$pdo->commit();
} catch (PDOException $e) {
$pdo->rollBack();
log_message("Transaction failed on page $page: " . $e->getMessage());
$stats['failed'] += count($books);
}
$stats['pages']++;
log_message("Page $page done - " . count($books) . " books processed.");
// Find next page
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$nextNode = $xpath->query('//li[contains(@class,"next")]/a')->item(0);
if ($nextNode) {
$currentUrl = "https://books.toscrape.com/catalogue/"
. $nextNode->getAttribute('href');
$page++;
sleep(1);
} else {
log_message("Last page reached.");
$currentUrl = null;
}
unset($html, $books, $dom, $xpath);
}
// Summary
$duration = round(microtime(true) - $startTime, 2);
$summary = "
============================================
Scrape Complete: " . date('Y-m-d H:i:s') . "
Duration: {$duration}s
Pages: {$stats['pages']}
Inserted: {$stats['inserted']}
Updated: {$stats['updated']}
Unchanged: {$stats['unchanged']}
Failed: {$stats['failed']}
============================================";
log_message($summary);
?>
Output:
[2026-05-03 09:00:01] Scraper started.
[2026-05-03 09:00:01] Database ready.
[2026-05-03 09:00:01] Fetching page 1: https://books.toscrape.com/
[2026-05-03 09:00:02] Page 1 done - 20 books processed.
[2026-05-03 09:00:03] Fetching page 2: https://books.toscrape.com/catalogue/page-2.html
[2026-05-03 09:00:04] Page 2 done - 20 books processed.
...
[2026-05-03 09:02:14] Last page reached.
============================================
Scrape Complete: 2026-05-03 09:02:14
Duration: 133.4s
Pages: 50
Inserted: 1000
Updated: 0
Unchanged: 0
Failed: 0
============================================
Frequently Asked Questions
Should I use PDO or mysqli for PHP MySQL scraping?
PDO. It supports prepared statements natively, throws exceptions on errors instead of returning false silently, works with multiple database types, and has a cleaner API for the kind of repeated parameterized queries scraping jobs run. mysqli works but gives you no advantage over PDO and has more verbose syntax for prepared statements.
How do I avoid duplicate records when running the scraper multiple times?
Two parts. First, add a UNIQUE constraint on the column that identifies a unique record – typically the URL. Second, use INSERT ... ON DUPLICATE KEY UPDATE instead of plain INSERT. This handles new records and updates to existing ones in a single query without checking for existence first. The UNIQUE constraint is not optional – without it ON DUPLICATE KEY UPDATE has nothing to trigger on.
Why should I store price as DECIMAL instead of VARCHAR?
VARCHAR prices can’t be sorted or compared numerically without conversion. DECIMAL(10,2) stores exact currency values without floating point rounding errors and lets you run queries like WHERE price BETWEEN 10 AND 50 or ORDER BY price ASC correctly. Strip the currency symbol before inserting and store the number alone – keep currency in a separate column.
Is MySQL good enough for large-scale scraping?
For most scraping projects – millions of rows, dozens of columns, daily updates – MySQL handles it fine with proper indexing. Add indexes on columns you filter or sort by frequently. For truly massive datasets at hundreds of millions of rows, or when you need horizontal scaling across multiple servers, PostgreSQL or a distributed database becomes worth considering. For anything under 50 million rows MySQL with correct schema design is more than enough.
How do I store scraped data from multiple websites in the same database?
Add a source column to your table:
ALTER TABLE products ADD COLUMN source VARCHAR(100) DEFAULT NULL;
ALTER TABLE products DROP INDEX unique_url;
ALTER TABLE products ADD UNIQUE KEY unique_url_source (url, source);
The composite UNIQUE key on url and source prevents duplicates within the same site while allowing the same URL to be tracked from different sources. Insert with ':source' => 'books.toscrape.com' in the parameters and filter by source in your queries.
How do I export scraped data from MySQL to CSV?
<?php
$stmt = $pdo->query("SELECT title, price, currency, rating, url, scraped_at FROM products");
$rows = $stmt->fetchAll();
$csvFile = __DIR__ . '/products_export.csv';
$handle = fopen($csvFile, 'w');
// Header row
fputcsv($handle, ['Title', 'Price', 'Currency', 'Rating', 'URL', 'Scraped At']);
// Data rows
foreach ($rows as $row) {
fputcsv($handle, $row);
}
fclose($handle);
echo "Exported " . count($rows) . " rows to products_export.csv" . PHP_EOL;
?>
Summary
PHP MySQL scraping with PDO and prepared statements is the foundation of any serious data collection project. The key points:
- Always use PDO with prepared statements – never concatenate variables into SQL strings. One scraped title with a single quote will break a raw query. Prepared statements prevent this permanently.
- Design the schema before writing code – DECIMAL for prices, utf8mb4 for character set, UNIQUE constraints on identifying columns. Fixing schema mistakes on a populated table is painful.
- Use ON DUPLICATE KEY UPDATE – handles inserts and updates in one query, makes the scraper safe to run repeatedly without manual duplicate checking.
- Use transactions for batch inserts – significantly faster than individual commits and ensures consistency if something fails mid-batch.
- Index columns you query on – price, rating, scraped_at. Without indexes, filtering a large table scans every row.
For the complete scraping setup this database layer connects to, the PHP cURL web scraping complete guide covers fetching, parsing, error handling, and pagination in full detail. For automating this script to run daily without manual intervention, the PHP cron job guide covers scheduling, logging, and debugging from start to finish. For a complete project that combines scraping, MySQL storage, and email alerts, the PHP price tracker guide puts all of it together.
