Table of Contents
Web scraping at scale PHP requires a completely different architecture than simple loops. Building a scraper that works on 10 pages is easy. Building one that reliably handles 10,000 pages without overloading the target site, crashing your server, or getting blocked demands queues, rate limiting, and failure recovery.
This guide covers the patterns you actually need for production scraping at scale – job queues, parallel requests, rate limiting, failure handling, monitoring and recovery. All with working code.
The Problems With Naive Scraping
A simple loop that scrapes 100 pages:
<?php
// This works for small scale but breaks at 10,000 pages
for ($page = 1; $page <= 10000; $page++) {
$url = "https://example.com/products?page=$page";
$html = file_get_contents($url);
// Parse and store...
}
?>
Problems that show up fast:
- Memory explosion – by page 500 you’re out of memory
- One failure kills everything – one network timeout and you restart from page 1
- Gets you blocked – 10,000 rapid requests look like a DDoS attack
- Can’t resume – if it crashes at page 5000, you lose all progress
- No visibility – you don’t know what’s happening or what failed
- Overloads the target – you might crash their server
At scale, you need a different architecture.
Web Scraping at Scale PHP: The Queue-Based Architecture
Instead of a loop, use a job queue:
URLs to Scrape
↓
Queue System (Redis / Database)
↓
Worker Processes (parse URLs from queue)
↓
Rate Limiter (respect target site)
↓
HTTP Requests (with retries)
↓
Data Processor
↓
Storage (MySQL)
↓
Monitoring & Alerts
Benefits:
- Distribute work across multiple workers
- Pause/resume without losing progress
- Rate limit automatically
- Retry failed requests intelligently
- Monitor what’s happening in real time
- Scale up or down as needed
Setting Up a Redis-Based Queue
Redis is perfect for job queues – fast, reliable, and simple.
Install Redis:
# macOS
brew install redis
# Ubuntu/Debian
sudo apt-get install redis-server
# Verify
redis-cli ping
# Output: PONG
PHP Redis client:
composer require predis/predis
Create a queue manager:
<?php
class ScrapingQueue {
private $redis;
private $queue_key = 'scraping:urls:queue';
private $working_key = 'scraping:urls:working';
private $failed_key = 'scraping:urls:failed';
private $completed_key = 'scraping:urls:completed';
public function __construct() {
$this->redis = new \Predis\Client();
}
// Add URLs to queue
public function enqueue($urls) {
foreach ($urls as $url) {
$this->redis->rpush(
$this->queue_key,
json_encode([
'url' => $url,
'attempts' => 0,
'created_at' => time()
])
);
}
return count($urls);
}
// Get next job from queue
public function dequeue() {
$job = $this->redis->lpop($this->queue_key);
if ($job) {
// Move to "working" set to track in-progress jobs
$this->redis->sadd($this->working_key, $job);
return json_decode($job, true);
}
return null;
}
// Mark job as completed
public function complete($job) {
$job_json = json_encode($job);
$this->redis->srem($this->working_key, $job_json);
$this->redis->sadd($this->completed_key,
$job_json);
$this->redis->incr('scraping:stats:completed');
}
// Mark job as failed (will retry)
public function fail($job, $error = '') {
$job['attempts']++;
$job['last_error'] = $error;
$job['failed_at'] = time();
$job_json = json_encode($job);
$this->redis->srem($this->working_key, $job_json);
// Retry up to 5 times
if ($job['attempts'] < 5) {
$this->redis->rpush(
$this->queue_key,
$job_json
);
$this->redis->incr('scraping:stats:retried');
} else {
// Permanent failure after 5 attempts
$this->redis->sadd(
$this->failed_key,
$job_json
);
$this->redis->incr('scraping:stats:failed');
}
}
// Get queue stats
public function stats() {
return [
'pending' => $this->redis->llen(
$this->queue_key
),
'working' => $this->redis->scard(
$this->working_key
),
'completed' => $this->redis->scard(
$this->completed_key
),
'failed' => $this->redis->scard(
$this->failed_key
),
'stats' => [
'completed' => (int)$this->redis->get(
'scraping:stats:completed'
) ?: 0,
'failed' => (int)$this->redis->get(
'scraping:stats:failed'
) ?: 0,
'retried' => (int)$this->redis->get(
'scraping:stats:retried'
) ?: 0
]
];
}
}
// Usage: Add initial URLs
$queue = new ScrapingQueue();
// Generate URLs for all pages
$urls = [];
for ($page = 1; $page <= 10000; $page++) {
$urls[] = "https://example.com/products?page=$page";
}
$count = $queue->enqueue($urls);
echo "Queued $count URLs\n";
?>
For official Redis documentation see Redis documentation.
Rate Limiting and Throttling
Scrape too fast and you get blocked. Rate limiting is essential:
<?php
class RateLimiter {
private $redis;
private $domain;
private $requests_per_second = 1;
public function __construct($domain,
$requests_per_second = 1) {
$this->redis = new \Predis\Client();
$this->domain = $domain;
$this->requests_per_second = $requests_per_second;
}
// Wait until safe to make next request
public function wait() {
$key = "rate_limit:{$this->domain}";
$window = 1; // 1 second window
while (true) {
$requests = $this->redis->incr($key);
if ($requests === 1) {
// First request in window
$this->redis->expire($key, $window);
}
if ($requests <= $this->requests_per_second) {
// Within limit
break;
}
// Wait 100ms and try again
usleep(100000);
}
}
// Concurrent request limit
public function acquireSlot() {
$key = "slots:{$this->domain}";
$slots = 3; // max 3 concurrent
while ($this->redis->llen($key) >= $slots) {
usleep(10000); // 10ms
}
// Reserve a slot
$slot_id = uniqid();
$this->redis->rpush($key, $slot_id);
return $slot_id;
}
public function releaseSlot($slot_id) {
$key = "slots:{$this->domain}";
$this->redis->lrem($key, 0, $slot_id);
}
}
// Usage in worker
$limiter = new RateLimiter('example.com',
1); // 1 request/sec
$slot = $limiter->acquireSlot();
$limiter->wait();
try {
// Make request
$html = file_get_contents($url);
} finally {
$limiter->releaseSlot($slot);
}
?>
The Worker Process
A worker pulls jobs from the queue and processes them:
<?php
class ScrapingWorker {
private $queue;
private $limiter;
private $db;
public function __construct($domain) {
$this->queue = new ScrapingQueue();
$this->limiter = new RateLimiter($domain, 1);
$this->db = new PDO(
'mysql:host=localhost;dbname=scraping',
'root',
''
);
}
// Run worker continuously
public function start() {
echo "Worker started\n";
while (true) {
// Check queue
$job = $this->queue->dequeue();
if (!$job) {
echo "Queue empty, sleeping...\n";
sleep(5);
continue;
}
// Process job
try {
$this->processJob($job);
$this->queue->complete($job);
echo "✓ Processed: {$job['url']}\n";
} catch (\Exception $e) {
echo "✗ Failed: {$e->getMessage()}\n";
$this->queue->fail($job,
$e->getMessage());
sleep(2); // Back off on error
}
}
}
private function processJob($job) {
$url = $job['url'];
// Rate limit
$this->limiter->wait();
// Fetch with timeout and retry logic
$html = $this->fetchWithRetry($url);
if (!$html) {
throw new Exception(
"Failed to fetch after retries"
);
}
// Parse HTML
$products = $this->parseProducts($html);
// Store in database
$this->storeProducts($products);
}
private function fetchWithRetry($url,
$max_retries = 3) {
for ($attempt = 1; $attempt <= $max_retries;
$attempt++) {
try {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,
true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_USERAGENT,
'ScraperBot/1.0 (+http://mysite.com)');
$response = curl_exec($ch);
$http_code = curl_getinfo($ch,
CURLINFO_HTTP_CODE);
curl_close($ch);
// Success
if ($http_code == 200) {
return $response;
}
// Rate limited - back off exponentially
if ($http_code == 429) {
$wait = pow(2, $attempt);
echo "Rate limited. Waiting ${wait}s\n";
sleep($wait);
continue;
}
// Other error
if ($http_code >= 400) {
throw new Exception(
"HTTP $http_code"
);
}
} catch (\Exception $e) {
if ($attempt == $max_retries) {
throw $e;
}
sleep(2);
continue;
}
}
return null;
}
private function parseProducts($html) {
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors(false);
$products = [];
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//div[@class="product"]')
as $node) {
$title = $xpath->query(
'.//h2[@class="title"]',
$node
)->item(0)?->textContent;
$price = $xpath->query(
'.//span[@class="price"]',
$node
)->item(0)?->textContent;
if ($title && $price) {
$products[] = [
'title' => trim($title),
'price' => trim($price)
];
}
}
return $products;
}
private function storeProducts($products) {
$stmt = $this->db->prepare(
'INSERT INTO products (title, price)
VALUES (?, ?)
ON DUPLICATE KEY UPDATE
price = VALUES(price)'
);
foreach ($products as $product) {
$stmt->execute([
$product['title'],
$product['price']
]);
}
}
}
// Run worker
$worker = new ScrapingWorker('example.com');
$worker->start();
?>
Running Multiple Workers in Parallel
Scale horizontally by running multiple workers:
# Terminal 1
php worker.php
# Terminal 2
php worker.php
# Terminal 3
php worker.php
# You now have 3 workers processing jobs in parallel
# Each respects rate limiting automatically
Use supervisor or systemd to keep workers running:
# /etc/supervisor/conf.d/scraper.conf
[program:scraper-worker]
process_name=%(program_name)s_%(process_num)02d command=php /path/to/worker.php autostart=true autorestart=true numprocs=3 redirect_stderr=true stdout_logfile=/var/log/scraper.log
Monitoring and Dashboard
You need visibility into what’s happening:
<?php
class ScrapingMonitor {
private $redis;
public function __construct() {
$this->redis = new \Predis\Client();
}
public function getStatus() {
$queue = new ScrapingQueue();
$stats = $queue->stats();
$total = $stats['stats']['completed'] +
$stats['stats']['failed'] +
$stats['pending'];
$progress = $total > 0 ?
round(($stats['stats']['completed'] / $total) * 100, 2) :
0;
return [
'status' => [
'pending' => $stats['pending'],
'working' => $stats['working'],
'completed' => $stats['stats']['completed'],
'failed' => $stats['stats']['failed'],
'retried' => $stats['stats']['retried']
],
'progress' => $progress . '%',
'estimated_time' => $this->estimateTime(
$stats['stats']['completed'],
$stats['pending']
),
'failed_urls' => $this->getFailedUrls(10)
];
}
private function estimateTime($completed,
$remaining) {
if ($completed == 0) return 'calculating...';
// Average 1 URL per second (adjust as needed)
$seconds = $remaining;
$hours = floor($seconds / 3600);
$minutes = floor(($seconds % 3600) / 60);
return "{$hours}h {$minutes}m remaining";
}
private function getFailedUrls($limit) {
$queue = new ScrapingQueue();
// Implementation to fetch failed URLs
return [];
}
}
// API endpoint
header('Content-Type: application/json');
$monitor = new ScrapingMonitor();
echo json_encode($monitor->getStatus(),
JSON_PRETTY_PRINT);
// Output:
// {
// "status": {
// "pending": 5000,
// "working": 3,
// "completed": 4500,
// "failed": 10,
// "retried": 150
// },
// "progress": "47.37%",
// "estimated_time": "1h 23m remaining"
// }
?>
Handling Failures and Recovery
At scale, failures are normal. Handle them gracefully:
<?php
class FailureHandler {
private $db;
public function __construct() {
$this->db = new PDO(
'mysql:host=localhost;dbname=scraping',
'root',
''
);
}
// Log failure for investigation
public function logFailure($url, $error,
$attempt) {
$stmt = $this->db->prepare(
'INSERT INTO scraping_failures
(url, error, attempt, logged_at)
VALUES (?, ?, ?, NOW())'
);
$stmt->execute([$url, $error, $attempt]);
}
// Retry logic with exponential backoff
public function shouldRetry($attempt,
$error) {
// Network errors: retry
if (strpos($error, 'timeout') !== false) {
return $attempt < 5;
}
// 503 Server Unavailable: retry
if (strpos($error, '503') !== false) {
return $attempt < 3;
}
// 404 Not Found: don't retry
if (strpos($error, '404') !== false) {
return false;
}
// Default: retry up to 3 times
return $attempt < 3;
}
// Report critical issues
public function alertOnCriticalFailure($error) {
// Send email, Slack notification, etc
if (stripos($error, 'database') !== false) {
// Database is down - critical
$this->sendAlert(
"CRITICAL: Database connection failed",
$error
);
}
if (stripos($error, 'memory') !== false) {
// Out of memory - critical
$this->sendAlert(
"CRITICAL: Out of memory",
$error
);
}
}
private function sendAlert($subject, $message) {
// Send to monitoring system
mail('ops@example.com', $subject, $message);
}
}
?>
Frequently Asked Questions
How many workers should I run?
Start with 3-5 workers and monitor CPU/memory. Each worker is a PHP process. If you have 8 CPU cores, 8 workers is reasonable. Monitor what’s actually happening with stats – if workers are idle waiting for rate limits, you don’t need more.
What’s the optimal rate limit?
1 request per second is a safe default for any public site. Some sites can handle 10/sec. Others block anything faster than 1 every 5 seconds. Start conservative, monitor 404s and 429s, and adjust upward cautiously. Respect robots.txt if it specifies a rate.
How do I resume scraping after a crash?
Jobs in the queue are pending. Jobs in “working” should be moved back to pending (they likely failed). Run this on restart:
<?php
$redis = new Predis\Client();
// Move stuck jobs back to queue
$stuck = $redis->smembers('scraping:urls:working');
foreach ($stuck as $job) {
$redis->rpush('scraping:urls:queue', $job);
}
$redis->del('scraping:urls:working');
?>
How much data can I store?
That depends on your database. Scraping 10,000 pages at 5KB each = 50MB of data. Add indexes and you’re at 100MB. That’s fine. 1,000,000 pages = 5GB which is reasonable for a single database. After that, start sharding or using data warehousing.
Should I use Redis or a database for the queue?
Redis is faster but all data is in memory – if Redis crashes you lose the queue. For critical scraping, use both – queue in Redis for speed, periodically save state to database. Or use a database-backed queue (slower but persistent).
Summary
Web scraping at scale requires a different architecture than simple loops:
- Queue system – Redis or database to track work
- Rate limiter – respect targets and avoid blocks
- Worker processes – parallel processing of jobs
- Failure handling – retries, logging, alerts
- Monitoring – visibility into what’s happening
- Recovery – resume after crashes without losing progress
Start simple with a single worker and queue. As volume increases, add more workers. Monitor failure rates and adjust rate limits based on what the target site can handle. For the ethical foundations of scraping at any scale, the ethical web scraping guide covers legal boundaries and best practices.
