Web Scraping at Scale PHP: The Essential Complete Guide to Queues, Rate Limiting and Monitoring

Web scraping at scale PHP requires a completely different architecture than simple loops. Building a scraper that works on 10 pages is easy. Building one that reliably handles 10,000 pages without overloading the target site, crashing your server, or getting blocked demands queues, rate limiting, and failure recovery.

This guide covers the patterns you actually need for production scraping at scale – job queues, parallel requests, rate limiting, failure handling, monitoring and recovery. All with working code.

The Problems With Naive Scraping

A simple loop that scrapes 100 pages:

<?php
// This works for small scale but breaks at 10,000 pages

for ($page = 1; $page <= 10000; $page++) {
    $url = "https://example.com/products?page=$page";
    $html = file_get_contents($url);

    // Parse and store...
}
?>

Problems that show up fast:

  • Memory explosion – by page 500 you’re out of memory
  • One failure kills everything – one network timeout and you restart from page 1
  • Gets you blocked – 10,000 rapid requests look like a DDoS attack
  • Can’t resume – if it crashes at page 5000, you lose all progress
  • No visibility – you don’t know what’s happening or what failed
  • Overloads the target – you might crash their server

At scale, you need a different architecture.

Web Scraping at Scale PHP: The Queue-Based Architecture

Instead of a loop, use a job queue:

URLs to Scrape
    ↓
Queue System (Redis / Database)
    ↓
Worker Processes (parse URLs from queue)
    ↓
Rate Limiter (respect target site)
    ↓
HTTP Requests (with retries)
    ↓
Data Processor
    ↓
Storage (MySQL)
    ↓
Monitoring & Alerts

Benefits:

  • Distribute work across multiple workers
  • Pause/resume without losing progress
  • Rate limit automatically
  • Retry failed requests intelligently
  • Monitor what’s happening in real time
  • Scale up or down as needed

Setting Up a Redis-Based Queue

Redis is perfect for job queues – fast, reliable, and simple.

Install Redis:

# macOS
brew install redis

# Ubuntu/Debian
sudo apt-get install redis-server

# Verify
redis-cli ping
# Output: PONG

PHP Redis client:

composer require predis/predis

Create a queue manager:

<?php
class ScrapingQueue {
    private $redis;
    private $queue_key = 'scraping:urls:queue';
    private $working_key = 'scraping:urls:working';
    private $failed_key = 'scraping:urls:failed';
    private $completed_key = 'scraping:urls:completed';

    public function __construct() {
        $this->redis = new \Predis\Client();
    }

    // Add URLs to queue
    public function enqueue($urls) {
        foreach ($urls as $url) {
            $this->redis->rpush(
                $this->queue_key, 
                json_encode([
                    'url' => $url,
                    'attempts' => 0,
                    'created_at' => time()
                ])
            );
        }
        return count($urls);
    }

    // Get next job from queue
    public function dequeue() {
        $job = $this->redis->lpop($this->queue_key);

        if ($job) {
            // Move to "working" set to track in-progress jobs
            $this->redis->sadd($this->working_key, $job);
            return json_decode($job, true);
        }

        return null;
    }

    // Mark job as completed
    public function complete($job) {
        $job_json = json_encode($job);
        $this->redis->srem($this->working_key, $job_json);
        $this->redis->sadd($this->completed_key, 
                          $job_json);
        $this->redis->incr('scraping:stats:completed');
    }

    // Mark job as failed (will retry)
    public function fail($job, $error = '') {
        $job['attempts']++;
        $job['last_error'] = $error;
        $job['failed_at'] = time();

        $job_json = json_encode($job);
        $this->redis->srem($this->working_key, $job_json);

        // Retry up to 5 times
        if ($job['attempts'] < 5) {
            $this->redis->rpush(
                $this->queue_key, 
                $job_json
            );
            $this->redis->incr('scraping:stats:retried');
        } else {
            // Permanent failure after 5 attempts
            $this->redis->sadd(
                $this->failed_key, 
                $job_json
            );
            $this->redis->incr('scraping:stats:failed');
        }
    }

    // Get queue stats
    public function stats() {
        return [
            'pending' => $this->redis->llen(
                $this->queue_key
            ),
            'working' => $this->redis->scard(
                $this->working_key
            ),
            'completed' => $this->redis->scard(
                $this->completed_key
            ),
            'failed' => $this->redis->scard(
                $this->failed_key
            ),
            'stats' => [
                'completed' => (int)$this->redis->get(
                    'scraping:stats:completed'
                ) ?: 0,
                'failed' => (int)$this->redis->get(
                    'scraping:stats:failed'
                ) ?: 0,
                'retried' => (int)$this->redis->get(
                    'scraping:stats:retried'
                ) ?: 0
            ]
        ];
    }
}

// Usage: Add initial URLs
$queue = new ScrapingQueue();

// Generate URLs for all pages
$urls = [];
for ($page = 1; $page <= 10000; $page++) {
    $urls[] = "https://example.com/products?page=$page";
}

$count = $queue->enqueue($urls);
echo "Queued $count URLs\n";
?>

For official Redis documentation see Redis documentation.

Rate Limiting and Throttling

Scrape too fast and you get blocked. Rate limiting is essential:

<?php
class RateLimiter {
    private $redis;
    private $domain;
    private $requests_per_second = 1;

    public function __construct($domain, 
                                 $requests_per_second = 1) {
        $this->redis = new \Predis\Client();
        $this->domain = $domain;
        $this->requests_per_second = $requests_per_second;
    }

    // Wait until safe to make next request
    public function wait() {
        $key = "rate_limit:{$this->domain}";
        $window = 1; // 1 second window

        while (true) {
            $requests = $this->redis->incr($key);

            if ($requests === 1) {
                // First request in window
                $this->redis->expire($key, $window);
            }

            if ($requests <= $this->requests_per_second) {
                // Within limit
                break;
            }

            // Wait 100ms and try again
            usleep(100000);
        }
    }

    // Concurrent request limit
    public function acquireSlot() {
        $key = "slots:{$this->domain}";
        $slots = 3; // max 3 concurrent

        while ($this->redis->llen($key) >= $slots) {
            usleep(10000); // 10ms
        }

        // Reserve a slot
        $slot_id = uniqid();
        $this->redis->rpush($key, $slot_id);

        return $slot_id;
    }

    public function releaseSlot($slot_id) {
        $key = "slots:{$this->domain}";
        $this->redis->lrem($key, 0, $slot_id);
    }
}

// Usage in worker
$limiter = new RateLimiter('example.com', 
                           1); // 1 request/sec

$slot = $limiter->acquireSlot();
$limiter->wait();

try {
    // Make request
    $html = file_get_contents($url);
} finally {
    $limiter->releaseSlot($slot);
}
?>

The Worker Process

A worker pulls jobs from the queue and processes them:

<?php
class ScrapingWorker {
    private $queue;
    private $limiter;
    private $db;

    public function __construct($domain) {
        $this->queue = new ScrapingQueue();
        $this->limiter = new RateLimiter($domain, 1);
        $this->db = new PDO(
            'mysql:host=localhost;dbname=scraping',
            'root',
            ''
        );
    }

    // Run worker continuously
    public function start() {
        echo "Worker started\n";

        while (true) {
            // Check queue
            $job = $this->queue->dequeue();

            if (!$job) {
                echo "Queue empty, sleeping...\n";
                sleep(5);
                continue;
            }

            // Process job
            try {
                $this->processJob($job);
                $this->queue->complete($job);
                echo "✓ Processed: {$job['url']}\n";
            } catch (\Exception $e) {
                echo "✗ Failed: {$e->getMessage()}\n";
                $this->queue->fail($job, 
                                  $e->getMessage());
                sleep(2); // Back off on error
            }
        }
    }

    private function processJob($job) {
        $url = $job['url'];

        // Rate limit
        $this->limiter->wait();

        // Fetch with timeout and retry logic
        $html = $this->fetchWithRetry($url);

        if (!$html) {
            throw new Exception(
                "Failed to fetch after retries"
            );
        }

        // Parse HTML
        $products = $this->parseProducts($html);

        // Store in database
        $this->storeProducts($products);
    }

    private function fetchWithRetry($url, 
                                     $max_retries = 3) {
        for ($attempt = 1; $attempt <= $max_retries; 
             $attempt++) {

            try {
                $ch = curl_init();

                curl_setopt($ch, CURLOPT_URL, $url);
                curl_setopt($ch, CURLOPT_RETURNTRANSFER, 
                           true);
                curl_setopt($ch, CURLOPT_TIMEOUT, 10);
                curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
                curl_setopt($ch, CURLOPT_USERAGENT,
                    'ScraperBot/1.0 (+http://mysite.com)');

                $response = curl_exec($ch);
                $http_code = curl_getinfo($ch, 
                    CURLINFO_HTTP_CODE);
                curl_close($ch);

                // Success
                if ($http_code == 200) {
                    return $response;
                }

                // Rate limited - back off exponentially
                if ($http_code == 429) {
                    $wait = pow(2, $attempt);
                    echo "Rate limited. Waiting ${wait}s\n";
                    sleep($wait);
                    continue;
                }

                // Other error
                if ($http_code >= 400) {
                    throw new Exception(
                        "HTTP $http_code"
                    );
                }

            } catch (\Exception $e) {
                if ($attempt == $max_retries) {
                    throw $e;
                }
                sleep(2);
                continue;
            }
        }

        return null;
    }

    private function parseProducts($html) {
        $dom = new DOMDocument();
        libxml_use_internal_errors(true);
        $dom->loadHTML($html);
        libxml_use_internal_errors(false);

        $products = [];
        $xpath = new DOMXPath($dom);

        foreach ($xpath->query('//div[@class="product"]') 
                 as $node) {

            $title = $xpath->query(
                './/h2[@class="title"]', 
                $node
            )->item(0)?->textContent;

            $price = $xpath->query(
                './/span[@class="price"]', 
                $node
            )->item(0)?->textContent;

            if ($title && $price) {
                $products[] = [
                    'title' => trim($title),
                    'price' => trim($price)
                ];
            }
        }

        return $products;
    }

    private function storeProducts($products) {
        $stmt = $this->db->prepare(
            'INSERT INTO products (title, price) 
             VALUES (?, ?)
             ON DUPLICATE KEY UPDATE 
             price = VALUES(price)'
        );

        foreach ($products as $product) {
            $stmt->execute([
                $product['title'],
                $product['price']
            ]);
        }
    }
}

// Run worker
$worker = new ScrapingWorker('example.com');
$worker->start();
?>

Running Multiple Workers in Parallel

Scale horizontally by running multiple workers:

# Terminal 1
php worker.php

# Terminal 2
php worker.php

# Terminal 3
php worker.php

# You now have 3 workers processing jobs in parallel
# Each respects rate limiting automatically

Use supervisor or systemd to keep workers running:

# /etc/supervisor/conf.d/scraper.conf

[program:scraper-worker]

process_name=%(program_name)s_%(process_num)02d command=php /path/to/worker.php autostart=true autorestart=true numprocs=3 redirect_stderr=true stdout_logfile=/var/log/scraper.log

Monitoring and Dashboard

You need visibility into what’s happening:

<?php
class ScrapingMonitor {
    private $redis;

    public function __construct() {
        $this->redis = new \Predis\Client();
    }

    public function getStatus() {
        $queue = new ScrapingQueue();
        $stats = $queue->stats();

        $total = $stats['stats']['completed'] + 
                 $stats['stats']['failed'] + 
                 $stats['pending'];

        $progress = $total > 0 ? 
            round(($stats['stats']['completed'] / $total) * 100, 2) : 
            0;

        return [
            'status' => [
                'pending' => $stats['pending'],
                'working' => $stats['working'],
                'completed' => $stats['stats']['completed'],
                'failed' => $stats['stats']['failed'],
                'retried' => $stats['stats']['retried']
            ],
            'progress' => $progress . '%',
            'estimated_time' => $this->estimateTime(
                $stats['stats']['completed'],
                $stats['pending']
            ),
            'failed_urls' => $this->getFailedUrls(10)
        ];
    }

    private function estimateTime($completed, 
                                   $remaining) {
        if ($completed == 0) return 'calculating...';

        // Average 1 URL per second (adjust as needed)
        $seconds = $remaining;
        $hours = floor($seconds / 3600);
        $minutes = floor(($seconds % 3600) / 60);

        return "{$hours}h {$minutes}m remaining";
    }

    private function getFailedUrls($limit) {
        $queue = new ScrapingQueue();
        // Implementation to fetch failed URLs
        return [];
    }
}

// API endpoint
header('Content-Type: application/json');
$monitor = new ScrapingMonitor();
echo json_encode($monitor->getStatus(), 
                 JSON_PRETTY_PRINT);

// Output:
// {
//   "status": {
//     "pending": 5000,
//     "working": 3,
//     "completed": 4500,
//     "failed": 10,
//     "retried": 150
//   },
//   "progress": "47.37%",
//   "estimated_time": "1h 23m remaining"
// }
?>

Handling Failures and Recovery

At scale, failures are normal. Handle them gracefully:

<?php
class FailureHandler {
    private $db;

    public function __construct() {
        $this->db = new PDO(
            'mysql:host=localhost;dbname=scraping',
            'root',
            ''
        );
    }

    // Log failure for investigation
    public function logFailure($url, $error, 
                               $attempt) {
        $stmt = $this->db->prepare(
            'INSERT INTO scraping_failures 
             (url, error, attempt, logged_at) 
             VALUES (?, ?, ?, NOW())'
        );

        $stmt->execute([$url, $error, $attempt]);
    }

    // Retry logic with exponential backoff
    public function shouldRetry($attempt, 
                                $error) {
        // Network errors: retry
        if (strpos($error, 'timeout') !== false) {
            return $attempt < 5;
        }

        // 503 Server Unavailable: retry
        if (strpos($error, '503') !== false) {
            return $attempt < 3;
        }

        // 404 Not Found: don't retry
        if (strpos($error, '404') !== false) {
            return false;
        }

        // Default: retry up to 3 times
        return $attempt < 3;
    }

    // Report critical issues
    public function alertOnCriticalFailure($error) {
        // Send email, Slack notification, etc
        if (stripos($error, 'database') !== false) {
            // Database is down - critical
            $this->sendAlert(
                "CRITICAL: Database connection failed",
                $error
            );
        }

        if (stripos($error, 'memory') !== false) {
            // Out of memory - critical
            $this->sendAlert(
                "CRITICAL: Out of memory",
                $error
            );
        }
    }

    private function sendAlert($subject, $message) {
        // Send to monitoring system
        mail('ops@example.com', $subject, $message);
    }
}
?>

Frequently Asked Questions

How many workers should I run?

Start with 3-5 workers and monitor CPU/memory. Each worker is a PHP process. If you have 8 CPU cores, 8 workers is reasonable. Monitor what’s actually happening with stats – if workers are idle waiting for rate limits, you don’t need more.

What’s the optimal rate limit?

1 request per second is a safe default for any public site. Some sites can handle 10/sec. Others block anything faster than 1 every 5 seconds. Start conservative, monitor 404s and 429s, and adjust upward cautiously. Respect robots.txt if it specifies a rate.

How do I resume scraping after a crash?

Jobs in the queue are pending. Jobs in “working” should be moved back to pending (they likely failed). Run this on restart:

<?php
$redis = new Predis\Client();

// Move stuck jobs back to queue
$stuck = $redis->smembers('scraping:urls:working');
foreach ($stuck as $job) {
    $redis->rpush('scraping:urls:queue', $job);
}
$redis->del('scraping:urls:working');
?>

How much data can I store?

That depends on your database. Scraping 10,000 pages at 5KB each = 50MB of data. Add indexes and you’re at 100MB. That’s fine. 1,000,000 pages = 5GB which is reasonable for a single database. After that, start sharding or using data warehousing.

Should I use Redis or a database for the queue?

Redis is faster but all data is in memory – if Redis crashes you lose the queue. For critical scraping, use both – queue in Redis for speed, periodically save state to database. Or use a database-backed queue (slower but persistent).


Summary

Web scraping at scale requires a different architecture than simple loops:

  • Queue system – Redis or database to track work
  • Rate limiter – respect targets and avoid blocks
  • Worker processes – parallel processing of jobs
  • Failure handling – retries, logging, alerts
  • Monitoring – visibility into what’s happening
  • Recovery – resume after crashes without losing progress

Start simple with a single worker and queue. As volume increases, add more workers. Monitor failure rates and adjust rate limits based on what the target site can handle. For the ethical foundations of scraping at any scale, the ethical web scraping guide covers legal boundaries and best practices.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top