PHP Dynamic Content Web Scraping: How to Scrape JavaScript Websites

Dynamic content web scraping PHP is one of the most common challenges developers hit when moving beyond basic scraping. PHP cURL is excellent for static websites. But load up a modern e-commerce site, a job board, or a news aggregator and you’ll often find cURL returning empty containers where the data should be.

This guide covers every practical approach to PHP dynamic content web scraping – from finding hidden APIs that bypass the JavaScript problem entirely, to extracting inline JSON from script tags, to integrating a headless browser when nothing else works. Each method has working code and clear guidance on when to use it.

What Is Dynamic Content?

Dynamic content is data that isn’t in the initial HTML response from the server. Instead of delivering complete HTML, the server sends a shell – navigation, layout, empty containers – and JavaScript fills in the actual content after the page loads by making additional requests to APIs or data endpoints.

From a scraper’s perspective: cURL fetches the initial HTML response and stops. JavaScript never runs. The data that would have loaded never loads. You get the page structure without the content.

How to Confirm Content Is JavaScript-Rendered

Before writing any code, confirm the problem is actually JavaScript rendering and not a broken selector or blocked request. Save the raw cURL response and inspect it:

<?php
function fetch_page($url) {
    $ch = curl_init();

    curl_setopt_array($ch, [
        CURLOPT_URL            => $url,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_CONNECTTIMEOUT => 10,
        CURLOPT_TIMEOUT        => 30,
        CURLOPT_ENCODING       => '',
        CURLOPT_HTTPHEADER     => [
            'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
            'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language: en-US,en;q=0.5',
        ],
    ]);

    $html     = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($httpCode !== 200) {
        echo "HTTP $httpCode - request failed." . PHP_EOL;
        return false;
    }

    return $html;
}

$url  = "https://example-js-site.com/products";
$html = fetch_page($url);

if ($html) {
    // Save raw response for inspection
    file_put_contents(__DIR__ . '/raw_response.html', $html);
    echo "Saved raw HTML - open raw_response.html in browser." . PHP_EOL;
    echo "Response size: " . strlen($html) . " bytes." . PHP_EOL;
}
?>

Output:

Saved raw HTML - open raw_response.html in browser.
Response size: 4821 bytes.

Open raw_response.html in Chrome. If the file shows empty product containers or placeholder elements where data should be – and the actual product data appears when you visit the real URL – JavaScript is loading that content. If the file looks the same as the live page, the problem is your selector, not JavaScript rendering.

Detecting JavaScript Framework Markers

<?php
$html = fetch_page("https://example-js-site.com/products");

if ($html) {
    // Check for common JavaScript framework signatures
    $markers = [
        '__NEXT_DATA__'      => 'Next.js (React SSR)',
        '__NUXT__'           => 'Nuxt.js (Vue SSR)',
        'ng-version'         => 'Angular',
        'data-reactroot'     => 'React',
        'data-server-rendered' => 'Vue SSR',
        'window.__INITIAL_STATE__' => 'Redux/Vuex inline state',
        'window.__data'      => 'Generic JS data injection',
        '<div id="app"></div>' => 'Single-page app shell',
        '<div id="root"></div>' => 'React app shell',
    ];

    $detected = [];

    foreach ($markers as $marker => $framework) {
        if (strpos($html, $marker) !== false) {
            $detected[] = $framework;
        }
    }

    if (!empty($detected)) {
        echo "JavaScript frameworks detected:" . PHP_EOL;
        foreach ($detected as $framework) {
            echo "  - $framework" . PHP_EOL;
        }
    } else {
        echo "No JS framework markers found - may be static HTML." . PHP_EOL;
    }
}
?>

Output on a Next.js site:

JavaScript frameworks detected:
  - Next.js (React SSR)
  - Redux/Vuex inline state

Output on a static HTML site:

No JS framework markers found - may be static HTML.

Framework detection narrows down the approach before you start. Next.js and Nuxt.js sites almost always have inline JSON you can extract directly. Pure React or Angular single-page apps usually require hitting an API. Knowing which framework you’re dealing with saves significant trial and error.

What You Need

  • PHP 7.4 or higher with cURL enabled
  • Node.js installed (only for Puppeteer integration in Method 3)
  • Basic familiarity with browser DevTools

If you haven’t built a basic PHP scraper yet, read the PHP web scraper beginner guide first – this guide assumes you’re comfortable with cURL requests and DOMDocument parsing.

Method 1: Finding and Hitting Hidden API Endpoints

Most JavaScript-rendered sites load their data from API endpoints the browser calls after the initial page load. These endpoints aren’t always publicly documented but they’re accessible – and hitting them directly with cURL is faster, cleaner, and more reliable than any other approach to dynamic content scraping.

This should always be your first attempt before reaching for a headless browser.

Finding the API Endpoint in Chrome DevTools

Open the target page in Chrome and follow these steps:

  • Press F12 to open DevTools
  • Click the Network tab
  • Reload the page
  • Click Fetch/XHR to filter API calls only
  • Look for requests returning JSON with the data you need
  • Click a request to see its URL, headers, and response

You’re looking for requests where the Response tab shows JSON containing the product names, prices, or whatever data you’re trying to scrape. Copy that URL – that’s your target.

Hitting the API Directly With cURL

<?php
function call_api($url, $headers = [], $params = []) {
    if (!empty($params)) {
        $url .= '?' . http_build_query($params);
    }

    $defaultHeaders = [
        'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
        'Accept: application/json, text/plain, */*',
        'Accept-Language: en-US,en;q=0.5',
        'X-Requested-With: XMLHttpRequest',
        'Connection: keep-alive',
    ];

    $ch = curl_init();

    curl_setopt_array($ch, [
        CURLOPT_URL            => $url,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_CONNECTTIMEOUT => 10,
        CURLOPT_TIMEOUT        => 30,
        CURLOPT_ENCODING       => '',
        CURLOPT_HTTPHEADER     => array_merge($defaultHeaders, $headers),
    ]);

    $response = curl_exec($ch);
    $errno    = curl_errno($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($errno || $httpCode !== 200) {
        echo "API request failed: HTTP $httpCode" . PHP_EOL;
        return false;
    }

    $data = json_decode($response, true);

    if (json_last_error() !== JSON_ERROR_NONE) {
        echo "Invalid JSON: " . json_last_error_msg() . PHP_EOL;
        return false;
    }

    return $data;
}

// Example: call a product API endpoint found in DevTools
$data = call_api("https://api.example.com/products", [], [
    'page'  => 1,
    'limit' => 20,
]);

if ($data) {
    echo "Products returned: " . count($data['products'] ?? []) . PHP_EOL;

    foreach ($data['products'] ?? [] as $product) {
        echo $product['name'] . " - $" . $product['price'] . PHP_EOL;
    }
}
?>

Output:

Products returned: 20
Laptop Stand Pro - $49.99
Mechanical Keyboard - $89.99
USB-C Hub - $34.99

Copying Request Headers From DevTools

Some API endpoints check for specific headers – an API key in a custom header, an authorization token, or a site-specific header the JavaScript adds to every request. In DevTools, click the request → Headers tab → copy the Request Headers section:

<?php
// Headers copied directly from DevTools Network tab
// Add any custom headers the API requires
$customHeaders = [
    'X-Api-Key: abc123def456',
    'X-Site-Token: eyJhbGciOiJIUzI1NiJ9...',
    'Referer: https://example.com/products',
    'Origin: https://example.com',
];

$data = call_api("https://api.example.com/products", $customHeaders, [
    'page' => 1,
]);

if ($data) {
    echo "Data fetched with custom headers." . PHP_EOL;
}
?>

Handling Paginated API Responses

API endpoints that power pagination usually accept a page number or offset parameter. Loop through them the same way you’d loop through paginated HTML pages:

<?php
function scrape_all_pages($baseUrl, $headers = []) {
    $page       = 1;
    $allItems   = [];
    $hasMore    = true;

    while ($hasMore) {
        echo "Fetching page $page..." . PHP_EOL;

        $data = call_api($baseUrl, $headers, [
            'page'  => $page,
            'limit' => 20,
        ]);

        if (!$data) {
            echo "Request failed on page $page - stopping." . PHP_EOL;
            break;
        }

        $items = $data['products'] ?? $data['items'] ?? $data['data'] ?? [];

        if (empty($items)) {
            echo "No items on page $page - last page reached." . PHP_EOL;
            $hasMore = false;
            break;
        }

        $allItems = array_merge($allItems, $items);
        echo "Page $page - " . count($items) . " items. Total: " . count($allItems) . PHP_EOL;

        // Check for explicit pagination metadata
        $totalPages  = $data['total_pages'] ?? $data['pages'] ?? null;
        $currentPage = $data['current_page'] ?? $data['page'] ?? $page;

        if ($totalPages && $currentPage >= $totalPages) {
            echo "Reached last page ($totalPages)." . PHP_EOL;
            $hasMore = false;
        } else {
            $page++;
            sleep(1); // polite delay between API calls
        }
    }

    return $allItems;
}

$allProducts = scrape_all_pages("https://api.example.com/products");
echo PHP_EOL . "Total products collected: " . count($allProducts) . PHP_EOL;
?>

Output:

Fetching page 1...
Page 1 - 20 items. Total: 20
Fetching page 2...
Page 2 - 20 items. Total: 40
Fetching page 3...
Page 3 - 20 items. Total: 60
...
Reached last page (10).

Total products collected: 200

Handling API Authentication Tokens

Some internal APIs require a bearer token that the site generates when you first load the page. The token appears in the Authorization header of API requests in DevTools. These tokens often expire – fetch a fresh one by hitting the login or session endpoint first:

<?php
function get_auth_token($loginUrl, $credentials) {
    $ch = curl_init();

    curl_setopt_array($ch, [
        CURLOPT_URL            => $loginUrl,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_POST           => true,
        CURLOPT_POSTFIELDS     => json_encode($credentials),
        CURLOPT_HTTPHEADER     => [
            'Content-Type: application/json',
            'Accept: application/json',
            'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
        ],
    ]);

    $response = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($httpCode !== 200) {
        echo "Authentication failed: HTTP $httpCode" . PHP_EOL;
        return false;
    }

    $data  = json_decode($response, true);
    $token = $data['token'] ?? $data['access_token'] ?? $data['auth_token'] ?? null;

    if (!$token) {
        echo "Token not found in response." . PHP_EOL;
        return false;
    }

    echo "Token obtained successfully." . PHP_EOL;
    return $token;
}

// Get token first
$token = get_auth_token("https://api.example.com/auth/login", [
    'email'    => 'your@email.com',
    'password' => 'yourpassword',
]);

if ($token) {
    // Use token in subsequent API calls
    $data = call_api("https://api.example.com/protected/products", [
        'Authorization: Bearer ' . $token,
    ]);

    if ($data) {
        echo "Protected data fetched. Items: " . count($data['items'] ?? []) . PHP_EOL;
    }
}
?>

Output:

Token obtained successfully.
Protected data fetched. Items: 50

When the API Approach Fails

API endpoints aren’t always usable. Three situations where you need a different approach:

  • The endpoint requires a dynamic token that changes on every page load and can’t be fetched separately – move to Method 2 (inline JSON) or Method 3 (headless browser)
  • The response is HTML, not JSON – the “API” is actually server-side rendering, not a data API. Parse the HTML response with DOMDocument as you would any static page
  • The endpoint validates browser fingerprint – Cloudflare or similar protection rejects non-browser clients even with correct headers. Method 3 is the only reliable option here

Method 2: Extracting Inline JSON From Script Tags

Server-side rendered JavaScript frameworks – Next.js, Nuxt.js, and many custom setups – embed the initial page data directly in the HTML as JSON inside a script tag. This data is delivered with the first response, before any JavaScript runs. cURL can read it even though the rendered page content can’t be seen.

When a framework marker like __NEXT_DATA__ appeared in your detection step, this is the method to try before anything else. It’s faster than hitting an API and doesn’t require a headless browser.

Finding Inline JSON in the Page Source

Before writing code, check what’s in the HTML source. In Chrome, right-click the page and select View Page Source – not Inspect, which shows the rendered DOM. Search for these patterns:

__NEXT_DATA__        - Next.js initial props
__NUXT__             - Nuxt.js server state
__INITIAL_STATE__    - Redux state
window.__data        - generic data injection
application/ld+json  - structured data (JSON-LD)
application/json     - inline JSON script blocks

If you find a script tag containing a large JSON object with the data you need – product names, prices, article content – this method will work.

Extracting Next.js Inline Data

<?php
function extract_nextjs_data($html) {
    if (!$html) return false;

    // Next.js embeds data in: <script id="__NEXT_DATA__" type="application/json">
    if (!preg_match(
        '/<script\s+id=["\']__NEXT_DATA__["\'][^>]*>(.*?)<\/script>/s',
        $html,
        $matches
    )) {
        echo "No __NEXT_DATA__ found in page." . PHP_EOL;
        return false;
    }

    $jsonData = json_decode($matches[1], true);

    if (json_last_error() !== JSON_ERROR_NONE) {
        echo "Failed to parse Next.js JSON: " . json_last_error_msg() . PHP_EOL;
        return false;
    }

    return $jsonData;
}

$html = fetch_page("https://example-nextjs-site.com/products");
$data = extract_nextjs_data($html);

if ($data) {
    // Data is nested under props.pageProps in most Next.js apps
    $pageProps = $data['props']['pageProps'] ?? [];

    echo "Page: "      . ($data['page'] ?? 'unknown') . PHP_EOL;
    echo "Build ID: "  . ($data['buildId'] ?? 'unknown') . PHP_EOL;
    echo "Props keys: " . implode(', ', array_keys($pageProps)) . PHP_EOL;

    // Navigate to the actual data - structure varies by site
    $products = $pageProps['products']
             ?? $pageProps['data']['products']
             ?? $pageProps['initialData']
             ?? [];

    echo "Products found: " . count($products) . PHP_EOL . PHP_EOL;

    foreach (array_slice($products, 0, 3) as $product) {
        echo $product['name'] . " - $" . $product['price'] . PHP_EOL;
    }
}
?>

Output:

Page: /products
Build ID: abc123xyz
Props keys: products, categories, totalCount
Products found: 24

Laptop Stand Pro - $49.99
Mechanical Keyboard - $89.99
USB-C Hub - $34.99

Extracting Nuxt.js Inline Data

<?php
function extract_nuxtjs_data($html) {
    if (!$html) return false;

    // Nuxt.js embeds data as: window.__NUXT__={...}
    // or as: <script>window.__NUXT__=...</script>
    if (!preg_match('/window\.__NUXT__\s*=\s*(\{.*?\});?\s*<\/script>/s', $html, $matches)) {
        echo "No __NUXT__ data found." . PHP_EOL;
        return false;
    }

    $jsonData = json_decode($matches[1], true);

    if (json_last_error() !== JSON_ERROR_NONE) {
        echo "Failed to parse Nuxt JSON: " . json_last_error_msg() . PHP_EOL;
        return false;
    }

    return $jsonData;
}

$html = fetch_page("https://example-nuxt-site.com/products");
$data = extract_nuxtjs_data($html);

if ($data) {
    // Nuxt stores page data under data[0] typically
    $pageData = $data['data'][0] ?? $data['state'] ?? [];

    echo "Nuxt data keys: " . implode(', ', array_keys($pageData)) . PHP_EOL;

    $products = $pageData['products'] ?? [];
    echo "Products: " . count($products) . PHP_EOL;
}
?>

Extracting Generic Inline JSON

Many sites that don’t use a specific framework still embed data in script tags. The pattern varies – sometimes it’s a named variable, sometimes a generic JSON block:

<?php
function extract_inline_json($html) {
    if (!$html) return [];

    $results = [];

    // Pattern 1: application/json script blocks
    preg_match_all(
        '/<script[^>]*type=["\']application\/json["\'][^>]*>(.*?)<\/script>/s',
        $html,
        $matches
    );

    foreach ($matches[1] as $index => $jsonString) {
        $data = json_decode(trim($jsonString), true);
        if (json_last_error() === JSON_ERROR_NONE) {
            $results["json_block_$index"] = $data;
        }
    }

    // Pattern 2: window.INITIAL_DATA = {...}
    if (preg_match('/window\.INITIAL_DATA\s*=\s*(\{.*?\});/s', $html, $match)) {
        $data = json_decode($match[1], true);
        if (json_last_error() === JSON_ERROR_NONE) {
            $results['initial_data'] = $data;
        }
    }

    // Pattern 3: window.__STORE__ = {...}
    if (preg_match('/window\.__STORE__\s*=\s*(\{.*?\});/s', $html, $match)) {
        $data = json_decode($match[1], true);
        if (json_last_error() === JSON_ERROR_NONE) {
            $results['store'] = $data;
        }
    }

    // Pattern 4: var pageData = {...}
    if (preg_match('/var\s+pageData\s*=\s*(\{.*?\});/s', $html, $match)) {
        $data = json_decode($match[1], true);
        if (json_last_error() === JSON_ERROR_NONE) {
            $results['page_data'] = $data;
        }
    }

    return $results;
}

$html    = fetch_page("https://example.com/products");
$results = extract_inline_json($html);

if (empty($results)) {
    echo "No inline JSON found." . PHP_EOL;
} else {
    echo "Inline JSON blocks found: " . count($results) . PHP_EOL;

    foreach ($results as $key => $data) {
        echo "  $key: " . count($data) . " top-level keys" . PHP_EOL;
    }
}
?>

Output when data is found:

Inline JSON blocks found: 2
  json_block_0: 3 top-level keys
  initial_data: 8 top-level keys

Extracting JSON-LD Structured Data

Many e-commerce and news sites embed structured data in application/ld+json script tags for SEO. This is publicly intended data – product details, prices, article content – and it’s the easiest inline JSON to parse:

<?php
function extract_json_ld($html) {
    if (!$html) return [];

    preg_match_all(
        '/<script[^>]*type=["\']application\/ld\+json["\'][^>]*>(.*?)<\/script>/s',
        $html,
        $matches
    );

    $results = [];

    foreach ($matches[1] as $jsonString) {
        $data = json_decode(trim($jsonString), true);

        if (json_last_error() === JSON_ERROR_NONE) {
            $results[] = $data;
        }
    }

    return $results;
}

$html     = fetch_page("https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html");
$jsonLd   = extract_json_ld($html);

if (empty($jsonLd)) {
    echo "No JSON-LD found on this page." . PHP_EOL;
} else {
    foreach ($jsonLd as $item) {
        $type = $item['@type'] ?? 'Unknown';
        echo "Type: $type" . PHP_EOL;

        // Common product fields
        if (isset($item['name']))        echo "Name: "   . $item['name']            . PHP_EOL;
        if (isset($item['price']))       echo "Price: "  . $item['price']           . PHP_EOL;
        if (isset($item['description'])) echo "Desc: "   . substr($item['description'], 0, 80) . "..." . PHP_EOL;
    }
}
?>

Output on a product page with JSON-LD:

Type: Product
Name: A Light in the Attic
Price: 51.77
Desc: It's hard to imagine a world without A Light in the Attic. This slim volume...

Inline JSON is often deeply nested. Use a recursive search function to find the data you need without hardcoding the exact path:

<?php
function find_in_json($data, $searchKey, $results = []) {
    if (!is_array($data)) return $results;

    foreach ($data as $key => $value) {
        if ($key === $searchKey) {
            $results[] = $value;
        }

        if (is_array($value)) {
            $results = find_in_json($value, $searchKey, $results);
        }
    }

    return $results;
}

// Find all 'price' values anywhere in the JSON structure
$html = fetch_page("https://example-nextjs-site.com/products");
$data = extract_nextjs_data($html);

if ($data) {
    $prices = find_in_json($data, 'price');
    echo "Prices found in JSON: " . count($prices) . PHP_EOL;

    foreach (array_slice($prices, 0, 5) as $price) {
        echo "  $" . $price . PHP_EOL;
    }

    // Find all 'name' values
    $names = find_in_json($data, 'name');
    echo PHP_EOL . "Names found: " . count($names) . PHP_EOL;

    foreach (array_slice($names, 0, 3) as $name) {
        echo "  $name" . PHP_EOL;
    }
}
?>

Output:

Prices found in JSON: 24
  $49.99
  $89.99
  $34.99
  $24.99
  $15.00

Names found: 27
  Laptop Stand Pro
  Mechanical Keyboard
  USB-C Hub

The recursive search finds values regardless of how deeply nested they are – useful when you’re exploring an unknown JSON structure and don’t yet know the exact path to the data you need.

Method 3: Integrating Puppeteer With PHP for Full JavaScript Rendering

When there’s no usable API and no inline JSON, the only reliable option is a headless browser – a real browser that runs without a visible window, executes JavaScript, and returns the fully rendered page. Puppeteer is the most widely used option. It controls a real Chromium instance programmatically from Node.js.

PHP doesn’t run Puppeteer directly. The integration works by calling a Node.js script from PHP using shell_exec(), passing the URL, and reading the rendered HTML back from a file or stdout. Not elegant, but reliable on any server where Node.js is installed.

Installing Puppeteer

# Check Node.js is installed
node --version

# Create a directory for your Puppeteer scripts
mkdir puppeteer-scraper
cd puppeteer-scraper

# Initialize npm and install Puppeteer
npm init -y
npm install puppeteer

Output:

v20.11.0
added 68 packages in 45s

The Puppeteer Script

Save this as render.js in your puppeteer-scraper directory:

// render.js - renders a URL and outputs the HTML
const puppeteer = require('puppeteer');

(async () => {
    // Get URL from command line argument
    const url = process.argv[2];

    if (!url) {
        console.error('Usage: node render.js ');
        process.exit(1);
    }

    let browser;

    try {
        browser = await puppeteer.launch({
            headless: 'new',      // use new headless mode
            args: [
                '--no-sandbox',
                '--disable-setuid-sandbox',
                '--disable-dev-shm-usage', // prevents crashes on low-memory servers
                '--disable-gpu',
            ]
        });

        const page = await browser.newPage();

        // Set a realistic viewport
        await page.setViewport({ width: 1280, height: 800 });

        // Set user agent to match a real browser
        await page.setUserAgent(
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
        );

        // Navigate and wait for network to be idle
        // networkidle2 means fewer than 2 network connections for 500ms
        await page.goto(url, {
            waitUntil: 'networkidle2',
            timeout: 30000,
        });

        // Get the fully rendered HTML
        const html = await page.content();

        // Output to stdout - PHP will capture this
        console.log(html);

    } catch (error) {
        console.error('Error: ' + error.message);
        process.exit(1);
    } finally {
        if (browser) await browser.close();
    }
})();

Calling Puppeteer From PHP

<?php
function fetch_rendered_page($url, $puppeteerScript, $timeout = 60) {
    // Sanitize URL to prevent command injection
    $safeUrl = escapeshellarg($url);
    $script  = escapeshellarg($puppeteerScript);

    $command = "node $script $safeUrl 2>&1";

    // Set timeout for the command
    $descriptors = [
        0 => ['pipe', 'r'],
        1 => ['pipe', 'w'],
        2 => ['pipe', 'w'],
    ];

    $process = proc_open($command, $descriptors, $pipes);

    if (!is_resource($process)) {
        echo "Failed to start Puppeteer process." . PHP_EOL;
        return false;
    }

    fclose($pipes[0]);

    // Read output with timeout
    stream_set_timeout($pipes[1], $timeout);
    $html = stream_get_contents($pipes[1]);
    fclose($pipes[1]);

    $errors = stream_get_contents($pipes[2]);
    fclose($pipes[2]);

    $exitCode = proc_close($process);

    if ($exitCode !== 0) {
        echo "Puppeteer error (exit $exitCode): $errors" . PHP_EOL;
        return false;
    }

    if (empty($html)) {
        echo "Puppeteer returned empty response." . PHP_EOL;
        return false;
    }

    return $html;
}

// Usage
$puppeteerScript = '/var/www/html/puppeteer-scraper/render.js';
$url             = 'https://example-js-site.com/products';

echo "Fetching rendered page..." . PHP_EOL;
$html = fetch_rendered_page($url, $puppeteerScript);

if ($html) {
    echo "Rendered HTML received: " . strlen($html) . " bytes." . PHP_EOL;

    // Now parse with DOMDocument as normal
    libxml_use_internal_errors(true);
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    libxml_clear_errors();

    $xpath    = new DOMXPath($dom);
    $products = $xpath->query('//div[contains(@class,"product-card")]');

    echo "Products found: " . $products->length . PHP_EOL;
}
?>

Output:

Fetching rendered page...
Rendered HTML received: 187432 bytes.
Products found: 24

Waiting for Specific Elements to Load

The networkidle2 wait strategy works for most sites. Some sites need a more specific wait – waiting until a particular element appears before capturing the HTML:

// render_wait.js - wait for a specific element before capturing
const puppeteer = require('puppeteer');

(async () => {
    const url     = process.argv[2];
    const selector = process.argv[3] || '.product-card'; // element to wait for

    if (!url) {
        console.error('Usage: node render_wait.js  ');
        process.exit(1);
    }

    let browser;

    try {
        browser = await puppeteer.launch({
            headless: 'new',
            args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage']
        });

        const page = await browser.newPage();

        await page.setViewport({ width: 1280, height: 800 });
        await page.setUserAgent(
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
        );

        await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 30000 });

        // Wait for the specific element - timeout after 10 seconds
        await page.waitForSelector(selector, { timeout: 10000 });

        // Small additional wait for any animations or lazy loading
        await new Promise(resolve => setTimeout(resolve, 1000));

        const html = await page.content();
        console.log(html);

    } catch (error) {
        console.error('Error: ' + error.message);
        process.exit(1);
    } finally {
        if (browser) await browser.close();
    }
})();
<?php
// Call the wait version with a specific selector
function fetch_rendered_page_wait($url, $selector, $puppeteerScript, $timeout = 60) {
    $safeUrl      = escapeshellarg($url);
    $safeSelector = escapeshellarg($selector);
    $script       = escapeshellarg($puppeteerScript);

    $command     = "node $script $safeUrl $safeSelector 2>&1";
    $descriptors = [0 => ['pipe', 'r'], 1 => ['pipe', 'w'], 2 => ['pipe', 'w']];
    $process     = proc_open($command, $descriptors, $pipes);

    if (!is_resource($process)) return false;

    fclose($pipes[0]);
    stream_set_timeout($pipes[1], $timeout);
    $html     = stream_get_contents($pipes[1]);
    $errors   = stream_get_contents($pipes[2]);
    fclose($pipes[1]);
    fclose($pipes[2]);
    $exitCode = proc_close($process);

    if ($exitCode !== 0) {
        echo "Error: $errors" . PHP_EOL;
        return false;
    }

    return $html ?: false;
}

$script = '/var/www/html/puppeteer-scraper/render_wait.js';

// Wait until .product-card elements appear before capturing
$html = fetch_rendered_page_wait(
    'https://example-js-site.com/products',
    '.product-card',
    $script
);

if ($html) {
    echo "Page rendered. Size: " . strlen($html) . " bytes." . PHP_EOL;
}
?>

Caching Rendered Pages

Puppeteer is slow – each page render takes 3-8 seconds. For scraping jobs that revisit the same pages, cache the rendered HTML to avoid re-rendering on every run:

<?php
function fetch_rendered_cached($url, $puppeteerScript, $cacheTtl = 3600) {
    $cacheDir  = __DIR__ . '/rendered_cache/';
    $cacheFile = $cacheDir . md5($url) . '.html';

    if (!is_dir($cacheDir)) {
        mkdir($cacheDir, 0755, true);
    }

    // Return cached version if fresh
    if (file_exists($cacheFile)) {
        $age = time() - filemtime($cacheFile);

        if ($age < $cacheTtl) {
            echo "Returning cached render (age: {$age}s)." . PHP_EOL;
            return file_get_contents($cacheFile);
        }
    }

    // Fetch fresh render
    echo "Rendering page with Puppeteer..." . PHP_EOL;
    $html = fetch_rendered_page($url, $puppeteerScript);

    if ($html) {
        file_put_contents($cacheFile, $html);
        echo "Render cached for {$cacheTtl}s." . PHP_EOL;
    }

    return $html;
}

$script = '/var/www/html/puppeteer-scraper/render.js';

// First call renders and caches
$html = fetch_rendered_cached('https://example-js-site.com/products', $script, 3600);

// Second call within 1 hour returns cache instantly
$html = fetch_rendered_cached('https://example-js-site.com/products', $script, 3600);
?>

Output on first call:

Rendering page with Puppeteer...
Render cached for 3600s.

Output on second call within cache window:

Returning cached render (age: 42s).

When to Use Puppeteer vs the Other Methods

Puppeteer adds complexity and slows down your scraper significantly. Use it as a last resort:

  • Use Method 1 (hidden API) – when DevTools shows JSON API calls with the data you need. Fastest option, least likely to break.
  • Use Method 2 (inline JSON) – when the raw HTML contains a script tag with embedded JSON. Fast and requires no extra tools.
  • Use Method 3 (Puppeteer) – only when Methods 1 and 2 both fail. Content is genuinely client-side rendered with no accessible data source, or the site uses advanced bot detection that requires actual browser behavior.

Choosing the Right Dynamic Content Web Scraping PHP Method

Work through these three questions in order before writing any code. Each one points to a specific method and saves significant time compared to trying everything at random.

Decision Framework

1. Open Chrome DevTools → Network → Fetch/XHR
   Does the page make API calls returning JSON with your data?
   YES → Method 1 (hit the API directly with cURL)
   NO  → continue

2. View Page Source → search for __NEXT_DATA__, __NUXT__,
   window.__data, or application/json script tags
   Is your data embedded as inline JSON?
   YES → Method 2 (extract from script tag)
   NO  → continue

3. Nothing above worked
   → Method 3 (Puppeteer — render the page fully)

Performance Comparison

<?php
// Method 1: Direct API call
// Speed: very fast (under 1 second per request)
// Reliability: high (API endpoints are stable)
// Complexity: low
$start = microtime(true);
$data  = call_api("https://api.example.com/products");
echo "Method 1 (API): " . round((microtime(true) - $start) * 1000) . "ms" . PHP_EOL;

// Method 2: Inline JSON extraction
// Speed: fast (same as a regular cURL request)
// Reliability: medium (breaks when site updates framework)
// Complexity: low
$start = microtime(true);
$html  = fetch_page("https://example-nextjs-site.com/products");
$data  = extract_nextjs_data($html);
echo "Method 2 (Inline JSON): " . round((microtime(true) - $start) * 1000) . "ms" . PHP_EOL;

// Method 3: Puppeteer rendering
// Speed: slow (3-8 seconds per page)
// Reliability: high (handles anything a browser can)
// Complexity: high
$start = microtime(true);
$html  = fetch_rendered_page("https://example-js-site.com/products", $script);
echo "Method 3 (Puppeteer): " . round((microtime(true) - $start) * 1000) . "ms" . PHP_EOL;
?>

Output:

Method 1 (API):          243ms
Method 2 (Inline JSON):  389ms
Method 3 (Puppeteer):    4821ms

Puppeteer takes 10-20x longer than the other methods. At scale – 500 pages per day – that difference is 2 minutes vs 40 minutes of processing time.

Frequently Asked Questions

Can PHP scrape JavaScript websites directly?

Not by itself. PHP cURL fetches raw HTML before JavaScript runs – any content loaded by JavaScript after page render is invisible to cURL. The three approaches that work are: hitting the API endpoint the JavaScript calls, extracting inline JSON from script tags in the raw HTML, or using Puppeteer via Node.js to render the page fully before PHP processes it.

How do I find the API endpoint a JavaScript site uses?

Open Chrome DevTools, go to the Network tab, filter by Fetch/XHR, and reload the page. Look for requests returning JSON with the data you need. Click any request to see the full URL, headers, and response. Copy the URL and try hitting it directly with cURL – most work without any additional authentication.

What is __NEXT_DATA__ and how do I extract it?

__NEXT_DATA__ is a script tag Next.js adds to every server-rendered page containing the initial page props as JSON. It looks like <script id="__NEXT_DATA__" type="application/json"> in the page source. Use a regex to match the script tag and json_decode() on its contents. The data is usually under props.pageProps in the resulting array.

Is Puppeteer free to use?

Yes. Puppeteer is an open-source Node.js library maintained by Google. It downloads a compatible version of Chromium automatically when you install it via npm. The only cost is server resources – headless Chrome uses significant memory, typically 200-400MB per browser instance. On a shared hosting account it may not be available at all since most shared hosts don’t allow running Chrome processes.

My Puppeteer script works locally but fails on the server. Why?

Three common causes. First, Node.js isn’t installed on the server or is a different version – run node --version to check. Second, the --no-sandbox flag is missing from the launch options – most Linux server environments require it. Third, the server doesn’t have the system libraries Chromium needs – run node -e "const p = require('puppeteer'); p.launch()" directly on the server to see the exact error message.

How do I scrape dynamic pagination that loads more content on scroll?

Infinite scroll pagination almost always works by calling an API endpoint when the user scrolls near the bottom of the page. Check DevTools Network tab while slowly scrolling down – you’ll see new API calls fire as more content loads. Hit those endpoints directly with cURL using an incrementing page or offset parameter. If there’s no API call and content genuinely appears through DOM manipulation only, Puppeteer can simulate scrolling with page.evaluate(() => window.scrollTo(0, document.body.scrollHeight)) before capturing the HTML.


Summary

Dynamic content web scraping in PHP comes down to understanding where the data actually lives – not just what the rendered page shows. Work through the methods in order:

  • Method 1 first – check DevTools for API calls. Hitting a JSON endpoint with cURL is faster, more reliable, and less likely to break than any other approach.
  • Method 2 second – check the raw HTML for inline JSON. Next.js, Nuxt.js, and many custom setups embed the initial data directly in script tags that cURL can read without JavaScript execution.
  • Method 3 last resort – Puppeteer handles anything a browser can, but costs 10-20x more time per page. Use it only when the other methods fail.

For the PHP cURL setup and DOMDocument parsing that processes the HTML once you have it, the PHP cURL web scraping complete guide covers every request option and parsing pattern in detail. For stopping your scraper from getting blocked while fetching these pages, the avoiding blocks guide covers headers, delays, proxies and block detection with working code. And for the 7 most common mistakes that cause scrapers to fail silently on dynamic sites, the web scraping errors guide covers each one with working fixes,

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top