PHP News Aggregator: Build a Multi-Source News Scraper With RSS and MySQL

A news aggregator collects articles from multiple sources and brings them into one place. Instead of visiting ten different sites every morning, a PHP news aggregator does it automatically – fetching headlines, filtering by topic, storing results, and optionally emailing you a digest.

This guide builds a complete PHP news aggregator from scratch. It covers RSS feed parsing – the correct way to aggregate news from sites that support it – HTML scraping for sites that don’t, MySQL storage with duplicate prevention, keyword filtering, and cron automation for daily runs.

What We’re Building

The finished PHP news aggregator does five things:

  • Fetches RSS feeds – reads structured XML news feeds from sites that publish them. Faster and more reliable than HTML scraping.
  • Scrapes HTML news sites – falls back to cURL and DOMDocument for sites without RSS feeds.
  • Stores articles in MySQL – saves every article with title, URL, source, and publication date. Skips duplicates automatically.
  • Filters by keyword – keeps only articles matching topics you care about.
  • Sends email digests – emails a formatted summary of new articles on a schedule.

RSS vs HTML Scraping – Which to Use

Always check for an RSS feed before writing any scraping code. RSS is a structured XML format sites publish specifically for content syndication – it’s faster to parse, less likely to break, and explicitly allowed by the site. HTML scraping is a fallback for sites that don’t offer RSS.

Find a site’s RSS feed by:

  • Adding /feed/ or /rss/ to the site URL – works on most WordPress sites
  • Looking for an RSS icon in the browser or page source
  • Checking /feed.xml or /atom.xml
  • Viewing page source and searching for application/rss+xml

What You Need

  • PHP 7.4 or higher with cURL and SimpleXML enabled
  • MySQL 5.7 or higher
  • Basic PHP and SQL knowledge

Verify SimpleXML is available:

<?php
if (extension_loaded('simplexml')) {
    echo "SimpleXML available." . PHP_EOL;
} else {
    echo "SimpleXML not available - enable it in php.ini." . PHP_EOL;
}
?>

Output:

SimpleXML available.

If you haven’t built a basic PHP scraper before, read the PHP web scraper beginner guide first – this guide assumes familiarity with cURL requests and DOMDocument parsing.

Setting Up the Database

The aggregator needs two tables. One for news sources – the sites and feeds you want to monitor. One for articles – every item collected with full metadata and duplicate prevention.

Creating the Tables

CREATE DATABASE IF NOT EXISTS news_aggregator
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

USE news_aggregator;

-- News sources to monitor
CREATE TABLE IF NOT EXISTS sources (
    id          INT AUTO_INCREMENT PRIMARY KEY,
    name        VARCHAR(255)  NOT NULL,
    url         VARCHAR(500)  NOT NULL UNIQUE,
    feed_url    VARCHAR(500)  DEFAULT NULL, -- RSS feed URL if available
    type        ENUM('rss','html') DEFAULT 'rss',
    active      TINYINT       DEFAULT 1,
    last_fetched TIMESTAMP    DEFAULT NULL,
    created_at  TIMESTAMP     DEFAULT CURRENT_TIMESTAMP
);

-- Articles collected from all sources
CREATE TABLE IF NOT EXISTS articles (
    id          INT AUTO_INCREMENT PRIMARY KEY,
    source_id   INT           NOT NULL,
    title       VARCHAR(500)  NOT NULL,
    url         VARCHAR(500)  NOT NULL,
    description TEXT          DEFAULT NULL,
    author      VARCHAR(255)  DEFAULT NULL,
    published_at TIMESTAMP    DEFAULT NULL,
    fetched_at  TIMESTAMP     DEFAULT CURRENT_TIMESTAMP,
    UNIQUE KEY unique_url (url),
    FOREIGN KEY (source_id) REFERENCES sources(id) ON DELETE CASCADE,
    INDEX idx_published (published_at),
    INDEX idx_source (source_id),
    INDEX idx_fetched (fetched_at)
);

Adding News Sources

Insert the sources you want to monitor. These are public RSS feeds that are freely available:

-- PHP and web development news sources with RSS feeds
INSERT INTO sources (name, url, feed_url, type) VALUES
('PHP.net News',     'https://www.php.net',          'https://www.php.net/feed.atom',                   'rss'),
('Hacker News',      'https://news.ycombinator.com', 'https://news.ycombinator.com/rss',                'rss'),
('CSS-Tricks',       'https://css-tricks.com',       'https://css-tricks.com/feed/',                    'rss'),
('Smashing Magazine','https://smashingmagazine.com', 'https://www.smashingmagazine.com/feed/',          'rss'),
('Dev.to',           'https://dev.to',               'https://dev.to/feed',                             'rss');

Connecting to the Database

<?php
function get_db_connection() {
    $host    = 'localhost';
    $dbname  = 'news_aggregator';
    $user    = 'your_username';
    $pass    = 'your_password';
    $charset = 'utf8mb4';

    $dsn = "mysql:host=$host;dbname=$dbname;charset=$charset";

    try {
        $pdo = new PDO($dsn, $user, $pass, [
            PDO::ATTR_ERRMODE            => PDO::ERRMODE_EXCEPTION,
            PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC,
            PDO::ATTR_EMULATE_PREPARES   => false,
        ]);
        return $pdo;
    } catch (PDOException $e) {
        echo "Connection failed: " . $e->getMessage() . PHP_EOL;
        return null;
    }
}

$pdo = get_db_connection();

if ($pdo) {
    $count = $pdo->query("SELECT COUNT(*) FROM sources WHERE active = 1")->fetchColumn();
    echo "Connected. Active sources: $count" . PHP_EOL;
}
?>

Output:

Connected. Active sources: 5

Managing Sources From PHP

<?php
function add_source($pdo, $name, $url, $feedUrl = null, $type = 'rss') {
    $sql = "INSERT INTO sources (name, url, feed_url, type)
            VALUES (:name, :url, :feed_url, :type)
            ON DUPLICATE KEY UPDATE
                name     = VALUES(name),
                feed_url = VALUES(feed_url),
                type     = VALUES(type)";

    try {
        $stmt = $pdo->prepare($sql);
        $stmt->execute([
            ':name'     => $name,
            ':url'      => $url,
            ':feed_url' => $feedUrl,
            ':type'     => $type,
        ]);
        echo "Source added: $name" . PHP_EOL;
        return true;
    } catch (PDOException $e) {
        echo "Failed to add source: " . $e->getMessage() . PHP_EOL;
        return false;
    }
}

function get_active_sources($pdo) {
    return $pdo->query(
        "SELECT * FROM sources WHERE active = 1 ORDER BY name ASC"
    )->fetchAll();
}

function pause_source($pdo, $sourceId) {
    $pdo->prepare("UPDATE sources SET active = 0 WHERE id = :id")
        ->execute([':id' => $sourceId]);
    echo "Source paused." . PHP_EOL;
}

// Add a new source
add_source(
    $pdo,
    'Laravel News',
    'https://laravel-news.com',
    'https://laravel-news.com/feed',
    'rss'
);

// List all active sources
$sources = get_active_sources($pdo);
echo PHP_EOL . "Active sources:" . PHP_EOL;

foreach ($sources as $source) {
    echo "  [{$source['type']}] {$source['name']} - {$source['feed_url']}" . PHP_EOL;
}
?>

Output:

Source added: Laravel News

Active sources:
  [rss] CSS-Tricks - https://css-tricks.com/feed/
  [rss] Dev.to - https://dev.to/feed
  [rss] Hacker News - https://news.ycombinator.com/rss
  [rss] Laravel News - https://laravel-news.com/feed
  [rss] PHP.net News - https://www.php.net/feed.atom
  [rss] Smashing Magazine - https://www.smashingmagazine.com/feed/

Parsing RSS Feeds With SimpleXML

RSS is XML – PHP’s SimpleXML extension reads it directly without any HTML parsing. You get structured data immediately: title, link, description, author, and publication date as named properties rather than DOM nodes you have to query.

Most news sites publish either RSS 2.0 or Atom feeds. The structure is slightly different between them – the code below handles both.

For the full SimpleXML documentation and available methods see the official PHP SimpleXML documentation.

Fetching an RSS Feed

<?php
function fetch_feed($url) {
    $ch = curl_init();

    curl_setopt_array($ch, [
        CURLOPT_URL            => $url,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_CONNECTTIMEOUT => 10,
        CURLOPT_TIMEOUT        => 30,
        CURLOPT_ENCODING       => '',
        CURLOPT_HTTPHEADER     => [
            'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
            'Accept: application/rss+xml, application/atom+xml, application/xml, text/xml, */*',
        ],
    ]);

    $response = curl_exec($ch);
    $errno    = curl_errno($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($errno || $httpCode !== 200) {
        echo "Feed fetch failed: HTTP $httpCode on $url" . PHP_EOL;
        return false;
    }

    return $response;
}

// Test with Hacker News RSS
$xml = fetch_feed("https://news.ycombinator.com/rss");

if ($xml) {
    echo "Feed fetched: " . strlen($xml) . " bytes." . PHP_EOL;
    echo "First 200 chars: " . substr($xml, 0, 200) . PHP_EOL;
}
?>

Output:

Feed fetched: 18432 bytes.
First 200 chars: 

  
    Hacker News
    https://news.ycombinator.com/
    Links for the intellectually curious

Parsing RSS 2.0 Feeds

<?php
function parse_rss_feed($xmlString, $sourceId) {
    if (!$xmlString) return [];

    // Suppress XML parsing warnings on malformed feeds
    libxml_use_internal_errors(true);
    $xml = simplexml_load_string($xmlString);
    libxml_clear_errors();

    if (!$xml) {
        echo "Failed to parse RSS XML." . PHP_EOL;
        return [];
    }

    $articles = [];

    // RSS 2.0 structure: rss > channel > item
    $items = $xml->channel->item ?? [];

    foreach ($items as $item) {
        $title       = (string) $item->title;
        $url         = (string) $item->link;
        $description = (string) $item->description;
        $author      = (string) ($item->author ?? $item->children('dc', true)->creator ?? '');
        $pubDate     = (string) $item->pubDate;

        // Skip items without title or URL
        if (!$title || !$url) continue;

        // Clean up description - strip HTML tags
        $description = strip_tags($description);
        $description = trim(substr($description, 0, 500)); // cap at 500 chars

        // Parse publication date
        $publishedAt = $pubDate ? date('Y-m-d H:i:s', strtotime($pubDate)) : null;

        $articles[] = [
            'source_id'    => $sourceId,
            'title'        => trim($title),
            'url'          => trim($url),
            'description'  => $description,
            'author'       => trim($author),
            'published_at' => $publishedAt,
        ];
    }

    return $articles;
}

// Test it
$xml      = fetch_feed("https://news.ycombinator.com/rss");
$articles = parse_rss_feed($xml, 1);

echo "Articles parsed: " . count($articles) . PHP_EOL . PHP_EOL;

foreach (array_slice($articles, 0, 3) as $article) {
    echo $article['title'] . PHP_EOL;
    echo "  URL:       " . $article['url']          . PHP_EOL;
    echo "  Published: " . $article['published_at'] . PHP_EOL;
    echo PHP_EOL;
}
?>

Output:

Articles parsed: 30

PHP 8.4 Released
  URL:       https://www.php.net/releases/8.4/en.php
  Published: 2026-05-01 09:00:00

Show HN: I built a web scraper in PHP that handles JS rendering
  URL:       https://news.ycombinator.com/item?id=12345678
  Published: 2026-05-01 08:45:00

Ask HN: Best practices for rate limiting in scrapers
  URL:       https://news.ycombinator.com/item?id=12345679
  Published: 2026-05-01 08:30:00

Parsing Atom Feeds

Atom is a slightly different XML format used by some sites including PHP.net. The structure uses entry instead of item and the link is an attribute rather than text content:

<?php
function parse_atom_feed($xmlString, $sourceId) {
    if (!$xmlString) return [];

    libxml_use_internal_errors(true);
    $xml = simplexml_load_string($xmlString);
    libxml_clear_errors();

    if (!$xml) {
        echo "Failed to parse Atom XML." . PHP_EOL;
        return [];
    }

    $articles   = [];
    $namespaces = $xml->getNamespaces(true);

    // Atom structure: feed > entry
    foreach ($xml->entry as $entry) {
        $title = (string) $entry->title;

        // Atom link is an attribute: <link href="..." />
        $url = '';
        foreach ($entry->link as $link) {
            $rel = (string) $link->attributes()->rel;
            if ($rel === 'alternate' || $rel === '') {
                $url = (string) $link->attributes()->href;
                break;
            }
        }

        $description = (string) ($entry->summary ?? $entry->content ?? '');
        $author      = (string) ($entry->author->name ?? '');
        $published   = (string) ($entry->published ?? $entry->updated ?? '');

        if (!$title || !$url) continue;

        $articles[] = [
            'source_id'    => $sourceId,
            'title'        => trim($title),
            'url'          => trim($url),
            'description'  => trim(strip_tags($description)),
            'author'       => trim($author),
            'published_at' => $published ? date('Y-m-d H:i:s', strtotime($published)) : null,
        ];
    }

    return $articles;
}

// Test with PHP.net Atom feed
$xml      = fetch_feed("https://www.php.net/feed.atom");
$articles = parse_atom_feed($xml, 2);

echo "Atom articles parsed: " . count($articles) . PHP_EOL;

foreach (array_slice($articles, 0, 2) as $article) {
    echo $article['title'] . " - " . $article['published_at'] . PHP_EOL;
}
?>

Output:

Atom articles parsed: 12
PHP 8.4.0 Released - 2026-11-21 00:00:00
PHP 8.3.14 Released - 2026-11-19 00:00:00

Auto-Detecting RSS vs Atom

Rather than checking which format each feed uses manually, detect it automatically from the XML root element:

<?php
function parse_feed($xmlString, $sourceId) {
    if (!$xmlString) return [];

    libxml_use_internal_errors(true);
    $xml = simplexml_load_string($xmlString);
    libxml_clear_errors();

    if (!$xml) {
        echo "Invalid XML - cannot parse feed." . PHP_EOL;
        return [];
    }

    $rootElement = $xml->getName();

    // Detect format from root element name
    if ($rootElement === 'feed') {
        // Atom feed
        return parse_atom_feed($xmlString, $sourceId);
    } elseif ($rootElement === 'rss' || $rootElement === 'channel') {
        // RSS 2.0 feed
        return parse_rss_feed($xmlString, $sourceId);
    } else {
        echo "Unknown feed format: $rootElement" . PHP_EOL;
        return [];
    }
}

// Works on both RSS and Atom automatically
$feeds = [
    ['url' => 'https://news.ycombinator.com/rss', 'source_id' => 1],
    ['url' => 'https://www.php.net/feed.atom',    'source_id' => 2],
];

foreach ($feeds as $feed) {
    $xml      = fetch_feed($feed['url']);
    $articles = parse_feed($xml, $feed['source_id']);
    echo "Parsed " . count($articles) . " articles from {$feed['url']}" . PHP_EOL;
}
?>

Output:

Parsed 30 articles from https://news.ycombinator.com/rss
Parsed 12 articles from https://www.php.net/feed.atom

Parsing RSS Feeds With SimpleXML

RSS feeds are XML documents with a standard structure. SimpleXML turns that XML into a PHP object you can navigate directly – no regex, no DOM parsing, no XPath required for basic feeds. It’s the fastest way to extract news articles from any site that publishes an RSS feed.

Fetching an RSS Feed

<?php
function fetch_feed($url) {
    $ch = curl_init();

    curl_setopt_array($ch, [
        CURLOPT_URL            => $url,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_CONNECTTIMEOUT => 10,
        CURLOPT_TIMEOUT        => 30,
        CURLOPT_ENCODING       => '',
        CURLOPT_HTTPHEADER     => [
            'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
            'Accept: application/rss+xml, application/xml, text/xml, */*',
        ],
    ]);

    $response = curl_exec($ch);
    $errno    = curl_errno($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($errno || $httpCode !== 200) {
        echo "Feed fetch failed: HTTP $httpCode on $url" . PHP_EOL;
        return false;
    }

    return $response;
}

$xml = fetch_feed("https://news.ycombinator.com/rss");

if ($xml) {
    echo "Feed fetched. Size: " . strlen($xml) . " bytes." . PHP_EOL;
}
?>

Output:

Feed fetched. Size: 18432 bytes.

Parsing Standard RSS Format

Most sites use RSS 2.0 – the structure is channel containing multiple item elements, each with title, link, description, and pubDate:

<?php
function parse_rss_feed($xml, $sourceId) {
    if (!$xml) return [];

    // Suppress XML parsing warnings on malformed feeds
    libxml_use_internal_errors(true);
    $feed = simplexml_load_string($xml);
    libxml_clear_errors();

    if (!$feed) {
        echo "Failed to parse XML feed." . PHP_EOL;
        return [];
    }

    $articles = [];

    // RSS 2.0 format - items are under channel
    $items = $feed->channel->item ?? [];

    // Atom format - items are direct children
    if (empty($items)) {
        $feed->registerXPathNamespace('atom', 'http://www.w3.org/2005/Atom');
        $items = $feed->xpath('//atom:entry') ?? [];
    }

    foreach ($items as $item) {
        $title       = (string) ($item->title       ?? '');
        $url         = (string) ($item->link        ?? '');
        $description = (string) ($item->description ?? '');
        $author      = (string) ($item->author      ?? '');
        $pubDate     = (string) ($item->pubDate     ?? $item->published ?? '');

        // Clean up values
        $title       = html_entity_decode(strip_tags(trim($title)));
        $description = html_entity_decode(strip_tags(trim($description)));
        $description = substr($description, 0, 500); // cap at 500 chars

        // Parse publication date
        $publishedAt = null;
        if ($pubDate) {
            $timestamp   = strtotime($pubDate);
            $publishedAt = $timestamp ? date('Y-m-d H:i:s', $timestamp) : null;
        }

        // Skip items with no title or URL
        if (empty($title) || empty($url)) continue;

        $articles[] = [
            'source_id'    => $sourceId,
            'title'        => $title,
            'url'          => trim($url),
            'description'  => $description,
            'author'       => trim($author),
            'published_at' => $publishedAt,
        ];
    }

    return $articles;
}

// Test it
$xml      = fetch_feed("https://news.ycombinator.com/rss");
$articles = parse_rss_feed($xml, 1);

echo "Articles parsed: " . count($articles) . PHP_EOL . PHP_EOL;

foreach (array_slice($articles, 0, 3) as $article) {
    echo $article['title'] . PHP_EOL;
    echo "  URL:       " . $article['url'] . PHP_EOL;
    echo "  Published: " . ($article['published_at'] ?? 'N/A') . PHP_EOL;
    echo PHP_EOL;
}
?>

Output:

Articles parsed: 30

PHP 8.4 Performance Benchmarks Released
  URL:       https://example.com/php-84-benchmarks
  Published: 2026-05-03 09:00:00

Building REST APIs With Laravel 11
  URL:       https://example.com/laravel-11-api
  Published: 2026-05-03 08:30:00

Why I Switched From Python to PHP for Web Scraping
  URL:       https://example.com/python-to-php
  Published: 2026-05-03 07:45:00

Handling Atom Feeds

Some sites publish Atom feeds instead of RSS. The structure is slightly different – entries use entry instead of item and links use an href attribute:

<?php
function parse_atom_feed($xml, $sourceId) {
    if (!$xml) return [];

    libxml_use_internal_errors(true);
    $feed = simplexml_load_string($xml);
    libxml_clear_errors();

    if (!$feed) return [];

    // Register Atom namespace
    $feed->registerXPathNamespace('atom', 'http://www.w3.org/2005/Atom');
    $entries  = $feed->xpath('//atom:entry');
    $articles = [];

    foreach ($entries as $entry) {
        $entry->registerXPathNamespace('atom', 'http://www.w3.org/2005/Atom');

        $title       = (string) ($entry->title   ?? '');
        $description = (string) ($entry->summary ?? $entry->content ?? '');
        $author      = (string) ($entry->author->name ?? '');
        $pubDate     = (string) ($entry->updated ?? $entry->published ?? '');

        // Atom links use href attribute
        $linkNode = $entry->xpath('atom:link[@rel="alternate"]')[0]
                 ?? $entry->xpath('atom:link')[0]
                 ?? null;
        $url = $linkNode ? (string) $linkNode['href'] : '';

        $title       = html_entity_decode(strip_tags(trim($title)));
        $description = html_entity_decode(strip_tags(trim($description)));
        $description = substr($description, 0, 500);

        $publishedAt = null;
        if ($pubDate) {
            $timestamp   = strtotime($pubDate);
            $publishedAt = $timestamp ? date('Y-m-d H:i:s', $timestamp) : null;
        }

        if (empty($title) || empty($url)) continue;

        $articles[] = [
            'source_id'    => $sourceId,
            'title'        => $title,
            'url'          => trim($url),
            'description'  => $description,
            'author'       => $author,
            'published_at' => $publishedAt,
        ];
    }

    return $articles;
}

// PHP.net uses Atom format
$xml      = fetch_feed("https://www.php.net/feed.atom");
$articles = parse_atom_feed($xml, 1);

echo "Atom articles parsed: " . count($articles) . PHP_EOL;
?>

Output:

Atom articles parsed: 10

Auto-Detecting Feed Format

Rather than manually deciding which parser to use, detect the format automatically:

<?php
function parse_feed($xml, $sourceId) {
    if (!$xml) return [];

    libxml_use_internal_errors(true);
    $feed = simplexml_load_string($xml);
    libxml_clear_errors();

    if (!$feed) {
        echo "Invalid XML feed." . PHP_EOL;
        return [];
    }

    $rootName = strtolower($feed->getName());

    // Atom feeds have a root element of "feed"
    if ($rootName === 'feed') {
        return parse_atom_feed($xml, $sourceId);
    }

    // RSS feeds have a root element of "rss" or "rdf"
    return parse_rss_feed($xml, $sourceId);
}

// Test auto-detection on both formats
$feeds = [
    ['url' => 'https://news.ycombinator.com/rss',    'id' => 1, 'name' => 'Hacker News'],
    ['url' => 'https://www.php.net/feed.atom',        'id' => 2, 'name' => 'PHP.net'],
    ['url' => 'https://css-tricks.com/feed/',         'id' => 3, 'name' => 'CSS-Tricks'],
];

foreach ($feeds as $feed) {
    $xml      = fetch_feed($feed['url']);
    $articles = parse_feed($xml, $feed['id']);
    echo "{$feed['name']}: " . count($articles) . " articles" . PHP_EOL;
    sleep(1);
}
?>

Output:

Hacker News: 30 articles
PHP.net: 10 articles
CSS-Tricks: 15 articles

Scraping HTML News Sites Without RSS

Not every news site publishes an RSS feed. For sites that don’t, cURL and DOMDocument extract the same data from the HTML directly. The approach is identical to any other web scraping project – fetch the page, find the article elements, extract title, URL, and date.

The challenge with news sites specifically is that every site has a different HTML structure. The code needs to be flexible enough to handle different layouts without rewriting the core logic each time.

Generic News Article Extractor

<?php
function scrape_news_page($url, $sourceId, $selectors) {
    $ch = curl_init();

    curl_setopt_array($ch, [
        CURLOPT_URL            => $url,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_CONNECTTIMEOUT => 10,
        CURLOPT_TIMEOUT        => 30,
        CURLOPT_ENCODING       => '',
        CURLOPT_HTTPHEADER     => [
            'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
            'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language: en-US,en;q=0.5',
        ],
    ]);

    $html     = curl_exec($ch);
    $errno    = curl_errno($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($errno || $httpCode !== 200) {
        echo "Scrape failed: HTTP $httpCode on $url" . PHP_EOL;
        return [];
    }

    libxml_use_internal_errors(true);
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    libxml_clear_errors();

    $xpath    = new DOMXPath($dom);
    $articles = [];

    // Use the XPath selector for article containers
    $items = $xpath->query($selectors['container']);

    if ($items->length === 0) {
        echo "No articles found on $url - check selector." . PHP_EOL;
        return [];
    }

    foreach ($items as $item) {
        // Extract title
        $titleNode = $xpath->query($selectors['title'], $item)->item(0);
        $title     = $titleNode ? trim($titleNode->textContent) : null;

        // Extract URL
        $linkNode  = $xpath->query($selectors['link'], $item)->item(0);
        $href      = $linkNode ? $linkNode->getAttribute('href') : null;

        // Make relative URLs absolute
        if ($href && strpos($href, 'http') !== 0) {
            $parsed = parse_url($url);
            $href   = $parsed['scheme'] . '://' . $parsed['host'] . $href;
        }

        // Extract description if selector provided
        $description = null;
        if (!empty($selectors['description'])) {
            $descNode    = $xpath->query($selectors['description'], $item)->item(0);
            $description = $descNode ? trim($descNode->textContent) : null;
            if ($description) {
                $description = substr($description, 0, 500);
            }
        }

        // Extract date if selector provided
        $publishedAt = null;
        if (!empty($selectors['date'])) {
            $dateNode = $xpath->query($selectors['date'], $item)->item(0);
            if ($dateNode) {
                // Try datetime attribute first (cleaner format)
                $dateStr = $dateNode->getAttribute('datetime')
                        ?: trim($dateNode->textContent);
                $timestamp   = strtotime($dateStr);
                $publishedAt = $timestamp ? date('Y-m-d H:i:s', $timestamp) : null;
            }
        }

        if (empty($title) || empty($href)) continue;

        $articles[] = [
            'source_id'    => $sourceId,
            'title'        => html_entity_decode($title),
            'url'          => $href,
            'description'  => $description,
            'author'       => null,
            'published_at' => $publishedAt,
        ];
    }

    return $articles;
}
?>

Defining Selectors Per Site

Each site gets its own selector configuration. Find the right XPath by inspecting the site in Chrome DevTools – right-click an article, Inspect, then identify the repeating container element:

<?php
// Selector configurations for different news sites
// Find these by inspecting the site in Chrome DevTools
$siteSelectors = [

    // books.toscrape.com - used as a safe scraping practice target
    'books.toscrape.com' => [
        'container'   => '//article[contains(@class,"product_pod")]',
        'title'       => './/h3/a',
        'link'        => './/h3/a',
        'description' => './/*[contains(@class,"price_color")]',
        'date'        => null,
    ],

    // Generic blog with standard article structure
    'example-blog.com' => [
        'container'   => '//article[contains(@class,"post")]',
        'title'       => './/h2[contains(@class,"entry-title")]/a',
        'link'        => './/h2[contains(@class,"entry-title")]/a',
        'description' => './/div[contains(@class,"entry-summary")]',
        'date'        => './/time[@datetime]',
    ],

    // News site with list layout
    'example-news.com' => [
        'container'   => '//li[contains(@class,"story-item")]',
        'title'       => './/h3/a',
        'link'        => './/h3/a',
        'description' => './/p[contains(@class,"summary")]',
        'date'        => './/span[contains(@class,"date")]',
    ],
];

// Scrape books.toscrape.com as a working example
$articles = scrape_news_page(
    "https://books.toscrape.com/",
    1,
    $siteSelectors['books.toscrape.com']
);

echo "Articles scraped: " . count($articles) . PHP_EOL . PHP_EOL;

foreach (array_slice($articles, 0, 3) as $article) {
    echo $article['title'] . PHP_EOL;
    echo "  URL: " . $article['url'] . PHP_EOL;
    echo PHP_EOL;
}
?>

Output:

Articles scraped: 20

A Light in the Attic
  URL: https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html

Tipping the Velvet
  URL: https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html

Soumission
  URL: https://books.toscrape.com/catalogue/soumission_998/index.html

Storing HTML Scrape Selectors in the Database

Instead of hardcoding selectors in PHP, store them in the sources table. This lets you add new sites without changing code:

-- Add selector columns to sources table
ALTER TABLE sources 
    ADD COLUMN selector_container  VARCHAR(500) DEFAULT NULL,
    ADD COLUMN selector_title      VARCHAR(500) DEFAULT NULL,
    ADD COLUMN selector_link       VARCHAR(500) DEFAULT NULL,
    ADD COLUMN selector_description VARCHAR(500) DEFAULT NULL,
    ADD COLUMN selector_date       VARCHAR(500) DEFAULT NULL;
<?php
// Insert an HTML source with selectors
function add_html_source($pdo, $name, $url, $selectors) {
    $sql = "INSERT INTO sources 
                (name, url, type, selector_container, selector_title, 
                 selector_link, selector_description, selector_date)
            VALUES 
                (:name, :url, 'html', :container, :title, 
                 :link, :description, :date)
            ON DUPLICATE KEY UPDATE
                selector_container   = VALUES(selector_container),
                selector_title       = VALUES(selector_title),
                selector_link        = VALUES(selector_link),
                selector_description = VALUES(selector_description),
                selector_date        = VALUES(selector_date)";

    $stmt = $pdo->prepare($sql);
    $stmt->execute([
        ':name'        => $name,
        ':url'         => $url,
        ':container'   => $selectors['container'],
        ':title'       => $selectors['title'],
        ':link'        => $selectors['link'],
        ':description' => $selectors['description'] ?? null,
        ':date'        => $selectors['date'] ?? null,
    ]);

    echo "HTML source added: $name" . PHP_EOL;
}

$pdo = get_db_connection();

add_html_source($pdo, 'Books to Scrape', 'https://books.toscrape.com/', [
    'container'   => '//article[contains(@class,"product_pod")]',
    'title'       => './/h3/a',
    'link'        => './/h3/a',
    'description' => './/*[contains(@class,"price_color")]',
    'date'        => null,
]);
?>

Output:

HTML source added: Books to Scrape

Loading Selectors From the Database

<?php
function scrape_html_source_from_db($pdo, $source) {
    $selectors = [
        'container'   => $source['selector_container'],
        'title'       => $source['selector_title'],
        'link'        => $source['selector_link'],
        'description' => $source['selector_description'],
        'date'        => $source['selector_date'],
    ];

    if (empty($selectors['container']) || empty($selectors['title'])) {
        echo "Missing selectors for source: {$source['name']}" . PHP_EOL;
        return [];
    }

    $articles = scrape_news_page(
        $source['url'],
        $source['id'],
        $selectors
    );

    // Update last_fetched timestamp
    $pdo->prepare(
        "UPDATE sources SET last_fetched = NOW() WHERE id = :id"
    )->execute([':id' => $source['id']]);

    return $articles;
}

// Load and scrape all HTML sources
$stmt    = $pdo->query("SELECT * FROM sources WHERE type = 'html' AND active = 1");
$sources = $stmt->fetchAll();

foreach ($sources as $source) {
    echo "Scraping: {$source['name']}..." . PHP_EOL;
    $articles = scrape_html_source_from_db($pdo, $source);
    echo "Found " . count($articles) . " articles." . PHP_EOL;
    sleep(2);
}
?>

Output:

Scraping: Books to Scrape...
Found 20 articles.

Storing Articles and Preventing Duplicates

Every time the aggregator runs it fetches the same feeds again. Without duplicate prevention the database fills with identical articles on every run. The UNIQUE constraint on the URL column combined with INSERT IGNORE handles this cleanly – new articles get inserted, existing ones get skipped silently.

Inserting a Single Article

<?php
function save_article($pdo, $article) {
    // INSERT IGNORE skips the row if URL already exists
    // without throwing an error or updating existing data
    $sql = "INSERT IGNORE INTO articles 
                (source_id, title, url, description, author, published_at)
            VALUES 
                (:source_id, :title, :url, :description, :author, :published_at)";

    try {
        $stmt = $pdo->prepare($sql);
        $stmt->execute([
            ':source_id'   => $article['source_id'],
            ':title'       => $article['title'],
            ':url'         => $article['url'],
            ':description' => $article['description'],
            ':author'      => $article['author'],
            ':published_at'=> $article['published_at'],
        ]);

        // rowCount() returns 1 for new insert, 0 if skipped
        return $stmt->rowCount();

    } catch (PDOException $e) {
        echo "Save failed for '{$article['title']}': " . $e->getMessage() . PHP_EOL;
        return false;
    }
}

// Test it
$pdo = get_db_connection();

$article = [
    'source_id'    => 1,
    'title'        => 'PHP 8.4 Released With New Features',
    'url'          => 'https://example.com/php-84-released',
    'description'  => 'PHP 8.4 brings property hooks and other improvements.',
    'author'       => 'PHP Team',
    'published_at' => date('Y-m-d H:i:s'),
];

$result = save_article($pdo, $article);
echo $result === 1 ? "Article saved." . PHP_EOL : "Article already exists - skipped." . PHP_EOL;

// Run again - same article
$result = save_article($pdo, $article);
echo $result === 1 ? "Article saved." . PHP_EOL : "Article already exists - skipped." . PHP_EOL;
?>

Output on first run:

Article saved.

Output on second run:

Article already exists - skipped.

Saving a Batch of Articles

Saving one article at a time inside a loop creates one database transaction per article. Wrapping the batch in a single transaction is significantly faster on large feeds:

<?php
function save_articles_batch($pdo, $articles) {
    if (empty($articles)) return ['saved' => 0, 'skipped' => 0];

    $sql  = "INSERT IGNORE INTO articles 
                 (source_id, title, url, description, author, published_at)
             VALUES 
                 (:source_id, :title, :url, :description, :author, :published_at)";
    $stmt = $pdo->prepare($sql);

    $saved   = 0;
    $skipped = 0;

    try {
        $pdo->beginTransaction();

        foreach ($articles as $article) {
            $stmt->execute([
                ':source_id'    => $article['source_id'],
                ':title'        => $article['title'],
                ':url'          => $article['url'],
                ':description'  => $article['description'] ?? null,
                ':author'       => $article['author']      ?? null,
                ':published_at' => $article['published_at'] ?? null,
            ]);

            $stmt->rowCount() === 1 ? $saved++ : $skipped++;
        }

        $pdo->commit();

    } catch (PDOException $e) {
        $pdo->rollBack();
        echo "Batch save failed: " . $e->getMessage() . PHP_EOL;
        return ['saved' => 0, 'skipped' => 0];
    }

    return ['saved' => $saved, 'skipped' => $skipped];
}

// Test with a batch
$articles = [
    [
        'source_id'    => 1,
        'title'        => 'New Laravel Version Released',
        'url'          => 'https://example.com/laravel-release',
        'description'  => 'Laravel gets major performance improvements.',
        'author'       => 'Taylor Otwell',
        'published_at' => date('Y-m-d H:i:s'),
    ],
    [
        'source_id'    => 1,
        'title'        => 'PHP Security Advisory',
        'url'          => 'https://example.com/php-security',
        'description'  => 'Security patches released for PHP 8.x.',
        'author'       => 'PHP Security Team',
        'published_at' => date('Y-m-d H:i:s'),
    ],
];

$result = save_articles_batch($pdo, $articles);
echo "Saved: {$result['saved']} | Skipped: {$result['skipped']}" . PHP_EOL;
?>

Output on first run:

Saved: 2 | Skipped: 0

Output on second run:

Saved: 0 | Skipped: 2

Validating Articles Before Saving

Feeds occasionally contain malformed entries – empty titles, duplicate URLs within the same feed, or relative URLs that didn’t get resolved. Validate before inserting:

<?php
function validate_article($article) {
    $errors = [];

    if (empty(trim($article['title'] ?? ''))) {
        $errors[] = "Missing title";
    }

    if (empty($article['url'])) {
        $errors[] = "Missing URL";
    } elseif (!filter_var($article['url'], FILTER_VALIDATE_URL)) {
        $errors[] = "Invalid URL: {$article['url']}";
    }

    if (empty($article['source_id'])) {
        $errors[] = "Missing source ID";
    }

    return $errors;
}

function save_articles_validated($pdo, $articles) {
    $valid   = [];
    $invalid = 0;

    foreach ($articles as $article) {
        $errors = validate_article($article);

        if (!empty($errors)) {
            echo "Invalid article skipped: " . implode(', ', $errors) . PHP_EOL;
            $invalid++;
            continue;
        }

        $valid[] = $article;
    }

    $result = save_articles_batch($pdo, $valid);
    $result['invalid'] = $invalid;

    return $result;
}

// Test with mixed valid and invalid articles
$articles = [
    [
        'source_id'    => 1,
        'title'        => 'Valid Article Title',
        'url'          => 'https://example.com/valid-article',
        'description'  => 'A valid article.',
        'author'       => null,
        'published_at' => date('Y-m-d H:i:s'),
    ],
    [
        'source_id'    => 1,
        'title'        => '',  // empty title - invalid
        'url'          => 'https://example.com/no-title',
        'description'  => null,
        'author'       => null,
        'published_at' => null,
    ],
    [
        'source_id'    => 1,
        'title'        => 'Article With Bad URL',
        'url'          => '/relative-url-not-resolved',  // invalid
        'description'  => null,
        'author'       => null,
        'published_at' => null,
    ],
];

$result = save_articles_validated($pdo, $articles);
echo "Saved: {$result['saved']} | Skipped: {$result['skipped']} | Invalid: {$result['invalid']}" . PHP_EOL;
?>

Output:

Invalid article skipped: Missing title
Invalid article skipped: Invalid URL: /relative-url-not-resolved
Saved: 1 | Skipped: 0 | Invalid: 2

Cleaning Up Old Articles

Left unchecked the articles table grows indefinitely. Add a cleanup function that removes articles older than a set number of days:

<?php
function cleanup_old_articles($pdo, $keepDays = 30) {
    $stmt = $pdo->prepare(
        "DELETE FROM articles 
         WHERE fetched_at < DATE_SUB(NOW(), INTERVAL :days DAY)"
    );

    $stmt->execute([':days' => $keepDays]);
    $deleted = $stmt->rowCount();

    echo "Cleaned up $deleted articles older than $keepDays days." . PHP_EOL;
    return $deleted;
}

// Keep only last 30 days of articles
cleanup_old_articles($pdo, 30);
?>

Output:

Cleaned up 143 articles older than 30 days.

Run this at the end of each aggregator job to keep the database size manageable. For a news aggregator checking multiple sources daily, 30 days of history is typically enough. Reduce to 7 days if disk space is a concern.

Filtering by Keyword and Sending Email Digests

Storing every article from every source creates noise. Keyword filtering keeps only articles relevant to your topics of interest. Combined with a daily email digest it becomes a genuinely useful monitoring tool – you get a curated summary of what matters without visiting any sites manually.

Keyword Filtering

<?php
function filter_articles_by_keywords($articles, $keywords) {
    if (empty($keywords)) return $articles;

    return array_filter($articles, function($article) use ($keywords) {
        $searchText = strtolower(
            ($article['title']       ?? '') . ' ' .
            ($article['description'] ?? '')
        );

        foreach ($keywords as $keyword) {
            if (strpos($searchText, strtolower(trim($keyword))) !== false) {
                return true; // article matches at least one keyword
            }
        }

        return false; // no keywords matched
    });
}

// Test keyword filtering
$articles = [
    [
        'title'       => 'PHP 8.4 Performance Improvements',
        'description' => 'New benchmarks show 20% speed increase.',
        'url'         => 'https://example.com/php-84',
        'source_id'   => 1,
        'author'      => null,
        'published_at'=> date('Y-m-d H:i:s'),
    ],
    [
        'title'       => 'Laravel 11 Released',
        'description' => 'New features and improvements in Laravel 11.',
        'url'         => 'https://example.com/laravel-11',
        'source_id'   => 1,
        'author'      => null,
        'published_at'=> date('Y-m-d H:i:s'),
    ],
    [
        'title'       => 'JavaScript Framework Comparison 2026',
        'description' => 'React vs Vue vs Svelte performance test.',
        'url'         => 'https://example.com/js-frameworks',
        'source_id'   => 1,
        'author'      => null,
        'published_at'=> date('Y-m-d H:i:s'),
    ],
];

$keywords = ['php', 'laravel', 'scraping'];
$filtered = filter_articles_by_keywords($articles, $keywords);

echo "Total articles: "    . count($articles)           . PHP_EOL;
echo "After filtering: "   . count($filtered)           . PHP_EOL . PHP_EOL;

foreach ($filtered as $article) {
    echo $article['title'] . PHP_EOL;
}
?>

Output:

Total articles: 3
After filtering: 2

PHP 8.4 Performance Improvements
Laravel 11 Released

Storing Keywords in the Database

Hardcoding keywords in the script means changing code every time you add a topic. Store them in a separate table instead:

CREATE TABLE IF NOT EXISTS keywords (
    id         INT AUTO_INCREMENT PRIMARY KEY,
    keyword    VARCHAR(100) NOT NULL UNIQUE,
    active     TINYINT DEFAULT 1,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Insert keywords to monitor
INSERT INTO keywords (keyword) VALUES
('php'),
('laravel'),
('web scraping'),
('mysql'),
('api'),
('security');
<?php
function get_active_keywords($pdo) {
    $stmt = $pdo->query(
        "SELECT keyword FROM keywords WHERE active = 1 ORDER BY keyword ASC"
    );
    return $stmt->fetchAll(PDO::FETCH_COLUMN);
}

function add_keyword($pdo, $keyword) {
    $stmt = $pdo->prepare(
        "INSERT IGNORE INTO keywords (keyword) VALUES (:keyword)"
    );
    $stmt->execute([':keyword' => strtolower(trim($keyword))]);
    echo "Keyword added: $keyword" . PHP_EOL;
}

$pdo      = get_db_connection();
$keywords = get_active_keywords($pdo);

echo "Monitoring keywords: " . implode(', ', $keywords) . PHP_EOL;
?>

Output:

Monitoring keywords: api, laravel, mysql, php, security, web scraping

Retrieving Filtered Articles From the Database

<?php
function get_recent_articles_by_keywords($pdo, $keywords, $hours = 24) {
    if (empty($keywords)) {
        // No keywords - return all recent articles
        $stmt = $pdo->prepare(
            "SELECT a.*, s.name as source_name 
             FROM articles a
             JOIN sources s ON s.id = a.source_id
             WHERE a.fetched_at >= DATE_SUB(NOW(), INTERVAL :hours HOUR)
             ORDER BY a.published_at DESC"
        );
        $stmt->execute([':hours' => $hours]);
        return $stmt->fetchAll();
    }

    // Build dynamic WHERE clause for keywords
    $conditions = [];
    $params     = [':hours' => $hours];

    foreach ($keywords as $index => $keyword) {
        $key            = ":keyword_$index";
        $conditions[]   = "(a.title LIKE $key OR a.description LIKE $key)";
        $params[$key]   = '%' . $keyword . '%';
    }

    $whereClause = implode(' OR ', $conditions);

    $sql = "SELECT a.*, s.name as source_name 
            FROM articles a
            JOIN sources s ON s.id = a.source_id
            WHERE a.fetched_at >= DATE_SUB(NOW(), INTERVAL :hours HOUR)
            AND ($whereClause)
            ORDER BY a.published_at DESC";

    $stmt = $pdo->prepare($sql);
    $stmt->execute($params);

    return $stmt->fetchAll();
}

// Get PHP and Laravel articles from last 24 hours
$keywords = get_active_keywords($pdo);
$articles = get_recent_articles_by_keywords($pdo, $keywords, 24);

echo "Relevant articles in last 24 hours: " . count($articles) . PHP_EOL . PHP_EOL;

foreach (array_slice($articles, 0, 3) as $article) {
    echo "[{$article['source_name']}] {$article['title']}" . PHP_EOL;
    echo "  Published: " . ($article['published_at'] ?? 'N/A') . PHP_EOL;
    echo "  URL: {$article['url']}" . PHP_EOL;
    echo PHP_EOL;
}
?>

Output:

Relevant articles in last 24 hours: 8

[Hacker News] PHP 8.4 Performance Benchmarks Released
  Published: 2026-05-03 09:00:00
  URL: https://example.com/php-84-benchmarks

[Laravel News] Laravel 11.x Security Patch
  Published: 2026-05-03 08:30:00
  URL: https://example.com/laravel-security

[Dev.to] Building a PHP Web Scraper in 2026
  Published: 2026-05-03 07:15:00
  URL: https://example.com/php-scraper-2026

Building the Email Digest

<?php
function build_digest_email($articles, $dateRange = 'last 24 hours') {
    if (empty($articles)) {
        return [
            'subject' => 'PHP News Digest - No new articles',
            'body'    => "No relevant articles found in the $dateRange.",
            'html'    => "<p>No relevant articles found in the $dateRange.</p>",
        ];
    }

    // Group articles by source
    $bySource = [];
    foreach ($articles as $article) {
        $bySource[$article['source_name']][] = $article;
    }

    $date    = date('D, d M Y');
    $count   = count($articles);
    $subject = "PHP News Digest - $count articles - $date";

    // Plain text version
    $textBody = "PHP News Digest - $date\n";
    $textBody .= str_repeat('=', 50) . "\n\n";
    $textBody .= "$count new articles matching your keywords.\n\n";

    foreach ($bySource as $sourceName => $sourceArticles) {
        $textBody .= strtoupper($sourceName) . "\n";
        $textBody .= str_repeat('-', strlen($sourceName)) . "\n";

        foreach ($sourceArticles as $article) {
            $textBody .= "- {$article['title']}\n";
            $textBody .= "  {$article['url']}\n";
            if (!empty($article['description'])) {
                $textBody .= "  " . substr($article['description'], 0, 120) . "...\n";
            }
            $textBody .= "\n";
        }
    }

    // HTML version
    $htmlBody  = "<!DOCTYPE html><html><body style='font-family:Arial,sans-serif;max-width:700px;margin:0 auto;padding:20px;'>";
    $htmlBody .= "<h2 style='color:#2c3e50;border-bottom:2px solid #3498db;padding-bottom:10px;'>PHP News Digest - $date</h2>";
    $htmlBody .= "<p style='color:#666;'>$count new articles matching your keywords.</p>";

    foreach ($bySource as $sourceName => $sourceArticles) {
        $htmlBody .= "<h3 style='color:#3498db;margin-top:25px;'>" . htmlspecialchars($sourceName) . "</h3>";

        foreach ($sourceArticles as $article) {
            $title       = htmlspecialchars($article['title']);
            $url         = htmlspecialchars($article['url']);
            $description = htmlspecialchars($article['description'] ?? '');
            $published   = $article['published_at'] 
                           ? date('d M Y H:i', strtotime($article['published_at'])) 
                           : '';

            $htmlBody .= "<div style='border-left:3px solid #3498db;padding:10px 15px;margin:10px 0;background:#f9f9f9;'>";
            $htmlBody .= "<p style='margin:0 0 5px;'><a href='$url' style='color:#2c3e50;font-weight:bold;text-decoration:none;'>$title</a></p>";

            if ($description) {
                $htmlBody .= "<p style='margin:5px 0;color:#666;font-size:13px;'>" . substr($description, 0, 150) . "...</p>";
            }

            if ($published) {
                $htmlBody .= "<p style='margin:5px 0 0;color:#999;font-size:12px;'>Published: $published</p>";
            }

            $htmlBody .= "</div>";
        }
    }

    $htmlBody .= "<p style='color:#999;font-size:12px;margin-top:30px;border-top:1px solid #eee;padding-top:10px;'>";
    $htmlBody .= "PHP News Aggregator - Generated " . date('Y-m-d H:i:s') . "</p>";
    $htmlBody .= "</body></html>";

    return [
        'subject' => $subject,
        'body'    => $textBody,
        'html'    => $htmlBody,
    ];
}

function send_digest($email, $articles) {
    $digest  = build_digest_email($articles);
    $headers = implode("\r\n", [
        'From: PHP News Aggregator <aggregator@yoursite.com>',
        'Reply-To: aggregator@yoursite.com',
        'MIME-Version: 1.0',
        'Content-Type: text/html; charset=UTF-8',
    ]);

    $sent = mail($email, $digest['subject'], $digest['html'], $headers);

    if ($sent) {
        echo "Digest sent to $email - {$digest['subject']}" . PHP_EOL;
    } else {
        echo "Failed to send digest email." . PHP_EOL;
    }

    return $sent;
}

// Usage
$pdo      = get_db_connection();
$keywords = get_active_keywords($pdo);
$articles = get_recent_articles_by_keywords($pdo, $keywords, 24);

send_digest('your@email.com', $articles);
?>

Output:

Digest sent to your@email.com - PHP News Digest - 8 articles - Sat, 03 May 2026

Complete PHP News Aggregator Script

Save this as aggregator.php and run it with php aggregator.php. It reads all active sources from the database, fetches RSS feeds and scrapes HTML sources, saves new articles, filters by keywords, and sends the daily digest.

<?php
// ============================================
// PHP News Aggregator - Complete Script
// ============================================

set_time_limit(0);
error_reporting(E_ALL);
ini_set('log_errors', 1);
ini_set('error_log', __DIR__ . '/aggregator_errors.log');

$logFile      = __DIR__ . '/aggregator.log';
$digestEmail  = 'your@email.com';
$startTime    = microtime(true);

// ---- Logging ----
function log_message($message) {
    global $logFile;
    $entry = '[' . date('Y-m-d H:i:s') . '] ' . $message . PHP_EOL;
    file_put_contents($logFile, $entry, FILE_APPEND);
    echo $entry;
}

// ---- Database ----
function get_db_connection() {
    try {
        return new PDO(
            "mysql:host=localhost;dbname=news_aggregator;charset=utf8mb4",
            'your_username',
            'your_password',
            [
                PDO::ATTR_ERRMODE            => PDO::ERRMODE_EXCEPTION,
                PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC,
                PDO::ATTR_EMULATE_PREPARES   => false,
            ]
        );
    } catch (PDOException $e) {
        log_message("DB failed: " . $e->getMessage());
        return null;
    }
}

// ---- Fetch ----
function fetch_url($url) {
    $ch = curl_init();

    curl_setopt_array($ch, [
        CURLOPT_URL            => $url,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_CONNECTTIMEOUT => 10,
        CURLOPT_TIMEOUT        => 30,
        CURLOPT_ENCODING       => '',
        CURLOPT_HTTPHEADER     => [
            'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
            'Accept: application/rss+xml, application/xml, text/html, */*',
            'Accept-Language: en-US,en;q=0.5',
        ],
    ]);

    $response = curl_exec($ch);
    $errno    = curl_errno($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($errno || $httpCode !== 200) {
        return false;
    }

    return $response;
}

// ---- RSS Parser ----
function parse_feed($xml, $sourceId) {
    if (!$xml) return [];

    libxml_use_internal_errors(true);
    $feed = simplexml_load_string($xml);
    libxml_clear_errors();

    if (!$feed) return [];

    $rootName = strtolower($feed->getName());
    $articles = [];
    $items    = $rootName === 'feed'
                ? $feed->xpath('//*[local-name()="entry"]')
                : ($feed->channel->item ?? []);

    foreach ($items as $item) {
        if ($rootName === 'feed') {
            $title   = (string) ($item->title   ?? '');
            $desc    = (string) ($item->summary ?? $item->content ?? '');
            $author  = (string) ($item->author->name ?? '');
            $pubDate = (string) ($item->updated ?? $item->published ?? '');
            $links   = $item->xpath('*[local-name()="link"][@rel="alternate"]')
                    ?: $item->xpath('*[local-name()="link"]');
            $url     = $links ? (string) $links[0]['href'] : '';
        } else {
            $title   = (string) ($item->title       ?? '');
            $url     = (string) ($item->link        ?? '');
            $desc    = (string) ($item->description ?? '');
            $author  = (string) ($item->author      ?? '');
            $pubDate = (string) ($item->pubDate     ?? '');
        }

        $title = html_entity_decode(strip_tags(trim($title)));
        $desc  = html_entity_decode(strip_tags(trim($desc)));
        $desc  = substr($desc, 0, 500);

        $publishedAt = null;
        if ($pubDate) {
            $ts          = strtotime($pubDate);
            $publishedAt = $ts ? date('Y-m-d H:i:s', $ts) : null;
        }

        if (empty($title) || empty(trim($url))) continue;

        $articles[] = [
            'source_id'    => $sourceId,
            'title'        => $title,
            'url'          => trim($url),
            'description'  => $desc,
            'author'       => trim($author),
            'published_at' => $publishedAt,
        ];
    }

    return $articles;
}

// ---- HTML Scraper ----
function scrape_html_source($html, $sourceId, $source) {
    if (!$html) return [];

    libxml_use_internal_errors(true);
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    libxml_clear_errors();

    $xpath    = new DOMXPath($dom);
    $items    = $xpath->query($source['selector_container']);
    $articles = [];
    $baseUrl  = parse_url($source['url'], PHP_URL_SCHEME) . '://' 
              . parse_url($source['url'], PHP_URL_HOST);

    foreach ($items as $item) {
        $titleNode = $xpath->query($source['selector_title'], $item)->item(0);
        $linkNode  = $xpath->query($source['selector_link'],  $item)->item(0);

        $title = $titleNode ? trim($titleNode->textContent) : null;
        $href  = $linkNode  ? $linkNode->getAttribute('href') : null;

        if ($href && strpos($href, 'http') !== 0) {
            $href = $baseUrl . $href;
        }

        if (empty($title) || empty($href)) continue;

        $articles[] = [
            'source_id'    => $sourceId,
            'title'        => html_entity_decode($title),
            'url'          => $href,
            'description'  => null,
            'author'       => null,
            'published_at' => null,
        ];
    }

    return $articles;
}

// ---- Save ----
function save_articles_batch($pdo, $articles) {
    if (empty($articles)) return ['saved' => 0, 'skipped' => 0];

    $sql  = "INSERT IGNORE INTO articles
                 (source_id, title, url, description, author, published_at)
             VALUES
                 (:source_id, :title, :url, :description, :author, :published_at)";
    $stmt = $pdo->prepare($sql);

    $saved = $skipped = 0;

    try {
        $pdo->beginTransaction();

        foreach ($articles as $article) {
            if (empty($article['title']) || empty($article['url'])) continue;
            if (!filter_var($article['url'], FILTER_VALIDATE_URL)) continue;

            $stmt->execute([
                ':source_id'    => $article['source_id'],
                ':title'        => $article['title'],
                ':url'          => $article['url'],
                ':description'  => $article['description'] ?? null,
                ':author'       => $article['author']      ?? null,
                ':published_at' => $article['published_at'] ?? null,
            ]);

            $stmt->rowCount() === 1 ? $saved++ : $skipped++;
        }

        $pdo->commit();
    } catch (PDOException $e) {
        $pdo->rollBack();
        log_message("Batch save failed: " . $e->getMessage());
    }

    return ['saved' => $saved, 'skipped' => $skipped];
}

// ---- Digest ----
function send_digest($email, $articles) {
    if (empty($articles)) {
        log_message("No new articles for digest - skipping email.");
        return false;
    }

    $date    = date('D, d M Y');
    $count   = count($articles);
    $subject = "PHP News Digest - $count articles - $date";

    $html  = "<!DOCTYPE html><html><body style='font-family:Arial,sans-serif;max-width:700px;margin:0 auto;padding:20px;'>";
    $html .= "<h2 style='color:#2c3e50;border-bottom:2px solid #3498db;padding-bottom:10px;'>PHP News Digest - $date</h2>";
    $html .= "<p style='color:#666;'>$count new articles matching your keywords.</p>";

    $bySource = [];
    foreach ($articles as $article) {
        $bySource[$article['source_name']][] = $article;
    }

    foreach ($bySource as $sourceName => $sourceArticles) {
        $html .= "<h3 style='color:#3498db;margin-top:25px;'>" . htmlspecialchars($sourceName) . "</h3>";

        foreach ($sourceArticles as $article) {
            $title = htmlspecialchars($article['title']);
            $url   = htmlspecialchars($article['url']);
            $desc  = htmlspecialchars($article['description'] ?? '');

            $html .= "<div style='border-left:3px solid #3498db;padding:10px 15px;margin:10px 0;background:#f9f9f9;'>";
            $html .= "<p style='margin:0 0 5px;'><a href='$url' style='color:#2c3e50;font-weight:bold;'>$title</a></p>";
            if ($desc) {
                $html .= "<p style='color:#666;font-size:13px;margin:5px 0;'>" . substr($desc, 0, 150) . "...</p>";
            }
            $html .= "</div>";
        }
    }

    $html .= "<p style='color:#999;font-size:12px;margin-top:30px;'>PHP News Aggregator - " . date('Y-m-d H:i:s') . "</p>";
    $html .= "</body></html>";

    $headers = "From: aggregator@yoursite.com\r\nMIME-Version: 1.0\r\nContent-Type: text/html; charset=UTF-8";

    return mail($email, $subject, $html, $headers);
}

// ============================================
// MAIN LOOP
// ============================================

log_message("Aggregator started.");

$pdo = get_db_connection();

if (!$pdo) exit("Cannot continue without database." . PHP_EOL);

// Load sources
$sources  = $pdo->query("SELECT * FROM sources WHERE active = 1")->fetchAll();
$keywords = $pdo->query("SELECT keyword FROM keywords WHERE active = 1")->fetchAll(PDO::FETCH_COLUMN);

log_message("Sources: " . count($sources) . " | Keywords: " . implode(', ', $keywords));

$stats = ['sources' => 0, 'articles' => 0, 'saved' => 0, 'skipped' => 0, 'failed' => 0];

foreach ($sources as $source) {
    log_message("Processing: {$source['name']}");
    $stats['sources']++;

    $content  = fetch_url($source['type'] === 'rss' ? $source['feed_url'] : $source['url']);

    if (!$content) {
        log_message("  Failed to fetch - skipping.");
        $stats['failed']++;
        continue;
    }

    $articles = $source['type'] === 'rss'
                ? parse_feed($content, $source['id'])
                : scrape_html_source($content, $source['id'], $source);

    if (!empty($keywords)) {
        $articles = array_values(array_filter($articles, function($article) use ($keywords) {
            $text = strtolower(($article['title'] ?? '') . ' ' . ($article['description'] ?? ''));
            foreach ($keywords as $kw) {
                if (strpos($text, strtolower($kw)) !== false) return true;
            }
            return false;
        }));
    }

    $result = save_articles_batch($pdo, $articles);

    $stats['articles'] += count($articles);
    $stats['saved']    += $result['saved'];
    $stats['skipped']  += $result['skipped'];

    log_message("  Found: " . count($articles) . " | Saved: {$result['saved']} | Skipped: {$result['skipped']}");

    // Update last_fetched
    $pdo->prepare("UPDATE sources SET last_fetched = NOW() WHERE id = :id")
        ->execute([':id' => $source['id']]);

    sleep(2);
}

// Send digest
$newArticles = $pdo->prepare(
    "SELECT a.*, s.name as source_name FROM articles a
     JOIN sources s ON s.id = a.source_id
     WHERE a.fetched_at >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
     ORDER BY a.published_at DESC"
);
$newArticles->execute();
$digest = $newArticles->fetchAll();

send_digest($digestEmail, $digest);

// Cleanup old articles
$pdo->query("DELETE FROM articles WHERE fetched_at < DATE_SUB(NOW(), INTERVAL 30 DAY)");

$duration = round(microtime(true) - $startTime, 2);

$summary = "
============================================
Aggregator Complete: " . date('Y-m-d H:i:s') . "
Duration:  {$duration}s
Sources:   {$stats['sources']}
Articles:  {$stats['articles']}
Saved:     {$stats['saved']}
Skipped:   {$stats['skipped']}
Failed:    {$stats['failed']}
============================================";

log_message($summary);
?>

Output:

[2026-05-03 09:00:01] Aggregator started.
[2026-05-03 09:00:01] Sources: 6 | Keywords: api, laravel, mysql, php, security, web scraping
[2026-05-03 09:00:01] Processing: CSS-Tricks
[2026-05-03 09:00:02] Found: 4 | Saved: 4 | Skipped: 0
[2026-05-03 09:00:02] Processing: Dev.to
[2026-05-03 09:00:04] Found: 7 | Saved: 7 | Skipped: 0
[2026-05-03 09:00:04] Processing: Hacker News
[2026-05-03 09:00:06] Found: 3 | Saved: 3 | Skipped: 0
...
============================================
Aggregator Complete: 2026-05-03 09:00:19
Duration:  18.4s
Sources:   6
Articles:  31
Saved:     24
Skipped:   7
Failed:    0
============================================

Cron Automation

Run the aggregator daily at 7am by adding this to your crontab with crontab -e:

0 7 * * * /usr/bin/php /var/www/html/aggregator.php >> /var/www/html/aggregator_cron.log 2>&1

Test the command manually first:

/usr/bin/php /var/www/html/aggregator.php

For the complete cron setup guide including how to find your PHP binary path, configure cPanel cron jobs, and debug when jobs don’t run, see the PHP cron job guide.

Frequently Asked Questions

What is a PHP news aggregator?

A PHP news aggregator is a script that automatically collects articles from multiple news sources – RSS feeds and websites – and stores them in one place. Instead of visiting each site manually, the aggregator runs on a schedule, fetches new content, filters it by keywords, and optionally emails you a digest of what’s relevant.

What is the difference between RSS and HTML scraping for news aggregation?

RSS feeds are structured XML documents sites publish specifically for content syndication. They’re faster to parse, less likely to break when a site redesigns, and explicitly authorized by the publisher. HTML scraping is a fallback – it extracts the same data from the raw page HTML but requires XPath selectors that can break any time the site changes its layout. Always check for an RSS feed before writing HTML scraping code.

How do I add a new news source to the aggregator?

For RSS sources run one INSERT statement:

<?php
$pdo = get_db_connection();

add_source(
    $pdo,
    'New Site Name',
    'https://newsite.com',
    'https://newsite.com/feed',
    'rss'
);
?>

The aggregator picks it up automatically on the next run – no code changes needed.

How do I stop getting duplicate articles?

The UNIQUE constraint on the url column combined with INSERT IGNORE prevents duplicates at the database level. Every time the aggregator runs it tries to insert all fetched articles – existing URLs are silently skipped, new ones are saved. No manual deduplication required.

Can I aggregate news from sites that block scraping?

RSS feeds are explicitly provided for aggregation – no blocking concerns. For HTML scraping, add proper headers, delays between requests, and a cookie jar. The avoiding blocks guide covers all seven techniques with working code. If a site consistently blocks you and has no RSS feed, check whether it has a public API – many news sites do.

How do I make the email digest go to multiple recipients?

Add multiple addresses to the mail headers:

<?php
$recipients = ['first@email.com', 'second@email.com'];
$to         = implode(', ', $recipients);

mail($to, $subject, $html, $headers);
?>

For more reliable delivery use PHPMailer with SMTP rather than PHP’s built-in mail(). The PHP price tracker guide covers PHPMailer SMTP setup in detail – the same configuration works here.


Summary

A PHP news aggregator built on RSS parsing and targeted HTML scraping gives you a fully automated news monitoring system in a few hundred lines of PHP. The key components:

  • RSS first – SimpleXML parses RSS and Atom feeds in seconds. Always check for a feed before writing HTML scraping code.
  • HTML fallback – cURL and DOMDocument handle sites without RSS. Store selectors in the database so adding new sources doesn’t require code changes.
  • INSERT IGNORE for duplicates – combined with a UNIQUE constraint on URL, this makes re-running the aggregator safe and idempotent.
  • Keyword filtering – store keywords in the database, filter at query time. Only relevant articles reach the digest.
  • Daily email digest – formatted HTML email grouped by source. Actionable without visiting the aggregator directly.
  • Cron automation – one crontab entry handles daily execution, logging, and cleanup without manual intervention.

For the MySQL storage layer this aggregator uses, the PHP MySQL scraping guide covers PDO connections, prepared statements, and duplicate handling in full detail. For the cURL fetching and DOMDocument parsing underneath everything, the PHP cURL web scraping complete guide covers every option and pattern used here.

Note: This tutorial is for educational purposes. Always respect website terms before scraping.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top