Skip to main content

A pipe-based news article scraping and metadata extraction library

Project description

PipeScraper ๐Ÿ”—

A pipe-based news article scraping and metadata extraction library for Python

Python 3.8+ License: MIT

pipescraper provides a natural language verb-based interface for scraping news websites and extracting structured article metadata using the intuitive pipe (>>) operator. Built on top of trafilatura with supplementary time extraction via newspaper4k, pipescraper combines powerful extraction capabilities with an elegant, chainable API.

from pipescraper import *

# Your scraping pipeline reads like a story
result = ("https://www.bbc.com/news"   # Replace with your target URL
    >> FetchLinks(max_links=10) 
    >> ExtractArticles() 
    >> FilterArticles(lambda a: a.language == 'en')
    >> ToDataFrame() 
    >> SaveAs("articles.csv")
)

๐Ÿ’ก How to read >>: Read the >> operator as "pipe to" or "then". For example, the code above reads as: "Take the URL, then fetch links, then extract articles, then filter for English articles... "


๐ŸŒŸ Why pipescraper?

Readability First

# โŒ Traditional logic is nested, hard to read, and error-prone
urls = fetch_links("https://www.bbc.com/news", max_links=10)  # Replace with your target URL
articles = []
for url in urls:
    time.sleep(1)
    art = extract_article(url)
    if art.language == 'en' and art.author:
        articles.append(art)
save_to_csv(articles, "articles.csv")

# โœ… pipescraper: Clear and intuitive
("https://www.bbc.com/news"   # Replace with your target URL
    >> FetchLinks(max_links=10) 
    >> ExtractArticles(delay=1.0) 
    >> FilterArticles(lambda a: a.language == 'en' and bool(a.author)) 
    >> ToDataFrame() 
    >> SaveAs("articles.csv")
)

Key Features

  • ๐Ÿ”— Pipe-based syntax โ€” Chain operations naturally with the >> operator
  • ๐Ÿ“ฐ Comprehensive metadata extraction โ€” Extract URL, source, title, text, author, dates, language, and more
  • โฐ Publication time parsing โ€” Supplement trafilatura's date extraction with full timestamp support
  • ๐Ÿค– Respectful scraping โ€” Built-in robots.txt compliance and request throttling
  • ๐ŸŒ Google News Search โ€” Search for keywords or sentences across regions and time periods โญ NEW
  • ๐Ÿง  Automatic URL Decoding โ€” Parallel batchexecute decoder for Google News (bypasses consent wall) โญ NEW
  • ๐Ÿ“Š Pandas integration โ€” Export to DataFrame with CSV, JSON, Excel support
  • ๐ŸŽฏ Flexible filtering โ€” Filter articles by language, author, content length, or custom criteria
  • ๐Ÿงน Automatic deduplication โ€” Remove duplicate articles by URL
  • โšก Parallel Scraping โ€” Turbocharge batch extraction with multi-threaded workers
  • ๐Ÿ”ง PipeFrame integration โ€” Use all PipeFrame verbs (select, filter, mutate, arrange, etc.) for data manipulation
  • ๐Ÿ“ˆ PipePlotly integration โ€” Create visualizations with Grammar of Graphics using ggplot, geom_bar, geom_point, etc.

๐Ÿš€ Quick Start

Installation

# Basic installation
pip install pipescraper

# Install with all optional integrations (PipeFrame & PipePlotly)
pip install pipescraper[all]

Or install from source:

git clone https://github.com/Yasser03/pipescraper.git
cd pipescraper
pip install -e .

Hello pipescraper!

from pipescraper import FetchLinks, ExtractArticles, ToDataFrame, SaveAs

# Simple pipeline: URL โ†’ Links โ†’ Articles โ†’ DataFrame โ†’ CSV
df = ("https://www.bbc.com/news"   # Replace with your target URL
      >> FetchLinks(max_links=10) 
      >> ExtractArticles() 
      >> ToDataFrame() 
      >> SaveAs("articles.csv"))

print(f"Scraped {len(df)} articles successfully! ๐ŸŽ‰")

๐Ÿ“š Core Concepts

The Pipe Operator >>

Chain operations naturally without nested function calls or loops:

# PipeScraper approach (reads like a recipe)
articles = ("https://www.bbc.com/news"  # Replace with your target URL
    >> FetchLinks(max_links=20)
    >> ExtractArticles(skip_errors=True)
    >> Deduplicate()
    >> LimitArticles(10)
)

Core Verbs

Verb Purpose Example
FetchLinks() Fetch article links from a base URL >> FetchLinks(max_links=50, delay=1.0)
ExtractArticles() Extract metadata from urls >> ExtractArticles(workers=5, extract_time=True)
FetchGoogleNews() Search Google News >> FetchGoogleNews(search="SpaceX", period="1d")
FilterArticles() Filter by criteria >> FilterArticles(lambda a: a.language == 'en')
LimitArticles() Limit number of articles >> LimitArticles(10)
Deduplicate() Remove duplicates >> Deduplicate()
ToDataFrame() Convert to DataFrame >> ToDataFrame(include_text=True)
ToPipeFrame() Convert to PipeFrame >> ToPipeFrame()
SaveAs() Save to file >> SaveAs("output.csv")

๐Ÿ”ฅ Advanced Features

Google News Integration & Decoding

Search for specific topics from Google News, leveraging a high-performance parallel decoder that resolves consent-gated URLs automatically.

# Search for multiple related topics
search_articles = (FetchGoogleNews(
                        search=["latest AI breakthroughs", "quantum computing news"],
                        period="7d",
                        max_results=20) 
                   >> ExtractArticles(workers=5) 
                   >> ToDataFrame())

Turbo Parallel Pipeline

Scrape safely and heavily concurrently using multi-threaded workers.

# Scrape 50 articles in parallel using 10 workers
df = ("https://www.bbc.com/news"   # Replace with your target URL
      >> FetchLinks(max_links=50) 
      >> ExtractArticles(workers=10) 
      >> ToDataFrame())

Extracted Metadata Fields

Each article contains the following fields:

Field Description Source
url Article URL Input
source Domain/source name Parsed
title Article headline Trafilatura / newspaper4k
text Main article content Trafilatura
description Article summary Trafilatura
author Author name(s) Trafilatura / newspaper4k
date_published Publication date (YYYY-MM-DD) Trafilatura / newspaper4k
time_published Publication time (HH:MM:SS) newspaper4k โญ
language Language code (e.g., 'en') Trafilatura
tags Article tags/categories Trafilatura
image_url Main article image Trafilatura / newspaper4k

โญ Note: time_published is extracted via newspaper4k to supplement trafilatura, which only provides dates.

Data Manipulation & Visualization

Install PipeFrame (pip install pipescraper[pipeframe]) and PipePlotly (pip install pipescraper[pipeplotly]) for seamless end-to-end pipelines:

from pipescraper import ExtractArticles, ToPipeFrame
from pipeframe import filter, arrange, group_by, summarize
from pipeplotly import ggplot, aes, geom_bar, theme_minimal

# Full Pipeline: Scrape -> Mutate -> Group -> Plot
fig = ("https://www.bbc.com/news"   # Replace with your target URL
       >> FetchLinks(max_links=20) 
       >> ExtractArticles() 
       >> ToPipeFrame() 
       >> filter(lambda df: df['author'].notna())
       >> arrange('date_published', ascending=False)
       >> ggplot(aes(x='source')) 
       >> geom_bar() 
       >> theme_minimal())

fig.show()

๐ŸŽฏ Real-World Examples

Respectful Scrape & Filter

Configure delays and robots.txt compliance.

result = ("https://www.bbc.com/news"   # Replace with your target URL
          >> FetchLinks(
              max_links=50,
              respect_robots=True,
              delay=3.0,
              user_agent="MyBot/1.0 (contact@example.com)"
          ) 
          >> ExtractArticles(delay=2.0)
          >> FilterArticles(lambda a: a.language == 'en' and bool(a.author))
          >> LimitArticles(20)
          >> Deduplicate()
          >> ToDataFrame(include_text=False)
          >> SaveAs("respectful_scrape.csv"))

Direct Article Extraction

Extract from a specific URL or list of URLs without link discovery.

df = ("https://www.bbc.com/news/specific-article"   # Replace with your target URL
      >> ExtractArticles() 
      >> ToDataFrame() 
      >> SaveAs("single_article.json"))

๐Ÿ†š Feature Comparison

pipescraper vs. Trafilatura

Feature pipescraper Trafilatura
Content extraction โœ… (via trafilatura) โœ…
Metadata extraction โœ… Enhanced โœ… Basic
Publication time โœ… (via newspaper4k) โŒ (date only)
Pipe syntax โœ… โŒ
Link discovery โœ… โŒ
Batch / Parallel โœ… Manual
DataFrame export โœ… (CSV/JSON/Excel) โŒ
Google News Filter โœ… โŒ

Design Decision: pipescraper uses a dual-engine approach. Trafilatura provides industry-leading content extraction, while newspaper4k complements it by capturing the exact time_published, ensuring complete temporal metadata.


๐ŸŽ“ Learning Resources


๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“œ License

MIT License - see the LICENSE file for details.


๐Ÿ‘จโ€๐Ÿ’ป Author

Dr. Yasser Mustafa

AI & Data Science Specialist | Theoretical Physics PhD

  • ๐ŸŽ“ PhD in Theoretical Nuclear Physics
  • ๐Ÿ’ผ 10+ years in production AI/ML systems
  • ๐Ÿ”ฌ 48+ research publications
  • ๐Ÿข Experience: Government (Abu Dhabi), Media (Track24), Recruitment (Reed), Energy (ADNOC)
  • ๐Ÿ“ Based in Newcastle Upon Tyne, UK
  • โœ‰๏ธ yasser.mustafan@gmail.com
  • ๐Ÿ”— LinkedIn | GitHub

PipeScraper was born from the need for a more intuitive, pipe-based approach to news scraping, combining the analytical power of trafilatura with the elegance of a functional programming interface.


๐ŸŒŸ Star History

If PipeScraper helps your work, please consider giving it a star! โญ


๐Ÿ“œ How to Cite

If you use PipeScraper in your research or project, please cite it as follows:

@software{pipescraper2026,
  author = {Mustafa, Yasser},
  title = {PipeScraper: A pipe-based news article scraping and metadata extraction library},
  url = {https://github.com/Yasser03/pipescraper},
  version = {0.3.0},
  year = {2026}
}

๐Ÿ™ Acknowledgments

  • trafilatura โ€” Core content extraction engine
  • newspaper4k โ€” Supplementary time extraction
  • pipeframe โ€” Inspiration for pipe-based syntax
  • pipeplotly โ€” Pipe pattern implementation reference

๐Ÿ’ฌ Community

  • Issues: Report bugs or request features
  • Discussions: Ask questions, share use cases

Made with โค๏ธ by Dr. Yasser Mustafa

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipescraper-0.3.0.tar.gz (31.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pipescraper-0.3.0-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file pipescraper-0.3.0.tar.gz.

File metadata

  • Download URL: pipescraper-0.3.0.tar.gz
  • Upload date:
  • Size: 31.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pipescraper-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c6d28cae5de8382377edc648208c5845e09d4dd514de22abd2c7f0e8942b8e89
MD5 744ba892b237ceb180f2be4cfc4da283
BLAKE2b-256 de59e7ef785a7316d9cf53ea2fd18fb50cb053c2d2caf1cd9890cf94af12b1e1

See more details on using hashes here.

File details

Details for the file pipescraper-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: pipescraper-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 28.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pipescraper-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cb5feacf84383fc4bf46db2290276c63224d6cb5f2ab6addb036f3fc49b36186
MD5 64f51e4b76c65cf003fd8546e00f1d9d
BLAKE2b-256 24fe0f216f4a2a2bc5665c82d0aaf4257724b2edc1a3c827292ceff7dd0401f2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page