A pipe-based news article scraping and metadata extraction library

These details have not been verified by PyPI

Project links

Project description

PipeScraper 🔗

A pipe-based news article scraping and metadata extraction library for Python

pipescraper provides a natural language verb-based interface for scraping news websites and extracting structured article metadata using the intuitive pipe (>>) operator. Built on top of trafilatura with supplementary time extraction via newspaper4k, pipescraper combines powerful extraction capabilities with an elegant, chainable API.

from pipescraper import *

# Your scraping pipeline reads like a story
result = ("https://www.bbc.com/news"   # Replace with your target URL
    >> FetchLinks(max_links=10) 
    >> ExtractArticles() 
    >> FilterArticles(lambda a: a.language == 'en')
    >> ToDataFrame() 
    >> SaveAs("articles.csv")
)

💡 How to read >>: Read the >> operator as "pipe to" or "then". For example, the code above reads as: "Take the URL, then fetch links, then extract articles, then filter for English articles... "

🌟 Why pipescraper?

Readability First

# ❌ Traditional logic is nested, hard to read, and error-prone
urls = fetch_links("https://www.bbc.com/news", max_links=10)  # Replace with your target URL
articles = []
for url in urls:
    time.sleep(1)
    art = extract_article(url)
    if art.language == 'en' and art.author:
        articles.append(art)
save_to_csv(articles, "articles.csv")

# ✅ pipescraper: Clear and intuitive
("https://www.bbc.com/news"   # Replace with your target URL
    >> FetchLinks(max_links=10) 
    >> ExtractArticles(delay=1.0) 
    >> FilterArticles(lambda a: a.language == 'en' and bool(a.author)) 
    >> ToDataFrame() 
    >> SaveAs("articles.csv")
)

Key Features

🔗 Pipe-based syntax — Chain operations naturally with the >> operator
📰 Comprehensive metadata extraction — Extract URL, source, title, text, author, dates, language, and more
⏰ Publication time parsing — Supplement trafilatura's date extraction with full timestamp support
🤖 Respectful scraping — Built-in robots.txt compliance and request throttling
🌐 Google News Search — Search for keywords or sentences across regions and time periods ⭐ NEW
🧠 Automatic URL Decoding — Parallel batchexecute decoder for Google News (bypasses consent wall) ⭐ NEW
📊 Pandas integration — Export to DataFrame with CSV, JSON, Excel support
🎯 Flexible filtering — Filter articles by language, author, content length, or custom criteria
🧹 Automatic deduplication — Remove duplicate articles by URL
⚡ Parallel Scraping — Turbocharge batch extraction with multi-threaded workers
🔧 PipeFrame integration — Use all PipeFrame verbs (select, filter, mutate, arrange, etc.) for data manipulation
📈 PipePlotly integration — Create visualizations with Grammar of Graphics using ggplot, geom_bar, geom_point, etc.

🚀 Quick Start

Installation

# Basic installation
pip install pipescraper

# Install with all optional integrations (PipeFrame & PipePlotly)
pip install pipescraper[all]

Or install from source:

git clone https://github.com/Yasser03/pipescraper.git
cd pipescraper
pip install -e .

Hello pipescraper!

from pipescraper import FetchLinks, ExtractArticles, ToDataFrame, SaveAs

# Simple pipeline: URL → Links → Articles → DataFrame → CSV
df = ("https://www.bbc.com/news"   # Replace with your target URL
      >> FetchLinks(max_links=10) 
      >> ExtractArticles() 
      >> ToDataFrame() 
      >> SaveAs("articles.csv"))

print(f"Scraped {len(df)} articles successfully! 🎉")

📚 Core Concepts

The Pipe Operator `>>`

Chain operations naturally without nested function calls or loops:

# PipeScraper approach (reads like a recipe)
articles = ("https://www.bbc.com/news"  # Replace with your target URL
    >> FetchLinks(max_links=20)
    >> ExtractArticles(skip_errors=True)
    >> Deduplicate()
    >> LimitArticles(10)
)

Core Verbs

Verb	Purpose	Example
`FetchLinks()`	Fetch article links from a base URL	`>> FetchLinks(max_links=50, delay=1.0)`
`ExtractArticles()`	Extract metadata from urls	`>> ExtractArticles(workers=5, extract_time=True)`
`FetchGoogleNews()`	Search Google News	`>> FetchGoogleNews(search="SpaceX", period="1d")`
`FilterArticles()`	Filter by criteria	`>> FilterArticles(lambda a: a.language == 'en')`
`LimitArticles()`	Limit number of articles	`>> LimitArticles(10)`
`Deduplicate()`	Remove duplicates	`>> Deduplicate()`
`ToDataFrame()`	Convert to DataFrame	`>> ToDataFrame(include_text=True)`
`ToPipeFrame()`	Convert to PipeFrame	`>> ToPipeFrame()`
`SaveAs()`	Save to file	`>> SaveAs("output.csv")`

🔥 Advanced Features

Google News Integration & Decoding

Search for specific topics from Google News, leveraging a high-performance parallel decoder that resolves consent-gated URLs automatically.

# Search for multiple related topics
search_articles = (FetchGoogleNews(
                        search=["latest AI breakthroughs", "quantum computing news"],
                        period="7d",
                        max_results=20) 
                   >> ExtractArticles(workers=5) 
                   >> ToDataFrame())

Turbo Parallel Pipeline

Scrape safely and heavily concurrently using multi-threaded workers.

# Scrape 50 articles in parallel using 10 workers
df = ("https://www.bbc.com/news"   # Replace with your target URL
      >> FetchLinks(max_links=50) 
      >> ExtractArticles(workers=10) 
      >> ToDataFrame())

Extracted Metadata Fields

Each article contains the following fields:

Field	Description	Source
`url`	Article URL	Input
`source`	Domain/source name	Parsed
`title`	Article headline	Trafilatura / newspaper4k
`text`	Main article content	Trafilatura
`description`	Article summary	Trafilatura
`author`	Author name(s)	Trafilatura / newspaper4k
`date_published`	Publication date (YYYY-MM-DD)	Trafilatura / newspaper4k
`time_published`	Publication time (HH:MM:SS)	newspaper4k ⭐
`language`	Language code (e.g., 'en')	Trafilatura
`tags`	Article tags/categories	Trafilatura
`image_url`	Main article image	Trafilatura / newspaper4k

⭐ Note: time_published is extracted via newspaper4k to supplement trafilatura, which only provides dates.

Data Manipulation & Visualization

Install PipeFrame (pip install pipescraper[pipeframe]) and PipePlotly (pip install pipescraper[pipeplotly]) for seamless end-to-end pipelines:

from pipescraper import ExtractArticles, ToPipeFrame
from pipeframe import filter, arrange, group_by, summarize
from pipeplotly import ggplot, aes, geom_bar, theme_minimal

# Full Pipeline: Scrape -> Mutate -> Group -> Plot
fig = ("https://www.bbc.com/news"   # Replace with your target URL
       >> FetchLinks(max_links=20) 
       >> ExtractArticles() 
       >> ToPipeFrame() 
       >> filter(lambda df: df['author'].notna())
       >> arrange('date_published', ascending=False)
       >> ggplot(aes(x='source')) 
       >> geom_bar() 
       >> theme_minimal())

fig.show()

🎯 Real-World Examples

Respectful Scrape & Filter

Configure delays and robots.txt compliance.

result = ("https://www.bbc.com/news"   # Replace with your target URL
          >> FetchLinks(
              max_links=50,
              respect_robots=True,
              delay=3.0,
              user_agent="MyBot/1.0 (contact@example.com)"
          ) 
          >> ExtractArticles(delay=2.0)
          >> FilterArticles(lambda a: a.language == 'en' and bool(a.author))
          >> LimitArticles(20)
          >> Deduplicate()
          >> ToDataFrame(include_text=False)
          >> SaveAs("respectful_scrape.csv"))

Direct Article Extraction

Extract from a specific URL or list of URLs without link discovery.

df = ("https://www.bbc.com/news/specific-article"   # Replace with your target URL
      >> ExtractArticles() 
      >> ToDataFrame() 
      >> SaveAs("single_article.json"))

🆚 Feature Comparison

pipescraper vs. Trafilatura

Feature	pipescraper	Trafilatura
Content extraction	✅ (via trafilatura)	✅
Metadata extraction	✅ Enhanced	✅ Basic
Publication time	✅ (via newspaper4k)	❌ (date only)
Pipe syntax	✅	❌
Link discovery	✅	❌
Batch / Parallel	✅	Manual
DataFrame export	✅ (CSV/JSON/Excel)	❌
Google News Filter	✅	❌

Design Decision: pipescraper uses a dual-engine approach. Trafilatura provides industry-leading content extraction, while newspaper4k complements it by capturing the exact time_published, ensuring complete temporal metadata.

🎓 Learning Resources

Tutorial Notebook - A complete, hands-on, end-to-end walkthrough
API Reference - Detailed core documentation
Examples - More advanced usage examples
Contributing Guide - How to contribute

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📜 License

MIT License - see the LICENSE file for details.

👨‍💻 Author

Dr. Yasser Mustafa

AI & Data Science Specialist | Theoretical Physics PhD

🎓 PhD in Theoretical Nuclear Physics
💼 10+ years in production AI/ML systems
🔬 48+ research publications
🏢 Experience: Government (Abu Dhabi), Media (Track24), Recruitment (Reed), Energy (ADNOC)
📍 Based in Newcastle Upon Tyne, UK
✉️ yasser.mustafan@gmail.com
🔗 LinkedIn | GitHub

PipeScraper was born from the need for a more intuitive, pipe-based approach to news scraping, combining the analytical power of trafilatura with the elegance of a functional programming interface.

🌟 Star History

If PipeScraper helps your work, please consider giving it a star! ⭐

📜 How to Cite

If you use PipeScraper in your research or project, please cite it as follows:

@software{pipescraper2026,
  author = {Mustafa, Yasser},
  title = {PipeScraper: A pipe-based news article scraping and metadata extraction library},
  url = {https://github.com/Yasser03/pipescraper},
  version = {0.3.0},
  year = {2026}
}

🙏 Acknowledgments

trafilatura — Core content extraction engine
newspaper4k — Supplementary time extraction
pipeframe — Inspiration for pipe-based syntax
pipeplotly — Pipe pattern implementation reference

💬 Community

Issues: Report bugs or request features
Discussions: Ask questions, share use cases

Made with ❤️ by Dr. Yasser Mustafa

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Mar 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipescraper-0.3.0.tar.gz (31.4 kB view details)

Uploaded Mar 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pipescraper-0.3.0-py3-none-any.whl (28.3 kB view details)

Uploaded Mar 20, 2026 Python 3

File details

Details for the file pipescraper-0.3.0.tar.gz.

File metadata

Download URL: pipescraper-0.3.0.tar.gz
Upload date: Mar 20, 2026
Size: 31.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pipescraper-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`c6d28cae5de8382377edc648208c5845e09d4dd514de22abd2c7f0e8942b8e89`
MD5	`744ba892b237ceb180f2be4cfc4da283`
BLAKE2b-256	`de59e7ef785a7316d9cf53ea2fd18fb50cb053c2d2caf1cd9890cf94af12b1e1`

See more details on using hashes here.

File details

Details for the file pipescraper-0.3.0-py3-none-any.whl.

File metadata

Download URL: pipescraper-0.3.0-py3-none-any.whl
Upload date: Mar 20, 2026
Size: 28.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pipescraper-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cb5feacf84383fc4bf46db2290276c63224d6cb5f2ab6addb036f3fc49b36186`
MD5	`64f51e4b76c65cf003fd8546e00f1d9d`
BLAKE2b-256	`24fe0f216f4a2a2bc5665c82d0aaf4257724b2edc1a3c827292ceff7dd0401f2`

See more details on using hashes here.

pipescraper 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PipeScraper 🔗

🌟 Why pipescraper?

Readability First

Key Features

🚀 Quick Start

Installation

Hello pipescraper!

📚 Core Concepts

The Pipe Operator >>

Core Verbs

🔥 Advanced Features

Google News Integration & Decoding

Turbo Parallel Pipeline

Extracted Metadata Fields

Data Manipulation & Visualization

🎯 Real-World Examples

Respectful Scrape & Filter

Direct Article Extraction

🆚 Feature Comparison

pipescraper vs. Trafilatura

🎓 Learning Resources

🤝 Contributing

📜 License

👨‍💻 Author

🌟 Star History

📜 How to Cite

🙏 Acknowledgments

💬 Community

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

The Pipe Operator `>>`