A pipe-based news article scraping and metadata extraction library
Project description
PipeScraper ๐
A pipe-based news article scraping and metadata extraction library for Python
pipescraper provides a natural language verb-based interface for scraping news websites and extracting structured article metadata using the intuitive pipe (>>) operator. Built on top of trafilatura with supplementary time extraction via newspaper4k, pipescraper combines powerful extraction capabilities with an elegant, chainable API.
from pipescraper import *
# Your scraping pipeline reads like a story
result = ("https://www.bbc.com/news" # Replace with your target URL
>> FetchLinks(max_links=10)
>> ExtractArticles()
>> FilterArticles(lambda a: a.language == 'en')
>> ToDataFrame()
>> SaveAs("articles.csv")
)
๐ก How to read
>>: Read the>>operator as "pipe to" or "then". For example, the code above reads as: "Take the URL, then fetch links, then extract articles, then filter for English articles... "
๐ Why pipescraper?
Readability First
# โ Traditional logic is nested, hard to read, and error-prone
urls = fetch_links("https://www.bbc.com/news", max_links=10) # Replace with your target URL
articles = []
for url in urls:
time.sleep(1)
art = extract_article(url)
if art.language == 'en' and art.author:
articles.append(art)
save_to_csv(articles, "articles.csv")
# โ
pipescraper: Clear and intuitive
("https://www.bbc.com/news" # Replace with your target URL
>> FetchLinks(max_links=10)
>> ExtractArticles(delay=1.0)
>> FilterArticles(lambda a: a.language == 'en' and bool(a.author))
>> ToDataFrame()
>> SaveAs("articles.csv")
)
Key Features
- ๐ Pipe-based syntax โ Chain operations naturally with the
>>operator - ๐ฐ Comprehensive metadata extraction โ Extract URL, source, title, text, author, dates, language, and more
- โฐ Publication time parsing โ Supplement trafilatura's date extraction with full timestamp support
- ๐ค Respectful scraping โ Built-in robots.txt compliance and request throttling
- ๐ Google News Search โ Search for keywords or sentences across regions and time periods โญ NEW
- ๐ง Automatic URL Decoding โ Parallel
batchexecutedecoder for Google News (bypasses consent wall) โญ NEW - ๐ Pandas integration โ Export to DataFrame with CSV, JSON, Excel support
- ๐ฏ Flexible filtering โ Filter articles by language, author, content length, or custom criteria
- ๐งน Automatic deduplication โ Remove duplicate articles by URL
- โก Parallel Scraping โ Turbocharge batch extraction with multi-threaded workers
- ๐ง PipeFrame integration โ Use all PipeFrame verbs (select, filter, mutate, arrange, etc.) for data manipulation
- ๐ PipePlotly integration โ Create visualizations with Grammar of Graphics using ggplot, geom_bar, geom_point, etc.
๐ Quick Start
Installation
# Basic installation
pip install pipescraper
# Install with all optional integrations (PipeFrame & PipePlotly)
pip install pipescraper[all]
Or install from source:
git clone https://github.com/Yasser03/pipescraper.git
cd pipescraper
pip install -e .
Hello pipescraper!
from pipescraper import FetchLinks, ExtractArticles, ToDataFrame, SaveAs
# Simple pipeline: URL โ Links โ Articles โ DataFrame โ CSV
df = ("https://www.bbc.com/news" # Replace with your target URL
>> FetchLinks(max_links=10)
>> ExtractArticles()
>> ToDataFrame()
>> SaveAs("articles.csv"))
print(f"Scraped {len(df)} articles successfully! ๐")
๐ Core Concepts
The Pipe Operator >>
Chain operations naturally without nested function calls or loops:
# PipeScraper approach (reads like a recipe)
articles = ("https://www.bbc.com/news" # Replace with your target URL
>> FetchLinks(max_links=20)
>> ExtractArticles(skip_errors=True)
>> Deduplicate()
>> LimitArticles(10)
)
Core Verbs
| Verb | Purpose | Example |
|---|---|---|
FetchLinks() |
Fetch article links from a base URL | >> FetchLinks(max_links=50, delay=1.0) |
ExtractArticles() |
Extract metadata from urls | >> ExtractArticles(workers=5, extract_time=True) |
FetchGoogleNews() |
Search Google News | >> FetchGoogleNews(search="SpaceX", period="1d") |
FilterArticles() |
Filter by criteria | >> FilterArticles(lambda a: a.language == 'en') |
LimitArticles() |
Limit number of articles | >> LimitArticles(10) |
Deduplicate() |
Remove duplicates | >> Deduplicate() |
ToDataFrame() |
Convert to DataFrame | >> ToDataFrame(include_text=True) |
ToPipeFrame() |
Convert to PipeFrame | >> ToPipeFrame() |
SaveAs() |
Save to file | >> SaveAs("output.csv") |
๐ฅ Advanced Features
Google News Integration & Decoding
Search for specific topics from Google News, leveraging a high-performance parallel decoder that resolves consent-gated URLs automatically.
# Search for multiple related topics
search_articles = (FetchGoogleNews(
search=["latest AI breakthroughs", "quantum computing news"],
period="7d",
max_results=20)
>> ExtractArticles(workers=5)
>> ToDataFrame())
Turbo Parallel Pipeline
Scrape safely and heavily concurrently using multi-threaded workers.
# Scrape 50 articles in parallel using 10 workers
df = ("https://www.bbc.com/news" # Replace with your target URL
>> FetchLinks(max_links=50)
>> ExtractArticles(workers=10)
>> ToDataFrame())
Extracted Metadata Fields
Each article contains the following fields:
| Field | Description | Source |
|---|---|---|
url |
Article URL | Input |
source |
Domain/source name | Parsed |
title |
Article headline | Trafilatura / newspaper4k |
text |
Main article content | Trafilatura |
description |
Article summary | Trafilatura |
author |
Author name(s) | Trafilatura / newspaper4k |
date_published |
Publication date (YYYY-MM-DD) | Trafilatura / newspaper4k |
time_published |
Publication time (HH:MM:SS) | newspaper4k โญ |
language |
Language code (e.g., 'en') | Trafilatura |
tags |
Article tags/categories | Trafilatura |
image_url |
Main article image | Trafilatura / newspaper4k |
โญ Note: time_published is extracted via newspaper4k to supplement trafilatura, which only provides dates.
Data Manipulation & Visualization
Install PipeFrame (pip install pipescraper[pipeframe]) and PipePlotly (pip install pipescraper[pipeplotly]) for seamless end-to-end pipelines:
from pipescraper import ExtractArticles, ToPipeFrame
from pipeframe import filter, arrange, group_by, summarize
from pipeplotly import ggplot, aes, geom_bar, theme_minimal
# Full Pipeline: Scrape -> Mutate -> Group -> Plot
fig = ("https://www.bbc.com/news" # Replace with your target URL
>> FetchLinks(max_links=20)
>> ExtractArticles()
>> ToPipeFrame()
>> filter(lambda df: df['author'].notna())
>> arrange('date_published', ascending=False)
>> ggplot(aes(x='source'))
>> geom_bar()
>> theme_minimal())
fig.show()
๐ฏ Real-World Examples
Respectful Scrape & Filter
Configure delays and robots.txt compliance.
result = ("https://www.bbc.com/news" # Replace with your target URL
>> FetchLinks(
max_links=50,
respect_robots=True,
delay=3.0,
user_agent="MyBot/1.0 (contact@example.com)"
)
>> ExtractArticles(delay=2.0)
>> FilterArticles(lambda a: a.language == 'en' and bool(a.author))
>> LimitArticles(20)
>> Deduplicate()
>> ToDataFrame(include_text=False)
>> SaveAs("respectful_scrape.csv"))
Direct Article Extraction
Extract from a specific URL or list of URLs without link discovery.
df = ("https://www.bbc.com/news/specific-article" # Replace with your target URL
>> ExtractArticles()
>> ToDataFrame()
>> SaveAs("single_article.json"))
๐ Feature Comparison
pipescraper vs. Trafilatura
| Feature | pipescraper | Trafilatura |
|---|---|---|
| Content extraction | โ (via trafilatura) | โ |
| Metadata extraction | โ Enhanced | โ Basic |
| Publication time | โ (via newspaper4k) | โ (date only) |
| Pipe syntax | โ | โ |
| Link discovery | โ | โ |
| Batch / Parallel | โ | Manual |
| DataFrame export | โ (CSV/JSON/Excel) | โ |
| Google News Filter | โ | โ |
Design Decision: pipescraper uses a dual-engine approach. Trafilatura provides industry-leading content extraction, while newspaper4k complements it by capturing the exact time_published, ensuring complete temporal metadata.
๐ Learning Resources
- Tutorial Notebook - A complete, hands-on, end-to-end walkthrough
- API Reference - Detailed core documentation
- Examples - More advanced usage examples
- Contributing Guide - How to contribute
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
๐ License
MIT License - see the LICENSE file for details.
๐จโ๐ป Author
Dr. Yasser Mustafa
AI & Data Science Specialist | Theoretical Physics PhD
- ๐ PhD in Theoretical Nuclear Physics
- ๐ผ 10+ years in production AI/ML systems
- ๐ฌ 48+ research publications
- ๐ข Experience: Government (Abu Dhabi), Media (Track24), Recruitment (Reed), Energy (ADNOC)
- ๐ Based in Newcastle Upon Tyne, UK
- โ๏ธ yasser.mustafan@gmail.com
- ๐ LinkedIn | GitHub
PipeScraper was born from the need for a more intuitive, pipe-based approach to news scraping, combining the analytical power of trafilatura with the elegance of a functional programming interface.
๐ Star History
If PipeScraper helps your work, please consider giving it a star! โญ
๐ How to Cite
If you use PipeScraper in your research or project, please cite it as follows:
@software{pipescraper2026,
author = {Mustafa, Yasser},
title = {PipeScraper: A pipe-based news article scraping and metadata extraction library},
url = {https://github.com/Yasser03/pipescraper},
version = {0.3.0},
year = {2026}
}
๐ Acknowledgments
- trafilatura โ Core content extraction engine
- newspaper4k โ Supplementary time extraction
- pipeframe โ Inspiration for pipe-based syntax
- pipeplotly โ Pipe pattern implementation reference
๐ฌ Community
- Issues: Report bugs or request features
- Discussions: Ask questions, share use cases
Made with โค๏ธ by Dr. Yasser Mustafa
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pipescraper-0.3.0.tar.gz.
File metadata
- Download URL: pipescraper-0.3.0.tar.gz
- Upload date:
- Size: 31.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6d28cae5de8382377edc648208c5845e09d4dd514de22abd2c7f0e8942b8e89
|
|
| MD5 |
744ba892b237ceb180f2be4cfc4da283
|
|
| BLAKE2b-256 |
de59e7ef785a7316d9cf53ea2fd18fb50cb053c2d2caf1cd9890cf94af12b1e1
|
File details
Details for the file pipescraper-0.3.0-py3-none-any.whl.
File metadata
- Download URL: pipescraper-0.3.0-py3-none-any.whl
- Upload date:
- Size: 28.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb5feacf84383fc4bf46db2290276c63224d6cb5f2ab6addb036f3fc49b36186
|
|
| MD5 |
64f51e4b76c65cf003fd8546e00f1d9d
|
|
| BLAKE2b-256 |
24fe0f216f4a2a2bc5665c82d0aaf4257724b2edc1a3c827292ceff7dd0401f2
|