A generic module for extracting text and dates from web articles
Project description
Web Article Extractor
A generic, production-ready Python module for extracting article text and publication dates from web URLs using a three-stage pipeline: HTML parsers (newspaper3k → trafilatura) with Google Gemini LLM fallback.
Features
- 🎯 Three-Stage Extraction: newspaper3k → trafilatura → Gemini LLM fallback
- 📊 CSV-Based Workflow: Process multiple URLs from CSV with configurable column mappings
- 🔧 YAML Configuration: Flexible column mapping without code changes
- 📝 Structured Logging: JSON-formatted logs with CLI-configurable levels
- 📅 ISO 8601 Dates: Automatic date normalization to standard format
- 🏗️ Provider Pattern: Extensible architecture for adding new LLM providers
- ✅ High Quality: Black (108), isort, pylint 10.0, pytest coverage ≥90%
- 🚀 Production Ready: Pre-commit hooks, CI/CD, comprehensive tests
Installation
# Clone repository
git clone https://github.com/yourusername/web-article-extractor.git
cd web-article-extractor
# Install in development mode
pip install -e ".[dev]"
# Or install from PyPI (when published)
pip install web-article-extractor
Quick Start
1. Set up Gemini API Key
export GEMINI_API_KEY="your-api-key-here"
2. Create Configuration File
Create config.yaml:
id_column: rest_id
url_columns:
- Web site restaurant
- Web site Chef
- Web
3. Run Extraction
web-article-extractor input.csv --output-csv output.csv --config config.yaml --log-level INFO
Usage Examples
Command Line
# Basic usage
web-article-extractor restaurants.csv --output-csv results.csv --config config.yaml
# With debug logging
web-article-extractor input.csv -o output.csv -c config.yaml --log-level DEBUG
# With different log levels
web-article-extractor input.csv --output-csv output.csv --config config.yaml --log-level WARNING
Programmatic Usage
from web_article_extractor import ArticleExtractor
from web_article_extractor.config import Config
from web_article_extractor.logger import setup_logger
# Setup logging
setup_logger("web_article_extractor", "INFO")
# Load configuration
config = Config.from_yaml("config.yaml")
# Create extractor
extractor = ArticleExtractor()
# Process CSV
extractor.process_csv("input.csv", "output.csv", config)
Input/Output Format
Input CSV
Your CSV should contain:
- One column with unique identifiers (specified in
id_column) - One or more columns with URLs (specified in
url_columns)
Example:
rest_id,Web site restaurant,Web site Chef
1,https://example.com/restaurant,https://example.com/chef
2,https://test.com/place,
Output CSV
Generated CSV contains:
| Column | Description |
|---|---|
id |
The identifier from your input CSV |
url |
The URL that was processed |
extracted_text |
Extracted article text |
publication_date |
ISO 8601 formatted date (YYYY-MM-DD) |
extraction_method |
Method used: newspaper, trafilatura, or gemini |
status |
success or error |
error_message |
Error details if status is error |
Three-Stage Extraction Pipeline
-
newspaper3k (Stage 1)
- Fast, specialized for news articles
- Extracts text + publish date
- Falls back if extraction fails or text < 100 chars
-
trafilatura (Stage 2)
- Generic web page extractor
- Better for diverse site structures
- Falls back if extraction fails or text < 100 chars
-
Google Gemini (Stage 3)
- LLM-powered extraction using Gemini 2.0 Flash
- Ultimate fallback when HTML parsing fails
- Uses AI to understand and extract content
Configuration
YAML Schema
# Required: Column name containing unique identifiers
id_column: id
# Required: List of column names containing URLs to extract
url_columns:
- url_column_1
- url_column_2
- url_column_3
Environment Variables
GEMINI_API_KEY: Google Gemini API key (required)
Logging Levels
DEBUG: Detailed extraction attempts, all stagesINFO: Successful extractions, progress updates (default)WARNING: Recoverable issuesERROR: Failed extractionsCRITICAL: System-level failures
Development
Setup Development Environment
# Install with development dependencies
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
# Run pre-commit on all files
pre-commit run --all-files
Running Tests
# Run all tests with coverage
pytest
# Run with coverage report
pytest --cov=web_article_extractor --cov-report=html
# Run specific test file
pytest tests/test_unit.py
# Run with verbose output
pytest -v
Code Quality
# Format code with black
black --line-length=108 src/ tests/
# Sort imports
isort --profile=black --line-length=108 src/ tests/
# Run pylint
pylint src/web_article_extractor
# Run all checks (via pre-commit)
pre-commit run --all-files
Architecture
The module follows these design principles:
- Provider Pattern: Extensible LLM provider system
- Configuration-Driven: YAML-based, no hardcoded values
- Structured Logging: JSON logs for production observability
- Three-Stage Pipeline: HTML parsers first, LLM as fallback
- ISO 8601 Dates: Standardized date format
For detailed architecture documentation, see .github/instructions/architecture.instructions.md.
Extending the Module
Adding a New LLM Provider
# src/web_article_extractor/providers/openai.py
from .base import BaseAPIProvider
import openai
class OpenAIProvider(BaseAPIProvider):
def get_env_key_name(self) -> str:
return "OPENAI_API_KEY"
def get_default_model(self) -> str:
return "gpt-4"
def _initialize_client(self):
openai.api_key = self.api_key
return openai
def query(self, prompt: str) -> str:
response = self.client.ChatCompletion.create(
model=self.model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
Requirements
- Python 3.13+
- google-generativeai
- newspaper3k
- trafilatura
- pyyaml
- pandas
- requests
- click
- python-json-logger
- python-dateutil
License
MIT License - see LICENSE file for details.
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Ensure all tests pass and coverage ≥90%
- Run pre-commit hooks
- Submit a pull request
Support
For issues, questions, or contributions, please open an issue on GitHub.
Acknowledgments
- newspaper3k for fast news article extraction
- trafilatura for robust web content extraction
- Google Gemini for LLM-powered fallback extraction
- Pydantic for robust configuration validation
Changelog
0.1.0 (2026-02-01)
- Initial release
- Three-stage extraction pipeline
- YAML configuration with Pydantic validation
- Structured logging
- CLI interface with options instead of arguments
- Provider pattern for LLM extensibility
- Comprehensive test suite (≥90% coverage)
- One test file per source module following Python standards
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file web_article_extractor-1.1.0.tar.gz.
File metadata
- Download URL: web_article_extractor-1.1.0.tar.gz
- Upload date:
- Size: 20.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f5d596f8a8fca3bb1565c5bb861bf312fc8ee6c89b5fcdc884acf9150e038f1
|
|
| MD5 |
4f5bd6d9baea887e78b8ea2af2fbcd48
|
|
| BLAKE2b-256 |
196071c115e7129fd304f9e7a9cab0cc89a9eb6f889fb78ad67c42144691c12c
|
Provenance
The following attestation bundles were made for web_article_extractor-1.1.0.tar.gz:
Publisher:
publish.yml on callidio/web-article-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
web_article_extractor-1.1.0.tar.gz -
Subject digest:
5f5d596f8a8fca3bb1565c5bb861bf312fc8ee6c89b5fcdc884acf9150e038f1 - Sigstore transparency entry: 894499829
- Sigstore integration time:
-
Permalink:
callidio/web-article-extractor@b8c490a3fe7c1f82027d206feeb444a3d8499717 -
Branch / Tag:
refs/tags/1.1.0 - Owner: https://github.com/callidio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b8c490a3fe7c1f82027d206feeb444a3d8499717 -
Trigger Event:
release
-
Statement type:
File details
Details for the file web_article_extractor-1.1.0-py3-none-any.whl.
File metadata
- Download URL: web_article_extractor-1.1.0-py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2107c4a04907cc77551c91e5b37d8ce08e19d7adfd376b6855244de74cf6a359
|
|
| MD5 |
fb394bfc4d2f8200a72ce54789e75199
|
|
| BLAKE2b-256 |
b696e54630bda4a6ca2db42185cf60b4a10053dcae584fd435ef8918a58590ac
|
Provenance
The following attestation bundles were made for web_article_extractor-1.1.0-py3-none-any.whl:
Publisher:
publish.yml on callidio/web-article-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
web_article_extractor-1.1.0-py3-none-any.whl -
Subject digest:
2107c4a04907cc77551c91e5b37d8ce08e19d7adfd376b6855244de74cf6a359 - Sigstore transparency entry: 894499879
- Sigstore integration time:
-
Permalink:
callidio/web-article-extractor@b8c490a3fe7c1f82027d206feeb444a3d8499717 -
Branch / Tag:
refs/tags/1.1.0 - Owner: https://github.com/callidio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b8c490a3fe7c1f82027d206feeb444a3d8499717 -
Trigger Event:
release
-
Statement type: