A generic module for extracting text and dates from web articles

These details have not been verified by PyPI

Project description

Web Article Extractor

A generic, production-ready Python module for extracting article text and publication dates from web URLs using a three-stage pipeline: HTML parsers (newspaper3k → trafilatura) with Google Gemini LLM fallback.

Features

🎯 Three-Stage Extraction: newspaper3k → trafilatura → Gemini LLM fallback
📊 CSV-Based Workflow: Process multiple URLs from CSV with configurable column mappings
🔧 YAML Configuration: Flexible column mapping without code changes
📝 Structured Logging: JSON-formatted logs with CLI-configurable levels
📅 ISO 8601 Dates: Automatic date normalization to standard format
🏗️ Provider Pattern: Extensible architecture for adding new LLM providers
✅ High Quality: Black (108), isort, pylint 10.0, pytest coverage ≥90%
🚀 Production Ready: Pre-commit hooks, CI/CD, comprehensive tests

Installation

# Clone repository
git clone https://github.com/yourusername/web-article-extractor.git
cd web-article-extractor

# Install in development mode
pip install -e ".[dev]"

# Or install from PyPI (when published)
pip install web-article-extractor

Quick Start

1. Set up Gemini API Key

export GEMINI_API_KEY="your-api-key-here"

2. Create Configuration File

Create config.yaml:

id_column: rest_id
url_columns:
  - Web site restaurant
  - Web site Chef
  - Web

3. Run Extraction

web-article-extractor input.csv --output-csv output.csv --config config.yaml --log-level INFO

Usage Examples

Command Line

# Basic usage
web-article-extractor restaurants.csv --output-csv results.csv --config config.yaml

# With debug logging
web-article-extractor input.csv -o output.csv -c config.yaml --log-level DEBUG

# With different log levels
web-article-extractor input.csv --output-csv output.csv --config config.yaml --log-level WARNING

Programmatic Usage

from web_article_extractor import ArticleExtractor
from web_article_extractor.config import Config
from web_article_extractor.logger import setup_logger

# Setup logging
setup_logger("web_article_extractor", "INFO")

# Load configuration
config = Config.from_yaml("config.yaml")

# Create extractor
extractor = ArticleExtractor()

# Process CSV
extractor.process_csv("input.csv", "output.csv", config)

Input/Output Format

Input CSV

Your CSV should contain:

One column with unique identifiers (specified in id_column)
One or more columns with URLs (specified in url_columns)

Example:

rest_id,Web site restaurant,Web site Chef
1,https://example.com/restaurant,https://example.com/chef
2,https://test.com/place,

Output CSV

Generated CSV contains:

Column	Description
`id`	The identifier from your input CSV
`url`	The URL that was processed
`extracted_text`	Extracted article text
`publication_date`	ISO 8601 formatted date (YYYY-MM-DD)
`extraction_method`	Method used: `newspaper`, `trafilatura`, or `gemini`
`status`	`success` or `error`
`error_message`	Error details if status is `error`

Three-Stage Extraction Pipeline

newspaper3k (Stage 1)
- Fast, specialized for news articles
- Extracts text + publish date
- Falls back if extraction fails or text < 100 chars
trafilatura (Stage 2)
- Generic web page extractor
- Better for diverse site structures
- Falls back if extraction fails or text < 100 chars
Google Gemini (Stage 3)
- LLM-powered extraction using Gemini 2.0 Flash
- Ultimate fallback when HTML parsing fails
- Uses AI to understand and extract content

Configuration

YAML Schema

# Required: Column name containing unique identifiers
id_column: id

# Required: List of column names containing URLs to extract
url_columns:
  - url_column_1
  - url_column_2
  - url_column_3

Environment Variables

GEMINI_API_KEY: Google Gemini API key (required)

Logging Levels

DEBUG: Detailed extraction attempts, all stages
INFO: Successful extractions, progress updates (default)
WARNING: Recoverable issues
ERROR: Failed extractions
CRITICAL: System-level failures

Development

Setup Development Environment

# Install with development dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run pre-commit on all files
pre-commit run --all-files

Running Tests

# Run all tests with coverage
pytest

# Run with coverage report
pytest --cov=web_article_extractor --cov-report=html

# Run specific test file
pytest tests/test_unit.py

# Run with verbose output
pytest -v

Code Quality

# Format code with black
black --line-length=108 src/ tests/

# Sort imports
isort --profile=black --line-length=108 src/ tests/

# Run pylint
pylint src/web_article_extractor

# Run all checks (via pre-commit)
pre-commit run --all-files

Architecture

The module follows these design principles:

Provider Pattern: Extensible LLM provider system
Configuration-Driven: YAML-based, no hardcoded values
Structured Logging: JSON logs for production observability
Three-Stage Pipeline: HTML parsers first, LLM as fallback
ISO 8601 Dates: Standardized date format

For detailed architecture documentation, see .github/instructions/architecture.instructions.md.

Extending the Module

Adding a New LLM Provider

# src/web_article_extractor/providers/openai.py
from .base import BaseAPIProvider
import openai

class OpenAIProvider(BaseAPIProvider):
    def get_env_key_name(self) -> str:
        return "OPENAI_API_KEY"

    def get_default_model(self) -> str:
        return "gpt-4"

    def _initialize_client(self):
        openai.api_key = self.api_key
        return openai

    def query(self, prompt: str) -> str:
        response = self.client.ChatCompletion.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

Requirements

Python 3.13+
google-generativeai
newspaper3k
trafilatura
pyyaml
pandas
requests
click
python-json-logger
python-dateutil

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes with tests
Ensure all tests pass and coverage ≥90%
Run pre-commit hooks
Submit a pull request

Support

For issues, questions, or contributions, please open an issue on GitHub.

Acknowledgments

newspaper3k for fast news article extraction
trafilatura for robust web content extraction
Google Gemini for LLM-powered fallback extraction
Pydantic for robust configuration validation

Changelog

0.1.0 (2026-02-01)

Initial release
Three-stage extraction pipeline
YAML configuration with Pydantic validation
Structured logging
CLI interface with options instead of arguments
Provider pattern for LLM extensibility
Comprehensive test suite (≥90% coverage)
One test file per source module following Python standards

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.1.3

Feb 1, 2026

This version

1.1.0

Feb 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_article_extractor-1.1.0.tar.gz (20.3 kB view details)

Uploaded Feb 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

web_article_extractor-1.1.0-py3-none-any.whl (15.4 kB view details)

Uploaded Feb 1, 2026 Python 3

File details

Details for the file web_article_extractor-1.1.0.tar.gz.

File metadata

Download URL: web_article_extractor-1.1.0.tar.gz
Upload date: Feb 1, 2026
Size: 20.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for web_article_extractor-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5f5d596f8a8fca3bb1565c5bb861bf312fc8ee6c89b5fcdc884acf9150e038f1`
MD5	`4f5bd6d9baea887e78b8ea2af2fbcd48`
BLAKE2b-256	`196071c115e7129fd304f9e7a9cab0cc89a9eb6f889fb78ad67c42144691c12c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for web_article_extractor-1.1.0.tar.gz:

Publisher: publish.yml on callidio/web-article-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: web_article_extractor-1.1.0.tar.gz
- Subject digest: 5f5d596f8a8fca3bb1565c5bb861bf312fc8ee6c89b5fcdc884acf9150e038f1
- Sigstore transparency entry: 894499829
- Sigstore integration time: Feb 1, 2026
Source repository:
- Permalink: callidio/web-article-extractor@b8c490a3fe7c1f82027d206feeb444a3d8499717
- Branch / Tag: refs/tags/1.1.0
- Owner: https://github.com/callidio
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b8c490a3fe7c1f82027d206feeb444a3d8499717
- Trigger Event: release

File details

Details for the file web_article_extractor-1.1.0-py3-none-any.whl.

File metadata

Download URL: web_article_extractor-1.1.0-py3-none-any.whl
Upload date: Feb 1, 2026
Size: 15.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for web_article_extractor-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2107c4a04907cc77551c91e5b37d8ce08e19d7adfd376b6855244de74cf6a359`
MD5	`fb394bfc4d2f8200a72ce54789e75199`
BLAKE2b-256	`b696e54630bda4a6ca2db42185cf60b4a10053dcae584fd435ef8918a58590ac`

See more details on using hashes here.

Provenance

The following attestation bundles were made for web_article_extractor-1.1.0-py3-none-any.whl:

Publisher: publish.yml on callidio/web-article-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: web_article_extractor-1.1.0-py3-none-any.whl
- Subject digest: 2107c4a04907cc77551c91e5b37d8ce08e19d7adfd376b6855244de74cf6a359
- Sigstore transparency entry: 894499879
- Sigstore integration time: Feb 1, 2026
Source repository:
- Permalink: callidio/web-article-extractor@b8c490a3fe7c1f82027d206feeb444a3d8499717
- Branch / Tag: refs/tags/1.1.0
- Owner: https://github.com/callidio
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b8c490a3fe7c1f82027d206feeb444a3d8499717
- Trigger Event: release

web-article-extractor 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Web Article Extractor

Features

Installation

Quick Start

1. Set up Gemini API Key

2. Create Configuration File

3. Run Extraction

Usage Examples

Command Line

Programmatic Usage

Input/Output Format

Input CSV

Output CSV

Three-Stage Extraction Pipeline

Configuration

YAML Schema

Environment Variables

Logging Levels

Development

Setup Development Environment

Running Tests

Code Quality

Architecture

Extending the Module

Adding a New LLM Provider

Requirements

License

Contributing

Support

Acknowledgments

Changelog

0.1.0 (2026-02-01)

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance