Skip to main content

A generic module for extracting text and dates from web articles

Project description

Web Article Extractor

A generic, production-ready Python module for extracting article text and publication dates from web URLs using a three-stage pipeline: HTML parsers (newspaper3k → trafilatura) with Google Gemini LLM fallback.

License: MIT Python 3.13+ Code style: black

Features

  • 🎯 Three-Stage Extraction: newspaper3k → trafilatura → Gemini LLM fallback
  • 📊 CSV-Based Workflow: Process multiple URLs from CSV with configurable column mappings
  • 🔧 YAML Configuration: Flexible column mapping without code changes
  • 📝 Structured Logging: JSON-formatted logs with CLI-configurable levels
  • 📅 ISO 8601 Dates: Automatic date normalization to standard format
  • 🏗️ Provider Pattern: Extensible architecture for adding new LLM providers
  • High Quality: Black (108), isort, pylint 10.0, pytest coverage ≥90%
  • 🚀 Production Ready: Pre-commit hooks, CI/CD, comprehensive tests

Installation

# Clone repository
git clone https://github.com/yourusername/web-article-extractor.git
cd web-article-extractor

# Install in development mode
pip install -e ".[dev]"

# Or install from PyPI (when published)
pip install web-article-extractor

Quick Start

1. Set up Gemini API Key

export GEMINI_API_KEY="your-api-key-here"

2. Create Configuration File

Create config.yaml:

id_column: rest_id
url_columns:
  - Web site restaurant
  - Web site Chef
  - Web

3. Run Extraction

web-article-extractor input.csv --output-csv output.csv --config config.yaml --log-level INFO

Usage Examples

Command Line

# Basic usage
web-article-extractor restaurants.csv --output-csv results.csv --config config.yaml

# With debug logging
web-article-extractor input.csv -o output.csv -c config.yaml --log-level DEBUG

# With different log levels
web-article-extractor input.csv --output-csv output.csv --config config.yaml --log-level WARNING

Programmatic Usage

from web_article_extractor import ArticleExtractor
from web_article_extractor.config import Config
from web_article_extractor.logger import setup_logger

# Setup logging
setup_logger("web_article_extractor", "INFO")

# Load configuration
config = Config.from_yaml("config.yaml")

# Create extractor
extractor = ArticleExtractor()

# Process CSV
extractor.process_csv("input.csv", "output.csv", config)

Input/Output Format

Input CSV

Your CSV should contain:

  • One column with unique identifiers (specified in id_column)
  • One or more columns with URLs (specified in url_columns)

Example:

rest_id,Web site restaurant,Web site Chef
1,https://example.com/restaurant,https://example.com/chef
2,https://test.com/place,

Output CSV

Generated CSV contains:

Column Description
id The identifier from your input CSV
url The URL that was processed
extracted_text Extracted article text
publication_date ISO 8601 formatted date (YYYY-MM-DD)
extraction_method Method used: newspaper, trafilatura, or gemini
status success or error
error_message Error details if status is error

Three-Stage Extraction Pipeline

  1. newspaper3k (Stage 1)

    • Fast, specialized for news articles
    • Extracts text + publish date
    • Falls back if extraction fails or text < 100 chars
  2. trafilatura (Stage 2)

    • Generic web page extractor
    • Better for diverse site structures
    • Falls back if extraction fails or text < 100 chars
  3. Google Gemini (Stage 3)

    • LLM-powered extraction using Gemini 2.0 Flash
    • Ultimate fallback when HTML parsing fails
    • Uses AI to understand and extract content

Configuration

YAML Schema

# Required: Column name containing unique identifiers
id_column: id

# Required: List of column names containing URLs to extract
url_columns:
  - url_column_1
  - url_column_2
  - url_column_3

Environment Variables

  • GEMINI_API_KEY: Google Gemini API key (required)

Logging Levels

  • DEBUG: Detailed extraction attempts, all stages
  • INFO: Successful extractions, progress updates (default)
  • WARNING: Recoverable issues
  • ERROR: Failed extractions
  • CRITICAL: System-level failures

Development

Setup Development Environment

# Install with development dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run pre-commit on all files
pre-commit run --all-files

Running Tests

# Run all tests with coverage
pytest

# Run with coverage report
pytest --cov=web_article_extractor --cov-report=html

# Run specific test file
pytest tests/test_unit.py

# Run with verbose output
pytest -v

Code Quality

# Format code with black
black --line-length=108 src/ tests/

# Sort imports
isort --profile=black --line-length=108 src/ tests/

# Run pylint
pylint src/web_article_extractor

# Run all checks (via pre-commit)
pre-commit run --all-files

Architecture

The module follows these design principles:

  • Provider Pattern: Extensible LLM provider system
  • Configuration-Driven: YAML-based, no hardcoded values
  • Structured Logging: JSON logs for production observability
  • Three-Stage Pipeline: HTML parsers first, LLM as fallback
  • ISO 8601 Dates: Standardized date format

For detailed architecture documentation, see .github/instructions/architecture.instructions.md.

Extending the Module

Adding a New LLM Provider

# src/web_article_extractor/providers/openai.py
from .base import BaseAPIProvider
import openai

class OpenAIProvider(BaseAPIProvider):
    def get_env_key_name(self) -> str:
        return "OPENAI_API_KEY"

    def get_default_model(self) -> str:
        return "gpt-4"

    def _initialize_client(self):
        openai.api_key = self.api_key
        return openai

    def query(self, prompt: str) -> str:
        response = self.client.ChatCompletion.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

Requirements

  • Python 3.13+
  • google-generativeai
  • newspaper3k
  • trafilatura
  • pyyaml
  • pandas
  • requests
  • click
  • python-json-logger
  • python-dateutil

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Ensure all tests pass and coverage ≥90%
  5. Run pre-commit hooks
  6. Submit a pull request

Support

For issues, questions, or contributions, please open an issue on GitHub.

Acknowledgments

  • newspaper3k for fast news article extraction
  • trafilatura for robust web content extraction
  • Google Gemini for LLM-powered fallback extraction
  • Pydantic for robust configuration validation

Changelog

0.1.0 (2026-02-01)

  • Initial release
  • Three-stage extraction pipeline
  • YAML configuration with Pydantic validation
  • Structured logging
  • CLI interface with options instead of arguments
  • Provider pattern for LLM extensibility
  • Comprehensive test suite (≥90% coverage)
  • One test file per source module following Python standards

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_article_extractor-1.1.0.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web_article_extractor-1.1.0-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file web_article_extractor-1.1.0.tar.gz.

File metadata

  • Download URL: web_article_extractor-1.1.0.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for web_article_extractor-1.1.0.tar.gz
Algorithm Hash digest
SHA256 5f5d596f8a8fca3bb1565c5bb861bf312fc8ee6c89b5fcdc884acf9150e038f1
MD5 4f5bd6d9baea887e78b8ea2af2fbcd48
BLAKE2b-256 196071c115e7129fd304f9e7a9cab0cc89a9eb6f889fb78ad67c42144691c12c

See more details on using hashes here.

Provenance

The following attestation bundles were made for web_article_extractor-1.1.0.tar.gz:

Publisher: publish.yml on callidio/web-article-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file web_article_extractor-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for web_article_extractor-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2107c4a04907cc77551c91e5b37d8ce08e19d7adfd376b6855244de74cf6a359
MD5 fb394bfc4d2f8200a72ce54789e75199
BLAKE2b-256 b696e54630bda4a6ca2db42185cf60b4a10053dcae584fd435ef8918a58590ac

See more details on using hashes here.

Provenance

The following attestation bundles were made for web_article_extractor-1.1.0-py3-none-any.whl:

Publisher: publish.yml on callidio/web-article-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page