A powerful async news content extraction library with modern API for web scraping and article analysis

These details have not been verified by PyPI

Project links

Project description

Journalist

A powerful async news content extraction library with modern API for web scraping and article analysis.

Features

🚀 Modern Async API - Built with asyncio for high-performance concurrent scraping
📰 Universal News Support - Works with news websites and content from any language or region
🎯 Smart Content Extraction - Multiple extraction methods (readability, CSS selectors, JSON-LD) 🔄 Flexible Persistence - Memory-only or filesystem persistence modes
🛡️ Error Handling - Robust error handling with custom exception types
📊 Session Management - Built-in session management with race condition protection
🧪 Well Tested - Comprehensive unit tests with high coverage

Installation

Option 1: Using pip (Recommended)

pip install journ4list

Option 2: Using Poetry

poetry add journ4list

Option 3: Development Installation

Using Poetry (Recommended for Development)

# Clone the repository
git clone https://github.com/username/journalist.git
cd journalist

# Install with Poetry
poetry install

# Activate virtual environment
poetry shell

Using pip-tools (Alternative)

# Clone the repository
git clone https://github.com/username/journalist.git
cd journalist

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install pip-tools
pip install pip-tools

# Compile and install dependencies
pip-compile requirements.in --output-file requirements.txt
pip install -r requirements.txt

Quick Start

Basic Usage

import asyncio
from journalist import Journalist

async def main():
    # Create journalist instance
    journalist = Journalist(persist=True, scrape_depth=1)

    # Extract content from news sites
    result = await journalist.read(
        urls=[            "https://www.bbc.com/news",
            "https://www.reuters.com/"
        ],
        keywords=["teknologi", "spor", "ekonomi"]
    )

    # Access extracted articles
    for article in result['articles']:
        print(f"Title: {article['title']}")
        print(f"URL: {article['url']}")
        print(f"Content: {article['content'][:200]}...")
        print("-" * 50)

    # Check extraction summary
    summary = result['extraction_summary']
    print(f"Processed {summary['urls_processed']} URLs")
    print(f"Found {summary['articles_extracted']} articles")
    print(f"Extraction took {summary['extraction_time_seconds']} seconds")

# Run the example
asyncio.run(main())

Memory-Only Mode (No File Persistence)

import asyncio
from journalist import Journalist

async def main():
    # Use memory-only mode for temporary scraping
    journalist = Journalist(persist=False)

    result = await journalist.read(        urls=["https://www.cnn.com/"],
        keywords=["news", "breaking"]
    )

    # Articles are stored in memory only
    print(f"Found {len(result['articles'])} articles")
    print(f"Session ID: {result['session_id']}")

asyncio.run(main())

Concurrent Scraping

import asyncio
from journalist import Journalist

async def scrape_multiple_sources():
    """Example of concurrent scraping with multiple journalist instances."""

    # Create tasks for different news sources
    async def scrape_sports():
        journalist = Journalist(persist=True, scrape_depth=2)
        return await journalist.read(
            urls=["https://www.espn.com/", "https://www.skysports.com/"],
            keywords=["futbol", "basketbol"]
        )

    async def scrape_tech():
        journalist = Journalist(persist=True, scrape_depth=1)
        return await journalist.read(
            urls=["https://www.techcrunch.com/", "https://www.wired.com/"],
            keywords=["teknologi", "yazılım"]
        )

    # Run concurrently
    sports_task = asyncio.create_task(scrape_sports())
    tech_task = asyncio.create_task(scrape_tech())

    sports_result, tech_result = await asyncio.gather(sports_task, tech_task)

    print(f"Sports articles: {len(sports_result['articles'])}")
    print(f"Tech articles: {len(tech_result['articles'])}")

asyncio.run(scrape_multiple_sources())

Configuration

Journalist Parameters

persist (bool, default: True) - Enable filesystem persistence for session data
scrape_depth (int, default: 1) - Depth level for link discovery and scraping

Environment Configuration

The library uses sensible defaults but can be configured via the JournalistConfig class:

from journalist.config import JournalistConfig

# Get current workspace path
workspace = JournalistConfig.get_base_workspace_path()
print(f"Workspace: {workspace}")  # Output: .journalist_workspace

Error Handling

The library provides custom exception types for better error handling:

import asyncio
from journalist import Journalist
from journalist.exceptions import NetworkError, ExtractionError, ValidationError

async def robust_scraping():
    try:
        journalist = Journalist()
        result = await journalist.read(
            urls=["https://example-news-site.com/"],
            keywords=["important", "news"]
        )
        return result

    except NetworkError as e:
        print(f"Network error: {e}")
        if hasattr(e, 'status_code'):
            print(f"HTTP Status: {e.status_code}")

    except ExtractionError as e:
        print(f"Content extraction failed: {e}")
        if hasattr(e, 'url'):
            print(f"Failed URL: {e.url}")

    except ValidationError as e:
        print(f"Input validation error: {e}")

    except Exception as e:
        print(f"Unexpected error: {e}")

asyncio.run(robust_scraping())

API Reference

Journalist Class

`init(persist=True, scrape_depth=1)`

Initialize a new Journalist instance.

Parameters:

persist (bool): Enable filesystem persistence
scrape_depth (int): Link discovery depth level

`async read(urls, keywords=None)`

Extract content from provided URLs with optional keyword filtering.

Parameters:

urls (List[str]): List of website URLs to process
keywords (Optional[List[str]]): Keywords for relevance filtering

Returns:

Dict[str, Any]: Dictionary containing extracted articles and metadata

Return Structure:

{
    'articles': [
        {
            'title': str,
            'url': str,
            'content': str,
            'author': str,
            'published_date': str,
            'keywords_found': List[str]
        }
    ],
    'session_id': str,
    'extraction_summary': {
        'session_id': str,
        'urls_requested': int,
        'urls_processed': int,
        'articles_extracted': int,
        'extraction_time_seconds': float,
        'keywords_used': List[str]
    }
}

Development

Running Tests

# Using Poetry
poetry run pytest

# Using pip
pytest

# With coverage
pytest --cov=journalist --cov-report=html

Code Quality

# Format code
black src tests

# Sort imports
isort src tests

# Type checking
mypy src

# Linting
pylint src

Development Dependencies

The project supports both Poetry and pip-tools for dependency management:

Poetry (pyproject.toml):

poetry install --with dev

pip-tools (requirements.in):

pip-compile requirements.in --output-file requirements.txt
python -m pip install -r requirements.txt

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Add tests for new functionality
Ensure tests pass (pytest)
Format code (black src tests)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

Changelog

v0.1.0 (2025-06-17)

Initial release
Async API for universal news content extraction
Support for multiple extraction methods
Memory and filesystem persistence modes
Comprehensive error handling
Session management with race condition protection
Concurrent scraping support

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Oktay Burak Ertas
Email: oktay.burak.ertas@gmail.com

Acknowledgments

Built with modern Python async/await patterns
Optimized for global news websites
Inspired by newspaper3k and readability libraries
Uses BeautifulSoup4 and lxml for HTML parsing

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.13.0

Dec 27, 2025

0.12.0

Dec 21, 2025

0.11.0

Jul 24, 2025

0.10.0

Jul 24, 2025

0.9.0

Jul 24, 2025

0.8.0

Jul 24, 2025

0.7.0

Jul 22, 2025

0.6.0

Jul 22, 2025

0.5.0

Jul 22, 2025

0.4.0

Jul 22, 2025

This version

0.3.0

Jun 26, 2025

0.2.0

Jun 21, 2025

0.1.0

Jun 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

journ4list-0.3.0.tar.gz (35.9 kB view details)

Uploaded Jun 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

journ4list-0.3.0-py3-none-any.whl (42.9 kB view details)

Uploaded Jun 26, 2025 Python 3

File details

Details for the file journ4list-0.3.0.tar.gz.

File metadata

Download URL: journ4list-0.3.0.tar.gz
Upload date: Jun 26, 2025
Size: 35.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for journ4list-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`cd71e121716bd42952687fbf255bc9d1253891860f75431308f1930b37a331a4`
MD5	`186668119217cd54ecfdc078424e92e9`
BLAKE2b-256	`9992a84241b2bb913b3d01e238db097bf42e8abc8160b99adef4de896aba7439`

See more details on using hashes here.

Provenance

The following attestation bundles were made for journ4list-0.3.0.tar.gz:

Publisher: publish.yml on oktay-be/journalist

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: journ4list-0.3.0.tar.gz
- Subject digest: cd71e121716bd42952687fbf255bc9d1253891860f75431308f1930b37a331a4
- Sigstore transparency entry: 251901699
- Sigstore integration time: Jun 26, 2025
Source repository:
- Permalink: oktay-be/journalist@dffc57171175cf261fec1e56df0412eb76dc539f
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/oktay-be
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@dffc57171175cf261fec1e56df0412eb76dc539f
- Trigger Event: push

File details

Details for the file journ4list-0.3.0-py3-none-any.whl.

File metadata

Download URL: journ4list-0.3.0-py3-none-any.whl
Upload date: Jun 26, 2025
Size: 42.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for journ4list-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`155c52aa14e41997b147d938667e9e243e607f699b639a0595b2ae535d9606cf`
MD5	`87bd1910b7666bb001955c907c329313`
BLAKE2b-256	`dd5aed411f263b49b71620e295a775d4cb636007edf8f9cc5068b39b0b705c25`

See more details on using hashes here.

Provenance

The following attestation bundles were made for journ4list-0.3.0-py3-none-any.whl:

Publisher: publish.yml on oktay-be/journalist

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: journ4list-0.3.0-py3-none-any.whl
- Subject digest: 155c52aa14e41997b147d938667e9e243e607f699b639a0595b2ae535d9606cf
- Sigstore transparency entry: 251901710
- Sigstore integration time: Jun 26, 2025
Source repository:
- Permalink: oktay-be/journalist@dffc57171175cf261fec1e56df0412eb76dc539f
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/oktay-be
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@dffc57171175cf261fec1e56df0412eb76dc539f
- Trigger Event: push

journ4list 0.3.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Journalist

Features

Installation

Option 1: Using pip (Recommended)

Option 2: Using Poetry

Option 3: Development Installation

Using Poetry (Recommended for Development)

Using pip-tools (Alternative)

Quick Start

Basic Usage

Memory-Only Mode (No File Persistence)

Concurrent Scraping

Configuration

Journalist Parameters

Environment Configuration

Error Handling

API Reference

Journalist Class

__init__(persist=True, scrape_depth=1)

async read(urls, keywords=None)

Development

Running Tests

Code Quality

Development Dependencies

Contributing

Changelog

v0.1.0 (2025-06-17)

License

Author

Acknowledgments

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`init(persist=True, scrape_depth=1)`

`async read(urls, keywords=None)`