A powerful async news content extraction library with modern API for web scraping and article analysis
Project description
Journalist
A powerful async news content extraction library with modern API for web scraping and article analysis.
Features
🚀 Modern Async API - Built with asyncio for high-performance concurrent scraping
📰 Universal News Support - Works with news websites and content from any language or region
🎯 Smart Content Extraction - Multiple extraction methods (readability, CSS selectors, JSON-LD)
🔄 Flexible Persistence - Memory-only or filesystem persistence modes
🛡️ Error Handling - Robust error handling with custom exception types
📊 Session Management - Built-in session management with race condition protection
🧪 Well Tested - Comprehensive unit tests with high coverage
Installation
Option 1: Using pip (Recommended)
pip install journ4list
Option 2: Using Poetry
poetry add journ4list
Option 3: Development Installation
Using Poetry (Recommended for Development)
# Clone the repository
git clone https://github.com/username/journalist.git
cd journalist
# Install with Poetry
poetry install
# Activate virtual environment
poetry shell
Using pip-tools (Alternative)
# Clone the repository
git clone https://github.com/username/journalist.git
cd journalist
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install pip-tools
pip install pip-tools
# Compile and install dependencies
pip-compile requirements.in --output-file requirements.txt
pip install -r requirements.txt
Quick Start
Basic Usage
import asyncio
from journalist import Journalist
async def main():
# Create journalist instance
journalist = Journalist(persist=True, scrape_depth=1)
# Extract content from news sites
result = await journalist.read(
urls=[ "https://www.bbc.com/news",
"https://www.reuters.com/"
],
keywords=["teknologi", "spor", "ekonomi"]
)
# Access extracted articles
for article in result['articles']:
print(f"Title: {article['title']}")
print(f"URL: {article['url']}")
print(f"Content: {article['content'][:200]}...")
print("-" * 50)
# Check extraction summary
summary = result['extraction_summary']
print(f"Processed {summary['urls_processed']} URLs")
print(f"Found {summary['articles_extracted']} articles")
print(f"Extraction took {summary['extraction_time_seconds']} seconds")
# Run the example
asyncio.run(main())
Memory-Only Mode (No File Persistence)
import asyncio
from journalist import Journalist
async def main():
# Use memory-only mode for temporary scraping
journalist = Journalist(persist=False)
result = await journalist.read( urls=["https://www.cnn.com/"],
keywords=["news", "breaking"]
)
# Articles are stored in memory only
print(f"Found {len(result['articles'])} articles")
print(f"Session ID: {result['session_id']}")
asyncio.run(main())
Concurrent Scraping
import asyncio
from journalist import Journalist
async def scrape_multiple_sources():
"""Example of concurrent scraping with multiple journalist instances."""
# Create tasks for different news sources
async def scrape_sports():
journalist = Journalist(persist=True, scrape_depth=2)
return await journalist.read(
urls=["https://www.espn.com/", "https://www.skysports.com/"],
keywords=["futbol", "basketbol"]
)
async def scrape_tech():
journalist = Journalist(persist=True, scrape_depth=1)
return await journalist.read(
urls=["https://www.techcrunch.com/", "https://www.wired.com/"],
keywords=["teknologi", "yazılım"]
)
# Run concurrently
sports_task = asyncio.create_task(scrape_sports())
tech_task = asyncio.create_task(scrape_tech())
sports_result, tech_result = await asyncio.gather(sports_task, tech_task)
print(f"Sports articles: {len(sports_result['articles'])}")
print(f"Tech articles: {len(tech_result['articles'])}")
asyncio.run(scrape_multiple_sources())
Configuration
Journalist Parameters
persist(bool, default:True) - Enable filesystem persistence for session datascrape_depth(int, default:1) - Depth level for link discovery and scraping
Environment Configuration
The library uses sensible defaults but can be configured via the JournalistConfig class:
from journalist.config import JournalistConfig
# Get current workspace path
workspace = JournalistConfig.get_base_workspace_path()
print(f"Workspace: {workspace}") # Output: .journalist_workspace
Error Handling
The library provides custom exception types for better error handling:
import asyncio
from journalist import Journalist
from journalist.exceptions import NetworkError, ExtractionError, ValidationError
async def robust_scraping():
try:
journalist = Journalist()
result = await journalist.read(
urls=["https://example-news-site.com/"],
keywords=["important", "news"]
)
return result
except NetworkError as e:
print(f"Network error: {e}")
if hasattr(e, 'status_code'):
print(f"HTTP Status: {e.status_code}")
except ExtractionError as e:
print(f"Content extraction failed: {e}")
if hasattr(e, 'url'):
print(f"Failed URL: {e.url}")
except ValidationError as e:
print(f"Input validation error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
asyncio.run(robust_scraping())
API Reference
Journalist Class
__init__(persist=True, scrape_depth=1)
Initialize a new Journalist instance.
Parameters:
persist(bool): Enable filesystem persistencescrape_depth(int): Link discovery depth level
async read(urls, keywords=None)
Extract content from provided URLs with optional keyword filtering.
Parameters:
urls(List[str]): List of website URLs to processkeywords(Optional[List[str]]): Keywords for relevance filtering
Returns:
Dict[str, Any]: Dictionary containing extracted articles and metadata
Return Structure:
{
'articles': [
{
'title': str,
'url': str,
'content': str,
'author': str,
'published_date': str,
'keywords_found': List[str]
}
],
'session_id': str,
'extraction_summary': {
'session_id': str,
'urls_requested': int,
'urls_processed': int,
'articles_extracted': int,
'extraction_time_seconds': float,
'keywords_used': List[str]
}
}
Development
Running Tests
# Using Poetry
poetry run pytest
# Using pip
pytest
# With coverage
pytest --cov=journalist --cov-report=html
Code Quality
# Format code
black src tests
# Sort imports
isort src tests
# Type checking
mypy src
# Linting
pylint src
Development Dependencies
The project supports both Poetry and pip-tools for dependency management:
Poetry (pyproject.toml):
poetry install --with dev
pip-tools (requirements.in):
pip-compile requirements.in --output-file requirements.txt
python -m pip install -r requirements.txt
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Ensure tests pass (
pytest) - Format code (
black src tests) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
Changelog
v0.1.0 (2025-06-17)
- Initial release
- Async API for universal news content extraction
- Support for multiple extraction methods
- Memory and filesystem persistence modes
- Comprehensive error handling
- Session management with race condition protection
- Concurrent scraping support
License
This project is licensed under the MIT License - see the LICENSE file for details.
Author
Oktay Burak Ertas
Email: oktay.burak.ertas@gmail.com
Acknowledgments
- Built with modern Python async/await patterns
- Optimized for global news websites
- Inspired by newspaper3k and readability libraries
- Uses BeautifulSoup4 and lxml for HTML parsing
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file journ4list-0.8.0.tar.gz.
File metadata
- Download URL: journ4list-0.8.0.tar.gz
- Upload date:
- Size: 36.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db77cdca589a653bd649df37dc3620f200e0bacae9f049f864492059cdc27a55
|
|
| MD5 |
32769db11605d76f617c26704660d944
|
|
| BLAKE2b-256 |
c883cb98a26eeb70949f785bf4210691366d07d1743f1b41ddbe88124b21af8b
|
Provenance
The following attestation bundles were made for journ4list-0.8.0.tar.gz:
Publisher:
publish.yml on oktay-be/journalist
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
journ4list-0.8.0.tar.gz -
Subject digest:
db77cdca589a653bd649df37dc3620f200e0bacae9f049f864492059cdc27a55 - Sigstore transparency entry: 308557272
- Sigstore integration time:
-
Permalink:
oktay-be/journalist@e37ff1b2f8156b6083f249d662e793bf4ad48b7f -
Branch / Tag:
refs/tags/v0.8.0 - Owner: https://github.com/oktay-be
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e37ff1b2f8156b6083f249d662e793bf4ad48b7f -
Trigger Event:
push
-
Statement type:
File details
Details for the file journ4list-0.8.0-py3-none-any.whl.
File metadata
- Download URL: journ4list-0.8.0-py3-none-any.whl
- Upload date:
- Size: 43.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f58d6a14aa1fb2d4746ca2a28bb431bd1965fa4e27f88db70ab22f18b8d0f27
|
|
| MD5 |
29354a7d775fab74db9c6d994530512a
|
|
| BLAKE2b-256 |
681ee096f81555c1abf2b250a41798abb1fbdb2f51ee57b87044908afa5b886b
|
Provenance
The following attestation bundles were made for journ4list-0.8.0-py3-none-any.whl:
Publisher:
publish.yml on oktay-be/journalist
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
journ4list-0.8.0-py3-none-any.whl -
Subject digest:
5f58d6a14aa1fb2d4746ca2a28bb431bd1965fa4e27f88db70ab22f18b8d0f27 - Sigstore transparency entry: 308557300
- Sigstore integration time:
-
Permalink:
oktay-be/journalist@e37ff1b2f8156b6083f249d662e793bf4ad48b7f -
Branch / Tag:
refs/tags/v0.8.0 - Owner: https://github.com/oktay-be
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e37ff1b2f8156b6083f249d662e793bf4ad48b7f -
Trigger Event:
push
-
Statement type: