AI-Powered Selector Discovery - Discover once, scrape forever
Project description
Yosoi - AI-Powered CSS Selector Discover
Discover CSS selectors once with AI, scrape forever with BeautifulSoup
Give Yosoi a URL, and it uses AI to automatically discover the best CSS selectors for extracting headlines, authors, dates, body text, and related content. Discovery takes 3 seconds and costs $0.001 per domain — then scrape thousands of articles for free with BeautifulSoup.
Key Benefits:
- Fast: 3 seconds to discover selectors per domain
- Cheap: $0.001 per domain (one-time cost)
- Accurate: Validates selectors before saving
- Reusable: Discover once, use forever
- Production-Ready: Type-safe, linted, tested
Quick Start
Installation
# Clone the repository
git clone <your-repo>
cd yosoi
# Install dependencies (using uv)
uv sync
# For development tools
uv sync --group dev
Configuration
Create a .env file (see env.example):
# Set keys for whichever providers you want to use
GROQ_KEY=your_groq_api_key_here # groq/...
GEMINI_KEY=your_gemini_api_key_here # gemini/...
OPENAI_KEY=your_openai_api_key_here # openai/...
CEREBRAS_KEY=your_cerebras_api_key_here # cerebras/...
OPENROUTER_KEY=your_openrouter_key_here # openrouter/...
# Optional: Observability
LOGFIRE_TOKEN=your_logfire_token_here # For Logfire tracing
Get API Keys:
- Groq (Free): https://console.groq.com/keys
- Gemini: https://aistudio.google.com/app/apikey
- OpenRouter (Free tier available): https://openrouter.ai/keys
- Logfire (Optional): https://logfire.pydantic.dev
Some Free models from OpenRouter require configuration of your Privacy Settings to allow training on your data.
Basic Usage
# Process a single URL (uses GROQ_KEY or GEMINI_KEY from .env)
uv run yosoi --url https://example.com/article
# Specify model explicitly with -m provider/model-name
uv run yosoi -m groq/llama-3.3-70b-versatile --url https://example.com/article
uv run yosoi -m gemini/gemini-2.0-flash --url https://example.com/article
uv run yosoi -m openai/gpt-4o --url https://example.com/article
uv run yosoi -m openrouter/gpt-oss --url https://example.com/article
# Process multiple URLs from a file
uv run yosoi -m groq/llama-3.3-70b-versatile --file urls.txt
# Force re-discovery
uv run yosoi --url https://example.com --force
# Show summary of all saved selectors
uv run yosoi --summary
# Enable debug mode (saves extracted HTML)
uv run yosoi --url https://example.com --debug
URLs File Format
Create urls.txt with one URL per line:
https://example.com/article1
https://example.com/article2
# Comments are allowed
https://example.com/article3
Or use JSON format (urls.json):
[
{"url": "https://example.com/article1"},
{"url": "https://example.com/article2"}
]
Project Structure
.
├── .yosoi/ # Hidden runtime directory
│ └── selectors/ # Persisted selector snapshots
├── yosoi/ # Main package
│ ├── __init__.py
│ ├── __main__.py
│ ├── cli.py # CLI entry point & argument parsing
│ ├── core/ # Core pipeline logic
│ │ ├── pipeline.py # SelectorDiscoveryPipeline orchestrator
│ │ ├── cleaning/
│ │ │ └── cleaner.py # HTML noise removal
│ │ ├── discovery/
│ │ │ ├── agent.py # pydantic-ai agent definition
│ │ │ └── config.py # Model / provider configuration
│ │ ├── extraction/
│ │ │ └── extractor.py # Relevant HTML extraction
│ │ ├── fetcher/
│ │ │ ├── base.py # Abstract fetcher interface
│ │ │ ├── simple.py # httpx-based fetcher
│ │ │ ├── playwright.py # Playwright fetcher (JS-heavy sites)
│ │ │ └── smart.py # Auto-selects fetcher strategy
│ │ └── verification/
│ │ └── verifier.py # CSS selector verification
│ ├── models/ # Pydantic data models
│ │ ├── selectors.py # SelectorSet & field models
│ │ └── results.py # Pipeline result types
│ ├── storage/ # Persistence layer
│ │ ├── persistence.py # JSON read/write for selectors
│ │ ├── tracking.py # Domain-level tracking & stats
│ │ └── debug.py # Debug HTML snapshot storage
│ ├── outputs/ # Output formatters
│ │ ├── json.py # JSON report formatter
│ │ ├── markdown.py # Markdown / Rich table output
│ │ └── utils.py # Shared output helpers
│ ├── prompts/ # LLM prompt templates (markdown)
│ │ ├── discovery_system.md
│ │ └── discovery_user.md
│ └── utils/ # Shared utilities
│ ├── exceptions.py # Custom exception hierarchy
│ ├── files.py # File-path helpers
│ ├── headers.py # HTTP header rotation
│ ├── logging.py # Structured logging setup
│ ├── prompts.py # Prompt loader utility
│ └── retry.py # Retry / back-off helpers
├── tests/
│ ├── conftest.py
│ ├── integration/
│ │ ├── test_pipeline.py # End-to-end pipeline tests
│ │ └── test_snapshots.py # Selector snapshot regression tests
│ └── unit/
│ ├── test_discovery_bs4.py
│ ├── test_pipeline.py
│ └── test_pydantic_flow.py
├── pyproject.toml # Project config & dependencies
├── env.example # API key template
└── urls.txt # Example URL list
How It Works
Phase 1: Smart HTML Extraction
Full HTML (2MB)
↓
Remove noise (scripts, styles, nav, footer)
↓
Find main content (<article>, <main>, .content)
↓
Extract ~30k chars of relevant HTML
↓
Send to AI
Phase 2: AI Analysis
AI reads actual HTML structure
↓
Finds real class names & IDs
↓
Returns up to 3 selectors per field:
- Primary (most specific)
- Fallback (reliable backup)
- Tertiary (generic)
Phase 3: Validation
Test each selector on the actual page
↓
Find first working selector per field
↓
Mark which priority worked (primary/fallback/tertiary)
↓
Save validated selectors to JSON
Output Format
Selectors are saved as JSON files in the .yosoi/selectors/ directory:
{
"headline": {
"primary": "h1.article-title",
"fallback": "h1",
"tertiary": "h2"
},
"author": {
"primary": "a[href*='/author/']",
"fallback": ".byline",
"tertiary": "NA"
},
"date": {
"primary": "time.published-date",
"fallback": "time",
"tertiary": ".date"
},
"body_text": {
"primary": "article.content p",
"fallback": "article p",
"tertiary": "p"
},
"related_content": {
"primary": "aside.related a",
"fallback": ".sidebar a",
"tertiary": "NA"
}
}
Using Discovered Selectors
Once selectors are discovered, use them with standard BeautifulSoup:
from yosoi.storage.persistence import SelectorStorage
from bs4 import BeautifulSoup
import requests
# Load discovered selectors
storage = SelectorStorage()
selectors = storage.load_selectors('example.com')
# Scrape using the selectors (fast & free!)
url = 'https://example.com/another-article'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
# Extract data using validated selectors
headline_selector = selectors['headline']['primary']
headline = soup.select_one(headline_selector)
if headline:
print(f"Headline: {headline.get_text(strip=True)}")
# Extract body text
body_selector = selectors['body_text']['primary']
paragraphs = soup.select(body_selector)
body_text = '\n\n'.join(p.get_text(strip=True) for p in paragraphs)
print(f"\nBody:\n{body_text}")
Using as a Library
from yosoi.core.pipeline import SelectorDiscoveryPipeline
import os
# Initialize with your preferred provider
pipeline = SelectorDiscoveryPipeline(
ai_api_key=os.getenv('GROQ_KEY'),
model_name='llama-3.3-70b-versatile',
provider='groq'
)
# Process a URL
success = pipeline.process_url('https://example.com/article')
# Process multiple URLs
urls = ['https://example.com/article1', 'https://example.com/article2']
pipeline.process_urls(urls, force=False)
# Show summary
pipeline.show_summary()
Supported AI Models
Use -m provider/model-name to select any model explicitly. The API key is read from the corresponding environment variable.
| Provider | -m prefix |
Env key | Example model |
|---|---|---|---|
| Groq | groq/ |
GROQ_KEY |
llama-3.3-70b-versatile |
| Google Gemini | gemini/ |
GEMINI_KEY |
gemini-2.0-flash |
| OpenAI | openai/ |
OPENAI_KEY |
gpt-4o |
| Cerebras | cerebras/ |
CEREBRAS_KEY |
llama-3.3-70b |
| OpenRouter | openrouter/ |
OPENROUTER_KEY |
gpt-oss |
If -m is not provided, Yosoi auto-detects: uses Groq if GROQ_KEY is set, otherwise Gemini.
Observability with Logfire
Yosoi integrates with Logfire for comprehensive observability:
What's Tracked:
- Request/response traces for each URL
- AI model calls and responses
- Selector validation results
- Performance metrics
- Error tracking
Enable Logfire:
- Sign up at https://logfire.pydantic.dev
- Get your token
- Add
LOGFIRE_TOKEN=your_tokento.env - Run your discovery process
- View traces in Logfire dashboard
Features
AI-Powered - Uses Groq/Gemini to read HTML and find selectors Cheap - $0.001 per domain Verified - Tests each selector on the live page before saving Organized - Clean JSON output per domain Rich CLI - Nice terminal output with progress indicators Type-Safe - Full type hints with mypy checking Observable - Integrated with Logfire for tracing Production-Ready - Linted, formatted, and tested
Troubleshooting
AI Returns All "NA"
Cause: Site has poor semantic HTML or heavy JavaScript rendering Solution:
- Check if site requires JavaScript (use debug mode:
--debug) - Review extracted HTML in
debug_html/directory - Use the Playwright fetcher for JavaScript-heavy sites (
SmartFetcherselects it automatically) - If the site explicitly fails, re-run with
--forceafter verifying the page loads in a browser
Selectors Don't Work
Cause: Site structure changed or uses dynamic content Solution:
- Re-run with
--forceto re-discover selectors - Check if site requires authentication
- Verify selectors with
--debugmode
API Key Errors
Problem: GROQ_KEY or GEMINI_KEY not found
Solution:
- Ensure
.envfile exists in project root - Verify key is correctly formatted (no quotes needed)
- Check key has not expired at provider's dashboard
HTTP Errors (403, 429, 500)
- 403 Forbidden: Site blocks scrapers - may need different User-Agent
- 429 Too Many Requests: Rate limited - add delays between requests
- 5xx Server Error: Server issue - Yosoi will skip retries automatically
Import Errors
Problem: ModuleNotFoundError for pydantic_ai, logfire, etc.
Solution:
# Reinstall dependencies
uv sync
# If still failing, try clean install
rm -rf .venv
uv sync
Best Practices
For Reliable Scraping
- Test on multiple pages: Validate selectors work across different articles
- Use fallback selectors: Always have primary/fallback/tertiary
- Monitor changes: Re-discover periodically (sites change)
- Handle missing data: Not all fields exist on all pages
For Better AI Results
- Use debug mode first: Check what HTML is being sent to AI
- Prefer semantic HTML: Sites with
<article>,<time>, etc. work best - Avoid paywalled sites: Content behind login walls won't work
- Check rate limits: Respect site's
robots.txtand rate limits
For Production Use
- Cache selectors: Store and reuse for same domain
- Add error handling: Sites can change or go down
- Use Logfire: Monitor success rates and failures
- Set timeouts: Don't let requests hang indefinitely
Limitations / Future Developments
- JavaScript-rendered content: Not visible in raw HTML (maybe a future development)
- Paywalled sites: Cannot access content behind logins
- Dynamic selectors: Sites that change class names frequently
- Rate limits: Some sites may block or rate-limit requests
Citation
If you use yosoi in your research or project, please cite it using the metadata provided in the CITATION.cff file.
BibTeX
If you are using LaTeX, you can use the following entry:
@software{Berg_yosoi_2026,
author = {Berg, Andrew and Miles, Houston and Mefford, Braeden and Wang, Mia},
license = {Apache-2.0},
month = feb,
title = {{yosoi}},
url = {https://github.com/CascadingLabs/Yosoi},
version = {0.0.1-alpha6},
year = {2026}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yosoi-0.0.1a11.tar.gz.
File metadata
- Download URL: yosoi-0.0.1a11.tar.gz
- Upload date:
- Size: 103.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe14883164a2f54726d1bcd875e74edc919c4a4c00ebe871370570f1fada1405
|
|
| MD5 |
69c9a021a13b7116cb5784508a34cc7e
|
|
| BLAKE2b-256 |
50e52d75f75b299118ebab03fdac359d4c0994d21bab0fde06c3a4936ab957e8
|
Provenance
The following attestation bundles were made for yosoi-0.0.1a11.tar.gz:
Publisher:
publish.yaml on CascadingLabs/Yosoi
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
yosoi-0.0.1a11.tar.gz -
Subject digest:
fe14883164a2f54726d1bcd875e74edc919c4a4c00ebe871370570f1fada1405 - Sigstore transparency entry: 1108787803
- Sigstore integration time:
-
Permalink:
CascadingLabs/Yosoi@648ed08e120079682afeabbf7799b532703a4714 -
Branch / Tag:
refs/tags/0.0.1a11 - Owner: https://github.com/CascadingLabs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@648ed08e120079682afeabbf7799b532703a4714 -
Trigger Event:
release
-
Statement type:
File details
Details for the file yosoi-0.0.1a11-py3-none-any.whl.
File metadata
- Download URL: yosoi-0.0.1a11-py3-none-any.whl
- Upload date:
- Size: 131.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e380bf3611eeae4d11902beb6f8a52d7b36aff6620620f848ccc50c05c2cd321
|
|
| MD5 |
5a9d329c5c2f4d9d8425d61d1ecdfce7
|
|
| BLAKE2b-256 |
85c50ff798829030f076c0d3276f9887e4ace6076d574d3a2ca886aa1ccbccce
|
Provenance
The following attestation bundles were made for yosoi-0.0.1a11-py3-none-any.whl:
Publisher:
publish.yaml on CascadingLabs/Yosoi
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
yosoi-0.0.1a11-py3-none-any.whl -
Subject digest:
e380bf3611eeae4d11902beb6f8a52d7b36aff6620620f848ccc50c05c2cd321 - Sigstore transparency entry: 1108787860
- Sigstore integration time:
-
Permalink:
CascadingLabs/Yosoi@648ed08e120079682afeabbf7799b532703a4714 -
Branch / Tag:
refs/tags/0.0.1a11 - Owner: https://github.com/CascadingLabs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@648ed08e120079682afeabbf7799b532703a4714 -
Trigger Event:
release
-
Statement type: