AI-Powered Selector Discovery - Discover once, scrape forever

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

andrewpberg HoustonSM

These details have not been verified by PyPI

Project links

Homepage

Project description

Yosoi - AI-Powered CSS Selector Discover

Discover CSS selectors once with AI, scrape forever with BeautifulSoup

Give Yosoi a URL, and it uses AI to automatically discover the best CSS selectors for extracting headlines, authors, dates, body text, and related content. Discovery takes 3 seconds and costs $0.001 per domain — then scrape thousands of articles for free with BeautifulSoup.

Key Benefits:

Fast: 3 seconds to discover selectors per domain
Cheap: $0.001 per domain (one-time cost)
Accurate: Validates selectors before saving
Reusable: Discover once, use forever
Production-Ready: Type-safe, linted, tested

Quick Start

Installation

# Clone the repository
git clone <your-repo>
cd yosoi

# Install dependencies (using uv)
uv sync

# For development tools
uv sync --group dev

Configuration

Create a .env file (see env.example):

# Set keys for whichever providers you want to use
GROQ_KEY=your_groq_api_key_here           # groq/...
GEMINI_KEY=your_gemini_api_key_here       # gemini/...
OPENAI_KEY=your_openai_api_key_here       # openai/...
CEREBRAS_KEY=your_cerebras_api_key_here   # cerebras/...
OPENROUTER_KEY=your_openrouter_key_here  # openrouter/...

# Optional: Observability
LOGFIRE_TOKEN=your_logfire_token_here     # For Logfire tracing

Get API Keys:

Groq (Free): https://console.groq.com/keys
Gemini: https://aistudio.google.com/app/apikey
OpenRouter (Free tier available): https://openrouter.ai/keys
Logfire (Optional): https://logfire.pydantic.dev

Some Free models from OpenRouter require configuration of your Privacy Settings to allow training on your data.

Basic Usage

# Process a single URL (uses GROQ_KEY or GEMINI_KEY from .env)
uv run yosoi --url https://example.com/article

# Specify model explicitly with -m provider/model-name
uv run yosoi -m groq/llama-3.3-70b-versatile --url https://example.com/article
uv run yosoi -m gemini/gemini-2.0-flash --url https://example.com/article
uv run yosoi -m openai/gpt-4o --url https://example.com/article
uv run yosoi -m openrouter/gpt-oss --url https://example.com/article

# Process multiple URLs from a file
uv run yosoi -m groq/llama-3.3-70b-versatile --file urls.txt

# Force re-discovery
uv run yosoi --url https://example.com --force

# Show summary of all saved selectors
uv run yosoi --summary

# Enable debug mode (saves extracted HTML)
uv run yosoi --url https://example.com --debug

URLs File Format

Create urls.txt with one URL per line:

https://example.com/article1
https://example.com/article2
# Comments are allowed
https://example.com/article3

Or use JSON format (urls.json):

[
  {"url": "https://example.com/article1"},
  {"url": "https://example.com/article2"}
]

Project Structure

.
├── .yosoi/                        # Hidden runtime directory
│   └── selectors/                 # Persisted selector snapshots
├── yosoi/                         # Main package
│   ├── __init__.py
│   ├── __main__.py
│   ├── cli.py                     # CLI entry point & argument parsing
│   ├── core/                      # Core pipeline logic
│   │   ├── pipeline.py            # SelectorDiscoveryPipeline orchestrator
│   │   ├── cleaning/
│   │   │   └── cleaner.py         # HTML noise removal
│   │   ├── discovery/
│   │   │   ├── agent.py           # pydantic-ai agent definition
│   │   │   └── config.py          # Model / provider configuration
│   │   ├── extraction/
│   │   │   └── extractor.py       # Relevant HTML extraction
│   │   ├── fetcher/
│   │   │   ├── base.py            # Abstract fetcher interface
│   │   │   ├── simple.py          # httpx-based fetcher
│   │   │   ├── playwright.py      # Playwright fetcher (JS-heavy sites)
│   │   │   └── smart.py           # Auto-selects fetcher strategy
│   │   └── verification/
│   │       └── verifier.py        # CSS selector verification
│   ├── models/                    # Pydantic data models
│   │   ├── selectors.py           # SelectorSet & field models
│   │   └── results.py             # Pipeline result types
│   ├── storage/                   # Persistence layer
│   │   ├── persistence.py         # JSON read/write for selectors
│   │   ├── tracking.py            # Domain-level tracking & stats
│   │   └── debug.py               # Debug HTML snapshot storage
│   ├── outputs/                   # Output formatters
│   │   ├── json.py                # JSON report formatter
│   │   ├── markdown.py            # Markdown / Rich table output
│   │   └── utils.py               # Shared output helpers
│   ├── prompts/                   # LLM prompt templates (markdown)
│   │   ├── discovery_system.md
│   │   └── discovery_user.md
│   └── utils/                     # Shared utilities
│       ├── exceptions.py          # Custom exception hierarchy
│       ├── files.py               # File-path helpers
│       ├── headers.py             # HTTP header rotation
│       ├── logging.py             # Structured logging setup
│       ├── prompts.py             # Prompt loader utility
│       └── retry.py               # Retry / back-off helpers
├── tests/
│   ├── conftest.py
│   ├── integration/
│   │   ├── test_pipeline.py       # End-to-end pipeline tests
│   │   └── test_snapshots.py      # Selector snapshot regression tests
│   └── unit/
│       ├── test_discovery_bs4.py
│       ├── test_pipeline.py
│       └── test_pydantic_flow.py
├── pyproject.toml                 # Project config & dependencies
├── env.example                    # API key template
└── urls.txt                       # Example URL list

How It Works

Phase 1: Smart HTML Extraction

Full HTML (2MB)
  ↓
Remove noise (scripts, styles, nav, footer)
  ↓
Find main content (<article>, <main>, .content)
  ↓
Extract ~30k chars of relevant HTML
  ↓
Send to AI

Phase 2: AI Analysis

AI reads actual HTML structure
  ↓
Finds real class names & IDs
  ↓
Returns up to 3 selectors per field:
  - Primary (most specific)
  - Fallback (reliable backup)
  - Tertiary (generic)

Phase 3: Validation

Test each selector on the actual page
  ↓
Find first working selector per field
  ↓
Mark which priority worked (primary/fallback/tertiary)
  ↓
Save validated selectors to JSON

Output Format

Selectors are saved as JSON files in the .yosoi/selectors/ directory:

{
  "headline": {
    "primary": "h1.article-title",
    "fallback": "h1",
    "tertiary": "h2"
  },
  "author": {
    "primary": "a[href*='/author/']",
    "fallback": ".byline",
    "tertiary": "NA"
  },
  "date": {
    "primary": "time.published-date",
    "fallback": "time",
    "tertiary": ".date"
  },
  "body_text": {
    "primary": "article.content p",
    "fallback": "article p",
    "tertiary": "p"
  },
  "related_content": {
    "primary": "aside.related a",
    "fallback": ".sidebar a",
    "tertiary": "NA"
  }
}

Using Discovered Selectors

Once selectors are discovered, use them with standard BeautifulSoup:

from yosoi.storage.persistence import SelectorStorage
from bs4 import BeautifulSoup
import requests

# Load discovered selectors
storage = SelectorStorage()
selectors = storage.load_selectors('example.com')

# Scrape using the selectors (fast & free!)
url = 'https://example.com/another-article'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')

# Extract data using validated selectors
headline_selector = selectors['headline']['primary']
headline = soup.select_one(headline_selector)
if headline:
    print(f"Headline: {headline.get_text(strip=True)}")

# Extract body text
body_selector = selectors['body_text']['primary']
paragraphs = soup.select(body_selector)
body_text = '\n\n'.join(p.get_text(strip=True) for p in paragraphs)
print(f"\nBody:\n{body_text}")

Using as a Library

from yosoi.core.pipeline import SelectorDiscoveryPipeline
import os

# Initialize with your preferred provider
pipeline = SelectorDiscoveryPipeline(
    ai_api_key=os.getenv('GROQ_KEY'),
    model_name='llama-3.3-70b-versatile',
    provider='groq'
)

# Process a URL
success = pipeline.process_url('https://example.com/article')

# Process multiple URLs
urls = ['https://example.com/article1', 'https://example.com/article2']
pipeline.process_urls(urls, force=False)

# Show summary
pipeline.show_summary()

Supported AI Models

Use -m provider/model-name to select any model explicitly. The API key is read from the corresponding environment variable.

Provider	`-m` prefix	Env key	Example model
Groq	`groq/`	`GROQ_KEY`	`llama-3.3-70b-versatile`
Google Gemini	`gemini/`	`GEMINI_KEY`	`gemini-2.0-flash`
OpenAI	`openai/`	`OPENAI_KEY`	`gpt-4o`
Cerebras	`cerebras/`	`CEREBRAS_KEY`	`llama-3.3-70b`
OpenRouter	`openrouter/`	`OPENROUTER_KEY`	`gpt-oss`

If -m is not provided, Yosoi auto-detects: uses Groq if GROQ_KEY is set, otherwise Gemini.

Observability with Logfire

Yosoi integrates with Logfire for comprehensive observability:

What's Tracked:

Request/response traces for each URL
AI model calls and responses
Selector validation results
Performance metrics
Error tracking

Enable Logfire:

Sign up at https://logfire.pydantic.dev
Get your token
Add LOGFIRE_TOKEN=your_token to .env
Run your discovery process
View traces in Logfire dashboard

Features

AI-Powered - Uses Groq/Gemini to read HTML and find selectors Cheap - $0.001 per domain Verified - Tests each selector on the live page before saving Organized - Clean JSON output per domain Rich CLI - Nice terminal output with progress indicators Type-Safe - Full type hints with mypy checking Observable - Integrated with Logfire for tracing Production-Ready - Linted, formatted, and tested

Troubleshooting

AI Returns All "NA"

Cause: Site has poor semantic HTML or heavy JavaScript rendering Solution:

Check if site requires JavaScript (use debug mode: --debug)
Review extracted HTML in debug_html/ directory
Use the Playwright fetcher for JavaScript-heavy sites (SmartFetcher selects it automatically)
If the site explicitly fails, re-run with --force after verifying the page loads in a browser

Selectors Don't Work

Cause: Site structure changed or uses dynamic content Solution:

Re-run with --force to re-discover selectors
Check if site requires authentication
Verify selectors with --debug mode

API Key Errors

Problem: GROQ_KEY or GEMINI_KEY not found Solution:

Ensure .env file exists in project root
Verify key is correctly formatted (no quotes needed)
Check key has not expired at provider's dashboard

HTTP Errors (403, 429, 500)

403 Forbidden: Site blocks scrapers - may need different User-Agent
429 Too Many Requests: Rate limited - add delays between requests
5xx Server Error: Server issue - Yosoi will skip retries automatically

Import Errors

Problem: ModuleNotFoundError for pydantic_ai, logfire, etc. Solution:

# Reinstall dependencies
uv sync

# If still failing, try clean install
rm -rf .venv
uv sync

Best Practices

For Reliable Scraping

Test on multiple pages: Validate selectors work across different articles
Use fallback selectors: Always have primary/fallback/tertiary
Monitor changes: Re-discover periodically (sites change)
Handle missing data: Not all fields exist on all pages

For Better AI Results

Use debug mode first: Check what HTML is being sent to AI
Prefer semantic HTML: Sites with <article>, <time>, etc. work best
Avoid paywalled sites: Content behind login walls won't work
Check rate limits: Respect site's robots.txt and rate limits

For Production Use

Cache selectors: Store and reuse for same domain
Add error handling: Sites can change or go down
Use Logfire: Monitor success rates and failures
Set timeouts: Don't let requests hang indefinitely

Limitations / Future Developments

JavaScript-rendered content: Not visible in raw HTML (maybe a future development)
Paywalled sites: Cannot access content behind logins
Dynamic selectors: Sites that change class names frequently
Rate limits: Some sites may block or rate-limit requests

Citation

If you use yosoi in your research or project, please cite it using the metadata provided in the CITATION.cff file.

BibTeX

If you are using LaTeX, you can use the following entry:

@software{Berg_yosoi_2026,
author = {Berg, Andrew and Miles, Houston and Mefford, Braeden and Wang, Mia},
license = {Apache-2.0},
month = feb,
title = {{yosoi}},
url = {https://github.com/CascadingLabs/Yosoi},
version = {0.0.1-alpha6},
year = {2026}
}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

andrewpberg HoustonSM

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.1a16 pre-release

Mar 29, 2026

0.0.1a15 pre-release

Mar 29, 2026

0.0.1a14 pre-release

Mar 29, 2026

0.0.1a13 pre-release

Mar 29, 2026

0.0.1a12 pre-release

Mar 29, 2026

This version

0.0.1a11 pre-release

Mar 16, 2026

0.0.1a9 pre-release

Feb 20, 2026

0.0.1a7 pre-release

Feb 20, 2026

0.0.1a6 pre-release

Feb 19, 2026

0.0.1a5 pre-release

Feb 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yosoi-0.0.1a11.tar.gz (103.7 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

yosoi-0.0.1a11-py3-none-any.whl (131.6 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file yosoi-0.0.1a11.tar.gz.

File metadata

Download URL: yosoi-0.0.1a11.tar.gz
Upload date: Mar 16, 2026
Size: 103.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yosoi-0.0.1a11.tar.gz
Algorithm	Hash digest
SHA256	`fe14883164a2f54726d1bcd875e74edc919c4a4c00ebe871370570f1fada1405`
MD5	`69c9a021a13b7116cb5784508a34cc7e`
BLAKE2b-256	`50e52d75f75b299118ebab03fdac359d4c0994d21bab0fde06c3a4936ab957e8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for yosoi-0.0.1a11.tar.gz:

Publisher: publish.yaml on CascadingLabs/Yosoi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: yosoi-0.0.1a11.tar.gz
- Subject digest: fe14883164a2f54726d1bcd875e74edc919c4a4c00ebe871370570f1fada1405
- Sigstore transparency entry: 1108787803
- Sigstore integration time: Mar 16, 2026
Source repository:
- Permalink: CascadingLabs/Yosoi@648ed08e120079682afeabbf7799b532703a4714
- Branch / Tag: refs/tags/0.0.1a11
- Owner: https://github.com/CascadingLabs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@648ed08e120079682afeabbf7799b532703a4714
- Trigger Event: release

File details

Details for the file yosoi-0.0.1a11-py3-none-any.whl.

File metadata

Download URL: yosoi-0.0.1a11-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 131.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yosoi-0.0.1a11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e380bf3611eeae4d11902beb6f8a52d7b36aff6620620f848ccc50c05c2cd321`
MD5	`5a9d329c5c2f4d9d8425d61d1ecdfce7`
BLAKE2b-256	`85c50ff798829030f076c0d3276f9887e4ace6076d574d3a2ca886aa1ccbccce`

See more details on using hashes here.

Provenance

The following attestation bundles were made for yosoi-0.0.1a11-py3-none-any.whl:

Publisher: publish.yaml on CascadingLabs/Yosoi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: yosoi-0.0.1a11-py3-none-any.whl
- Subject digest: e380bf3611eeae4d11902beb6f8a52d7b36aff6620620f848ccc50c05c2cd321
- Sigstore transparency entry: 1108787860
- Sigstore integration time: Mar 16, 2026
Source repository:
- Permalink: CascadingLabs/Yosoi@648ed08e120079682afeabbf7799b532703a4714
- Branch / Tag: refs/tags/0.0.1a11
- Owner: https://github.com/CascadingLabs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@648ed08e120079682afeabbf7799b532703a4714
- Trigger Event: release

yosoi 0.0.1a11

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Yosoi - AI-Powered CSS Selector Discover

Quick Start

Installation

Configuration

Basic Usage

URLs File Format

Project Structure

How It Works

Phase 1: Smart HTML Extraction

Phase 2: AI Analysis

Phase 3: Validation

Output Format

Using Discovered Selectors

Using as a Library

Supported AI Models

Observability with Logfire

Features

Troubleshooting

AI Returns All "NA"

Selectors Don't Work

API Key Errors

HTTP Errors (403, 429, 500)

Import Errors

Best Practices

For Reliable Scraping

For Better AI Results

For Production Use

Limitations / Future Developments

Citation

BibTeX

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance