Skip to main content

AI-Powered Selector Discovery - Discover once, scrape forever

Project description

DOI License uv Actions status image image

Yosoi - AI-Powered CSS Selector Discover

Discover CSS selectors once with AI, scrape forever with BeautifulSoup

Give Yosoi a URL, and it uses AI to automatically discover the best CSS selectors for extracting headlines, authors, dates, body text, and related content. Discovery takes 3 seconds and costs $0.001 per domain — then scrape thousands of articles for free with BeautifulSoup.

Key Benefits:

  • Fast: 3 seconds to discover selectors per domain
  • Cheap: $0.001 per domain (one-time cost)
  • Accurate: Validates selectors before saving
  • Reusable: Discover once, use forever
  • Production-Ready: Type-safe, linted, tested

Quick Start

Installation

# Clone the repository
git clone <your-repo>
cd yosoi

# Install dependencies (using uv)
uv sync

# For development tools
uv sync --group dev

Configuration

Create a .env file (see env.example):

# Choose one or both providers
GROQ_KEY=your_groq_api_key_here           # For Llama 3.3 (faster, recommended)
GEMINI_KEY=your_gemini_api_key_here       # For Gemini 2.0 Flash

# Optional: Observability
LOGFIRE_TOKEN=your_logfire_token_here     # For Logfire tracing

Get API Keys:

Basic Usage

# Process a single URL
uv run yosoi --url https://example.com/article

# Process multiple URLs from a file
uv run yosoi --file urls.txt

# Force re-discovery
uv run yosoi --url https://example.com --force

# Show summary of all saved selectors
uv run yosoi --summary

# Enable debug mode (saves extracted HTML)
uv run yosoi --url https://example.com --debug

URLs File Format

Create urls.txt with one URL per line:

https://example.com/article1
https://example.com/article2
# Comments are allowed
https://example.com/article3

Or use JSON format (urls.json):

[
  {"url": "https://example.com/article1"},
  {"url": "https://example.com/article2"}
]

Project Structure

.
├── .yosoi/                   # .yosoi helper directory (hidden)
│   └── selectors/            # Discovered selectors (hidden)
├── main.py                   # CLI entry point & orchestrator
├── selector_discovery.py     # AI-powered selector discovery
├── selector_validator.py     # Selector validation & testing
├── selector_storage.py       # JSON storage operations
├── services.py              # Shared services (Logfire config)
├── models.py                # Pydantic models
├── pyproject.toml           # Project config & dependencies
├── .env                     # API keys (create this)
├── CHEAT_SHEET.md          # Dev tools quick reference
└── selectors/              # Output directory
    └── selectors_*.json     # Discovered selectors per domain

How It Works

Phase 1: Smart HTML Extraction

Full HTML (2MB)
  ↓
Remove noise (scripts, styles, nav, footer)
  ↓
Find main content (<article>, <main>, .content)
  ↓
Extract ~30k chars of relevant HTML
  ↓
Send to AI

Phase 2: AI Analysis

AI reads actual HTML structure
  ↓
Finds real class names & IDs
  ↓
Returns 3 selectors per field:
  - Primary (most specific)
  - Fallback (reliable backup)
  - Tertiary (generic)
  ↓
Smart fallback if AI fails

Phase 3: Validation

Test each selector on the actual page
  ↓
Find first working selector per field
  ↓
Mark which priority worked (primary/fallback/tertiary)
  ↓
Save validated selectors to JSON

Output Format

Selectors are saved as JSON files in the .yosoi/selectors/ directory:

{
  "headline": {
    "primary": "h1.article-title",
    "fallback": "h1",
    "tertiary": "h2"
  },
  "author": {
    "primary": "a[href*='/author/']",
    "fallback": ".byline",
    "tertiary": "NA"
  },
  "date": {
    "primary": "time.published-date",
    "fallback": "time",
    "tertiary": ".date"
  },
  "body_text": {
    "primary": "article.content p",
    "fallback": "article p",
    "tertiary": "p"
  },
  "related_content": {
    "primary": "aside.related a",
    "fallback": ".sidebar a",
    "tertiary": "NA"
  }
}

Using Discovered Selectors

Once selectors are discovered, use them with standard BeautifulSoup:

from selector_storage import SelectorStorage
from bs4 import BeautifulSoup
import requests

# Load discovered selectors
storage = SelectorStorage()
selectors = storage.load_selectors('example.com')

# Scrape using the selectors (fast & free!)
url = 'https://example.com/another-article'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')

# Extract data using validated selectors
headline_selector = selectors['headline']['primary']
headline = soup.select_one(headline_selector)
if headline:
    print(f"Headline: {headline.get_text(strip=True)}")

# Extract body text
body_selector = selectors['body_text']['primary']
paragraphs = soup.select(body_selector)
body_text = '\n\n'.join(p.get_text(strip=True) for p in paragraphs)
print(f"\nBody:\n{body_text}")

Using as a Library

from main import SelectorDiscoveryPipeline
import os

# Initialize with your preferred provider
pipeline = SelectorDiscoveryPipeline(
    ai_api_key=os.getenv('GROQ_KEY'),
    model_name='llama-3.3-70b-versatile',
    provider='groq'
)

# Process a URL
success = pipeline.process_url('https://example.com/article')

# Process multiple URLs
urls = ['https://example.com/article1', 'https://example.com/article2']
pipeline.process_urls(urls, force=False)

# Show summary
pipeline.show_summary()

Supported AI Models

Groq (Recommended)

  • Model: llama-3.3-70b-versatile
  • Cost: Free tier available
  • Setup: GROQ_KEY in .env

Google Gemini

  • Model: gemini-2.0-flash-exp
  • Cost: Free tier available
  • Setup: GEMINI_KEY in .env

The system automatically uses Groq if GROQ_KEY is set, otherwise falls back to Gemini.

Observability with Logfire

Yosoi integrates with Logfire for comprehensive observability:

What's Tracked:

  • Request/response traces for each URL
  • AI model calls and responses
  • Selector validation results
  • Performance metrics
  • Error tracking

Enable Logfire:

  1. Sign up at https://logfire.pydantic.dev
  2. Get your token
  3. Add LOGFIRE_TOKEN=your_token to .env
  4. Run your discovery process
  5. View traces in Logfire dashboard

Features

AI-Powered - Uses Groq/Gemini to read HTML and find selectors Cheap - $0.001 per domain Validated - Tests each selector before saving Organized - Clean JSON output per domain Fallback System - Uses heuristics when AI fails Rich CLI - Nice terminal output with progress indicators Type-Safe - Full type hints with mypy checking Observable - Integrated with Logfire for tracing Production-Ready - Linted, formatted, and tested

Troubleshooting

AI Returns All "NA"

Cause: Site has poor semantic HTML or heavy JavaScript rendering Solution:

  • Check if site requires JavaScript (use debug mode: --debug)
  • Review extracted HTML in debug_html/ directory
  • Consider using Selenium for JavaScript-heavy sites
  • Fallback heuristics will be used automatically

Selectors Don't Work

Cause: Site structure changed or uses dynamic content Solution:

  • Re-run with --force to re-discover selectors
  • Check if site requires authentication
  • Verify selectors with --debug mode

API Key Errors

Problem: GROQ_KEY or GEMINI_KEY not found Solution:

  • Ensure .env file exists in project root
  • Verify key is correctly formatted (no quotes needed)
  • Check key has not expired at provider's dashboard

HTTP Errors (403, 429, 500)

  • 403 Forbidden: Site blocks scrapers - may need different User-Agent
  • 429 Too Many Requests: Rate limited - add delays between requests
  • 5xx Server Error: Server issue - Yosoi will skip retries automatically

Import Errors

Problem: ModuleNotFoundError for pydantic_ai, logfire, etc. Solution:

# Reinstall dependencies
uv sync

# If still failing, try clean install
rm -rf .venv
uv sync

Best Practices

For Reliable Scraping

  1. Test on multiple pages: Validate selectors work across different articles
  2. Use fallback selectors: Always have primary/fallback/tertiary
  3. Monitor changes: Re-discover periodically (sites change)
  4. Handle missing data: Not all fields exist on all pages

For Better AI Results

  1. Use debug mode first: Check what HTML is being sent to AI
  2. Prefer semantic HTML: Sites with <article>, <time>, etc. work best
  3. Avoid paywalled sites: Content behind login walls won't work
  4. Check rate limits: Respect site's robots.txt and rate limits

For Production Use

  1. Cache selectors: Store and reuse for same domain
  2. Add error handling: Sites can change or go down
  3. Use Logfire: Monitor success rates and failures
  4. Set timeouts: Don't let requests hang indefinitely

Limitations / Future Developments

  • JavaScript-rendered content: Not visible in raw HTML (maybe a future development)
  • Paywalled sites: Cannot access content behind logins
  • Dynamic selectors: Sites that change class names frequently
  • Rate limits: Some sites may block or rate-limit requests

Citation

If you use yosoi in your research or project, please cite it using the metadata provided in the CITATION.cff file.

BibTeX

If you are using LaTeX, you can use the following entry:

@software{Berg_yosoi_2026,
author = {Berg, Andrew and Miles, Houston and Mefford, Braeden and Wang, Mia},
license = {Apache-2.0},
month = feb,
title = {{yosoi}},
url = {https://github.com/CascadingLabs/Yosoi},
version = {0.0.1-alpha6},
year = {2026}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yosoi-0.0.1a9.tar.gz (46.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yosoi-0.0.1a9-py3-none-any.whl (52.5 kB view details)

Uploaded Python 3

File details

Details for the file yosoi-0.0.1a9.tar.gz.

File metadata

  • Download URL: yosoi-0.0.1a9.tar.gz
  • Upload date:
  • Size: 46.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yosoi-0.0.1a9.tar.gz
Algorithm Hash digest
SHA256 a7a1433271b1e63ce7bb7d4eed3b9db6b4eb1cc706613834091dad38851ff32a
MD5 2e32b1e67e9d348d74ff9e1d443715cb
BLAKE2b-256 1e44a01ae8b6474be9b4eccefcba590fe7e5527b4e5eaafa87844e1256ff614c

See more details on using hashes here.

Provenance

The following attestation bundles were made for yosoi-0.0.1a9.tar.gz:

Publisher: publish.yaml on CascadingLabs/Yosoi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file yosoi-0.0.1a9-py3-none-any.whl.

File metadata

  • Download URL: yosoi-0.0.1a9-py3-none-any.whl
  • Upload date:
  • Size: 52.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yosoi-0.0.1a9-py3-none-any.whl
Algorithm Hash digest
SHA256 137c8e41028a0afd822578b1a038b3141c32e8c09fca6dc50d12ea6460fba315
MD5 8744043822b57b5f833d0dec917f1256
BLAKE2b-256 85d6c17231abcbfa4c05c72ccdbddf3ef9f3556bfa806737d2396894a9917da6

See more details on using hashes here.

Provenance

The following attestation bundles were made for yosoi-0.0.1a9-py3-none-any.whl:

Publisher: publish.yaml on CascadingLabs/Yosoi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page