AI powered API documentation scraper and converter
Project description
APIAS - AI Powered API Documentation Scraper
APIAS (AI Powered API Documentation Scraper) is a powerful tool that helps you extract and convert API documentation from various sources into structured formats.
Features
- Scrape API documentation from web pages
- Support for multiple documentation formats
- AI-powered content extraction and structuring
- Command-line interface for easy use
- Multiple output formats (Markdown, JSON, YAML)
- Batch processing mode with interactive TUI
Requirements
- Python 3.10 or higher (Python 3.9 is not supported)
- OpenAI API key (for AI-powered extraction)
Installation
Using uv (Recommended)
The fastest way to install APIAS is using uv:
# Install as a tool (recommended for CLI usage)
uv tool install apias --python=3.10
# Or install in a project
uv add apias
Using pip
pip install apias
Verify Python Version
python --version # Should be 3.10 or higher
Quick Start
from apias import apias
# Basic usage
doc = apias.scrape_url("https://api.example.com/docs")
print(doc.to_markdown())
# With custom configuration
config = {
"format": "markdown",
"output": "api_docs.md"
}
apias.scrape_and_save("https://api.example.com/docs", config)
Command Line Usage
# Scrape a single page
apias --url https://api.example.com/docs
# Scrape multiple pages from a website (batch mode)
apias --url https://example.com --mode batch
# Limit how many pages to scrape
apias --url https://example.com --mode batch --limit 50
# Use a configuration file
apias --url https://example.com --config apias_config.yaml
Configuration Guide
Think of APIAS like a team of workers in a factory!
APIAS can be configured using a YAML file. Generate an example with:
apias --generate-config
This creates apias_config.yaml that you can edit.
Understanding the Settings (Explained Simply)
num_threads - How Many Workers?
num_threads: 5 # Default: 5 workers
Imagine you have a big pile of web pages to process. num_threads is like choosing how many workers to hire:
+---> Worker 1 ---> processes page A
|
Your Pages -------->+---> Worker 2 ---> processes page B
(waiting) |
+---> Worker 3 ---> processes page C
|
+---> Worker 4 ---> processes page D
|
+---> Worker 5 ---> processes page E
- num_threads: 1 = One worker, processes pages one by one (slow but gentle on the website)
- num_threads: 5 = Five workers processing 5 pages at the same time (faster!)
- num_threads: 10 = Ten workers (even faster, but uses more computer power)
Warning: Don't use more than 10-15 threads! Too many workers might:
- Overwhelm the website you're scraping (they might block you!)
- Hit OpenAI rate limits (the AI can only handle so many requests)
- Use too much memory on your computer
Recommendation: Start with 5. Increase to 10 if everything works smoothly.
max_retries - How Many Times to Try Again?
max_retries: 3 # Default: 3 attempts
Sometimes things fail (network hiccups, server busy, etc.). max_retries is how many times APIAS will try again before giving up:
Attempt 1: "Hey server, give me this page!"
Server: "Sorry, I'm busy!" (FAIL)
Attempt 2: *waits 1 second* "Okay, how about now?"
Server: "Still busy!" (FAIL)
Attempt 3: *waits 2 seconds* "Please?"
Server: "Here you go!" (SUCCESS!)
- max_retries: 0 = Never retry (give up immediately on any error)
- max_retries: 3 = Try up to 3 times before giving up
- max_retries: 5 = Very persistent, keeps trying longer
chunk_size - How Big Are the Pieces?
chunk_size: 50000 # Default: 50,000 characters
Web pages can be HUGE. We can't send a giant page to the AI all at once (it would choke!). So we cut it into smaller pieces called "chunks":
Giant Web Page (200,000 characters)
====================================
Gets cut into pieces:
[ Chunk 1 ] [ Chunk 2 ] [ Chunk 3 ] [ Chunk 4 ]
(50,000) (50,000) (50,000) (50,000)
| | | |
v v v v
AI AI AI AI
| | | |
v v v v
[Result 1] [Result 2] [Result 3] [Result 4]
Then all results get merged back together!
- chunk_size: 30000 = Smaller pieces (more API calls, but safer for complex pages)
- chunk_size: 50000 = Default balance
- chunk_size: 100000 = Bigger pieces (fewer API calls, but might hit token limits)
model - Which AI Brain to Use?
model: gpt-4o-mini # Default: the smart but affordable one
Different AI models have different abilities and costs:
| Model | Speed | Quality | Cost | Best For |
|---|---|---|---|---|
gpt-4o-mini |
Fast | Good | Low | Most scraping tasks (recommended) |
gpt-4o |
Fast | Excellent | Medium | Complex documentation |
gpt-4-turbo |
Medium | Excellent | High | When quality matters most |
gpt-3.5-turbo |
Very Fast | Okay | Very Low | Simple pages, budget mode |
limit - Maximum Pages to Scrape
limit: 50 # Only scrape up to 50 pages (null = no limit)
In batch mode, a website might have thousands of pages. Use limit to control how many:
# Command line:
apias --url https://example.com --mode batch --limit 100
# Or in config file:
limit: 100
Quick Reference: Common Configurations
For Small Websites (< 50 pages)
num_threads: 3
max_retries: 3
chunk_size: 50000
model: gpt-4o-mini
limit: null
For Large Websites (100+ pages)
num_threads: 8
max_retries: 5
chunk_size: 40000
model: gpt-4o-mini
limit: 500
For Slow/Unstable Connections
num_threads: 2
max_retries: 5
retry_delay: 2.0
chunk_size: 30000
model: gpt-4o-mini
For CI/CD (Headless, No User Interaction)
num_threads: 5
no_tui: true
quiet: true
auto_resume: true
Environment Variables
You can also use environment variables:
# Required: Your OpenAI API key
export OPENAI_API_KEY="sk-your-key-here"
# Then run APIAS
apias --url https://example.com
Contributing
We welcome contributions! Please see our Contributing Guide for details.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Security
For security issues, please see our Security Policy.
Changelog
See CHANGELOG.md for a list of changes.
Support
- Documentation: https://github.com/Emasoft/apias/docs
- Issues: https://github.com/Emasoft/apias/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file apias-0.1.22.tar.gz.
File metadata
- Download URL: apias-0.1.22.tar.gz
- Upload date:
- Size: 407.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ab2d265bc2390523862a2d067d6cb1aec3469c73c6126f299426d2fbad15fd8
|
|
| MD5 |
9ae1df545c524ce8d26af420a8233cc5
|
|
| BLAKE2b-256 |
4635960086dd925c519906adb4756e1dd8d6abb66b419cf7faf08463a07453da
|
File details
Details for the file apias-0.1.22-py3-none-any.whl.
File metadata
- Download URL: apias-0.1.22-py3-none-any.whl
- Upload date:
- Size: 137.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e39deab1d1a060586b3f974aa2c8857647e6786e734fdaa68b80894f8b4e56a0
|
|
| MD5 |
d3af7c9b0ba69182361f6c3335a62675
|
|
| BLAKE2b-256 |
42562a0e4c5982821b14e4e663feed0950d691b3a4ea3fd545e720d5c0700172
|