Extract Table of Contents from Tibetan texts and return section start indices
Project description
ai_text_outline
Extract Table of Contents from Tibetan texts and return character indices of where each section begins.
Features
- Automatic ToC detection using དཀར་ཆག (dkar chag) markers
- Regex-based extraction for structured ToC sections
- LLM-powered fallback using Google Gemini, OpenAI, or Anthropic Claude
- Multi-provider support — choose your preferred LLM
- Fuzzy matching to locate sections even with text variations
- Efficient API usage — only sends relevant ToC sections to the LLM, not the entire text
Installation
Basic Installation
pip install ai_text_outline
With Specific LLM Provider
Install with support for a specific LLM provider:
# For Google Gemini (recommended)
pip install ai_text_outline[gemini]
# For OpenAI
pip install ai_text_outline[openai]
# For Anthropic Claude
pip install ai_text_outline[claude]
# For all providers
pip install ai_text_outline[all]
For Development
pip install -e ".[dev,all]"
Configuration
Environment Variables
The package requires an API key for at least one LLM provider. Set one of the following environment variables:
Google Gemini (Recommended)
export GEMINI_API_KEY="your-gemini-api-key-here"
Get your API key at: https://ai.google.dev/
OpenAI
export OPENAI_API_KEY="your-openai-api-key-here"
Anthropic Claude
export ANTHROPIC_API_KEY="your-anthropic-api-key-here"
Multiple Providers
If you set multiple API keys, the package uses this priority order:
- Gemini (if
GEMINI_API_KEYis set) - OpenAI (if
OPENAI_API_KEYis set) - Claude (if
ANTHROPIC_API_KEYis set)
You can override the default provider by passing provider parameter to the function.
Usage
Basic Usage
Extract ToC from a file:
from ai_text_outline import extract_toc_indices
# Extract from file
indices = extract_toc_indices(file_path="path/to/tibetan_text.txt")
print(indices) # [150, 2450, 5200, ...]
Extract from text string:
# Extract from text string
text = "..." # Your Tibetan text
indices = extract_toc_indices(text=text)
Advanced Configuration
indices = extract_toc_indices(
text=text,
provider="gemini", # Explicitly choose provider
model="gemini-1.5-pro", # Use specific model
chars_per_page=2000, # Chars per page (for estimation)
fuzzy_threshold=0.9, # Fuzzy match threshold (0.0-1.0)
)
For Backend Integration
Your backend should:
-
Install the package:
pip install ai_text_outline[gemini]
-
Set the API key in your environment:
export GEMINI_API_KEY="your-key"
-
Call the function when a user clicks the ToC extraction button:
from ai_text_outline import extract_toc_indices @app.post("/extract-toc") def extract_toc(request): # Option 1: From file path file_path = request.json.get("file_path") # Option 2: From text content text = request.json.get("text") try: indices = extract_toc_indices(file_path=file_path, text=text) return {"success": True, "indices": indices} except ValueError as e: return {"success": False, "error": str(e)}, 400
Return Value
Returns a sorted list of integers representing character indices where each ToC section begins:
[150, 2450, 5200] # Character positions in the text
If no ToC is found, returns an empty list: []
How It Works
Pipeline Overview
- Load text from file or string
- Find ToC section using དཀར་ཆག markers (or use first quarter/100 pages as fallback)
- Extract ToC entries using regex patterns
- Fallback to LLM if regex fails (sends only ToC section, not whole text)
- Locate section starts by page markers (if present) or fuzzy title matching
- Return sorted indices
ToC Section Detection
The package looks for དཀར་ཆག (Table of Contents marker) in the text:
- Takes the first occurrence as ToC start
- Takes the last occurrence as ToC body end anchor
- Extends until a double newline or 4 more pages, whichever comes first
Entry Extraction
Attempts regex patterns first:
- Tibetan text + delimiter (༎ ། . …) + page number
- Supports both Arabic (0-9) and Tibetan numerals (༠-༩)
If regex fails, sends the extracted ToC section to LLM for structured extraction (JSON format).
Section Location
For each ToC entry:
- If page numbers exist in text: finds page marker, returns position after it
- If no page markers: fuzzy matches the title using
rapidfuzz- Searches within ±50% of expected page offset
- Picks best match with similarity ≥ 90%
- If no match found: skips silently (not included in output)
Error Handling
- ValueError: Raised if neither or both of
file_path/textprovided - FileNotFoundError: Raised if file doesn't exist
- UnicodeDecodeError: Raised if file is not UTF-8 encoded
- ValueError: Raised if no API key is configured
For LLM errors (rate limits, auth failures), the package logs a warning and returns an empty list [].
Logging
Enable debug logging to see detailed extraction steps:
import logging
logging.basicConfig(level=logging.DEBUG)
indices = extract_toc_indices(text=text)
Debug output shows:
- Text length loaded
- Provider and model used
- Whether དཀར་ཆག section was found
- Number of ToC entries extracted
- Which entries couldn't be located
Requirements
- Python 3.9+
- At least one LLM API key (Gemini, OpenAI, or Claude)
Performance Notes
- Typical Tibetan texts (< 1000 pages): ~1-2 seconds
- Large texts (> 1000 pages): ~2-5 seconds depending on ToC complexity
- Only sends ToC section to LLM (not full text) → much cheaper API calls
Troubleshooting
"No API key found" Error
Make sure you've set one of these environment variables:
GEMINI_API_KEYOPENAI_API_KEYANTHROPIC_API_KEY
Check with:
echo $GEMINI_API_KEY
LLM Returns Empty Response
This typically means:
- The ToC format is unusual (try looking at the text manually)
- The LLM couldn't identify the structure (try a different model or provider)
- API rate limit reached (wait and retry)
Enable debug logging to see what text was sent to the LLM:
import logging
logging.basicConfig(level=logging.DEBUG)
Regex Extraction Works But Indices Are Wrong
The fuzzy matching threshold (default 0.9) may be too strict. Try:
indices = extract_toc_indices(text=text, fuzzy_threshold=0.85)
License
MIT License — See LICENSE file for details
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai_text_outline-0.1.0.tar.gz.
File metadata
- Download URL: ai_text_outline-0.1.0.tar.gz
- Upload date:
- Size: 19.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2b36ccc979613b11c12c7ce2ac0ca37d6d9519acbb98b269ba58ae3fb20d91a
|
|
| MD5 |
138d5c351ec10f037111dee7ff979126
|
|
| BLAKE2b-256 |
705bf59cb18407b9613f4accfe8b62695e46db764868d2114b96a21d9afbc4c1
|
File details
Details for the file ai_text_outline-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ai_text_outline-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a14f954d72721d68e5e00311ded10139b7e2494bb71790233754157c2b23d999
|
|
| MD5 |
a9e3b5a0a9b697a7b83431eaceb34cd6
|
|
| BLAKE2b-256 |
e9fba636a60cf7ed34e5eed69aa06ccf786aea891dc78b551fa6dc7ff0e5bcb8
|