Extract Table of Contents from Tibetan texts and return section start indices

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

ai_text_outline

Extract Table of Contents from Tibetan texts and return character indices of where each section begins.

Features

Automatic ToC detection using དཀར་ཆག (dkar chag) markers
Regex-based extraction for structured ToC sections
LLM-powered fallback using Google Gemini, OpenAI, or Anthropic Claude
Multi-provider support — choose your preferred LLM
Fuzzy matching to locate sections even with text variations
Efficient API usage — only sends relevant ToC sections to the LLM, not the entire text

Installation

Basic Installation

pip install ai_text_outline

With Specific LLM Provider

Install with support for a specific LLM provider:

# For Google Gemini (recommended)
pip install ai_text_outline[gemini]

# For OpenAI
pip install ai_text_outline[openai]

# For Anthropic Claude
pip install ai_text_outline[claude]

# For all providers
pip install ai_text_outline[all]

For Development

pip install -e ".[dev,all]"

Configuration

Environment Variables

The package requires an API key for at least one LLM provider. Set one of the following environment variables:

Google Gemini (Recommended)

export GEMINI_API_KEY="your-gemini-api-key-here"

Get your API key at: https://ai.google.dev/

OpenAI

export OPENAI_API_KEY="your-openai-api-key-here"

Anthropic Claude

export ANTHROPIC_API_KEY="your-anthropic-api-key-here"

Multiple Providers

If you set multiple API keys, the package uses this priority order:

Gemini (if GEMINI_API_KEY is set)
OpenAI (if OPENAI_API_KEY is set)
Claude (if ANTHROPIC_API_KEY is set)

You can override the default provider by passing provider parameter to the function.

Usage

Basic Usage

Extract ToC from a file:

from ai_text_outline import extract_toc_indices

# Extract from file
indices = extract_toc_indices(file_path="path/to/tibetan_text.txt")
print(indices)  # [150, 2450, 5200, ...]

Extract from text string:

# Extract from text string
text = "..."  # Your Tibetan text
indices = extract_toc_indices(text=text)

Advanced Configuration

indices = extract_toc_indices(
    text=text,
    provider="gemini",              # Explicitly choose provider
    model="gemini-1.5-pro",         # Use specific model
    chars_per_page=2000,            # Chars per page (for estimation)
    fuzzy_threshold=0.9,            # Fuzzy match threshold (0.0-1.0)
)

For Backend Integration

Your backend should:

Install the package:
```
pip install ai_text_outline[gemini]
```
Set the API key in your environment:
```
export GEMINI_API_KEY="your-key"
```

Call the function when a user clicks the ToC extraction button:

from ai_text_outline import extract_toc_indices

@app.post("/extract-toc")
def extract_toc(request):
    # Option 1: From file path
    file_path = request.json.get("file_path")
    
    # Option 2: From text content
    text = request.json.get("text")
    
    try:
        indices = extract_toc_indices(file_path=file_path, text=text)
        return {"success": True, "indices": indices}
    except ValueError as e:
        return {"success": False, "error": str(e)}, 400

Return Value

Returns a sorted list of integers representing character indices where each ToC section begins:

[150, 2450, 5200]  # Character positions in the text

If no ToC is found, returns an empty list: []

How It Works

Pipeline Overview

Load text from file or string
Find ToC section using དཀར་ཆག markers (or use first quarter/100 pages as fallback)
Extract ToC entries using regex patterns
Fallback to LLM if regex fails (sends only ToC section, not whole text)
Locate section starts by page markers (if present) or fuzzy title matching
Return sorted indices

ToC Section Detection

The package looks for དཀར་ཆག (Table of Contents marker) in the text:

Takes the first occurrence as ToC start
Takes the last occurrence as ToC body end anchor
Extends until a double newline or 4 more pages, whichever comes first

Entry Extraction

Attempts regex patterns first:

Tibetan text + delimiter (༎ ། . …) + page number
Supports both Arabic (0-9) and Tibetan numerals (༠-༩)

If regex fails, sends the extracted ToC section to LLM for structured extraction (JSON format).

Section Location

For each ToC entry:

If page numbers exist in text: finds page marker, returns position after it
If no page markers: fuzzy matches the title using rapidfuzz
- Searches within ±50% of expected page offset
- Picks best match with similarity ≥ 90%
- If no match found: skips silently (not included in output)

Error Handling

ValueError: Raised if neither or both of file_path/text provided
FileNotFoundError: Raised if file doesn't exist
UnicodeDecodeError: Raised if file is not UTF-8 encoded
ValueError: Raised if no API key is configured

For LLM errors (rate limits, auth failures), the package logs a warning and returns an empty list [].

Logging

Enable debug logging to see detailed extraction steps:

import logging

logging.basicConfig(level=logging.DEBUG)
indices = extract_toc_indices(text=text)

Debug output shows:

Text length loaded
Provider and model used
Whether དཀར་ཆག section was found
Number of ToC entries extracted
Which entries couldn't be located

Requirements

Python 3.9+
At least one LLM API key (Gemini, OpenAI, or Claude)

Performance Notes

Typical Tibetan texts (< 1000 pages): ~1-2 seconds
Large texts (> 1000 pages): ~2-5 seconds depending on ToC complexity
Only sends ToC section to LLM (not full text) → much cheaper API calls

Troubleshooting

"No API key found" Error

Make sure you've set one of these environment variables:

GEMINI_API_KEY
OPENAI_API_KEY
ANTHROPIC_API_KEY

Check with:

echo $GEMINI_API_KEY

LLM Returns Empty Response

This typically means:

The ToC format is unusual (try looking at the text manually)
The LLM couldn't identify the structure (try a different model or provider)
API rate limit reached (wait and retry)

Enable debug logging to see what text was sent to the LLM:

import logging
logging.basicConfig(level=logging.DEBUG)

Regex Extraction Works But Indices Are Wrong

The fuzzy matching threshold (default 0.9) may be too strict. Try:

indices = extract_toc_indices(text=text, fuzzy_threshold=0.85)

License

MIT License — See LICENSE file for details

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.6.0

Apr 10, 2026

0.5.0

Apr 8, 2026

0.4.0

Apr 2, 2026

0.3.1

Apr 2, 2026

0.3.0

Apr 2, 2026

0.2.2

Apr 2, 2026

0.2.1

Apr 2, 2026

0.1.1

Apr 1, 2026

This version

0.1.0

Apr 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_text_outline-0.1.0.tar.gz (19.1 kB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ai_text_outline-0.1.0-py3-none-any.whl (15.9 kB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file ai_text_outline-0.1.0.tar.gz.

File metadata

Download URL: ai_text_outline-0.1.0.tar.gz
Upload date: Apr 1, 2026
Size: 19.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for ai_text_outline-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f2b36ccc979613b11c12c7ce2ac0ca37d6d9519acbb98b269ba58ae3fb20d91a`
MD5	`138d5c351ec10f037111dee7ff979126`
BLAKE2b-256	`705bf59cb18407b9613f4accfe8b62695e46db764868d2114b96a21d9afbc4c1`

See more details on using hashes here.

File details

Details for the file ai_text_outline-0.1.0-py3-none-any.whl.

File metadata

Download URL: ai_text_outline-0.1.0-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 15.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for ai_text_outline-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a14f954d72721d68e5e00311ded10139b7e2494bb71790233754157c2b23d999`
MD5	`a9e3b5a0a9b697a7b83431eaceb34cd6`
BLAKE2b-256	`e9fba636a60cf7ed34e5eed69aa06ccf786aea891dc78b551fa6dc7ff0e5bcb8`

See more details on using hashes here.

ai-text-outline 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ai_text_outline

Features

Installation

Basic Installation

With Specific LLM Provider

For Development

Configuration

Environment Variables

Google Gemini (Recommended)

OpenAI

Anthropic Claude

Multiple Providers

Usage

Basic Usage

Advanced Configuration

For Backend Integration

Return Value

How It Works

Pipeline Overview

ToC Section Detection

Entry Extraction

Section Location

Error Handling

Logging

Requirements

Performance Notes

Troubleshooting

"No API key found" Error

LLM Returns Empty Response

Regex Extraction Works But Indices Are Wrong

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes