Simple text extraction from files using Vectorize Iris

These details have not been verified by PyPI

Project description

Vectorize Iris Python SDK

Document text extraction for Python

Extract text, tables, and structured data from PDFs, images, and documents with a single function call. Built on Vectorize Iris, the industry-leading AI extraction service.

Why Iris?

Traditional OCR tools struggle with complex layouts, poor scans, and structured data. Iris uses advanced AI to deliver:

✨ High accuracy - Even with poor quality or complex documents
📊 Structure preservation - Maintains tables, lists, and formatting
🎯 Smart chunking - Semantic splitting perfect for RAG pipelines
🔍 Metadata extraction - Extract specific fields using natural language
⚡ Simple API - One line of code to extract text

Quick Start

Installation

pip install vectorize-iris

Authentication

Set your credentials (get them at vectorize.io):

export VECTORIZE_API_TOKEN="your-token"
export VECTORIZE_ORG_ID="your-org-id"

Basic Usage

from vectorize_iris import extract_text_from_file

result = extract_text_from_file('document.pdf')
print(result.text)

That's it! Iris handles file upload, extraction, and polling automatically.

Features

Basic Text Extraction

from vectorize_iris import extract_text_from_file

result = extract_text_from_file('document.pdf')
print(result.text)

Output:

This is the extracted text from your PDF document.
All formatting and structure is preserved.

Tables, lists, and other elements are properly extracted.

Extract from Bytes

from vectorize_iris import extract_text

with open('document.pdf', 'rb') as f:
    file_bytes = f.read()

result = extract_text(file_bytes, 'document.pdf')
print(f"Extracted {len(result.text)} characters")

Output:

Extracted 5536 characters

Chunking for RAG

from vectorize_iris import extract_text_from_file, ExtractionOptions

result = extract_text_from_file(
    'long-document.pdf',
    options=ExtractionOptions(
        chunk_size=512
    )
)

for i, chunk in enumerate(result.chunks):
    print(f"Chunk {i+1}: {chunk[:100]}...")

Output:

Chunk 1: # Introduction
This document covers the basics of machine learning...

Chunk 2: ## Neural Networks
Neural networks are computational models inspired by...

Chunk 3: ### Training Process
The training process involves adjusting weights...

Custom Parsing Instructions

from vectorize_iris import extract_text_from_file, ExtractionOptions

result = extract_text_from_file(
    'report.pdf',
    options=ExtractionOptions(
        parsing_instructions='Extract only tables and numerical data, ignore narrative text'
    )
)

print(result.text)

Output:

Q1 2024 Revenue: $1,250,000
Q2 2024 Revenue: $1,450,000
Q3 2024 Revenue: $1,680,000

Region    | Sales  | Growth
----------|--------|-------
North     | $500K  | +12%
South     | $380K  | +8%
East      | $420K  | +15%
West      | $380K  | +10%

Inferred Metadata Schema

from vectorize_iris import extract_text_from_file, ExtractionOptions

result = extract_text_from_file(
    'invoice.pdf',
    options=ExtractionOptions(
        infer_metadata_schema=True
    )
)

import json
metadata = json.loads(result.metadata)
print(json.dumps(metadata, indent=2))

Output:

{
  "document_type": "invoice",
  "invoice_number": "INV-2024-001",
  "date": "2024-01-15",
  "total_amount": 1250.00,
  "currency": "USD",
  "vendor": "Acme Corp"
}

Async API

import asyncio
from vectorize_iris import extract_text_from_file_async

async def extract_multiple():
    files = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf']

    tasks = [extract_text_from_file_async(f) for f in files]
    results = await asyncio.gather(*tasks)

    for file, result in zip(files, results):
        print(f"{file}: {len(result.text)} chars extracted")

asyncio.run(extract_multiple())

Output:

doc1.pdf: 3421 chars extracted
doc2.pdf: 5892 chars extracted
doc3.pdf: 2156 chars extracted

Error Handling

from vectorize_iris import extract_text_from_file, VectorizeIrisError

try:
    result = extract_text_from_file('document.pdf')
    print(result.text)
except VectorizeIrisError as e:
    print(f"Extraction failed: {e}")

Output:

Extraction failed: File not found: document.pdf

Batch Processing

from vectorize_iris import extract_text_from_file
import glob

for pdf_file in glob.glob('documents/*.pdf'):
    print(f"Processing {pdf_file}...")
    result = extract_text_from_file(pdf_file)

    output_file = pdf_file.replace('.pdf', '.txt')
    with open(output_file, 'w') as f:
        f.write(result.text)

    print(f"  ✓ Saved to {output_file}")

Output:

Processing documents/report-q1.pdf...
  ✓ Saved to documents/report-q1.txt
Processing documents/report-q2.pdf...
  ✓ Saved to documents/report-q2.txt
Processing documents/report-q3.pdf...
  ✓ Saved to documents/report-q3.txt

API Reference

`extract_text_from_file(file_path, options=None)`

Extract text from a file.

Parameters:

file_path (str): Path to the file
options (ExtractionOptions, optional): Extraction options

Returns: ExtractionResultData with:

success (bool): Whether extraction succeeded
text (str): Extracted text
chunks (list[str], optional): Text chunks if chunking enabled
metadata (str, optional): JSON metadata if requested

`extract_text(file_bytes, file_name, options=None)`

Extract text from bytes.

Parameters:

file_bytes (bytes): File content
file_name (str): File name
options (ExtractionOptions, optional): Extraction options

Returns: ExtractionResultData

Async versions

extract_text_from_file_async() - Async version of extract_text_from_file
extract_text_async() - Async version of extract_text

ExtractionOptions

ExtractionOptions(
    chunk_size=512,                # default: 256
    parsing_instructions='...',    # custom instructions
    infer_metadata_schema=True,    # auto-detect metadata
    api_token='...',              # override env var
    org_id='...',                 # override env var
    poll_interval=2,              # seconds between checks
    timeout=300                   # max seconds to wait
)

📚 Full Documentation | 🏠 Back to Main README

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language

Release history Release notifications | RSS feed

0.1.0

Nov 24, 2025

This version

0.0.2

Nov 18, 2025

0.0.1

Nov 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vectorize_iris-0.0.2.tar.gz (181.8 kB view details)

Uploaded Nov 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vectorize_iris-0.0.2-py3-none-any.whl (9.9 kB view details)

Uploaded Nov 18, 2025 Python 3

File details

Details for the file vectorize_iris-0.0.2.tar.gz.

File metadata

Download URL: vectorize_iris-0.0.2.tar.gz
Upload date: Nov 18, 2025
Size: 181.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vectorize_iris-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`b9a16ba7807ecc94ad388c4369f6ccb1c2683a96d2701f52cb65b06accf390cd`
MD5	`34222209d68128a9a7eeca701d8dc716`
BLAKE2b-256	`9f8f32e9f434858d90a602fe4211e63696bdf42a03973300e5482eaee555f80f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vectorize_iris-0.0.2.tar.gz:

Publisher: release.yml on vectorize-io/vectorize-iris

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vectorize_iris-0.0.2.tar.gz
- Subject digest: b9a16ba7807ecc94ad388c4369f6ccb1c2683a96d2701f52cb65b06accf390cd
- Sigstore transparency entry: 707598028
- Sigstore integration time: Nov 18, 2025
Source repository:
- Permalink: vectorize-io/vectorize-iris@6c109bb0d32625b70a9c4bbe28ca97b43eaf7a4b
- Branch / Tag: refs/tags/py-0.0.2
- Owner: https://github.com/vectorize-io
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@6c109bb0d32625b70a9c4bbe28ca97b43eaf7a4b
- Trigger Event: push

File details

Details for the file vectorize_iris-0.0.2-py3-none-any.whl.

File metadata

Download URL: vectorize_iris-0.0.2-py3-none-any.whl
Upload date: Nov 18, 2025
Size: 9.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vectorize_iris-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b6063f69b77a988c891b53adeb7d36051e2aa7d83e04ac312bb2f1efb6e42d0a`
MD5	`c149b6a59945acdb497d7e7cbe341293`
BLAKE2b-256	`52af83e73508eb2f13f7baefa7001565508edbc4c192dacfb77de8e57924bb1a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vectorize_iris-0.0.2-py3-none-any.whl:

Publisher: release.yml on vectorize-io/vectorize-iris

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vectorize_iris-0.0.2-py3-none-any.whl
- Subject digest: b6063f69b77a988c891b53adeb7d36051e2aa7d83e04ac312bb2f1efb6e42d0a
- Sigstore transparency entry: 707598031
- Sigstore integration time: Nov 18, 2025
Source repository:
- Permalink: vectorize-io/vectorize-iris@6c109bb0d32625b70a9c4bbe28ca97b43eaf7a4b
- Branch / Tag: refs/tags/py-0.0.2
- Owner: https://github.com/vectorize-io
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@6c109bb0d32625b70a9c4bbe28ca97b43eaf7a4b
- Trigger Event: push

vectorize-iris 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Vectorize Iris Python SDK

Why Iris?

Quick Start

Installation

Authentication

Basic Usage

Features

Basic Text Extraction

Extract from Bytes

Chunking for RAG

Custom Parsing Instructions

Inferred Metadata Schema

Async API

Error Handling

Batch Processing

API Reference

extract_text_from_file(file_path, options=None)

extract_text(file_bytes, file_name, options=None)

Async versions

ExtractionOptions

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`extract_text_from_file(file_path, options=None)`

`extract_text(file_bytes, file_name, options=None)`