Simple text extraction from files using Vectorize Iris
Project description
Vectorize Iris Python SDK
Document text extraction for Python
Extract text, tables, and structured data from PDFs, images, and documents with a single function call. Built on Vectorize Iris, the industry-leading AI extraction service.
Why Iris?
Traditional OCR tools struggle with complex layouts, poor scans, and structured data. Iris uses advanced AI to deliver:
- ✨ High accuracy - Even with poor quality or complex documents
- 📊 Structure preservation - Maintains tables, lists, and formatting
- 🎯 Smart chunking - Semantic splitting perfect for RAG pipelines
- 🔍 Metadata extraction - Extract specific fields using natural language
- ⚡ Simple API - One line of code to extract text
Quick Start
Installation
pip install vectorize-iris
Authentication
Set your credentials (get them at vectorize.io):
export VECTORIZE_TOKEN="your-token"
export VECTORIZE_ORG_ID="your-org-id"
Basic Usage
from vectorize_iris import extract_text_from_file
result = extract_text_from_file('document.pdf')
print(result.text)
That's it! Iris handles file upload, extraction, and polling automatically.
Features
Basic Text Extraction
from vectorize_iris import extract_text_from_file
result = extract_text_from_file('document.pdf')
print(result.text)
Output:
This is the extracted text from your PDF document.
All formatting and structure is preserved.
Tables, lists, and other elements are properly extracted.
Extract from Bytes
from vectorize_iris import extract_text
with open('document.pdf', 'rb') as f:
file_bytes = f.read()
result = extract_text(file_bytes, 'document.pdf')
print(f"Extracted {len(result.text)} characters")
Output:
Extracted 5536 characters
Chunking for RAG
from vectorize_iris import extract_text_from_file, ExtractionOptions
result = extract_text_from_file(
'long-document.pdf',
options=ExtractionOptions(
chunk_size=512
)
)
for i, chunk in enumerate(result.chunks):
print(f"Chunk {i+1}: {chunk[:100]}...")
Output:
Chunk 1: # Introduction
This document covers the basics of machine learning...
Chunk 2: ## Neural Networks
Neural networks are computational models inspired by...
Chunk 3: ### Training Process
The training process involves adjusting weights...
Custom Parsing Instructions
from vectorize_iris import extract_text_from_file, ExtractionOptions
result = extract_text_from_file(
'report.pdf',
options=ExtractionOptions(
parsing_instructions='Extract only tables and numerical data, ignore narrative text'
)
)
print(result.text)
Output:
Q1 2024 Revenue: $1,250,000
Q2 2024 Revenue: $1,450,000
Q3 2024 Revenue: $1,680,000
Region | Sales | Growth
----------|--------|-------
North | $500K | +12%
South | $380K | +8%
East | $420K | +15%
West | $380K | +10%
Inferred Metadata Schema
from vectorize_iris import extract_text_from_file, ExtractionOptions
result = extract_text_from_file(
'invoice.pdf',
options=ExtractionOptions(
infer_metadata_schema=True
)
)
import json
metadata = json.loads(result.metadata)
print(json.dumps(metadata, indent=2))
Output:
{
"document_type": "invoice",
"invoice_number": "INV-2024-001",
"date": "2024-01-15",
"total_amount": 1250.00,
"currency": "USD",
"vendor": "Acme Corp"
}
Async API
import asyncio
from vectorize_iris import extract_text_from_file_async
async def extract_multiple():
files = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf']
tasks = [extract_text_from_file_async(f) for f in files]
results = await asyncio.gather(*tasks)
for file, result in zip(files, results):
print(f"{file}: {len(result.text)} chars extracted")
asyncio.run(extract_multiple())
Output:
doc1.pdf: 3421 chars extracted
doc2.pdf: 5892 chars extracted
doc3.pdf: 2156 chars extracted
Error Handling
from vectorize_iris import extract_text_from_file, VectorizeIrisError
try:
result = extract_text_from_file('document.pdf')
print(result.text)
except VectorizeIrisError as e:
print(f"Extraction failed: {e}")
Output:
Extraction failed: File not found: document.pdf
Batch Processing
from vectorize_iris import extract_text_from_file
import glob
for pdf_file in glob.glob('documents/*.pdf'):
print(f"Processing {pdf_file}...")
result = extract_text_from_file(pdf_file)
output_file = pdf_file.replace('.pdf', '.txt')
with open(output_file, 'w') as f:
f.write(result.text)
print(f" ✓ Saved to {output_file}")
Output:
Processing documents/report-q1.pdf...
✓ Saved to documents/report-q1.txt
Processing documents/report-q2.pdf...
✓ Saved to documents/report-q2.txt
Processing documents/report-q3.pdf...
✓ Saved to documents/report-q3.txt
API Reference
extract_text_from_file(file_path, options=None)
Extract text from a file.
Parameters:
file_path(str): Path to the fileoptions(ExtractionOptions, optional): Extraction options
Returns: ExtractionResultData with:
success(bool): Whether extraction succeededtext(str): Extracted textchunks(list[str], optional): Text chunks if chunking enabledmetadata(str, optional): JSON metadata if requested
extract_text(file_bytes, file_name, options=None)
Extract text from bytes.
Parameters:
file_bytes(bytes): File contentfile_name(str): File nameoptions(ExtractionOptions, optional): Extraction options
Returns: ExtractionResultData
Async versions
extract_text_from_file_async()- Async version ofextract_text_from_fileextract_text_async()- Async version ofextract_text
ExtractionOptions
ExtractionOptions(
chunk_size=512, # default: 256
parsing_instructions='...', # custom instructions
infer_metadata_schema=True, # auto-detect metadata
api_token='...', # override env var
org_id='...', # override env var
poll_interval=2, # seconds between checks
timeout=300 # max seconds to wait
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vectorize_iris-0.1.0.tar.gz.
File metadata
- Download URL: vectorize_iris-0.1.0.tar.gz
- Upload date:
- Size: 183.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14ab6161d77b72d8a294fa5d7830124201da83ad6c6dd5f1ddf69c202bfeffab
|
|
| MD5 |
35847a3770de242409248ce28abca597
|
|
| BLAKE2b-256 |
c0676ecbb78b9799bd2bda5efa2446f5d8b68c9a786aea036b680a21dfec30b8
|
Provenance
The following attestation bundles were made for vectorize_iris-0.1.0.tar.gz:
Publisher:
release.yml on vectorize-io/vectorize-iris
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vectorize_iris-0.1.0.tar.gz -
Subject digest:
14ab6161d77b72d8a294fa5d7830124201da83ad6c6dd5f1ddf69c202bfeffab - Sigstore transparency entry: 719695939
- Sigstore integration time:
-
Permalink:
vectorize-io/vectorize-iris@f515cd78a8820aeaf72f109ab7fc94d3257e7d28 -
Branch / Tag:
refs/tags/py-0.1.0 - Owner: https://github.com/vectorize-io
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f515cd78a8820aeaf72f109ab7fc94d3257e7d28 -
Trigger Event:
push
-
Statement type:
File details
Details for the file vectorize_iris-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vectorize_iris-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bdd460a46a789892a75d77ab4af68b42fdbb31bedd0e871dbf2e346f0d8212a8
|
|
| MD5 |
f89443fa6c72f17d811d9890c71e59ac
|
|
| BLAKE2b-256 |
d817ff3cb8629ad5f48db3b9d921d412e9d577b7467cb284a1c6bce71ffa4185
|
Provenance
The following attestation bundles were made for vectorize_iris-0.1.0-py3-none-any.whl:
Publisher:
release.yml on vectorize-io/vectorize-iris
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vectorize_iris-0.1.0-py3-none-any.whl -
Subject digest:
bdd460a46a789892a75d77ab4af68b42fdbb31bedd0e871dbf2e346f0d8212a8 - Sigstore transparency entry: 719695940
- Sigstore integration time:
-
Permalink:
vectorize-io/vectorize-iris@f515cd78a8820aeaf72f109ab7fc94d3257e7d28 -
Branch / Tag:
refs/tags/py-0.1.0 - Owner: https://github.com/vectorize-io
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f515cd78a8820aeaf72f109ab7fc94d3257e7d28 -
Trigger Event:
push
-
Statement type: