Skip to main content

Simple text extraction from files using Vectorize Iris

Project description

Vectorize Iris Python SDK

Document text extraction for Python

Extract text, tables, and structured data from PDFs, images, and documents with a single function call. Built on Vectorize Iris, the industry-leading AI extraction service.

PyPI version Python License: MIT

Why Iris?

Traditional OCR tools struggle with complex layouts, poor scans, and structured data. Iris uses advanced AI to deliver:

  • High accuracy - Even with poor quality or complex documents
  • 📊 Structure preservation - Maintains tables, lists, and formatting
  • 🎯 Smart chunking - Semantic splitting perfect for RAG pipelines
  • 🔍 Metadata extraction - Extract specific fields using natural language
  • Simple API - One line of code to extract text

Quick Start

Installation

pip install vectorize-iris

Authentication

Set your credentials (get them at vectorize.io):

export VECTORIZE_API_TOKEN="your-token"
export VECTORIZE_ORG_ID="your-org-id"

Basic Usage

from vectorize_iris import extract_text_from_file

result = extract_text_from_file('document.pdf')
print(result.text)

That's it! Iris handles file upload, extraction, and polling automatically.

Features

Basic Text Extraction

from vectorize_iris import extract_text_from_file

result = extract_text_from_file('document.pdf')
print(result.text)

Output:

This is the extracted text from your PDF document.
All formatting and structure is preserved.

Tables, lists, and other elements are properly extracted.

Extract from Bytes

from vectorize_iris import extract_text

with open('document.pdf', 'rb') as f:
    file_bytes = f.read()

result = extract_text(file_bytes, 'document.pdf')
print(f"Extracted {len(result.text)} characters")

Output:

Extracted 5536 characters

Chunking for RAG

from vectorize_iris import extract_text_from_file, ExtractionOptions

result = extract_text_from_file(
    'long-document.pdf',
    options=ExtractionOptions(
        chunk_size=512
    )
)

for i, chunk in enumerate(result.chunks):
    print(f"Chunk {i+1}: {chunk[:100]}...")

Output:

Chunk 1: # Introduction
This document covers the basics of machine learning...

Chunk 2: ## Neural Networks
Neural networks are computational models inspired by...

Chunk 3: ### Training Process
The training process involves adjusting weights...

Custom Parsing Instructions

from vectorize_iris import extract_text_from_file, ExtractionOptions

result = extract_text_from_file(
    'report.pdf',
    options=ExtractionOptions(
        parsing_instructions='Extract only tables and numerical data, ignore narrative text'
    )
)

print(result.text)

Output:

Q1 2024 Revenue: $1,250,000
Q2 2024 Revenue: $1,450,000
Q3 2024 Revenue: $1,680,000

Region    | Sales  | Growth
----------|--------|-------
North     | $500K  | +12%
South     | $380K  | +8%
East      | $420K  | +15%
West      | $380K  | +10%

Inferred Metadata Schema

from vectorize_iris import extract_text_from_file, ExtractionOptions

result = extract_text_from_file(
    'invoice.pdf',
    options=ExtractionOptions(
        infer_metadata_schema=True
    )
)

import json
metadata = json.loads(result.metadata)
print(json.dumps(metadata, indent=2))

Output:

{
  "document_type": "invoice",
  "invoice_number": "INV-2024-001",
  "date": "2024-01-15",
  "total_amount": 1250.00,
  "currency": "USD",
  "vendor": "Acme Corp"
}

Async API

import asyncio
from vectorize_iris import extract_text_from_file_async

async def extract_multiple():
    files = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf']

    tasks = [extract_text_from_file_async(f) for f in files]
    results = await asyncio.gather(*tasks)

    for file, result in zip(files, results):
        print(f"{file}: {len(result.text)} chars extracted")

asyncio.run(extract_multiple())

Output:

doc1.pdf: 3421 chars extracted
doc2.pdf: 5892 chars extracted
doc3.pdf: 2156 chars extracted

Error Handling

from vectorize_iris import extract_text_from_file, VectorizeIrisError

try:
    result = extract_text_from_file('document.pdf')
    print(result.text)
except VectorizeIrisError as e:
    print(f"Extraction failed: {e}")

Output:

Extraction failed: File not found: document.pdf

Batch Processing

from vectorize_iris import extract_text_from_file
import glob

for pdf_file in glob.glob('documents/*.pdf'):
    print(f"Processing {pdf_file}...")
    result = extract_text_from_file(pdf_file)

    output_file = pdf_file.replace('.pdf', '.txt')
    with open(output_file, 'w') as f:
        f.write(result.text)

    print(f"  ✓ Saved to {output_file}")

Output:

Processing documents/report-q1.pdf...
  ✓ Saved to documents/report-q1.txt
Processing documents/report-q2.pdf...
  ✓ Saved to documents/report-q2.txt
Processing documents/report-q3.pdf...
  ✓ Saved to documents/report-q3.txt

API Reference

extract_text_from_file(file_path, options=None)

Extract text from a file.

Parameters:

  • file_path (str): Path to the file
  • options (ExtractionOptions, optional): Extraction options

Returns: ExtractionResultData with:

  • success (bool): Whether extraction succeeded
  • text (str): Extracted text
  • chunks (list[str], optional): Text chunks if chunking enabled
  • metadata (str, optional): JSON metadata if requested

extract_text(file_bytes, file_name, options=None)

Extract text from bytes.

Parameters:

  • file_bytes (bytes): File content
  • file_name (str): File name
  • options (ExtractionOptions, optional): Extraction options

Returns: ExtractionResultData

Async versions

  • extract_text_from_file_async() - Async version of extract_text_from_file
  • extract_text_async() - Async version of extract_text

ExtractionOptions

ExtractionOptions(
    chunk_size=512,                # default: 256
    parsing_instructions='...',    # custom instructions
    infer_metadata_schema=True,    # auto-detect metadata
    api_token='...',              # override env var
    org_id='...',                 # override env var
    poll_interval=2,              # seconds between checks
    timeout=300                   # max seconds to wait
)

📚 Full Documentation | 🏠 Back to Main README

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vectorize_iris-0.0.2.tar.gz (181.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vectorize_iris-0.0.2-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file vectorize_iris-0.0.2.tar.gz.

File metadata

  • Download URL: vectorize_iris-0.0.2.tar.gz
  • Upload date:
  • Size: 181.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vectorize_iris-0.0.2.tar.gz
Algorithm Hash digest
SHA256 b9a16ba7807ecc94ad388c4369f6ccb1c2683a96d2701f52cb65b06accf390cd
MD5 34222209d68128a9a7eeca701d8dc716
BLAKE2b-256 9f8f32e9f434858d90a602fe4211e63696bdf42a03973300e5482eaee555f80f

See more details on using hashes here.

Provenance

The following attestation bundles were made for vectorize_iris-0.0.2.tar.gz:

Publisher: release.yml on vectorize-io/vectorize-iris

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vectorize_iris-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: vectorize_iris-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 9.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vectorize_iris-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b6063f69b77a988c891b53adeb7d36051e2aa7d83e04ac312bb2f1efb6e42d0a
MD5 c149b6a59945acdb497d7e7cbe341293
BLAKE2b-256 52af83e73508eb2f13f7baefa7001565508edbc4c192dacfb77de8e57924bb1a

See more details on using hashes here.

Provenance

The following attestation bundles were made for vectorize_iris-0.0.2-py3-none-any.whl:

Publisher: release.yml on vectorize-io/vectorize-iris

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page