Skip to main content

Simple text extraction from files using Vectorize Iris

Project description

Vectorize Iris

Vectorize Iris Python SDK

Document text extraction for Python

Extract text, tables, and structured data from PDFs, images, and documents with a single function call. Built on Vectorize Iris, the industry-leading AI extraction service.

PyPI version Python License: MIT

Why Iris?

Traditional OCR tools struggle with complex layouts, poor scans, and structured data. Iris uses advanced AI to deliver:

  • High accuracy - Even with poor quality or complex documents
  • 📊 Structure preservation - Maintains tables, lists, and formatting
  • 🎯 Smart chunking - Semantic splitting perfect for RAG pipelines
  • 🔍 Metadata extraction - Extract specific fields using natural language
  • Simple API - One line of code to extract text

Quick Start

Installation

pip install vectorize-iris

Authentication

Set your credentials (get them at vectorize.io):

export VECTORIZE_TOKEN="your-token"
export VECTORIZE_ORG_ID="your-org-id"

Basic Usage

from vectorize_iris import extract_text_from_file

result = extract_text_from_file('document.pdf')
print(result.text)

That's it! Iris handles file upload, extraction, and polling automatically.

Features

Basic Text Extraction

from vectorize_iris import extract_text_from_file

result = extract_text_from_file('document.pdf')
print(result.text)

Output:

This is the extracted text from your PDF document.
All formatting and structure is preserved.

Tables, lists, and other elements are properly extracted.

Extract from Bytes

from vectorize_iris import extract_text

with open('document.pdf', 'rb') as f:
    file_bytes = f.read()

result = extract_text(file_bytes, 'document.pdf')
print(f"Extracted {len(result.text)} characters")

Output:

Extracted 5536 characters

Chunking for RAG

from vectorize_iris import extract_text_from_file, ExtractionOptions

result = extract_text_from_file(
    'long-document.pdf',
    options=ExtractionOptions(
        chunk_size=512
    )
)

for i, chunk in enumerate(result.chunks):
    print(f"Chunk {i+1}: {chunk[:100]}...")

Output:

Chunk 1: # Introduction
This document covers the basics of machine learning...

Chunk 2: ## Neural Networks
Neural networks are computational models inspired by...

Chunk 3: ### Training Process
The training process involves adjusting weights...

Custom Parsing Instructions

from vectorize_iris import extract_text_from_file, ExtractionOptions

result = extract_text_from_file(
    'report.pdf',
    options=ExtractionOptions(
        parsing_instructions='Extract only tables and numerical data, ignore narrative text'
    )
)

print(result.text)

Output:

Q1 2024 Revenue: $1,250,000
Q2 2024 Revenue: $1,450,000
Q3 2024 Revenue: $1,680,000

Region    | Sales  | Growth
----------|--------|-------
North     | $500K  | +12%
South     | $380K  | +8%
East      | $420K  | +15%
West      | $380K  | +10%

Inferred Metadata Schema

from vectorize_iris import extract_text_from_file, ExtractionOptions

result = extract_text_from_file(
    'invoice.pdf',
    options=ExtractionOptions(
        infer_metadata_schema=True
    )
)

import json
metadata = json.loads(result.metadata)
print(json.dumps(metadata, indent=2))

Output:

{
  "document_type": "invoice",
  "invoice_number": "INV-2024-001",
  "date": "2024-01-15",
  "total_amount": 1250.00,
  "currency": "USD",
  "vendor": "Acme Corp"
}

Async API

import asyncio
from vectorize_iris import extract_text_from_file_async

async def extract_multiple():
    files = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf']

    tasks = [extract_text_from_file_async(f) for f in files]
    results = await asyncio.gather(*tasks)

    for file, result in zip(files, results):
        print(f"{file}: {len(result.text)} chars extracted")

asyncio.run(extract_multiple())

Output:

doc1.pdf: 3421 chars extracted
doc2.pdf: 5892 chars extracted
doc3.pdf: 2156 chars extracted

Error Handling

from vectorize_iris import extract_text_from_file, VectorizeIrisError

try:
    result = extract_text_from_file('document.pdf')
    print(result.text)
except VectorizeIrisError as e:
    print(f"Extraction failed: {e}")

Output:

Extraction failed: File not found: document.pdf

Batch Processing

from vectorize_iris import extract_text_from_file
import glob

for pdf_file in glob.glob('documents/*.pdf'):
    print(f"Processing {pdf_file}...")
    result = extract_text_from_file(pdf_file)

    output_file = pdf_file.replace('.pdf', '.txt')
    with open(output_file, 'w') as f:
        f.write(result.text)

    print(f"  ✓ Saved to {output_file}")

Output:

Processing documents/report-q1.pdf...
  ✓ Saved to documents/report-q1.txt
Processing documents/report-q2.pdf...
  ✓ Saved to documents/report-q2.txt
Processing documents/report-q3.pdf...
  ✓ Saved to documents/report-q3.txt

API Reference

extract_text_from_file(file_path, options=None)

Extract text from a file.

Parameters:

  • file_path (str): Path to the file
  • options (ExtractionOptions, optional): Extraction options

Returns: ExtractionResultData with:

  • success (bool): Whether extraction succeeded
  • text (str): Extracted text
  • chunks (list[str], optional): Text chunks if chunking enabled
  • metadata (str, optional): JSON metadata if requested

extract_text(file_bytes, file_name, options=None)

Extract text from bytes.

Parameters:

  • file_bytes (bytes): File content
  • file_name (str): File name
  • options (ExtractionOptions, optional): Extraction options

Returns: ExtractionResultData

Async versions

  • extract_text_from_file_async() - Async version of extract_text_from_file
  • extract_text_async() - Async version of extract_text

ExtractionOptions

ExtractionOptions(
    chunk_size=512,                # default: 256
    parsing_instructions='...',    # custom instructions
    infer_metadata_schema=True,    # auto-detect metadata
    api_token='...',              # override env var
    org_id='...',                 # override env var
    poll_interval=2,              # seconds between checks
    timeout=300                   # max seconds to wait
)

📚 Full Documentation | 🏠 Back to Main README

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vectorize_iris-0.1.0.tar.gz (183.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vectorize_iris-0.1.0-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file vectorize_iris-0.1.0.tar.gz.

File metadata

  • Download URL: vectorize_iris-0.1.0.tar.gz
  • Upload date:
  • Size: 183.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vectorize_iris-0.1.0.tar.gz
Algorithm Hash digest
SHA256 14ab6161d77b72d8a294fa5d7830124201da83ad6c6dd5f1ddf69c202bfeffab
MD5 35847a3770de242409248ce28abca597
BLAKE2b-256 c0676ecbb78b9799bd2bda5efa2446f5d8b68c9a786aea036b680a21dfec30b8

See more details on using hashes here.

Provenance

The following attestation bundles were made for vectorize_iris-0.1.0.tar.gz:

Publisher: release.yml on vectorize-io/vectorize-iris

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vectorize_iris-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vectorize_iris-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vectorize_iris-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bdd460a46a789892a75d77ab4af68b42fdbb31bedd0e871dbf2e346f0d8212a8
MD5 f89443fa6c72f17d811d9890c71e59ac
BLAKE2b-256 d817ff3cb8629ad5f48db3b9d921d412e9d577b7467cb284a1c6bce71ffa4185

See more details on using hashes here.

Provenance

The following attestation bundles were made for vectorize_iris-0.1.0-py3-none-any.whl:

Publisher: release.yml on vectorize-io/vectorize-iris

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page