Skip to main content

Simple text extraction from files using Vectorize Iris

Project description

Python API - Examples

Simple Python library for extracting text from documents using Vectorize Iris.

Installation

pip install vectorize-iris

Set your credentials:

export VECTORIZE_API_TOKEN="your-token"
export VECTORIZE_ORG_ID="your-org-id"

Basic Text Extraction

from vectorize_iris import extract_text_from_file

result = extract_text_from_file('document.pdf')
print(result.text)

Output:

This is the extracted text from your PDF document.
All formatting and structure is preserved.

Tables, lists, and other elements are properly extracted.

Extract from Bytes

from vectorize_iris import extract_text

with open('document.pdf', 'rb') as f:
    file_bytes = f.read()

result = extract_text(file_bytes, 'document.pdf')
print(f"Extracted {len(result.text)} characters")

Output:

Extracted 5536 characters

Chunking for RAG

from vectorize_iris import extract_text_from_file, ExtractionOptions

result = extract_text_from_file(
    'long-document.pdf',
    options=ExtractionOptions(
        chunk_size=512
    )
)

for i, chunk in enumerate(result.chunks):
    print(f"Chunk {i+1}: {chunk[:100]}...")

Output:

Chunk 1: # Introduction
This document covers the basics of machine learning...

Chunk 2: ## Neural Networks
Neural networks are computational models inspired by...

Chunk 3: ### Training Process
The training process involves adjusting weights...

Custom Parsing Instructions

from vectorize_iris import extract_text_from_file, ExtractionOptions

result = extract_text_from_file(
    'report.pdf',
    options=ExtractionOptions(
        parsing_instructions='Extract only tables and numerical data, ignore narrative text'
    )
)

print(result.text)

Output:

Q1 2024 Revenue: $1,250,000
Q2 2024 Revenue: $1,450,000
Q3 2024 Revenue: $1,680,000

Region    | Sales  | Growth
----------|--------|-------
North     | $500K  | +12%
South     | $380K  | +8%
East      | $420K  | +15%
West      | $380K  | +10%

Inferred Metadata Schema

from vectorize_iris import extract_text_from_file, ExtractionOptions

result = extract_text_from_file(
    'invoice.pdf',
    options=ExtractionOptions(
        infer_metadata_schema=True
    )
)

import json
metadata = json.loads(result.metadata)
print(json.dumps(metadata, indent=2))

Output:

{
  "document_type": "invoice",
  "invoice_number": "INV-2024-001",
  "date": "2024-01-15",
  "total_amount": 1250.00,
  "currency": "USD",
  "vendor": "Acme Corp"
}

Async API

import asyncio
from vectorize_iris import extract_text_from_file_async

async def extract_multiple():
    files = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf']

    tasks = [extract_text_from_file_async(f) for f in files]
    results = await asyncio.gather(*tasks)

    for file, result in zip(files, results):
        print(f"{file}: {len(result.text)} chars extracted")

asyncio.run(extract_multiple())

Output:

doc1.pdf: 3421 chars extracted
doc2.pdf: 5892 chars extracted
doc3.pdf: 2156 chars extracted

Error Handling

from vectorize_iris import extract_text_from_file, VectorizeIrisError

try:
    result = extract_text_from_file('document.pdf')
    print(result.text)
except VectorizeIrisError as e:
    print(f"Extraction failed: {e}")

Output:

Extraction failed: File not found: document.pdf

Batch Processing

from vectorize_iris import extract_text_from_file
import glob

for pdf_file in glob.glob('documents/*.pdf'):
    print(f"Processing {pdf_file}...")
    result = extract_text_from_file(pdf_file)

    output_file = pdf_file.replace('.pdf', '.txt')
    with open(output_file, 'w') as f:
        f.write(result.text)

    print(f"  ✓ Saved to {output_file}")

Output:

Processing documents/report-q1.pdf...
  ✓ Saved to documents/report-q1.txt
Processing documents/report-q2.pdf...
  ✓ Saved to documents/report-q2.txt
Processing documents/report-q3.pdf...
  ✓ Saved to documents/report-q3.txt

API Reference

extract_text_from_file(file_path, options=None)

Extract text from a file.

Parameters:

  • file_path (str): Path to the file
  • options (ExtractionOptions, optional): Extraction options

Returns: ExtractionResultData with:

  • success (bool): Whether extraction succeeded
  • text (str): Extracted text
  • chunks (list[str], optional): Text chunks if chunking enabled
  • metadata (str, optional): JSON metadata if requested

extract_text(file_bytes, file_name, options=None)

Extract text from bytes.

Parameters:

  • file_bytes (bytes): File content
  • file_name (str): File name
  • options (ExtractionOptions, optional): Extraction options

Returns: ExtractionResultData

Async versions

  • extract_text_from_file_async() - Async version of extract_text_from_file
  • extract_text_async() - Async version of extract_text

ExtractionOptions

ExtractionOptions(
    chunk_size=512,                # default: 256
    parsing_instructions='...',    # custom instructions
    infer_metadata_schema=True,    # auto-detect metadata
    api_token='...',              # override env var
    org_id='...',                 # override env var
    poll_interval=2,              # seconds between checks
    timeout=300                   # max seconds to wait
)

📚 Full Documentation | 🏠 Back to Main README

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vectorize_iris-0.0.1.tar.gz (181.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vectorize_iris-0.0.1-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file vectorize_iris-0.0.1.tar.gz.

File metadata

  • Download URL: vectorize_iris-0.0.1.tar.gz
  • Upload date:
  • Size: 181.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vectorize_iris-0.0.1.tar.gz
Algorithm Hash digest
SHA256 3fb435bd02a649293ba081654513acc68d7dd1310678447a62de317820a22faa
MD5 1bc48efe3e30dd76406c8554e1859bfd
BLAKE2b-256 54e8c0d55f58a78bf9545aac9158298225655d3cd123936dbf93f81317a98941

See more details on using hashes here.

Provenance

The following attestation bundles were made for vectorize_iris-0.0.1.tar.gz:

Publisher: release.yml on vectorize-io/vectorize-iris

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vectorize_iris-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: vectorize_iris-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vectorize_iris-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1bd651e6bb9fe745cba7fc808b8717ed0939b60fbd47590db9db1883c560bbc8
MD5 e05cbdb76dd7152c3a9d62ab766288ba
BLAKE2b-256 5333f4c0ffd32264404d6b839e3847fa3f11fe0469c2dbc2d8b7ee48973ec3b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for vectorize_iris-0.0.1-py3-none-any.whl:

Publisher: release.yml on vectorize-io/vectorize-iris

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page