Skip to main content

A Python library for extracting data from any document using AI

Project description

This Project has been moved to

https://github.com/NanoNets/docstrange

Try live demo

https://docstrange.nanonets.com

Nanonets Document Extractor

PyPI version PyPI - Downloads PyPI - Python Version PyPI - License

A Python library for extracting data from any document using AI.

🚀 Try it instantly! Visit extraction-api.nanonets.com to access our hosted document extractors with a user-friendly interface.

Quick Start

Installation

pip install nanonets-extractor

Basic Usage

from nanonets_extractor import DocumentExtractor

# Initialize extractor
extractor = DocumentExtractor()

# Extract data from any document
result = extractor.extract(
    file_path="invoice.pdf",
    output_type="flat-json"
)

print(result)

Initialization Parameters

DocumentExtractor()

Parameter Type Required Description
api_key str No API key for unlimited access (optional - uses free tier if not provided)
model str No AI model: "gemini" or "openai" (optional)

Examples

# Free tier (with rate limits)
extractor = DocumentExtractor()

# Unlimited access with API key
extractor = DocumentExtractor(api_key="your_api_key")

# Specify a particular model with API key
extractor = DocumentExtractor(api_key="your_api_key", model="openai")

💡 Getting Your API Key: If you hit rate limits, get your FREE API key from https://app.nanonets.com/#/keys for unlimited access.

Extract Method

extractor.extract()

Parameter Type Required Description
file_path str Yes Path to your document
output_type str No Output format (default: "flat-json")
specified_fields list No Extract only specific fields
json_schema dict No Custom JSON schema for output

Output Types

Type Description Parameters Required
"markdown" Clean markdown formatting None
"html" Semantic HTML structure None
"fields" Auto-detected key-value pairs None
"tables" Structured table data None
"csv" Tabular data and CSV format None
"flat-json" Flat key-value JSON None
"specified-fields" Custom field extraction specified_fields
"specified-json" Custom schema extraction json_schema

Supported Document Types

Works with any document type:

  • 📄 PDFs - Invoices, contracts, reports
  • 🖼️ Images - Screenshots, photos, scans
  • 📊 Spreadsheets - Excel, CSV files
  • 📝 Text Documents - Word docs, text files
  • 🆔 ID Documents - Passports, licenses, certificates
  • 🧾 Receipts - Any receipt or bill

Examples

Basic Extraction

from nanonets_extractor import DocumentExtractor

extractor = DocumentExtractor()

# Extract all data as key-value pairs
result = extractor.extract("document.pdf", output_type="fields")
print(result)

Different Output Formats

# Get clean markdown formatting
result = extractor.extract("document.pdf", output_type="markdown")
print(result)

# Get semantic HTML structure
result = extractor.extract("document.pdf", output_type="html")
print(result)

# Extract structured table data
result = extractor.extract("document.pdf", output_type="tables")
print(result)

# Get CSV format for tabular data
result = extractor.extract("document.pdf", output_type="csv")
print(result)

Extract Specific Fields

# Extract only specific fields
result = extractor.extract(
    file_path="invoice.pdf",
    output_type="specified-fields", 
    specified_fields=["invoice_number", "total", "customer_name"]
)

Custom JSON Schema

# Use custom schema
schema = {
    "invoice_number": "string",
    "line_items": [
        {
            "description": "string",
            "amount": "number"
        }
    ]
}

result = extractor.extract(
    file_path="invoice.pdf",
    output_type="specified-json",
    json_schema=schema
)

Batch Processing

# Process multiple files
files = ["doc1.pdf", "doc2.jpg", "doc3.docx"]
results = extractor.extract_batch(
    file_paths=files,
    output_type="fields"
)

for file_path, result in results.items():
    print(f"{file_path}: {result}")

Command Line Interface

# Free tier (with rate limits)
nanonets-extractor document.pdf

# With API key for unlimited access
nanonets-extractor document.pdf --api-key your_api_key

# Specify output format with API key
nanonets-extractor document.pdf --api-key your_api_key --output-type markdown

# Extract specific fields
nanonets-extractor invoice.pdf --output-type specified-fields --fields "invoice_number,total,date"

# Save to file
nanonets-extractor document.pdf --output result.json

# Process multiple files
nanonets-extractor *.pdf --output-dir results/

Error Handling

from nanonets_extractor import DocumentExtractor
from nanonets_extractor.exceptions import ExtractionError, UnsupportedFileError

extractor = DocumentExtractor()

try:
    result = extractor.extract("document.pdf")
    print(result)
except UnsupportedFileError as e:
    print(f"File type not supported: {e}")
except ExtractionError as e:
    print(f"Extraction failed: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Supported File Formats

  • PDFs: .pdf
  • Images: .png, .jpg, .jpeg, .tiff, .bmp, .gif
  • Documents: .docx, .doc
  • Spreadsheets: .xlsx, .xls, .csv
  • Text: .txt, .rtf

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanonets_extractor-0.2.2.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nanonets_extractor-0.2.2-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file nanonets_extractor-0.2.2.tar.gz.

File metadata

  • Download URL: nanonets_extractor-0.2.2.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for nanonets_extractor-0.2.2.tar.gz
Algorithm Hash digest
SHA256 2adbb35e4548fb9df28c5bab8612040bdb0a05979c66a9b7473cb2ed7dc2e106
MD5 a86319b9b45c7e6de300d858e0970af5
BLAKE2b-256 8116680f48d4ebb151f097b2c13fab812ca0b17ff4a02b0f8bb3c211ebcbe4a8

See more details on using hashes here.

File details

Details for the file nanonets_extractor-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for nanonets_extractor-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7776abb533e7b8a13dd1e77846cd4c43ad660263b69807c891c5b62276b2c274
MD5 48ccd3711a1501b37ae978ff855b3125
BLAKE2b-256 765432b9fe34a0cfd5f42b652df0f717631c3aa184b34a72c77d7a91744e9712

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page