A Python library for extracting data from any document using AI

These details have not been verified by PyPI

Project links

Project description

This Project has been moved to

https://github.com/NanoNets/docstrange

Try live demo

https://docstrange.nanonets.com

Nanonets Document Extractor

A Python library for extracting data from any document using AI.

🚀 Try it instantly! Visit extraction-api.nanonets.com to access our hosted document extractors with a user-friendly interface.

Quick Start

Installation

pip install nanonets-extractor

Basic Usage

from nanonets_extractor import DocumentExtractor

# Initialize extractor
extractor = DocumentExtractor()

# Extract data from any document
result = extractor.extract(
    file_path="invoice.pdf",
    output_type="flat-json"
)

print(result)

Initialization Parameters

DocumentExtractor()

Parameter	Type	Required	Description
`api_key`	str	No	API key for unlimited access (optional - uses free tier if not provided)
`model`	str	No	AI model: `"gemini"` or `"openai"` (optional)

Examples

# Free tier (with rate limits)
extractor = DocumentExtractor()

# Unlimited access with API key
extractor = DocumentExtractor(api_key="your_api_key")

# Specify a particular model with API key
extractor = DocumentExtractor(api_key="your_api_key", model="openai")

💡 Getting Your API Key: If you hit rate limits, get your FREE API key from https://app.nanonets.com/#/keys for unlimited access.

Extract Method

extractor.extract()

Parameter	Type	Required	Description
`file_path`	str	Yes	Path to your document
`output_type`	str	No	Output format (default: "flat-json")
`specified_fields`	list	No	Extract only specific fields
`json_schema`	dict	No	Custom JSON schema for output

Output Types

Type	Description	Parameters Required
`"markdown"`	Clean markdown formatting	None
`"html"`	Semantic HTML structure	None
`"fields"`	Auto-detected key-value pairs	None
`"tables"`	Structured table data	None
`"csv"`	Tabular data and CSV format	None
`"flat-json"`	Flat key-value JSON	None
`"specified-fields"`	Custom field extraction	`specified_fields`
`"specified-json"`	Custom schema extraction	`json_schema`

Supported Document Types

Works with any document type:

📄 PDFs - Invoices, contracts, reports
🖼️ Images - Screenshots, photos, scans
📊 Spreadsheets - Excel, CSV files
📝 Text Documents - Word docs, text files
🆔 ID Documents - Passports, licenses, certificates
🧾 Receipts - Any receipt or bill

Examples

Basic Extraction

from nanonets_extractor import DocumentExtractor

extractor = DocumentExtractor()

# Extract all data as key-value pairs
result = extractor.extract("document.pdf", output_type="fields")
print(result)

Different Output Formats

# Get clean markdown formatting
result = extractor.extract("document.pdf", output_type="markdown")
print(result)

# Get semantic HTML structure
result = extractor.extract("document.pdf", output_type="html")
print(result)

# Extract structured table data
result = extractor.extract("document.pdf", output_type="tables")
print(result)

# Get CSV format for tabular data
result = extractor.extract("document.pdf", output_type="csv")
print(result)

Extract Specific Fields

# Extract only specific fields
result = extractor.extract(
    file_path="invoice.pdf",
    output_type="specified-fields", 
    specified_fields=["invoice_number", "total", "customer_name"]
)

Custom JSON Schema

# Use custom schema
schema = {
    "invoice_number": "string",
    "line_items": [
        {
            "description": "string",
            "amount": "number"
        }
    ]
}

result = extractor.extract(
    file_path="invoice.pdf",
    output_type="specified-json",
    json_schema=schema
)

Batch Processing

# Process multiple files
files = ["doc1.pdf", "doc2.jpg", "doc3.docx"]
results = extractor.extract_batch(
    file_paths=files,
    output_type="fields"
)

for file_path, result in results.items():
    print(f"{file_path}: {result}")

Command Line Interface

# Free tier (with rate limits)
nanonets-extractor document.pdf

# With API key for unlimited access
nanonets-extractor document.pdf --api-key your_api_key

# Specify output format with API key
nanonets-extractor document.pdf --api-key your_api_key --output-type markdown

# Extract specific fields
nanonets-extractor invoice.pdf --output-type specified-fields --fields "invoice_number,total,date"

# Save to file
nanonets-extractor document.pdf --output result.json

# Process multiple files
nanonets-extractor *.pdf --output-dir results/

Error Handling

from nanonets_extractor import DocumentExtractor
from nanonets_extractor.exceptions import ExtractionError, UnsupportedFileError

extractor = DocumentExtractor()

try:
    result = extractor.extract("document.pdf")
    print(result)
except UnsupportedFileError as e:
    print(f"File type not supported: {e}")
except ExtractionError as e:
    print(f"Extraction failed: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Supported File Formats

PDFs: .pdf
Images: .png, .jpg, .jpeg, .tiff, .bmp, .gif
Documents: .docx, .doc
Spreadsheets: .xlsx, .xls, .csv
Text: .txt, .rtf

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.2

Aug 25, 2025

0.1.6

Jul 23, 2025

0.1.4

Jul 23, 2025

0.1.3

Jul 23, 2025

0.1.1

Jul 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanonets_extractor-0.2.2.tar.gz (25.1 kB view details)

Uploaded Aug 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nanonets_extractor-0.2.2-py3-none-any.whl (14.4 kB view details)

Uploaded Aug 25, 2025 Python 3

File details

Details for the file nanonets_extractor-0.2.2.tar.gz.

File metadata

Download URL: nanonets_extractor-0.2.2.tar.gz
Upload date: Aug 25, 2025
Size: 25.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for nanonets_extractor-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`2adbb35e4548fb9df28c5bab8612040bdb0a05979c66a9b7473cb2ed7dc2e106`
MD5	`a86319b9b45c7e6de300d858e0970af5`
BLAKE2b-256	`8116680f48d4ebb151f097b2c13fab812ca0b17ff4a02b0f8bb3c211ebcbe4a8`

See more details on using hashes here.

File details

Details for the file nanonets_extractor-0.2.2-py3-none-any.whl.

File metadata

Download URL: nanonets_extractor-0.2.2-py3-none-any.whl
Upload date: Aug 25, 2025
Size: 14.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for nanonets_extractor-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7776abb533e7b8a13dd1e77846cd4c43ad660263b69807c891c5b62276b2c274`
MD5	`48ccd3711a1501b37ae978ff855b3125`
BLAKE2b-256	`765432b9fe34a0cfd5f42b652df0f717631c3aa184b34a72c77d7a91744e9712`

See more details on using hashes here.

nanonets-extractor 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

This Project has been moved to

https://github.com/NanoNets/docstrange

Try live demo

https://docstrange.nanonets.com

Nanonets Document Extractor

Quick Start

Installation

Basic Usage

Initialization Parameters

DocumentExtractor()

Examples

Extract Method

extractor.extract()

Output Types

Supported Document Types

Examples

Basic Extraction

Different Output Formats

Extract Specific Fields

Custom JSON Schema

Batch Processing

Command Line Interface

Error Handling

Supported File Formats

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes