Skip to main content

LangChain integration with xParse Parse API for intelligent document parsing

Project description

langchain-xparse

LangChain integration with xParse Parse API for intelligent document parsing. Converts unstructured documents (PDF, images, Word, Excel, PPT, etc.) into AI-friendly structured data (JSON, Markdown) with rich metadata.

Installation

From PyPI:

pip install langchain-xparse

Configuration

Set your TextIn credentials (from Textin Workspace):

export XPARSE_APP_ID="your-app-id"
export XPARSE_SECRET_CODE="your-secret-code"

Or pass them when creating the loader:

loader = XParseLoader(
    file_path="doc.pdf",
    app_id="your-app-id",
    secret_code="your-secret-code",
)

Usage

Basic Usage

from langchain_xparse import XParseLoader

loader = XParseLoader(file_path="example.pdf")
docs = loader.load()
print(docs[0].page_content[:200])
print(docs[0].metadata)  # source, category, element_id, filename, page_number

Lazy Load

for doc in loader.lazy_load():
    # process each document
    print(doc.page_content[:100])

Async Load

async for doc in loader.alazy_load():
    # process each document asynchronously
    print(doc.page_content[:100])

Custom Parse Configuration

Customize parsing behavior using the config parameter. See Parse Config Documentation for details.

loader = XParseLoader(
    file_path="doc.pdf",
    config={
        "document": {
            "password": "pdf-password"  # For encrypted PDFs
        },
        "capabilities": {
            "include_hierarchy": True,         # Include parent-child relationships
            "include_inline_objects": True,    # Extract formulas, handwriting, etc.
            "include_table_structure": True,   # Detailed table structure
            "include_char_details": True,      # Character-level details
            "include_image_data": True,        # Image URLs and data
            "pages": True,                     # Page metadata
            "title_tree": True,                # Document outline/TOC
            "table_view": "html"               # Table format: "html" or "markdown"
        },
        "scope": {
            "page_range": "1-10"               # Process specific pages
        },
        "config": {
            "force_engine": "textin",          # Engine selection (expert mode)
            "engine_params": {
                "formula_level": 0,
                "image_output_type": "url"
            }
        }
    }
)
docs = loader.load()

Multiple Files

loader = XParseLoader(file_path=["a.pdf", "b.pdf", "c.docx"])
for doc in loader.lazy_load():
    print(f"{doc.metadata.get('source')}: {doc.page_content[:50]}")

File-like Object

When passing a file-like object instead of a path, you must set metadata_filename:

with open("doc.pdf", "rb") as f:
    loader = XParseLoader(file=f, metadata_filename="doc.pdf")
    docs = loader.load()

Document Metadata

Each loaded document includes rich metadata:

  • source: File path or filename
  • category: Element type (Title, NarrativeText, Table, Image, Formula, etc.)
  • element_id: Unique element identifier
  • filename: Original filename
  • page_number: Page number (if available)
  • parent_id: Parent element ID (with include_hierarchy)
  • children_ids: Child element IDs (with include_hierarchy)
  • Additional element-specific metadata

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_xparse-1.2.0.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_xparse-1.2.0-py3-none-any.whl (7.5 kB view details)

Uploaded Python 3

File details

Details for the file langchain_xparse-1.2.0.tar.gz.

File metadata

  • Download URL: langchain_xparse-1.2.0.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for langchain_xparse-1.2.0.tar.gz
Algorithm Hash digest
SHA256 4dd99c93f1dcb004b64e01a9eac40632cd5149c98ce21c0e3ab05972b950fddc
MD5 4610574bc11f68b6ef5415a1b7b5a80f
BLAKE2b-256 c85fdf3bd5bc3651e14f00048c37dd8d9189e2a0bd89fb1f7d6ed8be5d09baad

See more details on using hashes here.

File details

Details for the file langchain_xparse-1.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_xparse-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 83031d7e7fb4eeb49fa7b4e17e1c0e2418f1ce7abb3e4560148dc0be98f4ad83
MD5 a9ca0be85aa77ba43d28f44497179137
BLAKE2b-256 2d117010217e589958fa78987995c5690d62d67d4cdc76d33a1d87bbd432ee76

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page