Skip to main content

LangChain integration with xParse Pipeline API for document parsing, chunking and embedding

Project description

langchain-xparse

LangChain integration with xParse Pipeline API for document parsing, chunking and embedding. Supports parse / chunk / embed stages only (extract is not supported in this loader).

Installation

From PyPI:

pip install langchain-xparse

Local editable install:

pip install -e .

Configuration

Set your TextIn credentials (from Textin Workspace ):

export XPARSE_APP_ID="your-app-id"
export XPARSE_SECRET_CODE="your-secret-code"

Or pass them when creating the loader:

loader = XParseLoader(
    file_path="doc.pdf",
    app_id="your-app-id",
    secret_code="your-secret-code",
)

Usage

Basic (parse only)

from langchain_xparse import XParseLoader

loader = XParseLoader(file_path="example.pdf")
docs = loader.load()
print(docs[0].page_content[:200])
print(docs[0].metadata)  # source, category, element_id, filename, page_number, ...

Lazy load

for doc in loader.lazy_load():
    # process(doc)

Async

async for doc in loader.alazy_load():
    # process(doc)

Convenience params (parse + chunk, or parse + chunk + embed)

loader = XParseLoader(
    file_path="doc.pdf",
    parse_provider="textin",
    chunk_strategy="by_title",
    chunk_max_characters=500,
    chunk_overlap=50,
)
# Or with embed:
loader = XParseLoader(
    file_path="doc.pdf",
    parse_provider="textin",
    chunk_strategy="basic",
    chunk_max_characters=1000,
    embed_provider="qwen",
    embed_model_name="text-embedding-v4",
)
docs = loader.load()

Custom stages (advanced)

loader = XParseLoader(
    file_path="doc.pdf",
    stages=[
        {"type": "parse", "config": {"provider": "textin"}},
        {"type": "chunk", "config": {"strategy": "by_page", "max_characters": 800}},
    ],
)

Multiple files

loader = XParseLoader(file_path=["a.pdf", "b.pdf"])
for doc in loader.lazy_load():
    print(doc.metadata.get("source"), doc.page_content[:50])

File-like object

When passing a file-like object instead of a path, you must set metadata_filename:

with open("doc.pdf", "rb") as f:
    loader = XParseLoader(file=f, metadata_filename="doc.pdf")
    docs = loader.load()

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_xparse-1.0.0.tar.gz (6.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_xparse-1.0.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file langchain_xparse-1.0.0.tar.gz.

File metadata

  • Download URL: langchain_xparse-1.0.0.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for langchain_xparse-1.0.0.tar.gz
Algorithm Hash digest
SHA256 9d52182110c7a91a5132ca8235348b1a2085db69d9ee9100bea04a19600d2e55
MD5 0b8efcb2a8de9f4bff888f40fac783f4
BLAKE2b-256 2ea22803be0ac58d3f9f9cd4a4f69b2e54077fed822a8ac42331eec339b405f3

See more details on using hashes here.

File details

Details for the file langchain_xparse-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_xparse-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 def2e417f490f3bc08d0c5fe1a10c857e1b506e6b5fb065610080b065e97d958
MD5 02491cff9b6b30f3e8cebd641c2359a2
BLAKE2b-256 9b68fda799d544a79c9efbcbf5439e124282f84e3b6f5d8d8bd7c83216f058df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page