Skip to main content

No project description provided

Project description

An Excel Loader for Langchain that Preserves Document Structure

Usage

pip install langchain-excel-loader
from langchain_excel_loader import StructuredExcelLoader

# Initialize the loader with your Excel file
loader = StructuredExcelLoader("path/to/your/file.xlsx")

# Load all documents (one per sheet)
docs = loader.load()

Background

The current solution from langchain for loading .xlsx is by using the Unstructured document loader. This has two disadvantages:

  1. No attempt is made to preserve the structure of the document. This is as opposed to the CSV loader for example which ingests by row with the column title for each cell on the row:

CSV loader example

csv:

Name,Age Harry,21 Mary,48

Output:

[Document(page_content='Name: Harry \n Age: 21', metadata={'source':'csv.csv', 'row:0'}),
 Document(page_content='Name: Mary \n Age: 48', metadata={'source':'csv.csv', 'row:1'})]

Documents like these give the LLM the context to understand the meaning behind data.

Instead of an approach like the above, the Unstructured Excel Loader will simply add all the text content contained in the xlsx in one string with no indication of columns or rows.

  1. The second disadvantage is that the Unstructured package is large with multiple system dependencies and so not suitable for all environments and use cases.

Implementation of the StructuredExcelLoader

This package provides a StructuredExcelLoader, which uses openpyxl to read the .xlsx file. Since Excel spreadsheets have a less fixed structure than csv files, we opt to preserve the column and row number for each cell, giving the LLM a greater remit in inferring meaning from the document.

Example Output

Given an Excel file sample.xlsx with two sheets:

Sheet: "Employees"

Employee Department Salary
John Doe Sales 50000
Jane Smith Marketing 55000

Sheet: "Departments"

Department Location Manager
Sales New York Bob Wilson
Marketing Chicago Sarah Lee

The StructuredExcelLoader will create separate documents for each sheet:

[
    Document(
        page_content='''SHEET: "Employees"

ROW 1:
CELL A1: Employee
CELL B1: Department
CELL C1: Salary

ROW 2:
CELL A2: John Doe
CELL B2: Sales
CELL C2: 50000

ROW 3:
CELL A3: Jane Smith
CELL B3: Marketing
CELL C3: 55000''',
        metadata={'source': 'sample.xlsx', 'sheet_name': 'Employees'}
    ),
    
    Document(
        page_content='''SHEET: "Departments"

ROW 1:
CELL A1: Department
CELL B1: Location
CELL C1: Manager

ROW 2:
CELL A2: Sales
CELL B2: New York
CELL C2: Bob Wilson

ROW 3:
CELL A3: Marketing
CELL B3: Chicago
CELL C3: Sarah Lee''',
        metadata={'source': 'sample.xlsx', 'sheet_name': 'Departments'}
    )
]

Disadvantages

This approach is not as strong when the .xlsx is extremely complex as the LLM struggles to maintain understanding of the positioning of multiple Tables within a sheet. Although with the latest models (e.g. ChatGPT 4o, Gemini 2.5 at the time of writing) this limit has improved along with the LLMs' abilities to understand cell references accurately

Future Work

After the effectiveness of this approach is validated, it should be incorportaed into the langchain_community.document_loaders repository, alongside the existing UnstructuredExcelLoader, which still provides use in some cases.

Alternatively, an additional boolean argument could be provided called "preserve_structure", which would be set to true by default. If it is explicity set to false, the loader could produce documents as raw text strings without cell references.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_excel_loader-0.1.1.tar.gz (4.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_excel_loader-0.1.1-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file langchain_excel_loader-0.1.1.tar.gz.

File metadata

  • Download URL: langchain_excel_loader-0.1.1.tar.gz
  • Upload date:
  • Size: 4.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for langchain_excel_loader-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0a581936ee00fcaacea9d7041c7f79fe03e13e4bb17c2b8bcaecd28f03ea98f7
MD5 705251e99d468b438f05a5a37ccf0308
BLAKE2b-256 75bedba5a0034eadbb565da5e89c7f2b44328f9755e506f78926951abaad2b3c

See more details on using hashes here.

File details

Details for the file langchain_excel_loader-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_excel_loader-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 569a5b3ba9ffad0cf8820b87831995936163ebe9eebf92b3f227163bfd6ce453
MD5 774f45c7d0cf030d6d544f6c292072fc
BLAKE2b-256 b85af25702ec571163691bc8850bf8af2df78c2e3e66d16fea8b1a32a02b8e62

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page