No project description provided
Project description
An Excel Loader for Langchain that Preserves Document Structure
Usage
pip install langchain-excel-loader
from langchain_excel_loader import StructuredExcelLoader
# Initialize the loader with your Excel file
loader = StructuredExcelLoader("path/to/your/file.xlsx")
# Load all documents (one per sheet)
docs = loader.load()
Background
The current solution from langchain for loading .xlsx is by using the Unstructured document loader. This has two disadvantages:
- No attempt is made to preserve the structure of the document. This is as opposed to the CSV loader for example which ingests by row with the column title for each cell on the row:
CSV loader example
csv:
Name,Age Harry,21 Mary,48
Output:
[Document(page_content='Name: Harry \n Age: 21', metadata={'source':'csv.csv', 'row:0'}),
Document(page_content='Name: Mary \n Age: 48', metadata={'source':'csv.csv', 'row:1'})]
Documents like these give the LLM the context to understand the meaning behind data.
Instead of an approach like the above, the Unstructured Excel Loader will simply add all the text content contained in the xlsx in one string with no indication of columns or rows.
- The second disadvantage is that the Unstructured package is large with multiple system dependencies and so not suitable for all environments and use cases.
Implementation of the StructuredExcelLoader
This package provides a StructuredExcelLoader, which uses openpyxl to read the .xlsx file. Since Excel spreadsheets have a less fixed structure than csv files, we opt to preserve the column and row number for each cell, giving the LLM a greater remit in inferring meaning from the document.
Example Output
Given an Excel file sample.xlsx with two sheets:
Sheet: "Employees"
| Employee | Department | Salary |
|---|---|---|
| John Doe | Sales | 50000 |
| Jane Smith | Marketing | 55000 |
Sheet: "Departments"
| Department | Location | Manager |
|---|---|---|
| Sales | New York | Bob Wilson |
| Marketing | Chicago | Sarah Lee |
The StructuredExcelLoader will create separate documents for each sheet:
[
Document(
page_content='''SHEET: "Employees"
ROW 1:
CELL A1: Employee
CELL B1: Department
CELL C1: Salary
ROW 2:
CELL A2: John Doe
CELL B2: Sales
CELL C2: 50000
ROW 3:
CELL A3: Jane Smith
CELL B3: Marketing
CELL C3: 55000''',
metadata={'source': 'sample.xlsx', 'sheet_name': 'Employees'}
),
Document(
page_content='''SHEET: "Departments"
ROW 1:
CELL A1: Department
CELL B1: Location
CELL C1: Manager
ROW 2:
CELL A2: Sales
CELL B2: New York
CELL C2: Bob Wilson
ROW 3:
CELL A3: Marketing
CELL B3: Chicago
CELL C3: Sarah Lee''',
metadata={'source': 'sample.xlsx', 'sheet_name': 'Departments'}
)
]
Disadvantages
This approach is not as strong when the .xlsx is extremely complex as the LLM struggles to maintain understanding of the positioning of multiple Tables within a sheet. Although with the latest models (e.g. ChatGPT 4o, Gemini 2.5 at the time of writing) this limit has improved along with the LLMs' abilities to understand cell references accurately
Future Work
After the effectiveness of this approach is validated, it should be incorportaed into the langchain_community.document_loaders repository, alongside the existing UnstructuredExcelLoader, which still provides use in some cases.
Alternatively, an additional boolean argument could be provided called "preserve_structure", which would be set to true by default. If it is explicity set to false, the loader could produce documents as raw text strings without cell references.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_excel_loader-0.1.1.tar.gz.
File metadata
- Download URL: langchain_excel_loader-0.1.1.tar.gz
- Upload date:
- Size: 4.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a581936ee00fcaacea9d7041c7f79fe03e13e4bb17c2b8bcaecd28f03ea98f7
|
|
| MD5 |
705251e99d468b438f05a5a37ccf0308
|
|
| BLAKE2b-256 |
75bedba5a0034eadbb565da5e89c7f2b44328f9755e506f78926951abaad2b3c
|
File details
Details for the file langchain_excel_loader-0.1.1-py3-none-any.whl.
File metadata
- Download URL: langchain_excel_loader-0.1.1-py3-none-any.whl
- Upload date:
- Size: 4.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
569a5b3ba9ffad0cf8820b87831995936163ebe9eebf92b3f227163bfd6ce453
|
|
| MD5 |
774f45c7d0cf030d6d544f6c292072fc
|
|
| BLAKE2b-256 |
b85af25702ec571163691bc8850bf8af2df78c2e3e66d16fea8b1a32a02b8e62
|