Universal chunking functions to extract LLM-friendly chunks from any file type.
Project description
Unichunking
Extract LLM-friendly chunks from any file type.
Supported file types are :
- DOCX & DOCX-like (DOC, ODT)
- PPTX & PPTX-like (PPT, ODP)
- XLSX & XLSX-like (XLS, ODS)
- TXT, MD, CSV
- IPYNB
Installation
To install, run the following command:
python3 -m pip install delos-unichunking
How to use
The main functions are :
extract_subchunksreturns a list of all the text particles in the file.split_chunks_with_overlaptransforms a list of subchunks on a given page into a list of chunks following default or specified parameters for minimum/maximum token size and overlap.build_chunked_pagesreturns a list of "pages", which are lists of formated chunks, following the structure of the document.compute_pagesapproximates the pagination of a file that does not have a native pagination system (such as DOCX) by comparing it to a PDF version.
Specificities
Please note that the package requires a LibreOffice installation to run soffice commands, used during file conversions : for instance, DOC/ODT are first converted to DOCX format and processed as such.
The page numbers computed for DOCX files are an approximation and can be off by a few pages for large files.
Artifical page numbers are used for page-less structures such as TXT files, or to split large XLSX sheets into multiple pages, to follow a tokens-per-page limit.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file delos_unichunking-0.8.21.tar.gz.
File metadata
- Download URL: delos_unichunking-0.8.21.tar.gz
- Upload date:
- Size: 28.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41a9eee58f23b166469b3b27707f5f4cb3546c02e2f03bc7905b32cb0fd9e799
|
|
| MD5 |
37576f214908beffbc6b0d05990e349e
|
|
| BLAKE2b-256 |
4171c0ac74ade8911a892898f03e9f8e23a01b0202defc88d56d5af4532393b2
|
File details
Details for the file delos_unichunking-0.8.21-py3-none-any.whl.
File metadata
- Download URL: delos_unichunking-0.8.21-py3-none-any.whl
- Upload date:
- Size: 39.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eaf7e6989ff1c4b3b1ceb86a7b1754db471c8b1a7f9e5e754a3693ca655e7b5f
|
|
| MD5 |
5aa42cda6e9bea63fde38c192d538b0c
|
|
| BLAKE2b-256 |
3c199ea3a4dd67021ba40dcb550af067dec40bbff5feb0b4a1cfbf7152ad6ed2
|