Skip to main content

Universal chunking functions to extract LLM-friendly chunks from any file type.

Project description

Unichunking

Extract LLM-friendly chunks from any file type.

Supported file types are :

  • DOCX & DOCX-like (DOC, ODT)
  • PPTX & PPTX-like (PPT, ODP)
  • XLSX & XLSX-like (XLS, ODS)
  • TXT, MD, CSV
  • IPYNB

Installation

To install, run the following command:

python3 -m pip install delos-unichunking

How to use

The main functions are :

  • extract_subchunks returns a list of all the text particles in the file.
  • split_chunks_with_overlap transforms a list of subchunks on a given page into a list of chunks following default or specified parameters for minimum/maximum token size and overlap.
  • build_chunked_pages returns a list of "pages", which are lists of formated chunks, following the structure of the document.
  • compute_pages approximates the pagination of a file that does not have a native pagination system (such as DOCX) by comparing it to a PDF version.

Specificities

Please note that the package requires a LibreOffice installation to run soffice commands, used during file conversions : for instance, DOC/ODT are first converted to DOCX format and processed as such.

The page numbers computed for DOCX files are an approximation and can be off by a few pages for large files.

Artifical page numbers are used for page-less structures such as TXT files, or to split large XLSX sheets into multiple pages, to follow a tokens-per-page limit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delos_unichunking-1.2.0.tar.gz (31.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

delos_unichunking-1.2.0-py3-none-any.whl (43.1 kB view details)

Uploaded Python 3

File details

Details for the file delos_unichunking-1.2.0.tar.gz.

File metadata

  • Download URL: delos_unichunking-1.2.0.tar.gz
  • Upload date:
  • Size: 31.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.2 Linux/6.12.51-1-lts

File hashes

Hashes for delos_unichunking-1.2.0.tar.gz
Algorithm Hash digest
SHA256 de453a06d75f0553d980533b11d925907a79fdb073cd08f7b400c3977756ed4c
MD5 9168bfe16eb57ee6d0d094c20d8db854
BLAKE2b-256 72c4f239a3de3c03139b39817e0ed8972288cc580c73f64db6bdfbef9378f182

See more details on using hashes here.

File details

Details for the file delos_unichunking-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: delos_unichunking-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 43.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.2 Linux/6.12.51-1-lts

File hashes

Hashes for delos_unichunking-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 11e30f31b656ecdf145a92026e935a1062e0d3b70e13e97cf02510294cba8e2d
MD5 a928967f30cb29e2161d9ba9396986d8
BLAKE2b-256 d1cb8918096d25fe8c2e285c684f0bbe51a443c7d0f556712f6c020c61182a4c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page