A plugin for OVOS

These details have not been verified by PyPI

Project description

Document Chunkers

A collection of helpers to process raw documents

Overview

This library provides tools for chunking documents into manageable pieces such as paragraphs and sentences. It's particularly useful for preprocessing text data for natural language processing (NLP) tasks.

Text Segmenters
- Supported Models
- Usage
File Formats
- Supported File Formats
- Usage

Text Segmenters

SaT — Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation by Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan Vulić and Markus Schedl (**state-of-the-art, encouraged **). - 85 languages
WtP — Where’s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation by Benjamin Minixhofer, Jonas Pfeiffer and Ivan Vulić. - 85 languages
PySBD — {P}y{SBD}: Pragmatic Sentence Boundary Disambiguation by Nipun Sadvilkar and Mark Neumann (rule-based, lightweight) - 22 languages

Usage

Example: Using SaT for Sentence Segmentation

from ovos_document_chunkers import SaTSentenceSplitter

config = {"model": "sat-3l-sm", "use_cuda": False}
splitter = SaTSentenceSplitter(config)

text = "This is a sentence. And this is another one."
sentences = splitter.chunk(text)

for sentence in sentences:
    print(sentence)

Example: Using WtP for Paragraph Segmentation

from ovos_document_chunkers import WtPParagraphSplitter

config = {"model": "wtp-bert-mini", "use_cuda": False}
splitter = WtPParagraphSplitter(config)

text = "This is a paragraph. It contains multiple sentences.\n\nThis is another paragraph."
paragraphs = splitter.chunk(text)

for paragraph in paragraphs:
    print(paragraph)

Example: Using PySBD for Sentence Segmentation

from ovos_document_chunkers import PySBDSentenceSplitter

config = {"lang": "en"}
splitter = PySBDSentenceSplitter(config)

text = "This is a sentence. This is another one!"
sentences = splitter.chunk(text)

for sentence in sentences:
    print(sentence)

File Formats

Supported File Formats

Type	Description	Class Name	Expected Input	File Extension
Markdown	Splits Markdown text into sentences or paragraphs	MarkdownSentenceSplitter	String (url, path or Markdown text)	.md
		MarkdownParagraphSplitter	String (url, path or Markdown text)	.md
HTML	Splits HTML text into sentences or paragraphs	HTMLSentenceSplitter	String (url, path or HTML text)	.html
		HTMLParagraphSplitter	String (url, path or HTML text)	.html
PDF	Splits PDF documents into sentences or paragraphs	PDFSentenceSplitter	String (url or path to PDF file)	.pdf
		PDFParagraphSplitter	String (url or path to PDF file)	.pdf
doc	Splits Microsoft doc documents into sentences or paragraphs	DOCSentenceSplitter	String (url or path to doc file)	.doc
		DOCParagraphSplitter	String (url or path to doc file)	.doc
docx	Splits Microsoft docx documents into sentences or paragraphs	DOCxSentenceSplitter	String (url or path to docx file)	.docx
		DOCxParagraphSplitter	String (url or path to docx file)	.docx

Usage

Example using MarkdownSentenceSplitter

from ovos_document_chunkers.text.markdown import MarkdownSentenceSplitter
import requests

markdown_text = requests.get("https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md").text

sentence_splitter = MarkdownSentenceSplitter()
sentences = sentence_splitter.chunk(markdown_text)

print("Sentences:")
for sentence in sentences:
    print(sentence)

Example using MarkdownParagraphSplitter

from ovos_document_chunkers.text.markdown import MarkdownParagraphSplitter
import requests

markdown_text = requests.get("https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md").text

paragraph_splitter = MarkdownParagraphSplitter()
paragraphs = paragraph_splitter.chunk(markdown_text)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)

Example using HTMLSentenceSplitter

from ovos_document_chunkers import HTMLSentenceSplitter
import requests

html_text = requests.get("https://www.gofundme.com/f/openvoiceos").text

sentence_splitter = HTMLSentenceSplitter()
sentences = sentence_splitter.chunk(html_text)

print("Sentences:")
for sentence in sentences:
    print(sentence)

Example using HTMLParagraphSplitter

from ovos_document_chunkers import HTMLParagraphSplitter
import requests

html_text = requests.get("https://www.gofundme.com/f/openvoiceos").text

paragraph_splitter = HTMLParagraphSplitter()
paragraphs = paragraph_splitter.chunk(html_text)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)

Example using PDFParagraphSplitter

from ovos_document_chunkers import PDFParagraphSplitter

pdf_path = "/path/to/your/pdf/document.pdf"

paragraph_splitter = PDFParagraphSplitter()
paragraphs = paragraph_splitter.chunk(pdf_path)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)

Credits

This work was sponsored by VisioLab, part of Royal Dutch Visio, is the test, education, and research center in the field of (innovative) assistive technology for blind and visually impaired people and professionals. We explore (new) technological developments such as Voice, VR and AI and make the knowledge and expertise we gain available to everyone.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2a2 pre-release

Dec 18, 2025

0.1.2a1 pre-release

Jul 18, 2025

0.1.1

Jul 13, 2025

This version

0.1.1a1 pre-release

Jul 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ovos-document-chunkers-0.1.1a1.tar.gz (14.2 kB view details)

Uploaded Jul 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ovos_document_chunkers-0.1.1a1-py3-none-any.whl (18.9 kB view details)

Uploaded Jul 13, 2025 Python 3

File details

Details for the file ovos-document-chunkers-0.1.1a1.tar.gz.

File metadata

Download URL: ovos-document-chunkers-0.1.1a1.tar.gz
Upload date: Jul 13, 2025
Size: 14.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for ovos-document-chunkers-0.1.1a1.tar.gz
Algorithm	Hash digest
SHA256	`7a0de4a41719cb048b52e2d22a96f73f5f3f9ac223e64c77bdff19b492d37a69`
MD5	`58774f86d1d18621c9ecf6b125d6d95d`
BLAKE2b-256	`d9816db7d82571b8e37cc14607fed425b7c6eadf9e335b4bcceea51a600e6909`

See more details on using hashes here.

File details

Details for the file ovos_document_chunkers-0.1.1a1-py3-none-any.whl.

File metadata

Download URL: ovos_document_chunkers-0.1.1a1-py3-none-any.whl
Upload date: Jul 13, 2025
Size: 18.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for ovos_document_chunkers-0.1.1a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`75d3177622e43b0235e25dfe773216c746533694a4db89422f22e674d588ec75`
MD5	`53b7f01e3ba5db99651711c5af5c8637`
BLAKE2b-256	`709a832bcc114a79a5bfad0017a96790684aa6ab9a0486b86bd41519726943c9`

See more details on using hashes here.

ovos-document-chunkers 0.1.1a1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Document Chunkers

Overview

Text Segmenters

Usage

Example: Using SaT for Sentence Segmentation

Example: Using WtP for Paragraph Segmentation

Example: Using PySBD for Sentence Segmentation

File Formats

Supported File Formats

Usage

Example using MarkdownSentenceSplitter

Example using MarkdownParagraphSplitter

Example using HTMLSentenceSplitter

Example using HTMLParagraphSplitter

Example using PDFParagraphSplitter

Credits

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes