A plugin for OVOS
Project description
Document Chunkers
A collection of helpers to process raw documents
Overview
This library provides tools for chunking documents into manageable pieces such as paragraphs and sentences. It's particularly useful for preprocessing text data for natural language processing (NLP) tasks.
Text Segmenters
- SaT — Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation by Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan Vulić and Markus Schedl (**state-of-the-art, encouraged **). - 85 languages
- WtP — Where’s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation by Benjamin Minixhofer, Jonas Pfeiffer and Ivan Vulić. - 85 languages
- PySBD — {P}y{SBD}: Pragmatic Sentence Boundary Disambiguation by Nipun Sadvilkar and Mark Neumann (rule-based, lightweight) - 22 languages
Usage
Example: Using SaT for Sentence Segmentation
from ovos_document_chunkers import SaTSentenceSplitter
config = {"model": "sat-3l-sm", "use_cuda": False}
splitter = SaTSentenceSplitter(config)
text = "This is a sentence. And this is another one."
sentences = splitter.chunk(text)
for sentence in sentences:
print(sentence)
Example: Using WtP for Paragraph Segmentation
from ovos_document_chunkers import WtPParagraphSplitter
config = {"model": "wtp-bert-mini", "use_cuda": False}
splitter = WtPParagraphSplitter(config)
text = "This is a paragraph. It contains multiple sentences.\n\nThis is another paragraph."
paragraphs = splitter.chunk(text)
for paragraph in paragraphs:
print(paragraph)
Example: Using PySBD for Sentence Segmentation
from ovos_document_chunkers import PySBDSentenceSplitter
config = {"lang": "en"}
splitter = PySBDSentenceSplitter(config)
text = "This is a sentence. This is another one!"
sentences = splitter.chunk(text)
for sentence in sentences:
print(sentence)
File Formats
Supported File Formats
| Type | Description | Class Name | Expected Input | File Extension |
|---|---|---|---|---|
| Markdown | Splits Markdown text into sentences or paragraphs | MarkdownSentenceSplitter | String (url, path or Markdown text) | .md |
| MarkdownParagraphSplitter | String (url, path or Markdown text) | .md | ||
| HTML | Splits HTML text into sentences or paragraphs | HTMLSentenceSplitter | String (url, path or HTML text) | .html |
| HTMLParagraphSplitter | String (url, path or HTML text) | .html | ||
| Splits PDF documents into sentences or paragraphs | PDFSentenceSplitter | String (url or path to PDF file) | ||
| PDFParagraphSplitter | String (url or path to PDF file) | |||
| doc | Splits Microsoft doc documents into sentences or paragraphs | DOCSentenceSplitter | String (url or path to doc file) | .doc |
| DOCParagraphSplitter | String (url or path to doc file) | .doc | ||
| docx | Splits Microsoft docx documents into sentences or paragraphs | DOCxSentenceSplitter | String (url or path to docx file) | .docx |
| DOCxParagraphSplitter | String (url or path to docx file) | .docx |
Usage
Example using MarkdownSentenceSplitter
from ovos_document_chunkers.text.markdown import MarkdownSentenceSplitter
import requests
markdown_text = requests.get("https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md").text
sentence_splitter = MarkdownSentenceSplitter()
sentences = sentence_splitter.chunk(markdown_text)
print("Sentences:")
for sentence in sentences:
print(sentence)
Example using MarkdownParagraphSplitter
from ovos_document_chunkers.text.markdown import MarkdownParagraphSplitter
import requests
markdown_text = requests.get("https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md").text
paragraph_splitter = MarkdownParagraphSplitter()
paragraphs = paragraph_splitter.chunk(markdown_text)
print("\nParagraphs:")
for paragraph in paragraphs:
print(paragraph)
Example using HTMLSentenceSplitter
from ovos_document_chunkers import HTMLSentenceSplitter
import requests
html_text = requests.get("https://www.gofundme.com/f/openvoiceos").text
sentence_splitter = HTMLSentenceSplitter()
sentences = sentence_splitter.chunk(html_text)
print("Sentences:")
for sentence in sentences:
print(sentence)
Example using HTMLParagraphSplitter
from ovos_document_chunkers import HTMLParagraphSplitter
import requests
html_text = requests.get("https://www.gofundme.com/f/openvoiceos").text
paragraph_splitter = HTMLParagraphSplitter()
paragraphs = paragraph_splitter.chunk(html_text)
print("\nParagraphs:")
for paragraph in paragraphs:
print(paragraph)
Example using PDFParagraphSplitter
from ovos_document_chunkers import PDFParagraphSplitter
pdf_path = "/path/to/your/pdf/document.pdf"
paragraph_splitter = PDFParagraphSplitter()
paragraphs = paragraph_splitter.chunk(pdf_path)
print("\nParagraphs:")
for paragraph in paragraphs:
print(paragraph)
Credits
This work was sponsored by VisioLab, part of Royal Dutch Visio, is the test, education, and research center in the field of (innovative) assistive technology for blind and visually impaired people and professionals. We explore (new) technological developments such as Voice, VR and AI and make the knowledge and expertise we gain available to everyone.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ovos-document-chunkers-0.1.2a1.tar.gz.
File metadata
- Download URL: ovos-document-chunkers-0.1.2a1.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
57f55b1bb145ec52b56d7e32900c35f67b09af11570ad3555ff6effd588a9a48
|
|
| MD5 |
45cfa24531a65f7f1be3dfb355a2c511
|
|
| BLAKE2b-256 |
a8b3299902472ed2bb803b39d8c5ea81db9a011ba7b0d53e8544c6b92b91b979
|
File details
Details for the file ovos_document_chunkers-0.1.2a1-py3-none-any.whl.
File metadata
- Download URL: ovos_document_chunkers-0.1.2a1-py3-none-any.whl
- Upload date:
- Size: 19.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1721fe5a378ab04daa2e7bae24a7c727a9b682cc6e7af5ff6d33a504150eee4a
|
|
| MD5 |
ee632fe7054504ecdf1a4750355d3001
|
|
| BLAKE2b-256 |
2991a3075d86cd048b5315124b187ed2136f3e5650446e1f33e791dacd192b3b
|