Skip to main content

A plugin for OVOS

Project description

Document Chunkers

A collection of helpers to process raw documents

Overview

This library provides tools for chunking documents into manageable pieces such as paragraphs and sentences. It's particularly useful for preprocessing text data for natural language processing (NLP) tasks.

Text Segmenters

img.png

Usage

Example: Using SaT for Sentence Segmentation

from ovos_document_chunkers import SaTSentenceSplitter

config = {"model": "sat-3l-sm", "use_cuda": False}
splitter = SaTSentenceSplitter(config)

text = "This is a sentence. And this is another one."
sentences = splitter.chunk(text)

for sentence in sentences:
    print(sentence)

Example: Using WtP for Paragraph Segmentation

from ovos_document_chunkers import WtPParagraphSplitter

config = {"model": "wtp-bert-mini", "use_cuda": False}
splitter = WtPParagraphSplitter(config)

text = "This is a paragraph. It contains multiple sentences.\n\nThis is another paragraph."
paragraphs = splitter.chunk(text)

for paragraph in paragraphs:
    print(paragraph)

Example: Using PySBD for Sentence Segmentation

from ovos_document_chunkers import PySBDSentenceSplitter

config = {"lang": "en"}
splitter = PySBDSentenceSplitter(config)

text = "This is a sentence. This is another one!"
sentences = splitter.chunk(text)

for sentence in sentences:
    print(sentence)

File Formats

Supported File Formats

Type Description Class Name Expected Input File Extension
Markdown Splits Markdown text into sentences or paragraphs MarkdownSentenceSplitter String (url, path or Markdown text) .md
MarkdownParagraphSplitter String (url, path or Markdown text) .md
HTML Splits HTML text into sentences or paragraphs HTMLSentenceSplitter String (url, path or HTML text) .html
HTMLParagraphSplitter String (url, path or HTML text) .html
PDF Splits PDF documents into sentences or paragraphs PDFSentenceSplitter String (url or path to PDF file) .pdf
PDFParagraphSplitter String (url or path to PDF file) .pdf
doc Splits Microsoft doc documents into sentences or paragraphs DOCSentenceSplitter String (url or path to doc file) .doc
DOCParagraphSplitter String (url or path to doc file) .doc
docx Splits Microsoft docx documents into sentences or paragraphs DOCxSentenceSplitter String (url or path to docx file) .docx
DOCxParagraphSplitter String (url or path to docx file) .docx

Usage

Example using MarkdownSentenceSplitter

from ovos_document_chunkers.text.markdown import MarkdownSentenceSplitter
import requests

markdown_text = requests.get("https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md").text

sentence_splitter = MarkdownSentenceSplitter()
sentences = sentence_splitter.chunk(markdown_text)

print("Sentences:")
for sentence in sentences:
    print(sentence)

Example using MarkdownParagraphSplitter

from ovos_document_chunkers.text.markdown import MarkdownParagraphSplitter
import requests

markdown_text = requests.get("https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md").text

paragraph_splitter = MarkdownParagraphSplitter()
paragraphs = paragraph_splitter.chunk(markdown_text)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)

Example using HTMLSentenceSplitter

from ovos_document_chunkers import HTMLSentenceSplitter
import requests

html_text = requests.get("https://www.gofundme.com/f/openvoiceos").text

sentence_splitter = HTMLSentenceSplitter()
sentences = sentence_splitter.chunk(html_text)

print("Sentences:")
for sentence in sentences:
    print(sentence)

Example using HTMLParagraphSplitter

from ovos_document_chunkers import HTMLParagraphSplitter
import requests

html_text = requests.get("https://www.gofundme.com/f/openvoiceos").text

paragraph_splitter = HTMLParagraphSplitter()
paragraphs = paragraph_splitter.chunk(html_text)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)

Example using PDFParagraphSplitter

from ovos_document_chunkers import PDFParagraphSplitter

pdf_path = "/path/to/your/pdf/document.pdf"

paragraph_splitter = PDFParagraphSplitter()
paragraphs = paragraph_splitter.chunk(pdf_path)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)

Credits

image

This work was sponsored by VisioLab, part of Royal Dutch Visio, is the test, education, and research center in the field of (innovative) assistive technology for blind and visually impaired people and professionals. We explore (new) technological developments such as Voice, VR and AI and make the knowledge and expertise we gain available to everyone.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ovos_document_chunkers-0.1.2a2.tar.gz (15.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ovos_document_chunkers-0.1.2a2-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file ovos_document_chunkers-0.1.2a2.tar.gz.

File metadata

  • Download URL: ovos_document_chunkers-0.1.2a2.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for ovos_document_chunkers-0.1.2a2.tar.gz
Algorithm Hash digest
SHA256 f8aad8d24909b91dcddd083fc1b2c3a9d4d300f50150499aec695dc449583fd6
MD5 eaa227fbc2f2d280468c986e3068d15a
BLAKE2b-256 c8a373ee6f31f9dc64cfbc35f7e6dc7f54ce4ff7d174bb08211a50c0d9691813

See more details on using hashes here.

File details

Details for the file ovos_document_chunkers-0.1.2a2-py3-none-any.whl.

File metadata

File hashes

Hashes for ovos_document_chunkers-0.1.2a2-py3-none-any.whl
Algorithm Hash digest
SHA256 b6dd0ce224ddad8ee88f4f4d4135815e7da012a30d47e94aeaf597d10ffa58ba
MD5 95a6ea44b84385e747ec79d1df5d7efd
BLAKE2b-256 f9e67ceec37e35c5c548a2aa3973b33ee28e3ea1179e6ed7b96b5196478f0dc3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page