Skip to main content

A plugin for OVOS

Project description

Document Chunkers

A collection of helpers to process raw documents

Overview

This library provides tools for chunking documents into manageable pieces such as paragraphs and sentences. It's particularly useful for preprocessing text data for natural language processing (NLP) tasks.

Text Segmenters

img.png

Usage

Example: Using SaT for Sentence Segmentation

from ovos_document_chunkers import SaTSentenceSplitter

config = {"model": "sat-3l-sm", "use_cuda": False}
splitter = SaTSentenceSplitter(config)

text = "This is a sentence. And this is another one."
sentences = splitter.chunk(text)

for sentence in sentences:
    print(sentence)

Example: Using WtP for Paragraph Segmentation

from ovos_document_chunkers import WtPParagraphSplitter

config = {"model": "wtp-bert-mini", "use_cuda": False}
splitter = WtPParagraphSplitter(config)

text = "This is a paragraph. It contains multiple sentences.\n\nThis is another paragraph."
paragraphs = splitter.chunk(text)

for paragraph in paragraphs:
    print(paragraph)

Example: Using PySBD for Sentence Segmentation

from ovos_document_chunkers import PySBDSentenceSplitter

config = {"lang": "en"}
splitter = PySBDSentenceSplitter(config)

text = "This is a sentence. This is another one!"
sentences = splitter.chunk(text)

for sentence in sentences:
    print(sentence)

File Formats

Supported File Formats

Type Description Class Name Expected Input File Extension
Markdown Splits Markdown text into sentences or paragraphs MarkdownSentenceSplitter String (url, path or Markdown text) .md
MarkdownParagraphSplitter String (url, path or Markdown text) .md
HTML Splits HTML text into sentences or paragraphs HTMLSentenceSplitter String (url, path or HTML text) .html
HTMLParagraphSplitter String (url, path or HTML text) .html
PDF Splits PDF documents into sentences or paragraphs PDFSentenceSplitter String (url or path to PDF file) .pdf
PDFParagraphSplitter String (url or path to PDF file) .pdf
doc Splits Microsoft doc documents into sentences or paragraphs DOCSentenceSplitter String (url or path to doc file) .doc
DOCParagraphSplitter String (url or path to doc file) .doc
docx Splits Microsoft docx documents into sentences or paragraphs DOCxSentenceSplitter String (url or path to docx file) .docx
DOCxParagraphSplitter String (url or path to docx file) .docx

Usage

Example using MarkdownSentenceSplitter

from ovos_document_chunkers.text.markdown import MarkdownSentenceSplitter
import requests

markdown_text = requests.get("https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md").text

sentence_splitter = MarkdownSentenceSplitter()
sentences = sentence_splitter.chunk(markdown_text)

print("Sentences:")
for sentence in sentences:
    print(sentence)

Example using MarkdownParagraphSplitter

from ovos_document_chunkers.text.markdown import MarkdownParagraphSplitter
import requests

markdown_text = requests.get("https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md").text

paragraph_splitter = MarkdownParagraphSplitter()
paragraphs = paragraph_splitter.chunk(markdown_text)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)

Example using HTMLSentenceSplitter

from ovos_document_chunkers import HTMLSentenceSplitter
import requests

html_text = requests.get("https://www.gofundme.com/f/openvoiceos").text

sentence_splitter = HTMLSentenceSplitter()
sentences = sentence_splitter.chunk(html_text)

print("Sentences:")
for sentence in sentences:
    print(sentence)

Example using HTMLParagraphSplitter

from ovos_document_chunkers import HTMLParagraphSplitter
import requests

html_text = requests.get("https://www.gofundme.com/f/openvoiceos").text

paragraph_splitter = HTMLParagraphSplitter()
paragraphs = paragraph_splitter.chunk(html_text)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)

Example using PDFParagraphSplitter

from ovos_document_chunkers import PDFParagraphSplitter

pdf_path = "/path/to/your/pdf/document.pdf"

paragraph_splitter = PDFParagraphSplitter()
paragraphs = paragraph_splitter.chunk(pdf_path)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)

Credits

image

This work was sponsored by VisioLab, part of Royal Dutch Visio, is the test, education, and research center in the field of (innovative) assistive technology for blind and visually impaired people and professionals. We explore (new) technological developments such as Voice, VR and AI and make the knowledge and expertise we gain available to everyone.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ovos-document-chunkers-0.1.1.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ovos_document_chunkers-0.1.1-py3-none-any.whl (18.8 kB view details)

Uploaded Python 3

File details

Details for the file ovos-document-chunkers-0.1.1.tar.gz.

File metadata

  • Download URL: ovos-document-chunkers-0.1.1.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for ovos-document-chunkers-0.1.1.tar.gz
Algorithm Hash digest
SHA256 419b3df26f5274255b947243be188dad904e93b1a1ce4d97616188589d6315b0
MD5 bf6ff9daff5328d7ebe61e3ca2cc4ffb
BLAKE2b-256 bde26737cb477e32a3706b012ffebe299f1182308dc711da476c86a7a8beea05

See more details on using hashes here.

File details

Details for the file ovos_document_chunkers-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for ovos_document_chunkers-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2eca48614d853908e7fc522bb38222bac4fe300c2f6f2fcf26789a25f4dd7494
MD5 5fae2904099d0838f42161fbef995b54
BLAKE2b-256 5ac904db6e3a8dde71536b357fdde1624b4d90dff38d2cbb53bb40ba51ede0f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page