Skip to main content

A plugin for OVOS

Project description

Document Chunkers

A collection of helpers to process raw documents

Overview

This library provides tools for chunking documents into manageable pieces such as paragraphs and sentences. It's particularly useful for preprocessing text data for natural language processing (NLP) tasks.

Text Segmenters

img.png

Usage

Example: Using SaT for Sentence Segmentation

from ovos_document_chunkers import SaTSentenceSplitter

config = {"model": "sat-3l-sm", "use_cuda": False}
splitter = SaTSentenceSplitter(config)

text = "This is a sentence. And this is another one."
sentences = splitter.chunk(text)

for sentence in sentences:
    print(sentence)

Example: Using WtP for Paragraph Segmentation

from ovos_document_chunkers import WtPParagraphSplitter

config = {"model": "wtp-bert-mini", "use_cuda": False}
splitter = WtPParagraphSplitter(config)

text = "This is a paragraph. It contains multiple sentences.\n\nThis is another paragraph."
paragraphs = splitter.chunk(text)

for paragraph in paragraphs:
    print(paragraph)

Example: Using PySBD for Sentence Segmentation

from ovos_document_chunkers import PySBDSentenceSplitter

config = {"lang": "en"}
splitter = PySBDSentenceSplitter(config)

text = "This is a sentence. This is another one!"
sentences = splitter.chunk(text)

for sentence in sentences:
    print(sentence)

File Formats

Supported File Formats

Type Description Class Name Expected Input File Extension
Markdown Splits Markdown text into sentences or paragraphs MarkdownSentenceSplitter String (url, path or Markdown text) .md
MarkdownParagraphSplitter String (url, path or Markdown text) .md
HTML Splits HTML text into sentences or paragraphs HTMLSentenceSplitter String (url, path or HTML text) .html
HTMLParagraphSplitter String (url, path or HTML text) .html
PDF Splits PDF documents into sentences or paragraphs PDFSentenceSplitter String (url or path to PDF file) .pdf
PDFParagraphSplitter String (url or path to PDF file) .pdf
doc Splits Microsoft doc documents into sentences or paragraphs DOCSentenceSplitter String (url or path to doc file) .doc
DOCParagraphSplitter String (url or path to doc file) .doc
docx Splits Microsoft docx documents into sentences or paragraphs DOCxSentenceSplitter String (url or path to docx file) .docx
DOCxParagraphSplitter String (url or path to docx file) .docx

Usage

Example using MarkdownSentenceSplitter

from ovos_document_chunkers.text.markdown import MarkdownSentenceSplitter
import requests

markdown_text = requests.get("https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md").text

sentence_splitter = MarkdownSentenceSplitter()
sentences = sentence_splitter.chunk(markdown_text)

print("Sentences:")
for sentence in sentences:
    print(sentence)

Example using MarkdownParagraphSplitter

from ovos_document_chunkers.text.markdown import MarkdownParagraphSplitter
import requests

markdown_text = requests.get("https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md").text

paragraph_splitter = MarkdownParagraphSplitter()
paragraphs = paragraph_splitter.chunk(markdown_text)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)

Example using HTMLSentenceSplitter

from ovos_document_chunkers import HTMLSentenceSplitter
import requests

html_text = requests.get("https://www.gofundme.com/f/openvoiceos").text

sentence_splitter = HTMLSentenceSplitter()
sentences = sentence_splitter.chunk(html_text)

print("Sentences:")
for sentence in sentences:
    print(sentence)

Example using HTMLParagraphSplitter

from ovos_document_chunkers import HTMLParagraphSplitter
import requests

html_text = requests.get("https://www.gofundme.com/f/openvoiceos").text

paragraph_splitter = HTMLParagraphSplitter()
paragraphs = paragraph_splitter.chunk(html_text)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)

Example using PDFParagraphSplitter

from ovos_document_chunkers import PDFParagraphSplitter

pdf_path = "/path/to/your/pdf/document.pdf"

paragraph_splitter = PDFParagraphSplitter()
paragraphs = paragraph_splitter.chunk(pdf_path)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)

Credits

image

This work was sponsored by VisioLab, part of Royal Dutch Visio, is the test, education, and research center in the field of (innovative) assistive technology for blind and visually impaired people and professionals. We explore (new) technological developments such as Voice, VR and AI and make the knowledge and expertise we gain available to everyone.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ovos-document-chunkers-0.1.1a1.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ovos_document_chunkers-0.1.1a1-py3-none-any.whl (18.9 kB view details)

Uploaded Python 3

File details

Details for the file ovos-document-chunkers-0.1.1a1.tar.gz.

File metadata

  • Download URL: ovos-document-chunkers-0.1.1a1.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for ovos-document-chunkers-0.1.1a1.tar.gz
Algorithm Hash digest
SHA256 7a0de4a41719cb048b52e2d22a96f73f5f3f9ac223e64c77bdff19b492d37a69
MD5 58774f86d1d18621c9ecf6b125d6d95d
BLAKE2b-256 d9816db7d82571b8e37cc14607fed425b7c6eadf9e335b4bcceea51a600e6909

See more details on using hashes here.

File details

Details for the file ovos_document_chunkers-0.1.1a1-py3-none-any.whl.

File metadata

File hashes

Hashes for ovos_document_chunkers-0.1.1a1-py3-none-any.whl
Algorithm Hash digest
SHA256 75d3177622e43b0235e25dfe773216c746533694a4db89422f22e674d588ec75
MD5 53b7f01e3ba5db99651711c5af5c8637
BLAKE2b-256 709a832bcc114a79a5bfad0017a96790684aa6ab9a0486b86bd41519726943c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page