Skip to main content

A plugin for OVOS

Project description

Document Chunkers

A collection of helpers to process raw documents

Overview

This library provides tools for chunking documents into manageable pieces such as paragraphs and sentences. It's particularly useful for preprocessing text data for natural language processing (NLP) tasks.

Text Segmenters

img.png

Usage

Example: Using SaT for Sentence Segmentation

from ovos_document_chunkers import SaTSentenceSplitter

config = {"model": "sat-3l-sm", "use_cuda": False}
splitter = SaTSentenceSplitter(config)

text = "This is a sentence. And this is another one."
sentences = splitter.chunk(text)

for sentence in sentences:
    print(sentence)

Example: Using WtP for Paragraph Segmentation

from ovos_document_chunkers import WtPParagraphSplitter

config = {"model": "wtp-bert-mini", "use_cuda": False}
splitter = WtPParagraphSplitter(config)

text = "This is a paragraph. It contains multiple sentences.\n\nThis is another paragraph."
paragraphs = splitter.chunk(text)

for paragraph in paragraphs:
    print(paragraph)

Example: Using PySBD for Sentence Segmentation

from ovos_document_chunkers import PySBDSentenceSplitter

config = {"lang": "en"}
splitter = PySBDSentenceSplitter(config)

text = "This is a sentence. This is another one!"
sentences = splitter.chunk(text)

for sentence in sentences:
    print(sentence)

File Formats

Supported File Formats

Type Description Class Name Expected Input File Extension
Markdown Splits Markdown text into sentences or paragraphs MarkdownSentenceSplitter String (url, path or Markdown text) .md
MarkdownParagraphSplitter String (url, path or Markdown text) .md
HTML Splits HTML text into sentences or paragraphs HTMLSentenceSplitter String (url, path or HTML text) .html
HTMLParagraphSplitter String (url, path or HTML text) .html
PDF Splits PDF documents into sentences or paragraphs PDFSentenceSplitter String (url or path to PDF file) .pdf
PDFParagraphSplitter String (url or path to PDF file) .pdf
doc Splits Microsoft doc documents into sentences or paragraphs DOCSentenceSplitter String (url or path to doc file) .doc
DOCParagraphSplitter String (url or path to doc file) .doc
docx Splits Microsoft docx documents into sentences or paragraphs DOCxSentenceSplitter String (url or path to docx file) .docx
DOCxParagraphSplitter String (url or path to docx file) .docx

Usage

Example using MarkdownSentenceSplitter

from ovos_document_chunkers.text.markdown import MarkdownSentenceSplitter
import requests

markdown_text = requests.get("https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md").text

sentence_splitter = MarkdownSentenceSplitter()
sentences = sentence_splitter.chunk(markdown_text)

print("Sentences:")
for sentence in sentences:
    print(sentence)

Example using MarkdownParagraphSplitter

from ovos_document_chunkers.text.markdown import MarkdownParagraphSplitter
import requests

markdown_text = requests.get("https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md").text

paragraph_splitter = MarkdownParagraphSplitter()
paragraphs = paragraph_splitter.chunk(markdown_text)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)

Example using HTMLSentenceSplitter

from ovos_document_chunkers import HTMLSentenceSplitter
import requests

html_text = requests.get("https://www.gofundme.com/f/openvoiceos").text

sentence_splitter = HTMLSentenceSplitter()
sentences = sentence_splitter.chunk(html_text)

print("Sentences:")
for sentence in sentences:
    print(sentence)

Example using HTMLParagraphSplitter

from ovos_document_chunkers import HTMLParagraphSplitter
import requests

html_text = requests.get("https://www.gofundme.com/f/openvoiceos").text

paragraph_splitter = HTMLParagraphSplitter()
paragraphs = paragraph_splitter.chunk(html_text)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)

Example using PDFParagraphSplitter

from ovos_document_chunkers import PDFParagraphSplitter

pdf_path = "/path/to/your/pdf/document.pdf"

paragraph_splitter = PDFParagraphSplitter()
paragraphs = paragraph_splitter.chunk(pdf_path)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)

Credits

image

This work was sponsored by VisioLab, part of Royal Dutch Visio, is the test, education, and research center in the field of (innovative) assistive technology for blind and visually impaired people and professionals. We explore (new) technological developments such as Voice, VR and AI and make the knowledge and expertise we gain available to everyone.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ovos-document-chunkers-0.1.2a1.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ovos_document_chunkers-0.1.2a1-py3-none-any.whl (19.6 kB view details)

Uploaded Python 3

File details

Details for the file ovos-document-chunkers-0.1.2a1.tar.gz.

File metadata

  • Download URL: ovos-document-chunkers-0.1.2a1.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for ovos-document-chunkers-0.1.2a1.tar.gz
Algorithm Hash digest
SHA256 57f55b1bb145ec52b56d7e32900c35f67b09af11570ad3555ff6effd588a9a48
MD5 45cfa24531a65f7f1be3dfb355a2c511
BLAKE2b-256 a8b3299902472ed2bb803b39d8c5ea81db9a011ba7b0d53e8544c6b92b91b979

See more details on using hashes here.

File details

Details for the file ovos_document_chunkers-0.1.2a1-py3-none-any.whl.

File metadata

File hashes

Hashes for ovos_document_chunkers-0.1.2a1-py3-none-any.whl
Algorithm Hash digest
SHA256 1721fe5a378ab04daa2e7bae24a7c727a9b682cc6e7af5ff6d33a504150eee4a
MD5 ee632fe7054504ecdf1a4750355d3001
BLAKE2b-256 2991a3075d86cd048b5315124b187ed2136f3e5650446e1f33e791dacd192b3b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page