Skip to main content

A text splitting tool

Project description

Tixent: A Text Splitting Tool

PyPI PyPI - Version PyPI - Python Version License

CI/CD test lint

Build System Hatch project

Code pre-commit Code style: black Imports: isort Checked with mypy Ruff

Docstrings docformatter numpy


Installation

pip install tixent

Example

Suppose we have a function template that generates a string from a list of texts. Additionally, suppose we have a large list of texts. When you apply that list of texts to the function, it generates a long string.

Tixent can split the string generated by the template function so that the return value of counter for each element is less than a certain number.

Here, counter is a function that maps a string to an integer. Examples of such functions are len, which measures the length of a string, or tiktoken_counter("text-davinci-003"), which measures the number of tokens in a string

from typing import List

from tixent import split, tiktoken_counter


def summarization_template(texts: List[str]) -> str:
    text = " ".join(texts)
    t = "Summarize the following text.\n"
    t += f'Text: """{text}"""'
    return t


texts = [
    "Lorem ipsum dolor sit amet",
    "consectetur adipiscing elit",
    "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua",
    "Ut enim ad minim veniam",
    "quis nostrud exercitation ullamco laboris nisi",
    "ut aliquip ex ea commodo consequat",
    "Duis aute irure dolor in reprehenderit in voluptate velit",
    "esse cillum dolore eu fugiat nulla pariatur",
    "Excepteur sint occaecat cupidatat non proident",
    "sunt in culpa qui officia deserunt mollit anim id est laborum",
]
counter = tiktoken_counter("text-davinci-003")
max_count = 60

split_texts = split(texts, summarization_template, counter, max_count)
for text in split_texts:
    count = counter(text)
    assert count <= max_count
    print(f"count: {count}")
    print(text)
    print()
count: 60
Summarize the following text.
Text: """Lorem ipsum dolor sit amet consectetur adipiscing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam"""

count: 58
Summarize the following text.
Text: """quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit"""

count: 43
Summarize the following text.
Text: """esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident"""

count: 31
Summarize the following text.
Text: """sunt in culpa qui officia deserunt mollit anim id est laborum"""

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tixent-0.0.3.tar.gz (6.5 kB view details)

Uploaded Source

Built Distribution

tixent-0.0.3-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file tixent-0.0.3.tar.gz.

File metadata

  • Download URL: tixent-0.0.3.tar.gz
  • Upload date:
  • Size: 6.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.25.1

File hashes

Hashes for tixent-0.0.3.tar.gz
Algorithm Hash digest
SHA256 72084f3d107d953435b72c004c93af334a121330d115d92a8a1d2fa2e41c9488
MD5 a578e2ebb2febec9a2b094a40b6c8b86
BLAKE2b-256 1d34a38d076c40a35dae1edd784c4eb2543b186212260aec42e8aa9ee85f32f3

See more details on using hashes here.

File details

Details for the file tixent-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: tixent-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 5.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.25.1

File hashes

Hashes for tixent-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3edee261041b62f17a8da48d7852baf71b6e1e0fa34194c10abc7a6432973b48
MD5 bd5e10b40f9cad21e96b55eeab7353b0
BLAKE2b-256 a5e05ac3b710c107bfc52faee283a413d6563f8d0ac376de778b7c8cdf93da4d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page