Skip to main content

A Python tool for splitting large Markdown files into smaller sections based on a specified token limit. This is particularly useful for processing large Markdown files with GPT models, as it allows the models to handle the data in manageable chunks.

Project description

split_markdown4gpt

split_markdown4gpt is a Python tool designed to split large Markdown files into smaller sections based on a specified token limit. This is particularly useful for processing large Markdown files with GPT models, as it allows the models to handle the data in manageable chunks.

Version 1.0.1 (2023-06-18)

Installation

You can install split_markdown4gpt via pip:

pip install split_markdown4gpt

CLI usage

After installation, you can use the mdsplit4gpt command to split a Markdown file. Here's the basic syntax:

mdsplit4gpt path_to_your_file.md --model gpt-3.5-turbo --limit 4096 --separator "=== SPLIT ==="

This command will split the Markdown file at path_to_your_file.md into sections, each containing no more than 4096 tokens (as counted by the gpt-3.5-turbo model). The sections will be separated by === SPLIT ===.

All CLI options:

NAME
    mdsplit4gpt

SYNOPSIS
    mdsplit4gpt MD_PATH <flags>

POSITIONAL ARGUMENTS
    MD_PATH
        Type: Union

FLAGS
    -m, --model=MODEL
        Type: str
        Default: 'gpt-3.5-turbo'
    -l, --limit=LIMIT
        Type: Optional[int]
        Default: None
    -s, --separator=SEPARATOR
        Type: str
        Default: '=== SPLIT ==='

Python usage

You can also use split_markdown4gpt in your Python code. Here's a basic example:

from split_markdown4gpt import split

sections = split("path_to_your_file.md", model="gpt-3.5-turbo", limit=4096)
for section in sections:
    print(section)

This code does the same thing as the CLI command above, but in Python.

How it works

split_markdown4gpt works by tokenizing the input Markdown file using the specified GPT model's tokenizer (default is gpt-3.5-turbo). It then splits the file into sections, each containing no more than the specified token limit.

The splitting process respects the structure of the Markdown file. It will not split a section (as defined by Markdown headings) across multiple output sections unless the section is longer than the token limit. In that case, it will split the section at the sentence level.

The tool uses several libraries to accomplish this:

  • tiktoken for tokenizing the text according to the GPT model's rules.
  • fire for creating the CLI.
  • frontmatter for parsing the Markdown file's front matter (metadata at the start of the file).
  • mistletoe for parsing the Markdown file into a syntax tree.
  • syntok for splitting the text into sentences.
  • regex and PyYAML for various utility functions.

Contributing

Contributions to split_markdown4gpt are welcome! Please open an issue or submit a pull request on the GitHub repository.

License

  • Copyright (c) 2023 Adam Twardoch
  • Written with assistance from ChatGPT
  • Licensed under the Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

split_markdown4gpt-1.0.1.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

split_markdown4gpt-1.0.1-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file split_markdown4gpt-1.0.1.tar.gz.

File metadata

  • Download URL: split_markdown4gpt-1.0.1.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.4

File hashes

Hashes for split_markdown4gpt-1.0.1.tar.gz
Algorithm Hash digest
SHA256 20466e618c6417c8938ea0968962e942e753790859256ffc3ef89296d4782336
MD5 a021f153915032dbf0b729261f359ca8
BLAKE2b-256 cdd8d285c3cc1be1d5def8cd47ca41ee945324b0f2753cb8f4c6ffe22b0dbac9

See more details on using hashes here.

File details

Details for the file split_markdown4gpt-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for split_markdown4gpt-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2b0e7fb322869c3529a51d5918756f307395515d921f3dbfc18d25741cc81093
MD5 4635e6d5b6f697a6e98895feed17b3da
BLAKE2b-256 e3e74bb69951ce1e5484a8ff612955cb75f7393faaf6a631d5f187e359a19602

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page