Skip to main content

A package for working with data using NLP and ML.

Project description

structml Logo

PyPI PyPI - Python Version PyPI - Downloads

structml

Not ready for production use

A package for working with data using NLP and ML.

Features

  • The primary feature of this package is the line_heal module, which provides functions to reconstruct the original continuous text from a string that has been broken into multiple lines.
  • Parallel processing for processing multiple strings at once.
  • CPU and CUDA support.

Installation

Requirements: Python >=3.9 and Pytorch >=2.0.

To install the package from PyPI, run the following command:

pip install structml

To install the package from source, run the following command:

git clone https://github.com/PocketDocLabs/structml.git
cd structml
pip install .

Usage

line_heal

The line_heal module provides functions to reconstruct the original continuous text from a string that has been broken into multiple lines. This is done by choosing the most probable way to join lines or paragraphs, and remove hyphens if appropriate. The best fitting option is chosen based on the perplexity of the text, using a language model.

Functions

parse

parse is the main function of the line_heal module. It takes a string of text and returns the healed string.

def parse(string: str, verbose: bool = False, cuda: bool = True) -> str:
    """
    Heal a broken line of text.

    Args:
        string (str): The broken string of text.
        verbose (bool): Whether to print debug information.
        cuda (bool): Whether to use the GPU for processing.

    Returns:
        str: The healed line of text.
    """
Example
from structml import line_heal

line = "This line is broken into mul-\ntiple lines as is common in data\nthat originates from a structured\nsource.\nThe goal of this function is to\nheal the line in a way flows\nnaturally and is easy to read."

healed_line = line_heal.parse(line)

print(healed_line)
Output
This line is broken into multiple lines as is common in data that originates from a structured source.

The goal of this function is to heal the line in a way flows naturally and is easy to read.

parse_list

Note: This function uses large amounts of memory. parse_list is a function that takes a list of strings and returns a list of healed strings. Uses multiple processes to speed up the healing process.

def parse_list(strings: List[str], verbose: bool = False, cuda: bool = True) -> List[str]:
    """
    Heal a list of broken lines of text.

    Args:
        strings (List[str]): The list of broken strings of text.
        verbose (bool): Whether to print debug information.
        cuda (bool): Whether to use the GPU for processing.

    Returns:
        List[str]: The list of healed lines of text.
    """
Example
from structml import line_heal

lines = [
    "This line is broken into mul-\ntiple lines as is common in data\nthat originates from a structured\nsource.\nThe goal of this function is to\nheal the line in a way flows\nnaturally and is easy to read.",
    "Another goal of this function is to\ndo so in a way that is efficient\nand scalable."
]

healed_lines = line_heal.parse_list(lines)

for line in healed_lines:
    print(line)
Output
This line is broken into multiple lines as is common in data that originates from a structured source.

The goal of this function is to heal the line in a way flows naturally and is easy to read.
Another goal of this function is to do so in a way that is efficient and scalable.

Credits

  • Thank you to Sebastian Gabarain for the model that Dans-StructureEvaluator is based on.
  • This project was largely inspired by this project from which the method for healing lines was adapted.
  • The training for this project would not have been possible without Axolotl.

Model pages

  • Dans-StructureEvaluator, used in the line_heal module. Used to check the perplexity of the text with various joining strategies.

License

This project is licensed under the terms of the AGPL-3.0 license. The full text of the license can be found in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structml-0.0.4.tar.gz (17.6 kB view hashes)

Uploaded Source

Built Distribution

structml-0.0.4-py3-none-any.whl (17.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page