Skip to main content

ALEA low-level data generation techniques (procedural, KL3M)

Project description

ALEA Data Generator

PyPI version License: MIT Python Versions

This is a basic synthetic data generation/perturbation library designed to support the creation or augmentation designed by the ALEA Institute to support the creation and augmentation of data without relying on "tainted" LLMs.

Data generation techniques in this library:

  • do not require the use of any LLM or external data source
  • can be used with KL3M, our Fairly Trained LLM

Supported Patterns

The following data generation patterns are supported:

  • Simple string templates with sampled values (e.g., This Agreement, by and between <|company:a|> and <|company:b|>, is made as of <|date|>.)
    • Faker integration for common data types (e.g., names, addresses, dates, etc.)
  • Large templates with sampled values (e.g., jinja2 templates in files)
  • Common document types (e.g., emails, contracts, memos, etc. using templates)
  • Data perturbation (e.g., realistic errors introduced by humans, OCR, or other automated systems)
    • Skipping, doubling, or transposing/swapping characters
    • Skipping, doubling, or transposing/swapping tokens
    • QWERTY and mobile keyboard mistakes (off-by-one key, shift errors, etc.)
    • Homophones (e.g., their vs. there)
    • Synonyms (e.g., big vs. large)
    • Negation/antonyms (e.g., big vs. small)
    • Capitalization errors (e.g., big vs. Big)
    • Punctuation errors (e.g., big vs. big.)
    • OCR-like errors (e.g., misreading characters, smudges, etc.) -
  • Representation conversion (e.g., 429 to four hundred twenty-nine or four twenty-nine)
  • Format conversion (e.g., Markdown <-> HTML variants)

Future Roadmap

  • Document image generation for document/OCR models

License

The ALEA Data Generator library is released under the MIT License. See the LICENSE file for details.

Some of the data generation techniques used in this library may also retrieve data from external sources, which have their own licensing terms. These terms are documented in the alea-data-sources here:

See, e.g., the CMU Pronouncing Dictionary (cmudict), which is used in tasks like homophonic errors:

Support

If you encounter any issues or have questions about using the ALEA Data Generator library, please open an issue on GitHub.

Learn More

To learn more about ALEA and its software and research projects like KL3M and leeky, visit the ALEA website.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alea_data_generator-0.1.0.tar.gz (21.4 kB view hashes)

Uploaded Source

Built Distribution

alea_data_generator-0.1.0-py3-none-any.whl (31.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page