Skip to main content

ALEA low-level data generation techniques (procedural, KL3M)

Project description

ALEA Data Generator

PyPI version License: MIT Python Versions

This is a basic synthetic data generation/perturbation library designed to support the creation or augmentation designed by the ALEA Institute to support the creation and augmentation of data without relying on "tainted" LLMs.

Data generation techniques in this library:

  • do not require the use of any LLM or external data source
  • can be used with KL3M, our Fairly Trained LLM

Supported Patterns

The following data generation patterns are supported:

  • Simple string templates with sampled values (e.g., This Agreement, by and between <|company:a|> and <|company:b|>, is made as of <|date|>.)
    • Faker integration for common data types (e.g., names, addresses, dates, etc.)
  • Large templates with sampled values (e.g., jinja2 templates in files)
  • Common document types (e.g., emails, contracts, memos, etc. using templates)
  • Data perturbation (e.g., realistic errors introduced by humans, OCR, or other automated systems)
    • Skipping, doubling, or transposing/swapping characters
    • Skipping, doubling, or transposing/swapping tokens
    • QWERTY and mobile keyboard mistakes (off-by-one key, shift errors, etc.)
    • Homophones (e.g., their vs. there)
    • Synonyms (e.g., big vs. large)
    • Negation/antonyms (e.g., big vs. small)
    • Capitalization errors (e.g., big vs. Big)
    • Punctuation errors (e.g., big vs. big.)
    • OCR-like errors (e.g., misreading characters, smudges, etc.) -
  • Representation conversion (e.g., 429 to four hundred twenty-nine or four twenty-nine)
  • Format conversion (e.g., Markdown <-> HTML variants)

Future Roadmap

  • Document image generation for document/OCR models

License

The ALEA Data Generator library is released under the MIT License. See the LICENSE file for details.

Some of the data generation techniques used in this library may also retrieve data from external sources, which have their own licensing terms. These terms are documented in the alea-data-sources here:

See, e.g., the CMU Pronouncing Dictionary (cmudict), which is used in tasks like homophonic errors:

Support

If you encounter any issues or have questions about using the ALEA Data Generator library, please open an issue on GitHub.

Learn More

To learn more about ALEA and its software and research projects like KL3M and leeky, visit the ALEA website.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alea_data_generator-0.1.0.tar.gz (21.4 kB view details)

Uploaded Source

Built Distribution

alea_data_generator-0.1.0-py3-none-any.whl (31.5 kB view details)

Uploaded Python 3

File details

Details for the file alea_data_generator-0.1.0.tar.gz.

File metadata

  • Download URL: alea_data_generator-0.1.0.tar.gz
  • Upload date:
  • Size: 21.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.3 Linux/6.8.0-41-generic

File hashes

Hashes for alea_data_generator-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8a7b84fbe537b583474cdd15cf48f7c9f2c95b8a3214ade2c349b1a2e724ecf5
MD5 481ac9f03b65fe15a3c3b3a56c9f1157
BLAKE2b-256 8c3e71b9f44c312fccabf782c6e358aa33059dd9a0473fe4add55d7f48a7ec69

See more details on using hashes here.

File details

Details for the file alea_data_generator-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for alea_data_generator-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2cd0eed5971f14312887c569ca1ed5a39a0573933511ca5dcd68d012c834660a
MD5 95fbbaaa4fd8ec1275e08717f31d14d8
BLAKE2b-256 a0e927abcc363c016913e4e8581d0a78059868142a52fe35496c3fd490028468

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page