Skip to main content

ALEA low-level data generation techniques (procedural, KL3M)

Project description

ALEA Data Generator

PyPI version License: MIT Python Versions

This is a basic synthetic data generation/perturbation library designed to support the creation or augmentation designed by the ALEA Institute to support the creation and augmentation of data without relying on "tainted" LLMs.

Data generation techniques in this library:

  • do not require the use of any LLM or external data source
  • can be used with KL3M, our Fairly Trained LLM

Supported Patterns

The following data generation patterns are supported:

  • Simple string templates with sampled values (e.g., This Agreement, by and between <|company:a|> and <|company:b|>, is made as of <|date|>.)
    • Faker integration for common data types (e.g., names, addresses, dates, etc.)
  • Large templates with sampled values (e.g., jinja2 templates in files)
  • Common document types (e.g., emails, contracts, memos, etc. using templates)
  • Data perturbation (e.g., realistic errors introduced by humans, OCR, or other automated systems)
    • Skipping, doubling, or transposing/swapping characters
    • Skipping, doubling, or transposing/swapping tokens
    • QWERTY and mobile keyboard mistakes (off-by-one key, shift errors, etc.)
    • Homophones (e.g., their vs. there)
    • Synonyms (e.g., big vs. large)
    • Negation/antonyms (e.g., big vs. small)
    • Capitalization errors (e.g., big vs. Big)
    • Punctuation errors (e.g., big vs. big.)
    • OCR-like errors (e.g., misreading characters, smudges, etc.) -
  • Representation conversion (e.g., 429 to four hundred twenty-nine or four twenty-nine)
  • Format conversion (e.g., Markdown <-> HTML variants)

Future Roadmap

  • Document image generation for document/OCR models

License

The ALEA Data Generator library is released under the MIT License. See the LICENSE file for details.

Some of the data generation techniques used in this library may also retrieve data from external sources, which have their own licensing terms. These terms are documented in the alea-data-sources here:

See, e.g., the CMU Pronouncing Dictionary (cmudict), which is used in tasks like homophonic errors:

Support

If you encounter any issues or have questions about using the ALEA Data Generator library, please open an issue on GitHub.

Learn More

To learn more about ALEA and its software and research projects like KL3M and leeky, visit the ALEA website.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alea_data_generator-0.1.2.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

alea_data_generator-0.1.2-py3-none-any.whl (41.7 kB view details)

Uploaded Python 3

File details

Details for the file alea_data_generator-0.1.2.tar.gz.

File metadata

  • Download URL: alea_data_generator-0.1.2.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.3 Linux/6.8.0-50-generic

File hashes

Hashes for alea_data_generator-0.1.2.tar.gz
Algorithm Hash digest
SHA256 776f085a5c7cb31a48321edcaee407bacd9c965e25f129a958eaf7fd793a7b61
MD5 98d41490d0d47d779cb5a0fbc95a9ed2
BLAKE2b-256 ac6fdaf373ca824667244fbaa6dcb752007907210ae134c33e740e86943462e9

See more details on using hashes here.

File details

Details for the file alea_data_generator-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: alea_data_generator-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 41.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.3 Linux/6.8.0-50-generic

File hashes

Hashes for alea_data_generator-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 17db9ee84c8e4ce6d6dd97ac2cde01f75888dd00805f7265414b00ac9a47a0af
MD5 9b7ef28016d4c452230bceadaa7290ac
BLAKE2b-256 6a67985ea3d47208a058b7996c3a458428b12e68c6f65ee2b9b39741e83bf8ba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page