ALEA low-level data generation techniques (procedural, KL3M)
Project description
ALEA Data Generator
This is a basic synthetic data generation/perturbation library designed to support the creation or augmentation designed by the ALEA Institute to support the creation and augmentation of data without relying on "tainted" LLMs.
Data generation techniques in this library:
- do not require the use of any LLM or external data source
- can be used with KL3M, our Fairly Trained LLM
Supported Patterns
The following data generation patterns are supported:
- Simple string templates with sampled values (e.g.,
This Agreement, by and between <|company:a|> and <|company:b|>, is made as of <|date|>.
)- Faker integration for common data types (e.g., names, addresses, dates, etc.)
- Large templates with sampled values (e.g.,
jinja2
templates in files) - Common document types (e.g., emails, contracts, memos, etc. using templates)
- Data perturbation (e.g., realistic errors introduced by humans, OCR, or other automated systems)
- Skipping, doubling, or transposing/swapping characters
- Skipping, doubling, or transposing/swapping tokens
- QWERTY and mobile keyboard mistakes (off-by-one key, shift errors, etc.)
- Homophones (e.g.,
their
vs.there
) - Synonyms (e.g.,
big
vs.large
) - Negation/antonyms (e.g.,
big
vs.small
) - Capitalization errors (e.g.,
big
vs.Big
) - Punctuation errors (e.g.,
big
vs.big.
) - OCR-like errors (e.g., misreading characters, smudges, etc.) -
- Representation conversion (e.g.,
429
tofour hundred twenty-nine
orfour twenty-nine
) - Format conversion (e.g., Markdown <-> HTML variants)
Future Roadmap
- Document image generation for document/OCR models
License
The ALEA Data Generator library is released under the MIT License. See the LICENSE file for details.
Some of the data generation techniques used in this library may also retrieve data from external sources,
which have their own licensing terms. These terms are documented in the alea-data-sources
here:
See, e.g., the CMU Pronouncing Dictionary (cmudict
), which is used in tasks like homophonic errors:
Support
If you encounter any issues or have questions about using the ALEA Data Generator library, please open an issue on GitHub.
Learn More
To learn more about ALEA and its software and research projects like KL3M and leeky, visit the ALEA website.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file alea_data_generator-0.1.0.tar.gz
.
File metadata
- Download URL: alea_data_generator-0.1.0.tar.gz
- Upload date:
- Size: 21.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.3 Linux/6.8.0-41-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a7b84fbe537b583474cdd15cf48f7c9f2c95b8a3214ade2c349b1a2e724ecf5 |
|
MD5 | 481ac9f03b65fe15a3c3b3a56c9f1157 |
|
BLAKE2b-256 | 8c3e71b9f44c312fccabf782c6e358aa33059dd9a0473fe4add55d7f48a7ec69 |
File details
Details for the file alea_data_generator-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: alea_data_generator-0.1.0-py3-none-any.whl
- Upload date:
- Size: 31.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.3 Linux/6.8.0-41-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2cd0eed5971f14312887c569ca1ed5a39a0573933511ca5dcd68d012c834660a |
|
MD5 | 95fbbaaa4fd8ec1275e08717f31d14d8 |
|
BLAKE2b-256 | a0e927abcc363c016913e4e8581d0a78059868142a52fe35496c3fd490028468 |