Skip to main content

A package to convert project codebases into JSONL format for GPT model training.

Project description

PyPI version License: MIT Downloads

ProjectCodebaseToJsonl

ProjectCodebaseToJsonl is a Python package designed to convert project codebases into JSONL format. This is particularly useful for preparing data for training GPT models, as it allows for the easy transformation of existing project structures and code into a format compatible with machine learning pipelines.

Installation

To install ProjectCodebaseToJsonl, you can use pip:

pip install ProjectCodebaseToJsonl

Usage

As a Python Module

You can use ProjectCodebaseToJsonl as a module in your Python scripts.

Example:

from codebase_to_jsonl import generate_jsonl_for_project

# Generate JSONL for a project
project_data = generate_jsonl_for_project(
    project_path="path_to_your_project",
    project_name="YourProjectName",
    use_gitignore=True,
    validation_ratio=0.4
)

print("Project Data Generated:")
print(project_data)

Customizing Your Generator

You can customize the behavior of ProjectCodebaseToJsonl by adjusting parameters like use_gitignore and validation_ratio to suit the specific needs of your codebase and desired dataset characteristics.

Output Example

Running ProjectCodebaseToJsonl generates JSONL files for both training and validation, structured to facilitate GPT model training. Here's an example of the output structure:

{
    "project_name": "YourProjectName",
    "token_count": 12345,
    "training_file": "YourProjectName_training_20240101_123456.jsonl",
    "validation_file": "YourProjectName_validation_20240101_123456.jsonl"
}

Contributing

Contributions, issues, and feature requests are welcome! Feel free to check issues page.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ProjectCodebaseToJsonl-0.0.1.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

ProjectCodebaseToJsonl-0.0.1-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file ProjectCodebaseToJsonl-0.0.1.tar.gz.

File metadata

File hashes

Hashes for ProjectCodebaseToJsonl-0.0.1.tar.gz
Algorithm Hash digest
SHA256 e0e57eae2ee47fc752d53d1972e6666b99018cf71cec7b6745110c2ae1b9332c
MD5 fd6dfa00e017b568fc2ba007dba9b3d0
BLAKE2b-256 a7f3dd771878611a868c442ce4474fbb4986f04be8f80a48512d6c7b6d777e40

See more details on using hashes here.

File details

Details for the file ProjectCodebaseToJsonl-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for ProjectCodebaseToJsonl-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9bf32bf36150bebf76aa99ce5a7a9d88177a1fa09ca9a3032584c008f7eda104
MD5 bad2b06656e02c8f49ad7c3537fc118d
BLAKE2b-256 9f7bae848dbb1829bd45f40d64f6edb6b44b3c0f9c3e21b79a3a038d56a61029

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page