Skip to main content

A package to convert project codebases into JSONL format for GPT model training.

Project description

PyPI version License: MIT Downloads LinkedIn

ProjectCodebaseToJsonl

ProjectCodebaseToJsonl is a Python package designed to convert project codebases into JSONL format. This is particularly useful for preparing data for training GPT models, as it allows for the easy transformation of existing project structures and code into a format compatible with machine learning pipelines.

Installation

To install ProjectCodebaseToJsonl, you can use pip:

pip install ProjectCodebaseToJsonl

Usage

As a Python Module

You can use ProjectCodebaseToJsonl as a module in your Python scripts.

Example:

from codebase_to_jsonl import generate_jsonl_for_project

# Generate JSONL for a project
project_data = generate_jsonl_for_project(
    project_path="path_to_your_project",
    project_name="YourProjectName",
    use_gitignore=True,
    validation_ratio=0.4
)

print("Project Data Generated:")
print(project_data)

Customizing Your Generator

You can customize the behavior of ProjectCodebaseToJsonl by adjusting parameters like use_gitignore and validation_ratio to suit the specific needs of your codebase and desired dataset characteristics.

Output Example

Running ProjectCodebaseToJsonl generates JSONL files for both training and validation, structured to facilitate GPT model training. Here's an example of the output structure:

{
    "project_name": "YourProjectName",
    "token_count": 12345,
    "training_file": "YourProjectName_training_20240101_123456.jsonl",
    "validation_file": "YourProjectName_validation_20240101_123456.jsonl"
}

Contributing

Contributions, issues, and feature requests are welcome! Feel free to check issues page.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

projectcodebasetojsonl-2025.5.181247.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file projectcodebasetojsonl-2025.5.181247.tar.gz.

File metadata

File hashes

Hashes for projectcodebasetojsonl-2025.5.181247.tar.gz
Algorithm Hash digest
SHA256 bc591f2f61ee40c85203705a1cae39a06b7fe82aa1d905b26e1a27b0faedd4c6
MD5 ffd0c2e7ae81360c4195a98aa18193ac
BLAKE2b-256 9a770e1ac69ae1961d38a304b908b3514e0115dcbef3b63569988d55248bf615

See more details on using hashes here.

File details

Details for the file projectcodebasetojsonl-2025.5.181247-py3-none-any.whl.

File metadata

File hashes

Hashes for projectcodebasetojsonl-2025.5.181247-py3-none-any.whl
Algorithm Hash digest
SHA256 56ddbe3b12b7561e9f5e317f1ce8f623aa2b8715cdbb74739d0ad40d094c5d1f
MD5 7090b4f723feea275023b5b85cbcc776
BLAKE2b-256 011b41424a9c50740ad358c4ca98cd8a72ab31929683a0ece88f95100bb95b87

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page