MLCommons datasets format.
Project description
🥐 ML Croissant
Python requirements
Python version >= 3.10.
If you do not have a Python environment:
python3 -m venv ~/py3
source ~/py3/bin/activate
Install
python -m pip install ".[dev]"
Verify/load a Croissant dataset
python scripts/validate.py --file ../../datasets/titanic/metadata.json
The command:
- Exits with 0, prints
Doneand displays encountered warnings, when no error was found in the file. - Exits with 1 and displays all encountered errors/warnings, otherwise.
Similarly, you can generate a dataset by launching:
python scripts/load.py \
--file ../../datasets/titanic/metadata.json \
--record_set passengers \
--num_records 10
Programmatically build JSON-LD files
You can programmatically build Croissant JSON-LD files using the Python API.
import mlcroissant as mlc
metadata=mlc.nodes.Metadata(
name="...",
)
metadata.to_json() # this returns the JSON-LD file.
For a full working example, refer to the script to convert Hugging Face datasets to Croissant files. This script uses the Python API to programmatically build JSON-LD files.
Run tests
All tests can be run from the Makefile:
make tests
Design
The most important modules in the library are:
mlcroissant/_src/structure_graphis responsible for the static analysis of the Croissant files. We convert Croissant files to a Python representation called "structure graph" (using NetworkX). In the process, we catch any static analysis issues (e.g., a missing mandatory field or a logic problem in the file).mlcroissant/_src/operation_graphis responsible for the dynamic analysis of the Croissant files (i.e., actually loading the dataset by yielding examples). We convert the structure graph into an "operation graph". Operations are the unit transformations that allow to build the dataset (likeDownload,Extract, etc).
Other important modules are:
mlcroissant/_src/coredefines all needed core internals. For instance,Issuesare a way to track errors and warning during the analysis of Croissant files.mlcroissant/__init__.pydeclares the public API withmlcroissant.Dataset.
For the full design, refer to the design doc for an overview of the implementation.
Contribute
All contributions are welcome! We even have good first issues to start in the project. Refer to the GitHub project for more detailed user stories.
The development workflow goes as follow:
- Read above how the repo is designed.
- Fork the repository: https://github.com/mlcommons/croissant.
- Clone the newly forked repository:
git clone git@github.com:<YOUR_GITHUB_LDAP>/croissant.git
- Create a new branch:
cd croissant/ git checkout -b feature/my-awesome-new-feature
- Install the repository and dev tools:
cd python/mlcroissant pip install -e .[dev]
- Code the feature. We support VS Code with pre-set settings.
- Push to GitHub:
git add . git push --set-upstream origin feature/my-awesome-new-feature
- Update your code until all tests are green:
pytestruns unit tests.pytype -j autoruns pytype.
- Open a pull request (PR) with the main branch of https://github.com/mlcommons/croissant, and ask for feedback!
Debug
You can debug the validation of the file using the --debug flag:
python scripts/validate.py --file ../../datasets/titanic/metadata.json --debug
This will:
- print extra information, like the generated nodes;
- save the generated structure graph to a folder indicated in the logs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlcroissant-0.0.1-py2.py3-none-any.whl.
File metadata
- Download URL: mlcroissant-0.0.1-py2.py3-none-any.whl
- Upload date:
- Size: 71.9 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d306203ca63711b52ade93c178e89ea3be040c3b88174acdf43cc21df152b8d
|
|
| MD5 |
50d80495273b7a60953a5a02d65e897a
|
|
| BLAKE2b-256 |
45e25024f3560755730ed71bae798a7a84b6efaad70ac3d3c118f4a402ba40c7
|