A Python module to load mnist_datasets from scratch
Project description
MNIST Dataset Loader
An uniform interface to the MNIST handwritten digits(default) and MNIST fashion datasets, independent of any machine learning framework or external libraries except numpy. This implementation enables downloading, extracting, and loading the dataset effortlessly.
Features
- Pure Python + NumPy: No dependencies on deep learning frameworks.
- Automatic Download & Extraction: Fetches and prepares the dataset automatically.
- Supports Raw MNIST Format: Loads images and labels directly from binary files.
- ARFF Format Support: Provides an option to load data from an ARFF file.
- Custom Storage Location: Allows specifying a custom directory for storing dataset files.
MNIST Dataset Structure
The MNIST dataset consists of four binary files:
| File | Description | Count |
|---|---|---|
| train-images-idx3-ubyte.gz | Training images | 60,000 |
| train-labels-idx1-ubyte.gz | Training labels | 60,000 |
| t10k-images-idx3-ubyte.gz | Test images | 10,000 |
| t10k-labels-idx1-ubyte.gz | Test labels | 10,000 |
Note: The original MNIST site does not provide detailed information about the dataset files.
File Format Breakdown
Image File Format (*-images-idx3-ubyte)
| Offset (Bytes) | Content | Description |
|---|---|---|
| 0 - 3 | Magic number | 2051 (0x803 in hex) |
| 4 - 7 | Number of images | Total images in the dataset |
| 8 - 11 | Rows | Should be 28 |
| 12 - 15 | Columns | Should be 28 |
| 16 - *** | Pixel data | Each pixel is an unsigned value (0-255) |
Label File Format (*-labels-idx1-ubyte)
| Offset (Bytes) | Content | Description |
|---|---|---|
| 0 - 3 | Magic number | 2049 (0x801 in hex) |
| 4 - 7 | Number of labels | Total labels in the dataset |
| 8 - *** | Label Data | Each label is a single byte (0-9) |
Installation
Install the package via pip:
pip install mnist_datasets
Usage
Load MNIST Dataset
from mnist_datasets import MNISTLoader
loader = MNISTLoader()
images, labels = loader.load()
assert len(images) == 60000 and len(labels) == 60000
# Load test dataset
test_images, test_labels = loader.load(train=False)
assert len(test_images) == 10000 and len(test_labels) == 10000
Specify a Custom Folder
loader = MNISTLoader(folder='/tmp')
Load Data from an ARFF File
images_from_arff, labels_from_arff = MNISTLoader.from_arff()
Note: Default ARFF file source (for handwritten digits) is
https://www.openml.org/data/download/52667/mnist_784.arff. This method is provided for educational purposes and extremley slow.
Verify Consistency Between ARFF and MNIST Binary Format
import numpy as np
images_from_arff, labels_from_arff = MNISTLoader.from_arff(train=False)
images, labels = MNISTLoader().load(train=False)
np.alltrue(images_from_arff == images), np.alltrue(labels_from_arff == labels)
Load Images and Labels from Local Storage
images = MNISTLoader.load_images('/tmp/t10k-images-idx3-ubyte')
labels = MNISTLoader.load_labels('/tmp/t10k-labels-idx1-ubyte')
assert len(images) == 10000 and len(labels) == 10000
Note: All of the above examples would work for fashion MNIST with just following tweak:
loader = MNISTLoader('fashion')
Addtional steps that may be required/helpful
Install virtual environment support (Ubuntu/Debian)
You can skip this if python3 -m venv works
sudo apt update && sudo apt install -y python3-venv
# 1. Create a virtual environment in `.venv` folder
python3 -m venv .venv
# 2. Activate the virtual environment
source .venv/bin/activate
# 3. Upgrade pip (recommended)
pip install --upgrade pip
# 4. Install required pytorch
pip install torch==2.7.0 --index-url https://download.pytorch.org/whl/cpu
Why use this?
This project is designed for those who want an intuitive and dependency-free way to load the MNIST dataset while understanding its raw format in depth.
Contributions & Issues:
Found a bug? Want to contribute? Feel free to open an issue or submit a PR!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mnist_datasets-0.13.tar.gz.
File metadata
- Download URL: mnist_datasets-0.13.tar.gz
- Upload date:
- Size: 6.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac492d3c92e9ad1bf785de984af442bb70a561a40486ee6965ba67ed345fb2ae
|
|
| MD5 |
3d0829b51515ae06706dfab61b026c85
|
|
| BLAKE2b-256 |
0db563e4eeb3a9e430c06f88105565acdbf5199bd658cf9ffbc7db9364e977f6
|
File details
Details for the file mnist_datasets-0.13-py3-none-any.whl.
File metadata
- Download URL: mnist_datasets-0.13-py3-none-any.whl
- Upload date:
- Size: 7.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
715bea879986c2282aa0913a35d094a69848928854608293b5ee770edc3d42f1
|
|
| MD5 |
f138a10ff08a490ee82c0ac28f672f96
|
|
| BLAKE2b-256 |
c01ed96332ccc21c7311fb35ed5fd39d14f82c63bd445db85cc09fcf5fe07bb1
|