Skip to main content

A Python module to load mnist_datasets from scratch

Project description

MNIST Dataset Loader

An uniform interface to the MNIST handwritten digits(default) and MNIST fashion datasets, independent of any machine learning framework or external libraries except numpy. This implementation enables downloading, extracting, and loading the dataset effortlessly.

Features

  • Pure Python + NumPy: No dependencies on deep learning frameworks.
  • Automatic Download & Extraction: Fetches and prepares the dataset automatically.
  • Supports Raw MNIST Format: Loads images and labels directly from binary files.
  • ARFF Format Support: Provides an option to load data from an ARFF file.
  • Custom Storage Location: Allows specifying a custom directory for storing dataset files.

MNIST Dataset Structure

The MNIST dataset consists of four binary files:

File Description Count
train-images-idx3-ubyte.gz Training images 60,000
train-labels-idx1-ubyte.gz Training labels 60,000
t10k-images-idx3-ubyte.gz Test images 10,000
t10k-labels-idx1-ubyte.gz Test labels 10,000

Note: The original MNIST site does not provide detailed information about the dataset files.

File Format Breakdown

Image File Format (*-images-idx3-ubyte)

Offset (Bytes) Content Description
0 - 3 Magic number 2051 (0x803 in hex)
4 - 7 Number of images Total images in the dataset
8 - 11 Rows Should be 28
12 - 15 Columns Should be 28
16 - *** Pixel data Each pixel is an unsigned value (0-255)

Label File Format (*-labels-idx1-ubyte)

Offset (Bytes) Content Description
0 - 3 Magic number 2049 (0x801 in hex)
4 - 7 Number of labels Total labels in the dataset
8 - *** Label Data Each label is a single byte (0-9)

Installation

Install the package via pip:

pip install mnist_datasets

Usage

Load MNIST Dataset

from mnist_datasets import MNISTLoader
loader = MNISTLoader()
images, labels = loader.load()
assert len(images) == 60000 and len(labels) == 60000

# Load test dataset
test_images, test_labels = loader.load(train=False)
assert len(test_images) == 10000 and len(test_labels) == 10000

Specify a Custom Folder

loader = MNISTLoader(folder='/tmp')

Load Data from an ARFF File

images_from_arff, labels_from_arff = MNISTLoader.from_arff()

Note: Default ARFF file source (for handwritten digits) is https://www.openml.org/data/download/52667/mnist_784.arff. This method is provided for educational purposes and extremley slow.

Verify Consistency Between ARFF and MNIST Binary Format

import numpy as np
images_from_arff, labels_from_arff = MNISTLoader.from_arff(train=False)
images, labels = MNISTLoader().load(train=False)
np.alltrue(images_from_arff == images), np.alltrue(labels_from_arff == labels)

Load Images and Labels from Local Storage

images = MNISTLoader.load_images('/tmp/t10k-images-idx3-ubyte')
labels = MNISTLoader.load_labels('/tmp/t10k-labels-idx1-ubyte')
assert len(images) == 10000 and len(labels) == 10000

Note: All of the above examples would work for fashion MNIST with just following tweak:

loader = MNISTLoader('fashion')

Addtional steps that may be required/helpful

Install virtual environment support (Ubuntu/Debian)

You can skip this if python3 -m venv works

sudo apt update && sudo apt install -y python3-venv

# 1. Create a virtual environment in `.venv` folder
python3 -m venv .venv

# 2. Activate the virtual environment
source .venv/bin/activate

# 3. Upgrade pip (recommended)
pip install --upgrade pip

# 4. Install required pytorch
pip install torch==2.7.0  --index-url https://download.pytorch.org/whl/cpu

Why use this?

This project is designed for those who want an intuitive and dependency-free way to load the MNIST dataset while understanding its raw format in depth.

Contributions & Issues:

Found a bug? Want to contribute? Feel free to open an issue or submit a PR!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mnist_datasets-0.14.tar.gz (6.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mnist_datasets-0.14-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file mnist_datasets-0.14.tar.gz.

File metadata

  • Download URL: mnist_datasets-0.14.tar.gz
  • Upload date:
  • Size: 6.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for mnist_datasets-0.14.tar.gz
Algorithm Hash digest
SHA256 4f73a0ded7ba8772c0aef87f2ef71d026e7ad28137ef5e15649959e03834876a
MD5 70f7d914b8a4a8c4a72568b1a3418d29
BLAKE2b-256 53c0e9f3c30f1b9e0322baa3ad361c323ddbf97179d7aec9886ae664247b5443

See more details on using hashes here.

File details

Details for the file mnist_datasets-0.14-py3-none-any.whl.

File metadata

  • Download URL: mnist_datasets-0.14-py3-none-any.whl
  • Upload date:
  • Size: 7.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for mnist_datasets-0.14-py3-none-any.whl
Algorithm Hash digest
SHA256 8baeeb0fbe69b14b00be702d46752bab34ad1d1041d5efc41cc08adfddae8aed
MD5 e71e27b6457c2753896c4754186387d8
BLAKE2b-256 c627ab926560371a7fe729a60d603a326e05dad022e5684dd2fc07d687b49692

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page