Skip to main content

Zero-boilerplate converter from raw data (images, text, categories) to numeric NumPy arrays.

Project description

FastNum

Zero-boilerplate conversion from raw data to numeric NumPy arrays.

FastNum detects the kind of data you hand it — an image path, a sentence, a category list, or a batch of any of these — and returns the right numeric representation without a single line of configuration.


Why FastNum?

Most ML preprocessing pipelines repeat the same four patterns:

Input Desired output
Image file Float32 pixel array normalised to [0, 1]
Text sentence Integer token-ID sequence
Flat category list One-hot matrix
Batch of sentences Padded token-ID matrix

FastNum collapses all four into one call: fn.to_num(data).


Installation

pip install fastnum

Or from source:

git clone https://github.com/your-username/fastnum.git
cd fastnum
pip install -e ".[dev]"

Requirements: Python ≥ 3.9, numpy ≥ 1.24, opencv-python ≥ 4.8.


Quick start

from fastnum import FastNum

fn = FastNum()

# --- Image -----------------------------------------------------------
pixels = fn.to_num("photo.jpg")          # (H, W, 3) float32, values in [0, 1]
pixels = fn.to_num("photo.jpg", image_size=(224, 224))  # resize on the fly

# Batch of images (all resized to the same shape for stacking)
batch = fn.to_num(["a.jpg", "b.jpg"], image_size=(224, 224))  # (2, 224, 224, 3)

# --- Plain text ------------------------------------------------------
tokens = fn.to_num("the cat sat on the mat")   # int32 array of token IDs
print(fn.decode(tokens))                        # → "the cat sat on the mat"

# --- Category list ---------------------------------------------------
labels = ["dog", "cat", "dog", "bird"]
one_hot = fn.to_num(labels)
# array([[0., 1., 0.],
#        [1., 0., 0.],
#        [0., 1., 0.],
#        [0., 0., 1.]], dtype=float32)

# --- Sentence batch --------------------------------------------------
matrix = fn.to_num(["hello world", "foo bar baz"])
# int32 matrix (2, 3), shorter rows are right-padded with pad_token_id

# --- Raw NumPy array -------------------------------------------------
import numpy as np
fn.to_num(np.array([1, 2, 3]))              # cast to float32, no-op otherwise

API reference

FastNum(pad_token_id=0)

Parameter Type Default Description
pad_token_id int 0 ID reserved for the [PAD] token. The special token is inserted into the vocabulary at construction time so real words are always assigned different IDs.

to_num(data, image_size=None) → np.ndarray

Parameter Type Description
data str | list[str] | np.ndarray Input to convert.
image_size tuple[int, int] | None Target (H, W) for image resizing.

Return type depends on input:

Input dtype Shape
Image path / list of paths float32 (H, W, C) / (N, H, W, C)
Sentence int32 (T,)
Category list float32 (N, num_classes)
Sentence batch int32 (N, max_len)
np.ndarray float32 same as input

decode(token_ids) → str

Converts a token-ID sequence back to whitespace-separated text. Padding tokens are silently dropped.

vocab_size → int

Number of entries currently in the vocabulary, including [PAD].


The [PAD] token and collision safety

FastNum reserves pad_token_id inside the vocabulary at construction time:

self.vocab        = {"[PAD]": pad_token_id}
self.inverse_vocab = {pad_token_id: "[PAD]"}

Because [PAD] occupies a slot before any text is tokenised, _get_or_add assigns new words IDs equal to len(self.vocab), which can never equal pad_token_id again. This means:

  • A padded cell in a token matrix will never decode to a real word.
  • decode() does not need a special-case filter beyond i != self.pad_token_id — the two sets are disjoint by construction.

Development

# Run tests with coverage
pytest

# Lint
ruff check fastnum

# Type-check
mypy fastnum

License

MIT © your-username

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastnum-0.1.0.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fastnum-0.1.0-py3-none-any.whl (3.8 kB view details)

Uploaded Python 3

File details

Details for the file fastnum-0.1.0.tar.gz.

File metadata

  • Download URL: fastnum-0.1.0.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for fastnum-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1a8e050d9a43d34fb3a2c261573c89d68e5c981a9e9791766520485be6a2ea9f
MD5 ef218a9deb76b211370b1c25a94f4c8d
BLAKE2b-256 f35a36efdedc306a4286053809c53c0ca3a9ce962a47acc31d91a5250f56afce

See more details on using hashes here.

File details

Details for the file fastnum-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: fastnum-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 3.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for fastnum-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3d74ea888e876b36bcaaf623faf92ccda3bc587bc7833315abcaabf815d8119d
MD5 c067ecf2e286e03c35ef61d2435d6f41
BLAKE2b-256 db6643dd7e96d7da33fe46c8db4de963316183abf190364b795769708a80cd06

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page