Skip to main content

Fast TF-IDF vectorization with Rust-backed preprocessing

Project description

is-it-slop-preprocessing

Fast TF-IDF text vectorization for training AI text detection models.

Implementation in Rust with Python bindings.

The python bindings allow us to use the same Rust-based text preprocessing at training and inference time.

Features

  • Token n-grams: Uses tiktoken BPE token sequences (not characters/words)
  • sklearn-compatible API: Drop-in replacement for training pipelines
  • Parallel processing: Automatic multi-threading via Rust/rayon

Installation

pip install is-it-slop-preprocessing

Quick Start

from is_it_slop_preprocessing import TfidfVectorizer, VectorizerParams

# Configure vectorizer
params = VectorizerParams(
    ngram_range=(3, 5),  # 3-5 token n-grams
    min_df=10,           # Ignore terms in < 10 docs
    max_df=0.8,          # Ignore terms in > 80% of docs
    sublinear_tf=True    # Apply log scaling to term frequencies
)

# Fit and transform training data
vectorizer, X_train = TfidfVectorizer.fit_transform(train_texts, params)

# Transform test data
X_test = vectorizer.transform(test_texts)

# Save vectorizer for inference
vectorizer.save("tfidf_vectorizer.bin")

API Overview

VectorizerParams

Configuration for text processing:

  • ngram_range: Tuple of (min_n, max_n) for token n-gram range
  • min_df: Minimum document frequency (proportion or count)
  • max_df: Maximum document frequency (proportion or count)
  • sublinear_tf: Apply 1 + log(tf) scaling

TfidfVectorizer

Main vectorizer class:

  • fit_transform(texts, params): Fit and transform in one pass (faster)
  • fit(texts, params): Fit vocabulary only
  • transform(texts): Transform to TF-IDF matrix
  • save(path): Save to bincode format
  • load(path): Load from bincode format

Why Token N-grams?

Unlike character n-grams or word n-grams, this uses sequences of BPE tokens:

  • With ngram_range=(3,5), extracts 3-5 consecutive tiktoken tokens
  • Better captures AI patterns spanning multiple sub-word units
  • More compact vocabulary than character n-grams

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

is_it_slop_preprocessing-0.4.0.tar.gz (32.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

is_it_slop_preprocessing-0.4.0-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl (3.3 MB view details)

Uploaded PyPymanylinux: glibc 2.28+ x86-64

is_it_slop_preprocessing-0.4.0-pp311-pypy311_pp73-manylinux_2_28_aarch64.whl (3.2 MB view details)

Uploaded PyPymanylinux: glibc 2.28+ ARM64

is_it_slop_preprocessing-0.4.0-cp314-cp314t-manylinux_2_28_aarch64.whl (3.2 MB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.28+ ARM64

is_it_slop_preprocessing-0.4.0-cp314-cp314-win_amd64.whl (2.9 MB view details)

Uploaded CPython 3.14Windows x86-64

is_it_slop_preprocessing-0.4.0-cp314-cp314-manylinux_2_28_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.28+ x86-64

is_it_slop_preprocessing-0.4.0-cp314-cp314-manylinux_2_28_aarch64.whl (3.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.28+ ARM64

is_it_slop_preprocessing-0.4.0-cp314-cp314-macosx_11_0_arm64.whl (3.1 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

is_it_slop_preprocessing-0.4.0-cp313-cp313t-manylinux_2_28_aarch64.whl (3.2 MB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.28+ ARM64

is_it_slop_preprocessing-0.4.0-cp313-cp313-win_amd64.whl (2.9 MB view details)

Uploaded CPython 3.13Windows x86-64

is_it_slop_preprocessing-0.4.0-cp313-cp313-manylinux_2_28_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

is_it_slop_preprocessing-0.4.0-cp313-cp313-manylinux_2_28_aarch64.whl (3.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ ARM64

is_it_slop_preprocessing-0.4.0-cp313-cp313-macosx_11_0_arm64.whl (3.1 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

is_it_slop_preprocessing-0.4.0-cp313-cp313-macosx_10_12_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

is_it_slop_preprocessing-0.4.0-cp312-cp312-win_amd64.whl (2.9 MB view details)

Uploaded CPython 3.12Windows x86-64

is_it_slop_preprocessing-0.4.0-cp312-cp312-manylinux_2_28_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

is_it_slop_preprocessing-0.4.0-cp312-cp312-manylinux_2_28_aarch64.whl (3.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ ARM64

is_it_slop_preprocessing-0.4.0-cp312-cp312-macosx_11_0_arm64.whl (3.1 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

is_it_slop_preprocessing-0.4.0-cp312-cp312-macosx_10_12_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

is_it_slop_preprocessing-0.4.0-cp311-cp311-win_amd64.whl (2.9 MB view details)

Uploaded CPython 3.11Windows x86-64

is_it_slop_preprocessing-0.4.0-cp311-cp311-manylinux_2_28_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

is_it_slop_preprocessing-0.4.0-cp311-cp311-manylinux_2_28_aarch64.whl (3.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

is_it_slop_preprocessing-0.4.0-cp311-cp311-macosx_11_0_arm64.whl (3.1 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

is_it_slop_preprocessing-0.4.0-cp311-cp311-macosx_10_12_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file is_it_slop_preprocessing-0.4.0.tar.gz.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0.tar.gz
Algorithm Hash digest
SHA256 b90d5bd0c46138bc31b4c7b66d23267cb6da1082634d288dfe826f0ba2c200cf
MD5 e6e788673afd79771ffb49abe5a3e4a5
BLAKE2b-256 2c4e8bffd0bae7759cc9f6e891927eefda9d2a7b6d7ce75387f2969a4b4bd96b

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d8bad909e0167e0ce10367a20fa0d70980f16fc1b7adb33ae6494679eb47da04
MD5 e70faf99093914f6a67db513d80ec846
BLAKE2b-256 10afa3f7d9aecc9de76fec62e5f2b3d09c022d1fe4faf47700b8f0516eec6bdc

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-pp311-pypy311_pp73-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-pp311-pypy311_pp73-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 42f9279d96e0f308bd635c1385d0096c7c723d398de8292630ae85e4cb5f460d
MD5 abdf375d2e7efa0a6addb56998473d1d
BLAKE2b-256 36e5e6c095bb1e81296ebe702039c9531363f8268fa7e440bbae1586bf416d09

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp314-cp314t-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp314-cp314t-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 6a55af57d5fa0defba3562e233d4fef8aa4c3a73d7cb67734a469ca31d90a4ae
MD5 e2316402f65276f99100602f393af3f9
BLAKE2b-256 f94d01df2ceb1adeda8eef9f83781db74087c6ccc3f03de6e3fcf5a4cd8fd271

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 2536df4788094d5fdac6ad8e7066188ae4d9008268567aef44d8fe4c5622fcbb
MD5 dbcdbd811a0824f6c0c005028a768ca6
BLAKE2b-256 c727cd7a5342321b3ed4af87bc722d4c476cc9e6ab0acac17a07534310b0c67d

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp314-cp314-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp314-cp314-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 dce8d98b3d0a08c245a1f919c6e9e137f00d14c67ee540ef98c2faa1a8e8e951
MD5 79fc0b09f95ea984e5eedf35aa79099c
BLAKE2b-256 4cdd5d81fa2b9ffe2cdd1219d454ef28ab5218b5c617a14e5c66a1c47b96d634

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp314-cp314-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp314-cp314-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 b5edb37e73376fd1c3e483e4ef4ef3906fca8cf850ade46d51ae2bcc3dfa827c
MD5 f560cdc1ac2c43dcd4d9081cf9577d20
BLAKE2b-256 df86966c6174f2319800aec10af4676a8a6c3cc43c19742301a4e02610749540

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 922e23f3b310f5f4d3e25038bc73b741744a1d8291accc152a5a67f4b086d44d
MD5 d95d32da69ab8d645727a3426b79f49c
BLAKE2b-256 a9490d6fea11cb352c88786aeee2ee68809a890335bc60fe20f92269b99974e8

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp313-cp313t-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp313-cp313t-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 25d9ab11b85e59f1c6ca11ac979a3dadba282b9d1a3a9967601405fdb5df274d
MD5 a682c00f1b870ba39b7548a5d192cd37
BLAKE2b-256 a2b2e4437e26b17475921d19ceeb1d5b2d5e0a1f05c37f0e507e8860116f4842

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 71c880d5830988042c11ca1a464fbafc42306a4a6ae35ff11817e62c8d3d09e1
MD5 38fba44d08014d5ecdf45e88ef7ad09f
BLAKE2b-256 817fcb11df7185911f8c7908c62d860fb041bcf37a41e03cdda5b208030aa04f

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 978534338fe68f3bdbbb298376626d519459f2356044bc64aaa19001e5eabc5a
MD5 f76b94bf2a470a340d6f4c95beaa72f7
BLAKE2b-256 06b2f602e287d7d34b3c468accdec7d1fcc00243358c9dc7e12c15f5ae6bb829

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 4b854aff2aa52c84b6c96fb8f49a27cccc33656c10cac85b5fa004509b5f27bd
MD5 0f217346a5e78140f475f2f8b5dc19f6
BLAKE2b-256 d623cf2e0b389f044bbfaa710d4f84ff1e083d60aa06ab8208c1d0e5523f2ee3

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 796d82d0ef26aef9511e892dadec49d0c7fdf77d2de501233e2c50c619e45c07
MD5 8dbbeaf7f4366d418486ef03174ec310
BLAKE2b-256 ce39bb3d47bd3cb6747e6796e984a9c2c4694845352ad2712422a2228da9e217

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 2d68800deac5a66c09f9d2abc3aed5ae4d21e3f918f079871fc86755a58f69ba
MD5 f527084ac90ee119075ca4fbdaf7fce5
BLAKE2b-256 acdd2cc7a39828ff88deba16531094d105496d9f3c4b3ff848882cd3dea79f76

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 8d740dcd5dd0d819ca8c63e78b087333219004cc165d8c690d9826fbf19a5b93
MD5 80dd83ce7696a077fa92339d39765ef6
BLAKE2b-256 a2ea02c9594fb817f574c8a4c939c8e8dfce5549905c7ac2dc65490f33749197

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 cf4c24aeb5d42d2bd1b7e3a01b780a56ac4c24abf12224e5ac5eb3420129b728
MD5 3988965e352f397caa33caffa80b3bb9
BLAKE2b-256 a385e5224307d62465974122cb1a425a260557809bc70a527728ddcdb8467f7f

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 aa4b6f7f280234a2a03815fe6b7451515e241eda4d3c3555b505866789b63dd1
MD5 3785f44dfbd13cb8b453618968738f0b
BLAKE2b-256 35167a48acb08df1d7b135020eb9dc74f5de783c4f96b427601d87279b9a456e

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fc80094289e271da93b0b0f6d64e75b46a94e857b373d78503a56f2ad1e3d43e
MD5 87e8b97c2932572bb290d025598e1afd
BLAKE2b-256 14635ae5cfaadacad675e4bbff59f62d9cc3e32bb0bdd744365817b91a91b97d

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4faf1b3ce12fc674578ae653cd786c665f2ade3292ac17cd10d5f24351021f17
MD5 ce03d96ffd68a4f21f61db65eca0a88c
BLAKE2b-256 80bb7c53ca4f7b7f51ffdb922eb1d7b8033f6a94e9e9da5a5c0f09cfe49c7dd6

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7b913043d76da252f5fc4e9b48a844fe0c4bee88eacb55c5de09b196448d1407
MD5 e8c627d1a874b0cdeb4143fb7a512ef2
BLAKE2b-256 7a3549577aa249c1f044056644f2c3ce91b24b9db00f4d1a4c53d844acaddc9f

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 11387a3d1dc9cd3305c8c283718767c4ba28fee03b0f700191c75d36bac7c93c
MD5 0c98dcbb1302aafdf77d814ee38f029a
BLAKE2b-256 2f5271c8633d9036ddee4b8075a5ff73d2857c31cd093017b193287a75de8a67

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 775a2ea2bbace31952ec8504240d713595117a1c89d7db766a180681b043ae9f
MD5 061d8754acef27ff60ec7a15b11a238a
BLAKE2b-256 56d73c98391f23833572876d060d0a14815537274aa8f02e6bbbad3d2be1a31f

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0103439057649649fdaea68892c5d74049edb7d96b3d382346bc6177fc2ac9f9
MD5 e3ca395673a18fba95002c7174c51708
BLAKE2b-256 b890d72dda2e9b9fedc1f5de3699637cad256b2dd2a62351f21127407cfd3625

See more details on using hashes here.

File details

Details for the file is_it_slop_preprocessing-0.4.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for is_it_slop_preprocessing-0.4.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 0cfb79fa3e67dc70752258d3018b9defe8d86ff44344fcbab6616ea15fe17842
MD5 5c2fbd2d8eacc65b0fa9c13809987d49
BLAKE2b-256 68eb0c3dc2e9faf4a2e7978bb804b19b8ac62dbc982956bcf2335169eb9aaf48

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page