Intelligent data ingestion, filtering and tokenization pipeline

These details have not been verified by PyPI

Project links

Homepage

Project description

Suur Data

Intelligent data ingestion, filtering, and tokenization pipeline.

Installation

pip install suur-data

See It In Action

One line to fetch, filter and tokenize any web page

from suur_data import suur_data

result = suur_data("https://en.wikipedia.org/wiki/Neural_network", topic="neural networks")
print(result["total_tokens"])

Fetch 3 classic novels and filter by topic in under 10 seconds

from suur_data import suur_data

result = suur_data(
    [
        "https://www.gutenberg.org/cache/epub/1342/pg1342.txt",
        "https://www.gutenberg.org/cache/epub/84/pg84.txt",
        "https://www.gutenberg.org/cache/epub/11/pg11.txt",
    ],
    topic="love",
    workers=3,
)

print(f"Total tokens: {result['total_tokens']}")
print(f"Chunks kept:  {result['num_chunks']}")

That fetches 3 classic novels totalling 1.3 million characters, filters 3000+ paragraphs down to only the relevant ones, and returns a training-ready tokenized dataset in under 10 seconds — one function call.

What It Returns

result = suur_data("data.txt", topic="neural networks")

result["tokens"]        # flat list of all token IDs
result["batch"]         # list of token ID lists, one per chunk
result["chunks"]        # list of kept text chunks as strings
result["num_chunks"]    # number of chunks kept after filtering
result["total_tokens"]  # total token count across all chunks

Full Documentation

All Installation Options

# Core — supports .txt, .csv, .json, .html, URLs
pip install suur-data

# Add PDF support
pip install suur-data[pdf]

# Add Word document support
pip install suur-data[docx]

# Add EPUB support
pip install suur-data[epub]

# Add HuggingFace pretrained tokenizers
pip install suur-data[hf]

# Everything
pip install suur-data[all]

Supported Input Formats

Format	Notes
.txt .md .rst	Plain text, auto encoding detection
.pdf	Requires suur-data[pdf]
.docx	Requires suur-data[docx]
.csv .tsv	All cells joined as text
.json	Recursively flattened key-value pairs
.html .htm	Scripts and styles stripped automatically
.epub	Requires suur-data[epub]
HTTP/HTTPS URL	Auto-downloaded, parsed by extension

Python API

Single source

from suur_data import suur_data

# From a local file
result = suur_data("data.txt", topic="machine learning")

# From a URL
result = suur_data("https://en.wikipedia.org/wiki/Neuroscience", topic="brain neurons")

Multiple sources — NEW in 1.1.0

from suur_data import suur_data

result = suur_data(
    [
        "data.txt",
        "research_paper.pdf",
        "https://en.wikipedia.org/wiki/Deep_learning",
        "https://en.wikipedia.org/wiki/Artificial_neural_network",
    ],
    topic="neural networks",
)

All sources are downloaded, merged, filtered together and tokenized in one call.

Parallel downloading with workers — NEW in 1.1.0

from suur_data import suur_data

result = suur_data(
    [
        "https://www.gutenberg.org/cache/epub/1342/pg1342.txt",
        "https://www.gutenberg.org/cache/epub/84/pg84.txt",
        "https://www.gutenberg.org/cache/epub/11/pg11.txt",
    ],
    topic="love",
    workers=3,   # downloads all 3 simultaneously
)

Without workers each source downloads one by one. With workers=3 all 3 download at the same time. Speed improvement is roughly 40 percent for 3 sources and grows with more sources.

Batch output per chunk — NEW in 1.1.0

result = suur_data("data.txt", topic="neural networks")

# Iterate chunk by chunk
for i, (chunk, tokens) in enumerate(zip(result["chunks"], result["batch"])):
    print(f"Chunk {i+1} ({len(tokens)} tokens):")
    print(chunk[:80])
    print(tokens[:10])

Custom BPE tokenizer trained on your data

result = suur_data(
    "data.txt",
    topic="machine learning",
    tokenizer="custom",
    vocab_size=4000,
    save_dir="./my_tokenizer",
)

Strict filter — only highly relevant chunks survive

result = suur_data("data.pdf", topic="quantum computing", threshold=0.15)

Loose filter — keep more content

result = suur_data("data.txt", topic="AI", threshold=0.02)

Skip filter entirely

result = suur_data("data.txt", no_filter=True)

Use directly with HuggingFace Transformers

import torch
from transformers import AutoModelForCausalLM

result = suur_data("data.txt", topic="neural networks", model="gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

for chunk_tokens in result["batch"]:
    input_ids = torch.tensor([chunk_tokens[:1024]])
    with torch.no_grad():
        outputs = model(input_ids)
    print(outputs.logits.shape)

Save and load tokens

import json

result = suur_data("data.txt", topic="neural networks")

# Save
with open("tokens.json", "w") as f:
    json.dump(result["tokens"], f)

# Load
with open("tokens.json", "r") as f:
    tokens = json.load(f)

Decode tokens back to text

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("gpt2")
text = tok.decode(result["tokens"])
print(text)

All Parameters

Parameter	Type	Default	Description
data_location	str or List[str]	required	URL, file path, or list of multiple sources
topic	str	""	Subject for relevance filtering. Empty skips filter
tokenizer	str	"pretrained"	"pretrained" or "custom"
model	str	"gpt2"	HuggingFace model name or Hub ID
vocab_size	int	8000	BPE vocab size for custom tokenizer
threshold	float	0.05	Relevance cutoff between 0.0 and 1.0
save_dir	str	None	Directory to save tokenizer files
no_filter	bool	False	Skip the relevance filter
verbose	bool	True	Show progress output
workers	int	1	Number of parallel download workers

Pretrained Model Shortcuts

Shortcut	Model
gpt2	GPT-2 (OpenAI)
bert	BERT base uncased
roberta	RoBERTa base
distilbert	DistilBERT base uncased
t5	T5 small

You can also pass any HuggingFace Hub model ID directly:

result = suur_data("data.txt", model="facebook/opt-125m")

How the Filter Works

The filter splits text into paragraph chunks using blank lines as boundaries. If no paragraphs are found it automatically falls back to sentence level splitting grouping every 3 sentences into a chunk.

Each chunk is scored against the topic using TF-IDF cosine similarity. A gentle length penalty is applied to very short chunks. Chunks below the threshold are dropped.

If the threshold is too strict and everything gets dropped it auto relaxes and keeps the top 10 percent so you never get empty output.

result = suur_data("data.txt", topic="AI", threshold=0.10)  # strict
result = suur_data("data.txt", topic="AI", threshold=0.02)  # loose

Architecture

Source (URL or file or list of sources)
        |
        v
Stage 1 — Ingest
Handles 8 file types and HTTP download.
Parallel downloading with workers parameter.
Merges all sources into one text string.
        |
        v
Stage 2 — Neural Filter
Strips boilerplate headers and footers.
Splits text into paragraph chunks.
Falls back to sentence splitting if no paragraphs found.
Scores each chunk against topic via TF-IDF cosine similarity.
Applies length penalty to very short chunks.
Shows progress bar while scoring.
Drops chunks below the relevance threshold.
Auto relaxes if threshold is too strict.
        |
        v
Stage 3 — Tokenize
Pretrained: HuggingFace AutoTokenizer with caching (loads once reuses for all chunks).
Custom: trains a BPE tokenizer on the filtered corpus.
        |
        v
{tokens, batch, chunks, num_chunks, total_tokens}

Changelog

1.1.0 — Major Update

Multiple sources — pass a list of URLs and files, all merged into one dataset
Parallel workers — workers parameter downloads all sources simultaneously, 40 percent faster
Batch output — result is now a dict with tokens per chunk not just a flat list
Tokenizer caching — pretrained tokenizer loads once and reuses for all chunks
Sentence level splitting — automatically falls back to sentence chunking if no paragraphs found
Boilerplate stripping — removes Gutenberg headers and footers before filtering
Length penalty — gentle penalty on very short chunks improves filter quality
Auto relax — if threshold drops everything keeps top 10 percent instead of returning empty

1.0.0 — Initial Release

Single source ingestion from URL or file
8 supported file formats
TF-IDF relevance filter
Pretrained HuggingFace tokenizer
Custom BPE tokenizer
CLI interface

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.2.0

Jun 2, 2026

1.1.3

Jun 2, 2026

1.1.2

Jun 1, 2026

This version

1.1.1

Jun 1, 2026

1.1.0

Jun 1, 2026

1.0.6

May 31, 2026

1.0.5

May 30, 2026

1.0.4

May 30, 2026

1.0.3

May 30, 2026

1.0.2

May 30, 2026

1.0.1

May 30, 2026

1.0.0

May 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

suur_data-1.1.1.tar.gz (15.9 kB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

suur_data-1.1.1-py3-none-any.whl (14.6 kB view details)

Uploaded Jun 1, 2026 Python 3

File details

Details for the file suur_data-1.1.1.tar.gz.

File metadata

Download URL: suur_data-1.1.1.tar.gz
Upload date: Jun 1, 2026
Size: 15.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for suur_data-1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`864c24b9640219ca46fbcbcc00d822a937175f388b692b78b1448fc286e22256`
MD5	`bcd5eab0a0a74ca44767fc838579494e`
BLAKE2b-256	`fc45e0de2373696ea89edc458e9d40bd2711315b3d5843e9f138f99b1fdfa0ec`

See more details on using hashes here.

File details

Details for the file suur_data-1.1.1-py3-none-any.whl.

File metadata

Download URL: suur_data-1.1.1-py3-none-any.whl
Upload date: Jun 1, 2026
Size: 14.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for suur_data-1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`97507b986c3a00b6e3424acda442ef0e1756c2856f59ad4cb24226fb60b47c33`
MD5	`0abd80b06fa9f53e97fb6e2cfc013d91`
BLAKE2b-256	`eb1e9ec16e985fe6a706d823a677efcac3be54fa1c2d4a8e266a744e04e32979`

See more details on using hashes here.

suur-data 1.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Suur Data

Installation

See It In Action

One line to fetch, filter and tokenize any web page

Fetch 3 classic novels and filter by topic in under 10 seconds

What It Returns

Full Documentation

All Installation Options

Supported Input Formats

Python API

Single source

Multiple sources — NEW in 1.1.0

Parallel downloading with workers — NEW in 1.1.0

Batch output per chunk — NEW in 1.1.0

Custom BPE tokenizer trained on your data

Strict filter — only highly relevant chunks survive

Loose filter — keep more content

Skip filter entirely

Use directly with HuggingFace Transformers

Save and load tokens

Decode tokens back to text

All Parameters

Pretrained Model Shortcuts

How the Filter Works

Architecture

Changelog

1.1.0 — Major Update

1.0.0 — Initial Release

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes