Intelligent data ingestion and tokenization pipeline

These details have not been verified by PyPI

Project links

Homepage

Project description

Suur Data

Intelligent data ingestion, filtering, and tokenization pipeline.

Installation

pip install suur-data

See It In Action

Single Source

from suur_data import suur_data

tokens = suur_data("https://en.wikipedia.org/wiki/Neural_network", topic="neural networks")
print(tokens)

Multiple Sources

from suur_data import suur_data

tokens = suur_data(
    [
        "data.txt",
        "research_paper.pdf",
        "https://en.wikipedia.org/wiki/Deep_learning",
        "https://en.wikipedia.org/wiki/Artificial_neural_network",
    ],
    topic="neural networks",
    threshold=0.05,
)
print(f"Total tokens: {len(tokens)}")

All sources are downloaded, merged, filtered together, and tokenized in one call.

Full Documentation

All Installation Options

pip install suur-data
pip install suur-data[pdf]
pip install suur-data[docx]
pip install suur-data[epub]
pip install suur-data[hf]
pip install suur-data[all]

Supported Input Formats

Format	Notes
.txt .md .rst	Plain text, auto encoding detection
.pdf	Requires suur-data[pdf]
.docx	Requires suur-data[docx]
.csv .tsv	All cells joined as text
.json	Recursively flattened key-value pairs
.html .htm	Scripts and styles stripped automatically
.epub	Requires suur-data[epub]
HTTP/HTTPS URL	Auto-downloaded, parsed by extension

Python API

from suur_data import suur_data

# From a URL
tokens = suur_data("https://en.wikipedia.org/wiki/Neuroscience", topic="brain neurons")

# From a local file
tokens = suur_data("data.txt", topic="machine learning")

# Multiple sources at once
tokens = suur_data(
    ["data.txt", "paper.pdf", "https://en.wikipedia.org/wiki/Deep_learning"],
    topic="neural networks"
)

# Custom BPE tokenizer trained on your data
tokens = suur_data("data.txt", topic="machine learning", tokenizer="custom", vocab_size=4000)

# Strict filter
tokens = suur_data("data.pdf", topic="quantum computing", threshold=0.15)

# Save tokenizer to disk
tokens = suur_data("data.txt", topic="biology", save_dir="./my_tokenizer")

# Skip filter entirely
tokens = suur_data("data.txt", no_filter=True)

Batch Output

result = suur_data("data.txt", topic="neural networks")

print(result["total_tokens"])    # total token count
print(result["num_chunks"])      # number of chunks kept

# Iterate chunk by chunk
for i, (chunk, tokens) in enumerate(zip(result["chunks"], result["batch"])):
    print(f"Chunk {i+1} ({len(tokens)} tokens):")
    print(chunk[:80])
    print(tokens[:10])

Use Directly With Transformers

import torch
from transformers import AutoModelForCausalLM

result = suur_data("data.txt", topic="neural networks", model="gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

for chunk_tokens in result["batch"]:
    input_ids = torch.tensor([chunk_tokens[:1024]])
    with torch.no_grad():
        outputs = model(input_ids)
    print(outputs.logits.shape)

All Parameters

Parameter	Type	Default	Description
data_location	str or List[str]	required	URL, file path, or list of multiple sources
topic	str	""	Subject for relevance filtering. Empty skips filter
tokenizer	str	"pretrained"	"pretrained" or "custom"
model	str	"gpt2"	HuggingFace model name or Hub ID
vocab_size	int	8000	BPE vocab size for custom tokenizer
threshold	float	0.05	Relevance cutoff between 0.0 and 1.0
save_dir	str	None	Directory to save tokenizer files
no_filter	bool	False	Skip the relevance filter
verbose	bool	True	Show progress output

Pretrained Model Shortcuts

Shortcut	Model
gpt2	GPT-2 (OpenAI)
bert	BERT base uncased
roberta	RoBERTa base
distilbert	DistilBERT base uncased
t5	T5 small

Architecture

Source (URL or file or list of sources)
        |
        v
Stage 1 — Ingest
Handles 8 file types and HTTP download.
Merges all sources into one text string.
        |
        v
Stage 2 — Neural Filter
Splits text into paragraph chunks.
Scores each chunk against topic via TF-IDF cosine similarity.
Shows progress bar while scoring.
Drops chunks below the relevance threshold.
        |
        v
Stage 3 — Tokenize
Pretrained: HuggingFace AutoTokenizer (GPT-2, BERT, etc.)
Custom: trains a BPE tokenizer on the filtered corpus.
        |
        v
{tokens, batch, chunks, num_chunks, total_tokens}

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.2.0

Jun 2, 2026

1.1.3

Jun 2, 2026

1.1.2

Jun 1, 2026

1.1.1

Jun 1, 2026

1.1.0

Jun 1, 2026

This version

1.0.6

May 31, 2026

1.0.5

May 30, 2026

1.0.4

May 30, 2026

1.0.3

May 30, 2026

1.0.2

May 30, 2026

1.0.1

May 30, 2026

1.0.0

May 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

suur_data-1.0.6.tar.gz (12.6 kB view details)

Uploaded May 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

suur_data-1.0.6-py3-none-any.whl (12.5 kB view details)

Uploaded May 31, 2026 Python 3

File details

Details for the file suur_data-1.0.6.tar.gz.

File metadata

Download URL: suur_data-1.0.6.tar.gz
Upload date: May 31, 2026
Size: 12.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for suur_data-1.0.6.tar.gz
Algorithm	Hash digest
SHA256	`6d9c7a167fb52efeb797fffa142524a170bf97c61bb299b57d45b5594c2e44da`
MD5	`10a2ff742cfaa4748d90ff25e911065c`
BLAKE2b-256	`90a26a6db266bc3cd14909c16a4ee7fb515fa7a5975a1ae84a99ead5eff5a61f`

See more details on using hashes here.

File details

Details for the file suur_data-1.0.6-py3-none-any.whl.

File metadata

Download URL: suur_data-1.0.6-py3-none-any.whl
Upload date: May 31, 2026
Size: 12.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for suur_data-1.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`779fc6b980bcde7d0bd217b4f37ef2f65cbbf131f100f41c277132f0ceb3e362`
MD5	`d68850962d255909375bbec0aa167bab`
BLAKE2b-256	`f718a56c3782fc80b985024febead967715037ca70f84adacd8a5448c19623c0`

See more details on using hashes here.

suur-data 1.0.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Suur Data

Installation

See It In Action

Single Source

Multiple Sources

Full Documentation

All Installation Options

Supported Input Formats

Python API

Batch Output

Use Directly With Transformers

All Parameters

Pretrained Model Shortcuts

Architecture

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes