Intelligent data ingestion and tokenization pipeline
Project description
Suur Data
Intelligent data ingestion, filtering, and tokenization pipeline.
Installation
pip install suur-data
See It In Action
Single Source
from suur_data import suur_data
tokens = suur_data("https://en.wikipedia.org/wiki/Neural_network", topic="neural networks")
print(tokens)
Multiple Sources
from suur_data import suur_data
tokens = suur_data(
[
"data.txt",
"research_paper.pdf",
"https://en.wikipedia.org/wiki/Deep_learning",
"https://en.wikipedia.org/wiki/Artificial_neural_network",
],
topic="neural networks",
threshold=0.05,
)
print(f"Total tokens: {len(tokens)}")
All sources are downloaded, merged, filtered together, and tokenized in one call.
Full Documentation
All Installation Options
# Core — supports .txt, .csv, .json, .html, URLs
pip install suur-data
# Add PDF support
pip install suur-data[pdf]
# Add Word document support
pip install suur-data[docx]
# Add EPUB support
pip install suur-data[epub]
# Add HuggingFace pretrained tokenizers
pip install suur-data[hf]
# Everything
pip install suur-data[all]
Supported Input Formats
| Format | Notes |
|---|---|
| .txt .md .rst | Plain text, auto encoding detection |
| Requires suur-data[pdf] | |
| .docx | Requires suur-data[docx] |
| .csv .tsv | All cells joined as text |
| .json | Recursively flattened key-value pairs |
| .html .htm | Scripts and styles stripped automatically |
| .epub | Requires suur-data[epub] |
| HTTP/HTTPS URL | Auto-downloaded, parsed by extension |
Python API
from suur_data import suur_data
# From a URL
tokens = suur_data("https://en.wikipedia.org/wiki/Neuroscience", topic="brain neurons")
# From a local file
tokens = suur_data("data.txt", topic="machine learning")
# Multiple sources at once
tokens = suur_data(
["data.txt", "paper.pdf", "https://en.wikipedia.org/wiki/Deep_learning"],
topic="neural networks"
)
# Custom BPE tokenizer trained on your data
tokens = suur_data("data.txt", topic="machine learning", tokenizer="custom", vocab_size=4000)
# Strict filter — only highly relevant chunks survive
tokens = suur_data("data.pdf", topic="quantum computing", threshold=0.15)
# Save tokenizer to disk for reuse
tokens = suur_data("data.txt", topic="biology", save_dir="./my_tokenizer")
# Skip filter entirely
tokens = suur_data("data.txt", no_filter=True)
All Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| data_location | str or List[str] | required | URL, file path, or list of multiple sources |
| topic | str | "" | Subject for relevance filtering. Empty skips filter |
| tokenizer | str | "pretrained" | "pretrained" or "custom" |
| model | str | "gpt2" | HuggingFace model name or Hub ID |
| vocab_size | int | 8000 | BPE vocab size for custom tokenizer |
| threshold | float | 0.05 | Relevance cutoff between 0.0 and 1.0 |
| save_dir | str | None | Directory to save tokenizer files |
| no_filter | bool | False | Skip the relevance filter |
| verbose | bool | True | Show progress output |
Pretrained Model Shortcuts
| Shortcut | Model |
|---|---|
| gpt2 | GPT-2 (OpenAI) |
| bert | BERT base uncased |
| roberta | RoBERTa base |
| distilbert | DistilBERT base uncased |
| t5 | T5 small |
You can also pass any HuggingFace Hub model ID directly:
tokens = suur_data("data.txt", model="facebook/opt-125m")
How the Filter Works
The filter splits text into paragraph chunks, converts each chunk and the topic into TF-IDF vectors, then scores them using cosine similarity. Chunks below the threshold are deleted. If the threshold is too strict and everything gets dropped, it automatically relaxes and keeps the top 30 percent.
tokens = suur_data("data.txt", topic="AI", threshold=0.10) # strict
tokens = suur_data("data.txt", topic="AI", threshold=0.02) # loose
Saving and Loading Tokens
import json
tokens = suur_data("data.txt", topic="neural networks")
with open("tokens.json", "w") as f:
json.dump(tokens, f)
with open("tokens.json", "r") as f:
tokens = json.load(f)
Decoding Tokens Back to Text
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
text = tok.decode(tokens)
print(text)
Architecture
Source (URL or file or list of sources)
|
v
Stage 1 — Ingest
Handles 8 file types and HTTP download.
Merges all sources into one text string.
|
v
Stage 2 — Neural Filter
Splits text into paragraph chunks.
Scores each chunk against topic via TF-IDF cosine similarity.
Shows progress bar while scoring.
Drops chunks below the relevance threshold.
|
v
Stage 3 — Tokenize
Pretrained: HuggingFace AutoTokenizer (GPT-2, BERT, etc.)
Custom: trains a BPE tokenizer on the filtered corpus.
|
v
List[int] — token IDs
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file suur_data-1.0.5.tar.gz.
File metadata
- Download URL: suur_data-1.0.5.tar.gz
- Upload date:
- Size: 12.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5533244e445a678f058c9e5563e11cf02efc75e4f30f2489e2c29faac62b4c9b
|
|
| MD5 |
7ee8d87a67b9f578b34666889ea2ee78
|
|
| BLAKE2b-256 |
05a7946042902d152e65a80f3f04ed75668baa7a853c6c19fa20b9a9dcf890ed
|
File details
Details for the file suur_data-1.0.5-py3-none-any.whl.
File metadata
- Download URL: suur_data-1.0.5-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f985b9ec534383109820ea4a6196067fefd88e3a0a2e243d726f3462dbbb73f8
|
|
| MD5 |
9f0c23290a0ca3b5d542db87428faabb
|
|
| BLAKE2b-256 |
4ec9be07722a6e6050c6e9a351ef6272fbdbe00e89fb8dce17e6821a4e58b1a0
|