Intelligent data ingestion and tokenization pipeline
Project description
Suur Data
Intelligent data ingestion, filtering, and tokenization pipeline.
Installation
pip install suur-data
See It In Action
Single Source
from suur_data import suur_data
tokens = suur_data("https://en.wikipedia.org/wiki/Neural_network", topic="neural networks")
print(tokens)
Multiple Sources
from suur_data import suur_data
tokens = suur_data(
[
"data.txt",
"research_paper.pdf",
"https://en.wikipedia.org/wiki/Deep_learning",
"https://en.wikipedia.org/wiki/Artificial_neural_network",
],
topic="neural networks",
threshold=0.05,
)
print(f"Total tokens: {len(tokens)}")
All sources are downloaded, merged, filtered together, and tokenized in one call.
Full Documentation
All Installation Options
pip install suur-data
pip install suur-data[pdf]
pip install suur-data[docx]
pip install suur-data[epub]
pip install suur-data[hf]
pip install suur-data[all]
Supported Input Formats
| Format | Notes |
|---|---|
| .txt .md .rst | Plain text, auto encoding detection |
| Requires suur-data[pdf] | |
| .docx | Requires suur-data[docx] |
| .csv .tsv | All cells joined as text |
| .json | Recursively flattened key-value pairs |
| .html .htm | Scripts and styles stripped automatically |
| .epub | Requires suur-data[epub] |
| HTTP/HTTPS URL | Auto-downloaded, parsed by extension |
Python API
from suur_data import suur_data
# From a URL
tokens = suur_data("https://en.wikipedia.org/wiki/Neuroscience", topic="brain neurons")
# From a local file
tokens = suur_data("data.txt", topic="machine learning")
# Multiple sources at once
tokens = suur_data(
["data.txt", "paper.pdf", "https://en.wikipedia.org/wiki/Deep_learning"],
topic="neural networks"
)
# Custom BPE tokenizer trained on your data
tokens = suur_data("data.txt", topic="machine learning", tokenizer="custom", vocab_size=4000)
# Strict filter
tokens = suur_data("data.pdf", topic="quantum computing", threshold=0.15)
# Save tokenizer to disk
tokens = suur_data("data.txt", topic="biology", save_dir="./my_tokenizer")
# Skip filter entirely
tokens = suur_data("data.txt", no_filter=True)
Batch Output
result = suur_data("data.txt", topic="neural networks")
print(result["total_tokens"]) # total token count
print(result["num_chunks"]) # number of chunks kept
# Iterate chunk by chunk
for i, (chunk, tokens) in enumerate(zip(result["chunks"], result["batch"])):
print(f"Chunk {i+1} ({len(tokens)} tokens):")
print(chunk[:80])
print(tokens[:10])
Use Directly With Transformers
import torch
from transformers import AutoModelForCausalLM
result = suur_data("data.txt", topic="neural networks", model="gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
for chunk_tokens in result["batch"]:
input_ids = torch.tensor([chunk_tokens[:1024]])
with torch.no_grad():
outputs = model(input_ids)
print(outputs.logits.shape)
All Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| data_location | str or List[str] | required | URL, file path, or list of multiple sources |
| topic | str | "" | Subject for relevance filtering. Empty skips filter |
| tokenizer | str | "pretrained" | "pretrained" or "custom" |
| model | str | "gpt2" | HuggingFace model name or Hub ID |
| vocab_size | int | 8000 | BPE vocab size for custom tokenizer |
| threshold | float | 0.05 | Relevance cutoff between 0.0 and 1.0 |
| save_dir | str | None | Directory to save tokenizer files |
| no_filter | bool | False | Skip the relevance filter |
| verbose | bool | True | Show progress output |
Pretrained Model Shortcuts
| Shortcut | Model |
|---|---|
| gpt2 | GPT-2 (OpenAI) |
| bert | BERT base uncased |
| roberta | RoBERTa base |
| distilbert | DistilBERT base uncased |
| t5 | T5 small |
Architecture
Source (URL or file or list of sources)
|
v
Stage 1 — Ingest
Handles 8 file types and HTTP download.
Merges all sources into one text string.
|
v
Stage 2 — Neural Filter
Splits text into paragraph chunks.
Scores each chunk against topic via TF-IDF cosine similarity.
Shows progress bar while scoring.
Drops chunks below the relevance threshold.
|
v
Stage 3 — Tokenize
Pretrained: HuggingFace AutoTokenizer (GPT-2, BERT, etc.)
Custom: trains a BPE tokenizer on the filtered corpus.
|
v
{tokens, batch, chunks, num_chunks, total_tokens}
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
suur_data-1.0.6.tar.gz
(12.6 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
suur_data-1.0.6-py3-none-any.whl
(12.5 kB
view details)
File details
Details for the file suur_data-1.0.6.tar.gz.
File metadata
- Download URL: suur_data-1.0.6.tar.gz
- Upload date:
- Size: 12.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d9c7a167fb52efeb797fffa142524a170bf97c61bb299b57d45b5594c2e44da
|
|
| MD5 |
10a2ff742cfaa4748d90ff25e911065c
|
|
| BLAKE2b-256 |
90a26a6db266bc3cd14909c16a4ee7fb515fa7a5975a1ae84a99ead5eff5a61f
|
File details
Details for the file suur_data-1.0.6-py3-none-any.whl.
File metadata
- Download URL: suur_data-1.0.6-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
779fc6b980bcde7d0bd217b4f37ef2f65cbbf131f100f41c277132f0ceb3e362
|
|
| MD5 |
d68850962d255909375bbec0aa167bab
|
|
| BLAKE2b-256 |
f718a56c3782fc80b985024febead967715037ca70f84adacd8a5448c19623c0
|