Skip to main content

Intelligent data ingestion and tokenization pipeline

Project description

Suur Data

Intelligent data ingestion, filtering, and tokenization pipeline.

Installation

pip install suur-data

See It In Action

Single Source

from suur_data import suur_data

tokens = suur_data("https://en.wikipedia.org/wiki/Neural_network", topic="neural networks")
print(tokens)

Multiple Sources

from suur_data import suur_data

tokens = suur_data(
    [
        "data.txt",
        "research_paper.pdf",
        "https://en.wikipedia.org/wiki/Deep_learning",
        "https://en.wikipedia.org/wiki/Artificial_neural_network",
    ],
    topic="neural networks",
    threshold=0.05,
)
print(f"Total tokens: {len(tokens)}")

All sources are downloaded, merged, filtered together, and tokenized in one call.


Full Documentation

All Installation Options

pip install suur-data
pip install suur-data[pdf]
pip install suur-data[docx]
pip install suur-data[epub]
pip install suur-data[hf]
pip install suur-data[all]

Supported Input Formats

Format Notes
.txt .md .rst Plain text, auto encoding detection
.pdf Requires suur-data[pdf]
.docx Requires suur-data[docx]
.csv .tsv All cells joined as text
.json Recursively flattened key-value pairs
.html .htm Scripts and styles stripped automatically
.epub Requires suur-data[epub]
HTTP/HTTPS URL Auto-downloaded, parsed by extension

Python API

from suur_data import suur_data

# From a URL
tokens = suur_data("https://en.wikipedia.org/wiki/Neuroscience", topic="brain neurons")

# From a local file
tokens = suur_data("data.txt", topic="machine learning")

# Multiple sources at once
tokens = suur_data(
    ["data.txt", "paper.pdf", "https://en.wikipedia.org/wiki/Deep_learning"],
    topic="neural networks"
)

# Custom BPE tokenizer trained on your data
tokens = suur_data("data.txt", topic="machine learning", tokenizer="custom", vocab_size=4000)

# Strict filter
tokens = suur_data("data.pdf", topic="quantum computing", threshold=0.15)

# Save tokenizer to disk
tokens = suur_data("data.txt", topic="biology", save_dir="./my_tokenizer")

# Skip filter entirely
tokens = suur_data("data.txt", no_filter=True)

Batch Output

result = suur_data("data.txt", topic="neural networks")

print(result["total_tokens"])    # total token count
print(result["num_chunks"])      # number of chunks kept

# Iterate chunk by chunk
for i, (chunk, tokens) in enumerate(zip(result["chunks"], result["batch"])):
    print(f"Chunk {i+1} ({len(tokens)} tokens):")
    print(chunk[:80])
    print(tokens[:10])

Use Directly With Transformers

import torch
from transformers import AutoModelForCausalLM

result = suur_data("data.txt", topic="neural networks", model="gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

for chunk_tokens in result["batch"]:
    input_ids = torch.tensor([chunk_tokens[:1024]])
    with torch.no_grad():
        outputs = model(input_ids)
    print(outputs.logits.shape)

All Parameters

Parameter Type Default Description
data_location str or List[str] required URL, file path, or list of multiple sources
topic str "" Subject for relevance filtering. Empty skips filter
tokenizer str "pretrained" "pretrained" or "custom"
model str "gpt2" HuggingFace model name or Hub ID
vocab_size int 8000 BPE vocab size for custom tokenizer
threshold float 0.05 Relevance cutoff between 0.0 and 1.0
save_dir str None Directory to save tokenizer files
no_filter bool False Skip the relevance filter
verbose bool True Show progress output

Pretrained Model Shortcuts

Shortcut Model
gpt2 GPT-2 (OpenAI)
bert BERT base uncased
roberta RoBERTa base
distilbert DistilBERT base uncased
t5 T5 small

Architecture

Source (URL or file or list of sources)
        |
        v
Stage 1 — Ingest
Handles 8 file types and HTTP download.
Merges all sources into one text string.
        |
        v
Stage 2 — Neural Filter
Splits text into paragraph chunks.
Scores each chunk against topic via TF-IDF cosine similarity.
Shows progress bar while scoring.
Drops chunks below the relevance threshold.
        |
        v
Stage 3 — Tokenize
Pretrained: HuggingFace AutoTokenizer (GPT-2, BERT, etc.)
Custom: trains a BPE tokenizer on the filtered corpus.
        |
        v
{tokens, batch, chunks, num_chunks, total_tokens}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

suur_data-1.0.6.tar.gz (12.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

suur_data-1.0.6-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file suur_data-1.0.6.tar.gz.

File metadata

  • Download URL: suur_data-1.0.6.tar.gz
  • Upload date:
  • Size: 12.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for suur_data-1.0.6.tar.gz
Algorithm Hash digest
SHA256 6d9c7a167fb52efeb797fffa142524a170bf97c61bb299b57d45b5594c2e44da
MD5 10a2ff742cfaa4748d90ff25e911065c
BLAKE2b-256 90a26a6db266bc3cd14909c16a4ee7fb515fa7a5975a1ae84a99ead5eff5a61f

See more details on using hashes here.

File details

Details for the file suur_data-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: suur_data-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 12.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for suur_data-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 779fc6b980bcde7d0bd217b4f37ef2f65cbbf131f100f41c277132f0ceb3e362
MD5 d68850962d255909375bbec0aa167bab
BLAKE2b-256 f718a56c3782fc80b985024febead967715037ca70f84adacd8a5448c19623c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page