Skip to main content

Intelligent data ingestion and tokenization pipeline

Project description

Suur Data

Intelligent data ingestion, filtering, and tokenization pipeline.

Installation

pip install suur-data

See It In Action

from suur_data import suur_data

tokens = suur_data("https://en.wikipedia.org/wiki/Neural_network", topic="neural networks") print(tokens)

That one line downloads a full Wikipedia page, filters it down to only the relevant paragraphs, and returns token IDs ready for any ML model.


Full Documentation

All Installation Options

pip install suur-data pip install suur-data[pdf] pip install suur-data[docx] pip install suur-data[epub] pip install suur-data[hf] pip install suur-data[all]

Supported Input Formats

Format Notes .txt .md .rst Plain text, auto encoding detection .pdf Requires suur-data[pdf] .docx Requires suur-data[docx] .csv .tsv All cells joined as text .json Recursively flattened key-value pairs .html .htm Scripts and styles stripped automatically .epub Requires suur-data[epub] HTTP/HTTPS URL Auto-downloaded, parsed by extension

Python API

from suur_data import suur_data

From a URL

tokens = suur_data("https://en.wikipedia.org/wiki/Neuroscience", topic="brain neurons")

From a local file

tokens = suur_data("data.txt", topic="machine learning")

Custom BPE tokenizer trained on your data

tokens = suur_data("data.txt", topic="machine learning", tokenizer="custom", vocab_size=4000)

Strict filter — only highly relevant chunks survive

tokens = suur_data("data.pdf", topic="quantum computing", threshold=0.15)

Save the tokenizer to disk for reuse

tokens = suur_data("data.txt", topic="biology", save_dir="./my_tokenizer")

Skip the filter entirely

tokens = suur_data("data.txt", no_filter=True)

All Parameters

Parameter Type Default Description data_location str required URL or local file path topic str "" Subject for relevance filtering. Empty skips filter tokenizer str "pretrained" "pretrained" or "custom" model str "gpt2" HuggingFace model name or Hub ID vocab_size int 8000 BPE vocab size for custom tokenizer threshold float 0.05 Relevance cutoff between 0.0 and 1.0 save_dir str None Directory to save tokenizer files no_filter bool False Skip the relevance filter verbose bool True Show progress output

Pretrained Model Shortcuts

Shortcut Model gpt2 GPT-2 (OpenAI) bert BERT base uncased roberta RoBERTa base distilbert DistilBERT base uncased t5 T5 small

You can also pass any HuggingFace Hub model ID directly: tokens = suur_data("data.txt", model="facebook/opt-125m")

How the Filter Works

The filter splits text into paragraph chunks, converts each chunk and the topic into TF-IDF vectors, then scores them using cosine similarity. Chunks below the threshold are deleted. If the threshold is too strict and everything gets dropped, it automatically relaxes and keeps the top 30 percent.

Raise the threshold for stricter filtering, lower it to keep more content: tokens = suur_data("data.txt", topic="AI", threshold=0.10) # strict tokens = suur_data("data.txt", topic="AI", threshold=0.02) # loose

Saving and Loading Tokens

import json

tokens = suur_data("data.txt", topic="neural networks") with open("tokens.json", "w") as f: json.dump(tokens, f)

with open("tokens.json", "r") as f: tokens = json.load(f)

Decoding Tokens Back to Text

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("gpt2") text = tok.decode(tokens) print(text)


Architecture

Source (URL or file) | v Stage 1 — Ingest Handles 8 file types and HTTP download. Outputs a single raw text string. | v Stage 2 — Neural Filter Splits text into paragraph chunks. Scores each chunk against topic via TF-IDF cosine similarity. Shows progress bar while scoring. Drops chunks below the relevance threshold. | v Stage 3 — Tokenize Pretrained: HuggingFace AutoTokenizer (GPT-2, BERT, etc.) Custom: trains a BPE tokenizer on the filtered corpus. | v List[int] — token IDs


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

suur_data-1.0.2.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

suur_data-1.0.2-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file suur_data-1.0.2.tar.gz.

File metadata

  • Download URL: suur_data-1.0.2.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for suur_data-1.0.2.tar.gz
Algorithm Hash digest
SHA256 e78da5aa90e34b4b4d2710c9b2c61e2b563d6937e720d884d20f766992e6532d
MD5 a8cb054026ef48e613807e9c32989dab
BLAKE2b-256 39e128fc7ffa0946c39f4a91e74db9f30ae065815661fe38b14b95017556ee8c

See more details on using hashes here.

File details

Details for the file suur_data-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: suur_data-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for suur_data-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ac036d79c0622f005bf0d8d37e24b7c5d480a888570e3560308e3521d497016d
MD5 e7fc342c36f27c371461adaf836abe44
BLAKE2b-256 d5136c4b359170ec9d8a3b05cb0d04afbf8d117483a937d61efdb6cf15abb5cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page