Skip to main content

The World's Data Filter — find the most valuable data, first with a universal SDK/CLI that ranks, filters, and subsets heterogeneous data by information gain, novelty, and quality.

Project description

The World’s Data Filter™ — find the most valuable data, first.

Surface your highest-value records with information gain, novelty, and quality scoring.
A universal SDK + CLI that ranks and subsets text, JSONL, CSV, logs, and mixed corpora so you see the signal first.
Built on submodular selection (facility location), stable embeddings, diversity, and fast heuristics.

Company: The World’s Data Company • Product: The World’s Data Filter™


✨ What it does

  • Universal features — pluggable extractors for text, JSON/CSV/tabular, and generic blobs.
  • Information Gain — greedy facility‑location selection to cover the dataset with minimal redundancy.
  • Novelty — distances from dataset centroid / past cache to prioritize new signal.
  • Quality filters — language/length heuristics for text; null/variance checks for tabular; duplicate/similarity suppression.
  • Explainable — scores per item: coverage_gain, novelty, quality, and a value_score aggregate.
  • SDK & CLI — embed in Python or run as wdf from the terminal.
  • Deterministic — stable SHA‑256–based embeddings by default (swap for your own encoder at any time).
  • No heavy models — NumPy/Scipy core; scikit‑learn is optional ([text] extra) for TF‑IDF.

Year 2 roadmap: The World’s Data Index (persistent vector/metadata store) — this repo stays the stateless filter/selector.


🚀 Quickstart (Windows / macOS / Linux)

# 1) Create a virtualenv (Python 3.10+)
python -m venv .venv
# Windows
.\.venv\Scripts\Activate.ps1
# macOS/Linux
# source .venv/bin/activate

# 2) Install
pip install -U pip
pip install -e .[dev]            # add [text] for TF-IDF utilities if you like

# 3) Run the demo
wdf score examples/news.jsonl --text-field text --out scores.csv
wdf filter examples/news.jsonl --text-field text --k 10 --out selected.jsonl --explain

Outputs:

  • scores.csv — per‑item coverage_gain, novelty, quality, value_score
  • selected.jsonl — the top‑K items by the chosen criterion (default: value_score) with explanations included by default (disable via --no-explain)

🧠 How it works (high level)

Feature extraction (adapters)

  • Text → deterministic hash embedding (384‑d) or optional TF‑IDF.
  • JSONL/CSV → flattened key/value signals, basic stats (NA ratio, variance), and hash embedding of important string fields.
  • Generic files → filename, size, MIME guess, byte histograms (lightweight), hash embedding of content bytes.

Each item yields a vector x_i (unit‑normalized) and auxiliary quality features.

Scoring

  • Facility Location (coverage)
    (F(S)=\sum_j \max_{i\in S} \text{sim}(x_i, x_j)) — select items that best cover the rest.
    Greedy selection approximates the optimum and doubles as a redundancy filter.
  • Novelty
    Distance from dataset centroid (or past cache) highlights unusual / new items.
  • Quality
    Text heuristics (language guess, length, printable ratio), tabular health (missing‑ness, low variance), duplicate checks.

Value score (combined)

value_score = w_cov * coverage_gain + w_nov * novelty + w_quality * quality
Weights configurable in CLI/SDK.


🧰 CLI usage

# Score a JSONL corpus (one object per line) with a 'text' field
wdf score examples/news.jsonl --text-field text --out scores.csv

# Filter top-K by value score (explain is on by default)
wdf select examples/news.jsonl --text-field text --k 50 --out selected.jsonl

# Prefer compact JSONL (disable explanations)
wdf select examples/news.jsonl --text-field text --k 50 --out selected.jsonl --no-explain

# From a CSV (choose a text column)
wdf score examples/sample.csv --csv --text-field body --id-field id --out scores.csv

# Tune weights + disable novelty
wdf filter examples/news.jsonl --text-field text --k 20 --w-cov 0.8 --w-nov 0.0 --w-qual 0.2 --out selected.jsonl

Input types supported today

  • .jsonl (id, text, and/or arbitrary fields)
  • .csv (choose columns)
  • Directory of .txt files (--dir)
  • Anything else you can adapt via a custom extractor (see worlddatafilter/extractors/base.py).

You can register your own extractor in ~20 lines — the SDK passes through meta and text to downstream systems.


📦 Python SDK

from worlddatafilter import WorldDataFilter, loaders

docs = loaders.load_jsonl("examples/news.jsonl", text_field="text")
wdf = WorldDataFilter()
scores = wdf.score(docs)      # list of ItemScore
selected = wdf.select(docs, k=25, weights=dict(cov=0.7, nov=0.2, qual=0.1))

🧪 Tests & Quality

ruff check .
pytest -q

🔌 Optional extras

  • pip install -e .[text] → scikit‑learn TF‑IDF utilities.
  • pip install -e .[api] → simple FastAPI server exposing /score & /filter (coming soon).

📄 License

Apache License 2.0 © The World’s Data Company

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

worlds_data_filter-0.1.1.tar.gz (21.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

worlds_data_filter-0.1.1-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file worlds_data_filter-0.1.1.tar.gz.

File metadata

  • Download URL: worlds_data_filter-0.1.1.tar.gz
  • Upload date:
  • Size: 21.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for worlds_data_filter-0.1.1.tar.gz
Algorithm Hash digest
SHA256 2be2fbf486d89b2f7b307e120c38b134ed204f377cbf4d884c299612ad7a7ce1
MD5 e7349f28a29de0c2691d24acf90459b9
BLAKE2b-256 4c5a3ef71b829ab5680c8dcf306233557830f15ee6901dc047804050fa648e41

See more details on using hashes here.

File details

Details for the file worlds_data_filter-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for worlds_data_filter-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ea72a486c0dcebbeffa9e935ae88dd4c7ed3e1bc6fae6fc97c7f5fd33f075a92
MD5 ef3d370a30110fe1ab8e44ac02eb1595
BLAKE2b-256 b1c75042173b6a824c9466230b92ac238c5ae8cfed870283bd9dfa3d98326f3b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page