Skip to main content

High-performance text metrics and filtering for large-scale corpora and pretrain curation

Project description

Cheesecloth

PyPI version License: MIT

A high-performance, Rust-powered text analysis toolkit for corpus filtering and quality assessment.

Cheesecloth provides 100+ text metrics for:

  • ⚡ Low-latency filtering of LLM pretraining datasets
  • 📊 Empirical research on text quality and characteristics
  • 🔍 Advanced statistical text analysis

Installation | Quick Examples | Complete Metrics | CLI Usage | Documentation

Installation

pip install cheesecloth

Quick Examples

import cheesecloth

# Basic character analysis
text = "The quick brown fox jumps over the lazy dog!"
metrics = cheesecloth.get_all_char_metrics(text)
print(f"Character count: {metrics['char_count']}")  # 44
print(f"Letters: {metrics['letter_count']}")        # 35
print(f"ASCII ratio: {metrics['ascii_ratio']:.2f}") # 1.00

# Comprehensive analysis (all metrics at once)
all_metrics = cheesecloth.get_all_metrics(text)
print(f"Questions: {all_metrics['patterns']['question_count']}")            # 0
print(f"Paragraphs: {all_metrics['segmentation']['paragraph_count']}")      # 1
print(f"Type-token ratio: {all_metrics['unigram']['type_token_ratio']:.2f}") # 0.56

Metrics Overview

Cheesecloth implements 100+ text analysis metrics across categories:

Category Description Examples
Character Character-level counts and distributions char_count, letter_count, ascii_ratio, char_entropy
Segmentation Text structure analysis paragraph_count, line_count, average_sentence_length
Unigram Word-level statistics token_count, type_token_ratio, hapax_legomena_ratio
Pattern Specific content patterns question_count, copyright_mention_count, contains_code
Compression Information density measures compression_ratio, compression_efficiency
Distribution Statistical distributions zipf_fitness_score, burstiness, vocab_growth
Tokenizer ML tokenization analysis subword_token_count, subword_efficiency
Readability Text complexity metrics readability_score, readability_level

For a complete list of all implemented metrics with detailed descriptions, see our Metrics Reference.

Development Roadmap

Cheesecloth follows a phased development approach:

Phase 1 (Complete) - Metrics Implementation ✅

  • Comprehensive suite of 100+ text metrics
  • High-performance Rust core with Python bindings
  • CLI tools for dataset analysis

Phase 2 (In Progress) - Statistical Research 🔬

  • Empirical baselines from 1T token sample (KL3M Data Project)
  • Statistical patterns between metrics and content quality
  • Research publication (see citation)

Phase 3 (Pending) - Production Filters 🔄

  • Configurable filtering pipelines based on Phase 2 findings
  • Adaptive filtering for streaming data
  • Production tools for large-scale corpus management

Key Features

  • Rust Core: High-performance algorithms implemented in Rust
  • Comprehensive Analysis: 100+ metrics from basic to advanced statistical measures
  • Type-Safe Interface: Python classes with IDE completion and convenience methods
  • LLM Integration: Support for ML tokenizers (GPT, BERT, etc.)
  • Statistical Tools: Analyze metric distributions across corpus samples
  • Minimal Dependencies: Lightweight with optional integrations
  • Adaptive Processing: Smart segmentation for large documents

Advanced Examples

Typed Metrics Interface

import cheesecloth
from cheesecloth.tokenized_metrics import AllMetrics

text = """
Copyright © 2025 ALEA Institute. All rights reserved.

Section 1: Introduction to Natural Language Processing

What are the fundamental challenges in processing human language?
"""

# Get all metrics with typesafe interface
metrics_dict = cheesecloth.get_all_metrics(text)
metrics = AllMetrics.from_dict(metrics_dict)

# Proper type safety and attribute access
print(f"Character count: {metrics.character.char_count}")          # 174
print(f"Has copyright notices: {metrics.patterns.has_copyright_notices}")  # True
print(f"Is educational content: {metrics.patterns.is_educational}")        # True
print(f"Question count: {metrics.patterns.question_count}")                # 1

Advanced Statistical Analysis

import cheesecloth

text = """Natural language processing (NLP) is a subfield of linguistics, computer 
science, and artificial intelligence concerned with the interactions between 
computers and human language. The goal is to enable computers to process 
and analyze large amounts of natural language data."""

# Check Zipf's law fitness (how well word frequency follows Zipf's distribution)
zipf_metrics = cheesecloth.get_zipf_metrics(text, include_punctuation=False, case_sensitive=False)
print(f"Zipf fitness score: {zipf_metrics['zipf_fitness_score']:.2f}")  # ~0.39

# Compression-based metrics (measures text complexity/redundancy)
compression_metrics = cheesecloth.get_compression_metrics(text)
print(f"Compression ratio: {compression_metrics['compression_ratio']:.2f}")  # ~1.59

CLI Usage

Analyze files and datasets with the CLI:

# Analyze a local file
python -m cheesecloth.cli data/war_and_peace.txt

# Hugging Face dataset with text in 'text' column
python -m cheesecloth.cli imdb --text-column text --limit 100

# Specific metric groups only
python -m cheesecloth.cli data/corpus.jsonl.gz --include-groups basic entropy

The CLI supports:

  • Local text, JSON, and JSONL files (compressed or uncompressed)
  • Hugging Face datasets
  • Pre-tokenized data with custom tokenizers
  • Filtering by metric groups
  • Comprehensive or targeted analysis

Documentation

Citation

If you use Cheesecloth in your research, please cite the KL3M Data Project:

@misc{bommarito2025kl3mdata,
  title={The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models},
  author={Bommarito II, Michael J. and Bommarito, Jillian and Katz, Daniel Martin},
  year={2025},
  eprint={2504.07854},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

About

Cheesecloth is an ALEA Institute project and part of our ongoing research into the development of legal, ethical, and sustainable AI systems.

Licensed under MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cheesecloth-0.2.2.tar.gz (134.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cheesecloth-0.2.2-cp311-cp311-manylinux_2_34_x86_64.whl (3.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

File details

Details for the file cheesecloth-0.2.2.tar.gz.

File metadata

  • Download URL: cheesecloth-0.2.2.tar.gz
  • Upload date:
  • Size: 134.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for cheesecloth-0.2.2.tar.gz
Algorithm Hash digest
SHA256 5aa07963cbeed5dd991d8eabca7ada631888ab9ea3dffff46707eef9b6da83b0
MD5 f7223e50a26a72bdc5bdc659fb59fff6
BLAKE2b-256 801449c3f26e4793f46ad7904c818145007754c5f8e391ea6885292312eb6572

See more details on using hashes here.

File details

Details for the file cheesecloth-0.2.2-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for cheesecloth-0.2.2-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 73bdf25e710da9729ebe48806a44848e81ff2539a647587e31faeab893a15c20
MD5 c83856aec868985788c7c649019b9a7b
BLAKE2b-256 7d3995f8b3cb82049906a636004793caf99eb191314883683c2b90c1953ba86c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page