Algorithmic text humanization with AI detection, tone analysis, paraphrasing, and spinning — 20-stage pipeline, 14 languages, zero dependencies
Project description
TextHumanize
The most advanced open-source text naturalization engine
Normalize style, improve readability, and ensure brand-safe content — offline, private, and blazing fast
42,375 lines of code · 75 Python modules · 17-stage pipeline · 14 languages + universal · 1,802 tests
Quick Start · Features · Documentation · Live Demo · License
TextHumanize is a pure-algorithmic text processing engine that normalizes style, improves readability, and removes mechanical patterns from text. No neural networks, no API keys, no internet — just 42K+ lines of finely tuned rules, dictionaries, and statistical methods.
Built-in toolkit: AI Detection · Paraphrasing · Tone Analysis · Watermark Cleaning · Content Spinning · Coherence Analysis · Readability Scoring · Stylistic Fingerprinting · Auto-Tuner · Perplexity Analysis · Plagiarism Detection · Async API · SSE Streaming
Platforms: Python (full) · TypeScript/JavaScript (core) · PHP (full)
Languages: 🇷🇺 RU · 🇺🇦 UK · 🇬🇧 EN · 🇩🇪 DE · 🇫🇷 FR · 🇪🇸 ES · 🇵🇱 PL · 🇧🇷 PT · 🇮🇹 IT · 🇸🇦 AR · 🇨🇳 ZH · 🇯🇵 JA · 🇰🇷 KO · 🇹🇷 TR · 🌍 any language via universal processor
Why TextHumanize?
Problem: Machine-generated text has uniform sentence lengths, bureaucratic vocabulary, formulaic connectors, and low stylistic diversity — reducing readability, engagement, and brand authenticity.
Solution: TextHumanize algorithmically normalizes text style while preserving meaning. Configurable intensity, deterministic output, full change reports. No cloud APIs, no rate limits, no data leaks.
| Advantage | Details |
|---|---|
| 30,000+ chars/sec | Process a full article in milliseconds, not seconds |
| 100% private | All processing is local — your text never leaves your machine |
| Precise control | Intensity 0–100, 9 profiles, keyword preservation, max change ratio |
| 14 languages | Full dictionaries for 14 languages; statistical processor for any other |
| Zero dependencies | Pure Python stdlib — no pip packages, no model downloads, starts in <100ms |
| Reproducible | Seed-based PRNG — same input + same seed = identical output |
| AI detection | 13-metric ensemble + 35-feature statistical detector — no ML required |
| Enterprise-ready | Dual license, 1,802 tests, CI/CD, benchmarks, on-prem deployment |
Comparison with Competitors
| Criterion | TextHumanize | Online Humanizers | GPT/LLM Rewriting |
|---|---|---|---|
| Works offline | ✅ | ❌ | ❌ |
| Privacy | ✅ Local only | ❌ Third-party servers | ❌ Cloud API |
| Speed | 30K+ chars/sec | 2–10 sec (network) | ~500 chars/sec |
| Cost per 1M chars | $0 | $10–50/month | $15–60 (GPT-4) |
| API key required | No | Yes | Yes |
| Deterministic | ✅ Seed-based | ❌ | ❌ |
| Languages | 14 + universal | 1–3 | 10+ but expensive |
| Built-in AI detector | ✅ 13 metrics | ❌ or basic | ❌ |
| Max change control | ✅ max_change_ratio |
❌ | ❌ Unpredictable |
| Open source | ✅ | ❌ | ❌ |
| Self-hosted | ✅ | ❌ | ❌ |
vs. Other Open-Source Libraries
| Feature | TextHumanize | Typical Alternatives |
|---|---|---|
| Pipeline stages | 17 | 2–4 |
| Languages | 14 + universal | 1–2 |
| AI detection | ✅ 13 metrics + statistical ML | ❌ |
| Python tests | 1,802 | 10–50 |
| Codebase size | 42,375 lines | 500–2K |
| Platforms | Python + JS + PHP | Single |
| Plugin system | ✅ | ❌ |
| Tone analysis | ✅ 7 levels | ❌ |
| REST API | ✅ 12 endpoints | ❌ |
| Readability metrics | ✅ 6 indices | 0–1 |
| Morphological engine | ✅ 4 languages | ❌ |
Installation
pip install texthumanize
From source:
git clone https://github.com/ksanyok/TextHumanize.git
cd TextHumanize && pip install -e .
PHP / TypeScript
# PHP
cd php/ && composer install
# TypeScript
cd js/ && npm install
Quick Start
from texthumanize import humanize, analyze, detect_ai, explain
# Humanize text
result = humanize("This text utilizes a comprehensive methodology for implementation.", lang="en")
print(result.text) # → "This text uses a complete method for setup."
print(result.change_ratio) # → 0.15
print(result.quality_score) # → 0.85
# With profile and intensity
result = humanize(text, lang="en", profile="web", intensity=70)
# AI Detection — 13-metric ensemble
ai = detect_ai("Text to check for AI generation.", lang="en")
print(f"AI: {ai['score']:.0%} | {ai['verdict']} | Confidence: {ai['confidence']:.0%}")
# Analyze text metrics
report = analyze("Text to analyze.", lang="en")
print(f"Artificiality: {report.artificiality_score:.1f}/100")
# Full change report
print(explain(result))
All Features at a Glance
from texthumanize import (
humanize, humanize_batch, humanize_chunked, humanize_ai,
detect_ai, detect_ai_batch, detect_ai_sentences, detect_ai_mixed,
paraphrase, analyze_tone, adjust_tone,
detect_watermarks, clean_watermarks,
spin, spin_variants, analyze_coherence, full_readability,
AutoTuner, BenchmarkSuite, STYLE_PRESETS,
)
# Paraphrasing
print(paraphrase("The system works efficiently.", lang="en"))
# Tone — 7-level formality scale
tone = analyze_tone("Please submit the documentation.", lang="en")
casual = adjust_tone("It is imperative to proceed.", target="casual", lang="en")
# Watermarks
clean = clean_watermarks("Te\u200bxt wi\u200bth hid\u200bden chars")
# Spinning
variants = spin_variants("Original text.", count=5, lang="en")
# Batch + chunked processing
results = humanize_batch(["Text 1", "Text 2"], lang="en", max_workers=4)
result = humanize_chunked(large_doc, chunk_size=3000, lang="ru")
# Async API — native asyncio support
from texthumanize import async_humanize, async_detect_ai
result = await async_humanize("Text to process", lang="en")
ai = await async_detect_ai("Text to check", lang="en")
Before & After
Before (AI-generated):
Furthermore, it is important to note that the implementation of cloud computing facilitates the optimization of business processes. Additionally, the utilization of microservices constitutes a significant advancement.
After (TextHumanize, profile="web", intensity=70):
But cloud computing helps optimize how businesses work. Also, microservices are a big step forward.
Feature Matrix
| Category | Feature | Python | JS | PHP |
|---|---|---|---|---|
| Core | humanize() — 17-stage pipeline |
✅ | ✅ | ✅ |
humanize_batch() — parallel processing |
✅ | — | ✅ | |
humanize_chunked() — large text support |
✅ | — | ✅ | |
humanize_ai() — three-tier AI + rules |
✅ | — | — | |
analyze() — artificiality scoring |
✅ | ✅ | ✅ | |
explain() — change report |
✅ | — | ✅ | |
| AI Detection | detect_ai() — 13-metric + statistical ML |
✅ | ✅ | ✅ |
detect_ai_batch() — batch detection |
✅ | — | — | |
detect_ai_sentences() — per-sentence |
✅ | — | — | |
detect_ai_mixed() — mixed content |
✅ | — | — | |
| NLP | paraphrase() — syntactic transforms |
✅ | — | ✅ |
POSTagger — rule-based POS (EN/RU/UK/DE) |
✅ | — | — | |
CJKSegmenter — zh/ja/ko word segmentation |
✅ | — | — | |
SyntaxRewriter — 8 sentence transforms |
✅ | — | — | |
WordLanguageModel — perplexity (14 langs) |
✅ | — | — | |
CollocEngine — PMI collocation scoring |
✅ | — | — | |
| Tone | analyze_tone() — formality analysis |
✅ | — | ✅ |
adjust_tone() — 7-level adjustment |
✅ | — | ✅ | |
| Watermarks | detect_watermarks() — 5 types |
✅ | — | ✅ |
clean_watermarks() — removal |
✅ | — | ✅ | |
| Spinning | spin() / spin_variants() |
✅ | — | ✅ |
| Analysis | analyze_coherence() — paragraph flow |
✅ | — | ✅ |
full_readability() — 6 indices |
✅ | — | ✅ | |
| Stylistic fingerprinting | ✅ | — | — | |
| Quality | BenchmarkSuite — 6-dimension scoring |
✅ | — | — |
FingerprintRandomizer — anti-detection |
✅ | — | — | |
| Advanced | Style presets (5 personas) | ✅ | — | — |
| Auto-Tuner (feedback loop) | ✅ | — | — | |
| Plugin system | ✅ | — | ✅ | |
| REST API (12 endpoints) | ✅ | — | — | |
| CLI (15+ commands) | ✅ | — | — | |
| Languages | Full dictionary support | 14 | 2 | 14 |
| Universal processor | ✅ | ✅ | ✅ |
Profiles
| Profile | Use Case | Sentence Length | Colloquialisms | Default Intensity |
|---|---|---|---|---|
chat |
Messaging, social media | 8–18 words | High | 80 |
web |
Blog posts, articles | 10–22 words | Medium | 60 |
seo |
SEO content (keyword-safe) | 12–25 words | None | 40 |
docs |
Technical documentation | 12–28 words | None | 50 |
formal |
Academic, legal | 15–30 words | None | 30 |
academic |
Research papers | 15–30 words | None | 25 |
marketing |
Sales, promo copy | 8–20 words | Medium | 70 |
social |
Social media posts | 6–15 words | High | 85 |
email |
Business emails | 10–22 words | Medium | 50 |
Style presets: student · copywriter · scientist · journalist · blogger
result = humanize(text, profile="seo", intensity=40,
constraints={"keep_keywords": ["API", "cloud"]})
Processing Pipeline
Input → Watermark Cleaning → Segmentation → CJK Segmentation → Typography
→ Debureaucratization → Structure → Repetitions → Liveliness
→ Paraphrasing → Syntax Rewriting → Tone Harmonization → Universal
→ Naturalization → Word LM Quality Gate → Readability → Grammar
→ Coherence Repair → Fingerprint Diversification → Validation → Output
17 stages with adaptive intensity (auto-reduces processing for already-natural text) and graduated retry (retries at lower intensity if change ratio exceeds limit).
AI Detection
13-metric ensemble + 35-feature statistical detector. No ML models, no APIs.
| Metric | What It Measures |
|---|---|
| AI Patterns | Formulaic phrases ("it is important to note", "furthermore") |
| Burstiness | Sentence length uniformity (humans vary, AI doesn't) |
| Opening Diversity | Repetitive sentence starts |
| Entropy | Word predictability (Shannon entropy) |
| Vocabulary | Lexical richness (type-to-token ratio) |
| Perplexity | Character-level predictability |
| + 7 more | Stylometry, coherence, grammar perfection, punctuation, rhythm, readability, Zipf |
Ensemble: Weighted sum (50%) + Strong signal detector (30%) + Majority voting (20%)
Verdicts: human_written (< 35%) · mixed (35–65%) · ai_generated (≥ 65%)
result = detect_ai("Text to check.", lang="en")
print(f"{result['score']:.0%} — {result['verdict']}")
# Per-sentence detection
for s in detect_ai_sentences(text, lang="en"):
print(f"{'🤖' if s['label'] == 'ai' else '👤'} {s['text'][:80]}")
CLI
texthumanize input.txt -l en -p web -i 70 -o output.txt
texthumanize input.txt --detect-ai
texthumanize input.txt --analyze
texthumanize input.txt --paraphrase -o out.txt
texthumanize input.txt --tone casual
texthumanize dummy --api --port 8080
echo "Text" | texthumanize - -l en
REST API
python -m texthumanize.api --port 8080
| Method | Endpoint | Description |
|---|---|---|
POST |
/humanize |
Humanize text |
POST |
/detect-ai |
AI detection (single or batch) |
POST |
/analyze |
Text metrics |
POST |
/paraphrase |
Paraphrase |
POST |
/tone/analyze |
Tone analysis |
POST |
/tone/adjust |
Tone adjustment |
POST |
/watermarks/detect |
Detect watermarks |
POST |
/watermarks/clean |
Clean watermarks |
POST |
/spin |
Text spinning |
POST |
/coherence |
Coherence analysis |
POST |
/readability |
Readability metrics |
GET |
/health |
Health check |
curl -X POST http://localhost:8080/humanize \
-H "Content-Type: application/json" \
-d '{"text": "Your text here.", "lang": "en", "profile": "web"}'
Language Support
| Language | Code | Bureaucratic | Synonyms | Collocations |
|---|---|---|---|---|
| Russian | ru |
70+ | 50+ | 408 |
| Ukrainian | uk |
50+ | 48 | 38 |
| English | en |
40+ | 35+ | 1,578 |
| German | de |
64 | 45 | 125 |
| French | fr |
20 | 20 | 128 |
| Spanish | es |
18 | 18 | 126 |
| Polish | pl |
18 | 18 | 34 |
| Portuguese | pt |
16 | 17 | 36 |
| Italian | it |
16 | 17 | 38 |
| Arabic | ar |
81 | 80 | — |
| Chinese | zh |
80 | 80 | — |
| Japanese | ja |
60+ | 60+ | — |
| Korean | ko |
60+ | 60+ | — |
| Turkish | tr |
60+ | 60+ | — |
Universal processor works for any language using statistical methods (burstiness, perplexity, punctuation normalization).
Performance
All benchmarks on Apple Silicon (M-series), Python 3.12, single thread.
| Function | Text Size | Avg Latency | Per 1K Words | Peak Memory |
|---|---|---|---|---|
humanize() |
30 words | 0.1 ms | ~5 ms | 4 KB |
humanize() |
80 words | 1.5 ms | ~19 ms | 4 KB |
humanize() |
400 words | 0.1 ms | < 1 ms | 6 KB |
detect_ai() |
30 words | 4.3 ms | — | 22 KB |
detect_ai() |
80 words | 36.8 ms | — | 71 KB |
detect_ai() |
400 words | 162 ms | — | 196 KB |
analyze() |
80 words | 478 ms | — | 362 KB |
paraphrase() |
80 words | 0.2 ms | — | 8 KB |
| Property | Value |
|---|---|
| LRU cache hit | 11× faster than cold call |
| External network calls | 0 (offline-first) |
| Deterministic (same seed) | ✅ Always |
| Pipeline timeout | 30 s (configurable) |
| Rate limiting (API) | 10 req/s per IP, burst 20 |
Run benchmarks yourself:
python benchmarks/run_benchmark.py
Plugin System
from texthumanize import Pipeline, humanize
def add_disclaimer(text: str, lang: str) -> str:
return text + "\n\n---\nProcessed by TextHumanize."
Pipeline.register_hook(add_disclaimer, after="naturalization")
result = humanize("Your text here.")
Pipeline.clear_plugins()
Available stages: watermark → segmentation → typography → debureaucratization → structure → repetitions → liveliness → universal → naturalization → validation → restore
Architecture
texthumanize/ # 75 Python modules, 42,375 lines
├── core.py # Facade: humanize(), analyze(), detect_ai()
├── pipeline.py # 17-stage pipeline + adaptive intensity
├── api.py # REST API server (12 endpoints)
├── cli.py # CLI (15+ commands)
├── exceptions.py # Exception hierarchy
│
├── analyzer.py # Artificiality scoring + 6 readability metrics
├── detectors.py # AI detector: 13 metrics + ensemble
├── statistical_detector.py # 35-feature ML classifier
├── pos_tagger.py # POS tagger (EN/RU/UK/DE)
├── collocation_engine.py # PMI collocation scoring (2,511 collocations)
├── word_lm.py # Word-level LM (14 langs)
│
├── normalizer.py # Typography (stage 2)
├── decancel.py # Debureaucratization (stage 3)
├── structure.py # Sentence diversification (stage 4)
├── naturalizer.py # Burstiness + perplexity (stage 10)
├── paraphraser_ext.py # Semantic paraphrasing (stage 7)
├── syntax_rewriter.py # Structural transforms (stage 7b)
├── grammar_fix.py # Grammar correction (stage 12)
├── coherence_repair.py # Coherence repair (stage 13)
├── validator.py # Quality validation (stage 14)
│
├── tone.py # Tone analysis & adjustment
├── watermark.py # Watermark detection & cleaning
├── spinner.py # Text spinning
├── coherence.py # Coherence analysis
├── morphology.py # Morphological engine (RU/UK/EN/DE)
├── ... # 30+ more modules
│
└── lang/ # 14 language packs + registry
├── en.py, ru.py, de.py ... # Data only, no logic
└── ar.py, zh.py, ja.py ... # Including CJK + RTL
Design principles: Modular · Declarative rules · Idempotent · Safe defaults · Extensible · Zero dependencies · Lazy imports
Testing & Quality
| Platform | Tests | Status |
|---|---|---|
| Python | 1,802 | ✅ All passing |
| PHP | 223 | ✅ All passing |
| TypeScript | 28 | ✅ All passing |
| Total | 2,053 | ✅ |
pytest -q # 1802 passed
ruff check texthumanize/ # Lint
mypy texthumanize/ # Type check
cd php && php vendor/bin/phpunit # 223 tests
CI/CD runs on every push: Python 3.9–3.13 + PHP 8.1–8.3 matrix, ruff, mypy, pytest with coverage ≥70%.
Security
| Aspect | Implementation |
|---|---|
| Input limits | 1 MB text, 5 MB API body |
| Network calls | Zero. No telemetry, no analytics |
| Dependencies | Zero. Pure stdlib |
| Regex safety | All linear-time, no user input compiled to regex |
| Reproducibility | Seed-based PRNG, deterministic output |
| Sandboxing | Resource limits documented for production |
Docker
docker build -t texthumanize .
docker run -p 8080:8080 texthumanize
# API mode
docker run -p 8080:8080 texthumanize --api --port 8080
# Process a file
docker run -v $(pwd):/data texthumanize /data/input.txt -o /data/output.txt
For Business & Enterprise
| Requirement | How TextHumanize Delivers |
|---|---|
| Predictability | Seed-based PRNG — same input + seed = identical output |
| Privacy | 100% local. Zero network calls. No data leaves your server |
| Auditability | Every call returns change_ratio, quality_score, similarity, explain() report |
| Integration | Python SDK · JS SDK · PHP SDK · CLI · REST API · Docker |
| Reliability | 2,053 tests across 3 platforms, CI/CD with ruff + mypy |
| No vendor lock-in | Zero dependencies. No cloud APIs, no API keys, no rate limits |
| Language coverage | 14 full language packs + universal processor for any language |
Contributing
See CONTRIBUTING.md for development setup, testing, and PR guidelines.
License & Pricing
TextHumanize uses a dual license model:
| Use Case | License | Cost |
|---|---|---|
| Personal / Academic / Open-source | Free License | Free |
| Commercial — 1 dev, 1 project | Indie | $199/year |
| Commercial — up to 5 devs | Startup | $499/year |
| Commercial — up to 20 devs | Business | $1,499/year |
| Enterprise / On-prem / SLA | Enterprise | Contact us |
All commercial licenses include full source code, updates for 1 year, and email support.
Full licensing details → · See LICENSE for legal text · Contact: ksanyok@me.com
Documentation · Live Demo · GitHub · Issues · Discussions · Commercial License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file texthumanize-0.24.0.tar.gz.
File metadata
- Download URL: texthumanize-0.24.0.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80580edbc1dcb41d863c3c9cb12ab7b4b786b2edfe9d3b156ecb887ae7f23e2f
|
|
| MD5 |
c9d65dda1d054415f8a61b9344ac37fb
|
|
| BLAKE2b-256 |
ae6d97d1a83d338682ef26b9b5cdc5bb5b3f44823d49425e1901ab750074ba90
|
File details
Details for the file texthumanize-0.24.0-py3-none-any.whl.
File metadata
- Download URL: texthumanize-0.24.0-py3-none-any.whl
- Upload date:
- Size: 1.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56b7d7d1157d27fc8c6cd0be58b351b662b5eefd6d62757a0ead81d64be97a6e
|
|
| MD5 |
4aa8152c6b4660dedb0b39d8786c7b7d
|
|
| BLAKE2b-256 |
4c76df9bc951320781af9d42dfaa89b2a0ef00e33f2960d7c6a38aeaa6dae8a5
|