Algorithmic text humanization with AI detection, tone analysis, paraphrasing, and spinning — 20-stage pipeline, 14 languages, zero dependencies
Project description
TextHumanize
The most advanced open-source text naturalization engine
Normalize style, improve readability, and ensure brand-safe content — offline, private, and blazing fast
56,800+ lines of code · 95 Python modules · 17-stage pipeline · 14 languages + universal · 1,995 tests
Quick Start · Features · Documentation · Live Demo · License
TextHumanize is a pure-algorithmic text processing engine that normalizes style, improves readability, and reduces mechanical patterns in text. No neural networks, no API keys, no internet — just 56K+ lines of finely tuned rules, dictionaries, and statistical methods.
Honest note: TextHumanize is a style-normalization tool, not an AI-detection bypass tool. It reduces AI-like patterns (formulaic connectors, uniform sentence length, bureaucratic vocabulary) but does not guarantee that processed text will pass external AI detectors. Quality of humanization varies by language and text type. See Limitations below.
Built-in toolkit: AI Detection · Paraphrasing · Tone Analysis · Watermark Cleaning · Content Spinning · Coherence Analysis · Readability Scoring · Stylistic Fingerprinting · Auto-Tuner · Perplexity Analysis · Plagiarism Detection · Async API · SSE Streaming
Platforms: Python (full) · TypeScript/JavaScript (core) · PHP (full)
Languages: 🇷🇺 RU · 🇺🇦 UK · 🇬🇧 EN · 🇩🇪 DE · 🇫🇷 FR · 🇪🇸 ES · 🇵🇱 PL · 🇧🇷 PT · 🇮🇹 IT · 🇸🇦 AR · 🇨🇳 ZH · 🇯🇵 JA · 🇰🇷 KO · 🇹🇷 TR · 🌍 any language via universal processor
Why TextHumanize?
Problem: Machine-generated text has uniform sentence lengths, bureaucratic vocabulary, formulaic connectors, and low stylistic diversity — reducing readability, engagement, and brand authenticity.
Solution: TextHumanize algorithmically normalizes text style while preserving meaning. Configurable intensity, deterministic output, full change reports. No cloud APIs, no rate limits, no data leaks.
| Advantage | Details |
|---|---|
| ~3,000 chars/sec | Process a full article in under a second |
| 100% private | All processing is local — your text never leaves your machine |
| Precise control | Intensity 0–100, 9 profiles, keyword preservation, max change ratio |
| 14 languages | Full dictionaries for 14 languages; statistical processor for any other |
| Zero dependencies | Pure Python stdlib — no pip packages, no model downloads, starts in <100ms |
| Reproducible | Seed-based PRNG — same input + same seed = identical output |
| AI detection | 13-metric ensemble + 35-feature statistical detector — no ML required |
| Enterprise-ready | Dual license, 1,995 tests, CI/CD, benchmarks, on-prem deployment |
Comparison with Competitors
| Criterion | TextHumanize | Online Humanizers | GPT/LLM Rewriting |
|---|---|---|---|
| Works offline | ✅ | ❌ | ❌ |
| Privacy | ✅ Local only | ❌ Third-party servers | ❌ Cloud API |
| Speed | ~3K chars/sec | 2–10 sec (network) | ~500 chars/sec |
| Cost per 1M chars | $0 | $10–50/month | $15–60 (GPT-4) |
| API key required | No | Yes | Yes |
| Deterministic | ✅ Seed-based | ❌ | ❌ |
| Languages | 14 + universal | 1–3 | 10+ but expensive |
| Built-in AI detector | ✅ 13 metrics | ❌ or basic | ❌ |
| Max change control | ✅ max_change_ratio |
❌ | ❌ Unpredictable |
| Open source | ✅ | ❌ | ❌ |
| Self-hosted | ✅ | ❌ | ❌ |
vs. Other Open-Source Libraries
| Feature | TextHumanize | Typical Alternatives |
|---|---|---|
| Pipeline stages | 17 | 2–4 |
| Languages | 14 + universal | 1–2 |
| AI detection | ✅ 13 metrics + statistical ML | ❌ |
| Python tests | 1,995 | 10–50 |
| Codebase size | 56,800+ lines | 500–2K |
| Platforms | Python + JS + PHP | Single |
| Plugin system | ✅ | ❌ |
| Tone analysis | ✅ 7 levels | ❌ |
| REST API | ✅ 12 endpoints | ❌ |
| Readability metrics | ✅ 6 indices | 0–1 |
| Morphological engine | ✅ 4 languages | ❌ |
Installation
pip install texthumanize
From source:
git clone https://github.com/ksanyok/TextHumanize.git
cd TextHumanize && pip install -e .
PHP / TypeScript
# PHP
cd php/ && composer install
# TypeScript
cd js/ && npm install
Quick Start
from texthumanize import humanize, analyze, detect_ai, explain
# Humanize text
result = humanize("This text utilizes a comprehensive methodology for implementation.", lang="en")
print(result.text) # → "This text uses a complete method for setup."
print(result.change_ratio) # → 0.15
print(result.quality_score) # → 0.85
# With profile and intensity
result = humanize(text, lang="en", profile="web", intensity=70)
# AI Detection — 13-metric ensemble
ai = detect_ai("Text to check for AI generation.", lang="en")
print(f"AI: {ai['score']:.0%} | {ai['verdict']} | Confidence: {ai['confidence']:.0%}")
# Analyze text metrics
report = analyze("Text to analyze.", lang="en")
print(f"Artificiality: {report.artificiality_score:.1f}/100")
# Full change report
print(explain(result))
All Features at a Glance
from texthumanize import (
humanize, humanize_batch, humanize_chunked, humanize_ai,
detect_ai, detect_ai_batch, detect_ai_sentences, detect_ai_mixed,
paraphrase, analyze_tone, adjust_tone,
detect_watermarks, clean_watermarks,
spin, spin_variants, analyze_coherence, full_readability,
AutoTuner, BenchmarkSuite, STYLE_PRESETS,
)
# Paraphrasing
print(paraphrase("The system works efficiently.", lang="en"))
# Tone — 7-level formality scale
tone = analyze_tone("Please submit the documentation.", lang="en")
casual = adjust_tone("It is imperative to proceed.", target="casual", lang="en")
# Watermarks
clean = clean_watermarks("Te\u200bxt wi\u200bth hid\u200bden chars")
# Spinning
variants = spin_variants("Original text.", count=5, lang="en")
# Batch + chunked processing
results = humanize_batch(["Text 1", "Text 2"], lang="en", max_workers=4)
result = humanize_chunked(large_doc, chunk_size=3000, lang="ru")
# Async API — native asyncio support
from texthumanize import async_humanize, async_detect_ai
result = await async_humanize("Text to process", lang="en")
ai = await async_detect_ai("Text to check", lang="en")
Before & After
Before (AI-generated):
Furthermore, it is important to note that the implementation of cloud computing facilitates the optimization of business processes. Additionally, the utilization of microservices constitutes a significant advancement.
After (TextHumanize, profile="web", intensity=70):
But cloud computing helps optimize how businesses work. Also, microservices are a big step forward.
Feature Matrix
| Category | Feature | Python | JS | PHP |
|---|---|---|---|---|
| Core | humanize() — 17-stage pipeline |
✅ | ✅ | ✅ |
humanize_batch() — parallel processing |
✅ | — | ✅ | |
humanize_chunked() — large text support |
✅ | — | ✅ | |
humanize_ai() — three-tier AI + rules |
✅ | — | — | |
analyze() — artificiality scoring |
✅ | ✅ | ✅ | |
explain() — change report |
✅ | — | ✅ | |
| AI Detection | detect_ai() — 13-metric + statistical ML |
✅ | ✅ | ✅ |
detect_ai_batch() — batch detection |
✅ | — | — | |
detect_ai_sentences() — per-sentence |
✅ | — | — | |
detect_ai_mixed() — mixed content |
✅ | — | — | |
| NLP | paraphrase() — syntactic transforms |
✅ | — | ✅ |
POSTagger — rule-based POS (EN/RU/UK/DE) |
✅ | — | — | |
CJKSegmenter — zh/ja/ko word segmentation |
✅ | — | — | |
SyntaxRewriter — 8 sentence transforms |
✅ | — | — | |
WordLanguageModel — perplexity (14 langs) |
✅ | — | — | |
CollocEngine — PMI collocation scoring |
✅ | — | — | |
| Tone | analyze_tone() — formality analysis |
✅ | — | ✅ |
adjust_tone() — 7-level adjustment |
✅ | — | ✅ | |
| Watermarks | detect_watermarks() — 5 types |
✅ | — | ✅ |
clean_watermarks() — removal |
✅ | — | ✅ | |
| Spinning | spin() / spin_variants() |
✅ | — | ✅ |
| Analysis | analyze_coherence() — paragraph flow |
✅ | — | ✅ |
full_readability() — 6 indices |
✅ | — | ✅ | |
| Stylistic fingerprinting | ✅ | — | — | |
| Quality | BenchmarkSuite — 6-dimension scoring |
✅ | — | — |
FingerprintRandomizer — anti-detection |
✅ | — | — | |
| Advanced | Style presets (5 personas) | ✅ | — | — |
| Auto-Tuner (feedback loop) | ✅ | — | — | |
| Plugin system | ✅ | — | ✅ | |
| REST API (12 endpoints) | ✅ | — | — | |
| CLI (15+ commands) | ✅ | — | — | |
| Languages | Full dictionary support | 14 | 2 | 14 |
| Universal processor | ✅ | ✅ | ✅ |
Profiles
| Profile | Use Case | Sentence Length | Colloquialisms | Default Intensity |
|---|---|---|---|---|
chat |
Messaging, social media | 8–18 words | High | 80 |
web |
Blog posts, articles | 10–22 words | Medium | 60 |
seo |
SEO content (keyword-safe) | 12–25 words | None | 40 |
docs |
Technical documentation | 12–28 words | None | 50 |
formal |
Academic, legal | 15–30 words | None | 30 |
academic |
Research papers | 15–30 words | None | 25 |
marketing |
Sales, promo copy | 8–20 words | Medium | 70 |
social |
Social media posts | 6–15 words | High | 85 |
email |
Business emails | 10–22 words | Medium | 50 |
Style presets: student · copywriter · scientist · journalist · blogger
result = humanize(text, profile="seo", intensity=40,
constraints={"keep_keywords": ["API", "cloud"]})
Processing Pipeline
Input → Watermark Cleaning → Segmentation → CJK Segmentation → Typography
→ Debureaucratization → Structure → Repetitions → Liveliness
→ Paraphrasing → Syntax Rewriting → Tone Harmonization → Universal
→ Naturalization → Word LM Quality Gate → Readability → Grammar
→ Coherence Repair → Fingerprint Diversification → Validation → Output
17 stages with adaptive intensity (auto-reduces processing for already-natural text) and graduated retry (retries at lower intensity if change ratio exceeds limit).
AI Detection
13-metric ensemble + 35-feature statistical detector. No ML models, no APIs.
| Metric | What It Measures |
|---|---|
| AI Patterns | Formulaic phrases ("it is important to note", "furthermore") |
| Burstiness | Sentence length uniformity (humans vary, AI doesn't) |
| Opening Diversity | Repetitive sentence starts |
| Entropy | Word predictability (Shannon entropy) |
| Vocabulary | Lexical richness (type-to-token ratio) |
| Perplexity | Character-level predictability |
| + 7 more | Stylometry, coherence, grammar perfection, punctuation, rhythm, readability, Zipf |
Ensemble: Weighted sum (50%) + Strong signal detector (30%) + Majority voting (20%)
Verdicts: human_written (< 35%) · mixed (35–65%) · ai_generated (≥ 65%)
result = detect_ai("Text to check.", lang="en")
print(f"{result['score']:.0%} — {result['verdict']}")
# Per-sentence detection
for s in detect_ai_sentences(text, lang="en"):
print(f"{'🤖' if s['label'] == 'ai' else '👤'} {s['text'][:80]}")
CLI
texthumanize input.txt -l en -p web -i 70 -o output.txt
texthumanize input.txt --detect-ai
texthumanize input.txt --analyze
texthumanize input.txt --paraphrase -o out.txt
texthumanize input.txt --tone casual
texthumanize dummy --api --port 8080
echo "Text" | texthumanize - -l en
REST API
python -m texthumanize.api --port 8080
| Method | Endpoint | Description |
|---|---|---|
POST |
/humanize |
Humanize text |
POST |
/detect-ai |
AI detection (single or batch) |
POST |
/analyze |
Text metrics |
POST |
/paraphrase |
Paraphrase |
POST |
/tone/analyze |
Tone analysis |
POST |
/tone/adjust |
Tone adjustment |
POST |
/watermarks/detect |
Detect watermarks |
POST |
/watermarks/clean |
Clean watermarks |
POST |
/spin |
Text spinning |
POST |
/coherence |
Coherence analysis |
POST |
/readability |
Readability metrics |
GET |
/health |
Health check |
curl -X POST http://localhost:8080/humanize \
-H "Content-Type: application/json" \
-d '{"text": "Your text here.", "lang": "en", "profile": "web"}'
Language Support
| Language | Code | Bureaucratic | Synonyms | Collocations |
|---|---|---|---|---|
| Russian | ru |
70+ | 50+ | 408 |
| Ukrainian | uk |
50+ | 48 | 38 |
| English | en |
40+ | 35+ | 1,578 |
| German | de |
64 | 45 | 125 |
| French | fr |
20 | 20 | 128 |
| Spanish | es |
18 | 18 | 126 |
| Polish | pl |
18 | 18 | 34 |
| Portuguese | pt |
16 | 17 | 36 |
| Italian | it |
16 | 17 | 38 |
| Arabic | ar |
81 | 80 | — |
| Chinese | zh |
80 | 80 | — |
| Japanese | ja |
60+ | 60+ | — |
| Korean | ko |
60+ | 60+ | — |
| Turkish | tr |
60+ | 60+ | — |
Universal processor works for any language using statistical methods (burstiness, perplexity, punctuation normalization).
Performance
All benchmarks on Apple Silicon (M-series), Python 3.12, single thread.
| Function | Text Size | Avg Latency | Peak Memory |
|---|---|---|---|
humanize() |
30 words | ~60 ms | 4 KB |
humanize() |
80 words | ~200 ms | 4 KB |
humanize() |
400 words | ~1.5 s | 6 KB |
detect_ai() |
30 words | ~50 ms | 22 KB |
detect_ai() |
80 words | ~150 ms | 71 KB |
detect_ai() |
400 words | ~500 ms | 196 KB |
analyze() |
80 words | ~500 ms | 362 KB |
paraphrase() |
80 words | ~5 ms | 8 KB |
| Property | Value |
|---|---|
| LRU cache hit | 11× faster than cold call |
| External network calls | 0 (offline-first) |
| Deterministic (same seed) | ✅ Always |
| Pipeline timeout | 30 s (configurable) |
| Rate limiting (API) | 10 req/s per IP, burst 20 |
Run benchmarks yourself:
python benchmarks/run_benchmark.py
Plugin System
from texthumanize import Pipeline, humanize
def add_disclaimer(text: str, lang: str) -> str:
return text + "\n\n---\nProcessed by TextHumanize."
Pipeline.register_hook(add_disclaimer, after="naturalization")
result = humanize("Your text here.")
Pipeline.clear_plugins()
Available stages: watermark → segmentation → typography → debureaucratization → structure → repetitions → liveliness → universal → naturalization → validation → restore
Architecture
texthumanize/ # 95 Python modules, 56,800+ lines
├── core.py # Facade: humanize(), analyze(), detect_ai()
├── pipeline.py # 17-stage pipeline + adaptive intensity
├── api.py # REST API server (12 endpoints)
├── cli.py # CLI (15+ commands)
├── exceptions.py # Exception hierarchy
│
├── analyzer.py # Artificiality scoring + 6 readability metrics
├── detectors.py # AI detector: 13 metrics + ensemble
├── statistical_detector.py # 35-feature ML classifier
├── pos_tagger.py # POS tagger (EN/RU/UK/DE)
├── collocation_engine.py # PMI collocation scoring (2,511 collocations)
├── word_lm.py # Word-level LM (14 langs)
│
├── normalizer.py # Typography (stage 2)
├── decancel.py # Debureaucratization (stage 3)
├── structure.py # Sentence diversification (stage 4)
├── naturalizer.py # Burstiness + perplexity (stage 10)
├── paraphraser_ext.py # Semantic paraphrasing (stage 7)
├── syntax_rewriter.py # Structural transforms (stage 7b)
├── grammar_fix.py # Grammar correction (stage 12)
├── coherence_repair.py # Coherence repair (stage 13)
├── validator.py # Quality validation (stage 14)
│
├── tone.py # Tone analysis & adjustment
├── watermark.py # Watermark detection & cleaning
├── spinner.py # Text spinning
├── coherence.py # Coherence analysis
├── morphology.py # Morphological engine (RU/UK/EN/DE)
├── ... # 30+ more modules
│
└── lang/ # 14 language packs + registry
├── en.py, ru.py, de.py ... # Data only, no logic
└── ar.py, zh.py, ja.py ... # Including CJK + RTL
Design principles: Modular · Declarative rules · Idempotent · Safe defaults · Extensible · Zero dependencies · Lazy imports
Testing & Quality
| Platform | Tests | Status |
|---|---|---|
| Python | 1,995 | ✅ All passing |
| PHP | 223 | ✅ All passing |
| TypeScript | 28 | ✅ All passing |
| Total | 2,246 | ✅ |
pytest -q # 1995 passed
ruff check texthumanize/ # Lint
mypy texthumanize/ # Type check
cd php && php vendor/bin/phpunit # 223 tests
CI/CD runs on every push: Python 3.9–3.13 + PHP 8.1–8.3 matrix, ruff, mypy, pytest with coverage ≥70%.
Security
| Aspect | Implementation |
|---|---|
| Input limits | 1 MB text, 5 MB API body |
| Network calls | Zero. No telemetry, no analytics |
| Dependencies | Zero. Pure stdlib |
| Regex safety | All linear-time, no user input compiled to regex |
| Reproducibility | Seed-based PRNG, deterministic output |
| Sandboxing | Resource limits documented for production |
Docker
docker build -t texthumanize .
docker run -p 8080:8080 texthumanize
# API mode
docker run -p 8080:8080 texthumanize --api --port 8080
# Process a file
docker run -v $(pwd):/data texthumanize /data/input.txt -o /data/output.txt
For Business & Enterprise
| Requirement | How TextHumanize Delivers |
|---|---|
| Predictability | Seed-based PRNG — same input + seed = identical output |
| Privacy | 100% local. Zero network calls. No data leaves your server |
| Auditability | Every call returns change_ratio, quality_score, similarity, explain() report |
| Integration | Python SDK · JS SDK · PHP SDK · CLI · REST API · Docker |
| Reliability | 2,246 tests across 3 platforms, CI/CD with ruff + mypy |
| No vendor lock-in | Zero dependencies. No cloud APIs, no API keys, no rate limits |
| Language coverage | 14 full language packs + universal processor for any language |
Contributing
See CONTRIBUTING.md for development setup, testing, and PR guidelines.
Limitations
TextHumanize is a style normalization tool. Please be aware of realistic expectations:
| Aspect | Current State | Notes |
|---|---|---|
| EN humanization | Reduces AI markers by 10–35% | Replaces bureaucratic phrases, varies sentence structure |
| RU humanization | Reduces AI markers by 15–30% | Good at debureaucratization, some sentences may sound awkward |
| UK humanization | Reduces AI markers by 20–50% | Best multilingual support after EN |
| External AI detectors | Not reliable bypass | GPTZero, Originality.ai, etc. use different models |
| Short texts (< 50 words) | Limited effect | Not enough context for meaningful transformation |
| Performance | ~3K chars/sec | Fast for batch processing, but not sub-millisecond |
| Built-in AI detector | Heuristic + statistical | Useful for internal scoring; not equivalent to GPTZero/Turnitin |
| Monotonicity | Higher intensity ≠ always lower AI score | Some transforms at high intensity may create new AI-like patterns |
What TextHumanize does well:
- Removes formulaic connectors ("furthermore", "it is important to note")
- Varies sentence length to add human-like burstiness
- Replaces bureaucratic vocabulary with simpler alternatives
- Deterministic, reproducible results with seed control
- 100% offline, no data leaks, zero dependencies
What TextHumanize does NOT do:
- Guarantee passing external AI detectors (GPTZero, Originality.ai, Turnitin)
- Rewrite text at the semantic level (it's rule-based, not LLM-based)
- Handle domain-specific jargon (medical, legal, etc.)
License & Pricing
TextHumanize uses a dual license model:
| Use Case | License | Cost |
|---|---|---|
| Personal / Academic / Open-source | Free License | Free |
| Commercial — 1 dev, 1 project | Indie | $199/year |
| Commercial — up to 5 devs | Startup | $499/year |
| Commercial — up to 20 devs | Business | $1,499/year |
| Enterprise / On-prem / SLA | Enterprise | Contact us |
All commercial licenses include full source code, updates for 1 year, and email support.
Full licensing details → · See LICENSE for legal text · Contact: ksanyok@me.com
Documentation · Live Demo · GitHub · Issues · Discussions · Commercial License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file texthumanize-0.25.0.tar.gz.
File metadata
- Download URL: texthumanize-0.25.0.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
885d4fb47eae29b1a01de02de397bca4740c6f83a2f8bbf39bf7185cfd49f490
|
|
| MD5 |
2afb0cd92faf90c3133f2541fd35a98e
|
|
| BLAKE2b-256 |
4411e2be8d05b4804cbca80783267373f7ec04861f3d45b2fe7f6b48a1a2328e
|
File details
Details for the file texthumanize-0.25.0-py3-none-any.whl.
File metadata
- Download URL: texthumanize-0.25.0-py3-none-any.whl
- Upload date:
- Size: 1.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc9aee3351bb804a65e932746c9a10d31bb15d79d84f00ea6b84cad3cf1e26c5
|
|
| MD5 |
36439956f20ae9974fbc7d4352118ec2
|
|
| BLAKE2b-256 |
122c28716b6c7cdcae568b7e2eefd2703baf951d421379b89048afac772485ac
|