Universal text analysis module for detecting language, meaningfulness, and structure
Project description
Text Quality Analyzer
Universal Python module for analyzing text quality, detecting language, meaningfulness, and document structure.
Features
- Language Detection: Automatic detection of 97+ languages with confidence scores
- Meaningfulness Check: Determines if text is coherent writing or random characters
- Structure Analysis: Detects Markdown headers, lists, paragraphs, links, and code blocks
- Readability Metrics: Flesch reading ease and other readability scores
- Lightweight: CPU-only, works offline, no GPU required
- Fast: Analyzes 1MB of text in under 1 second
Installation
From source (recommended for development)
cd text-quality-analyzer
pip install -e .
Install dependencies only
pip install -r requirements.txt
Required dependencies
langid>=1.1.6- Language detectiontextstat>=0.7.3- Readability metricswordfreq>=3.0.0- Word frequency dictionaries
Quick Start
Basic Usage
from text_quality_analyzer import TextProfiler
profiler = TextProfiler()
result = profiler.analyze_text("Your text here")
print(result["language"]) # 'en'
print(result["is_meaningful"]) # True
print(result["is_structured"]) # False
Convenience Function
from text_quality_analyzer import analyze_text
result = analyze_text("## Hello World\n\nThis is a test.")
print(result)
Usage Examples
Example 1: Language Detection
from text_quality_analyzer import TextProfiler
profiler = TextProfiler()
texts = {
"English": "This is a test document.",
"Russian": "Это тестовый документ.",
"Chinese": "这是一个测试文档。"
}
for name, text in texts.items():
result = profiler.analyze_text(text)
print(f"{name}: {result['language']} ({result['language_confidence']:.0%})")
# Output:
# English: en (99%)
# Russian: ru (98%)
# Chinese: zh (99%)
Example 2: Quality Filtering
from text_quality_analyzer import TextProfiler
profiler = TextProfiler()
# Filter meaningful text only
texts = [
"This is a well-written article.",
"xkcd1234!@#$%^&*()",
"Short"
]
for text in texts:
should_process, reason = profiler.should_process_text(
text,
require_meaningful=True,
allowed_languages=["en"]
)
print(f"{'ACCEPT' if should_process else 'REJECT'}: {reason}")
Example 3: Structure Detection
from text_quality_analyzer import TextProfiler
profiler = TextProfiler()
markdown_text = """## Introduction
This document has structure.
- Item 1
- Item 2
[Link](https://example.com)
"""
result = profiler.analyze_text(markdown_text)
if result["is_structured"]:
elements = result["structure_elements"]
print(f"Headers: {elements['headers']}")
print(f"Lists: {elements['total_list_items']}")
print(f"Links: {elements['total_links']}")
Example 4: Detailed Metrics
from text_quality_analyzer import TextProfiler
profiler = TextProfiler()
result = profiler.analyze_text("Your text here")
# Language
print(f"Language: {result['language']}")
print(f"Confidence: {result['language_confidence']:.0%}")
# Meaningfulness
print(f"Meaningful: {result['is_meaningful']}")
print(f"Score: {result['meaningfulness_score']:.2f}")
metrics = result['meaningfulness_metrics']
print(f" Letter ratio: {metrics['letter_ratio']:.2f}")
print(f" Stopword presence: {metrics['stopword_presence']:.2f}")
# Structure
print(f"Structured: {result['is_structured']}")
print(f"Score: {result['structure_score']:.2f}")
# Statistics
print(f"Words: {result['word_count']}")
print(f"Paragraphs: {result['paragraph_count']}")
API Reference
TextProfiler
Main class for text analysis.
profiler = TextProfiler(
min_confidence=0.7, # Minimum language detection confidence
min_meaningfulness=0.6, # Minimum meaningfulness score
min_structure=0.5, # Minimum structure score
max_text_length=1_000_000 # Maximum text length (1MB)
)
Methods
analyze_text(text: str) -> dict: Full analysis of textquick_check(text: str) -> dict: Boolean checks only (faster)get_text_summary(text: str) -> str: Human-readable summaryshould_process_text(text, ...) -> (bool, str): Decision helper for pipelines
Individual Components
from text_quality_analyzer import (
LanguageDetector,
MeaningfulnessChecker,
StructureAnalyzer
)
# Use components individually if needed
lang_detector = LanguageDetector()
result = lang_detector.detect("Hello world")
Output Format
The analyze_text() method returns a dictionary with:
{
"success": True,
"language": "en", # ISO 639-1 code
"language_confidence": 0.98, # 0.0-1.0
"is_meaningful": True,
"meaningfulness_score": 0.87, # 0.0-1.0
"meaningfulness_metrics": {
"letter_ratio": 0.82,
"space_ratio": 0.16,
"stopword_presence": 0.9,
"avg_word_length": 5.3,
"dictionary_match": 0.78
},
"is_structured": True,
"structure_score": 0.98, # 0.0-1.0 (high score due to headers+lists+code)
"structure_elements": {
"headers": 3,
"total_list_items": 5,
"paragraphs": 4,
"total_links": 2,
"code_blocks": 1
},
"text_length": 1024,
"word_count": 145,
"paragraph_count": 4,
"readability_index": 58.4, # Flesch reading ease
"processing_time_ms": 125
}
Use Cases
Content Filtering
Filter low-quality or spam content before processing:
should_process, reason = profiler.should_process_text(
user_input,
require_meaningful=True,
allowed_languages=["en", "ru"]
)
if not should_process:
return f"Content rejected: {reason}"
Telegram Bot Integration
Filter and classify Telegram posts:
profiler = TextProfiler()
def should_index_post(post_text):
result = profiler.analyze_text(post_text)
# Only index meaningful Russian texts
if result['language'] != 'ru':
return False
if not result['is_meaningful']:
return False
# Route structured docs to special processing
if result['is_structured']:
route_to_structured_queue(post_text)
return True
Document Classification
Classify documents by structure:
result = profiler.analyze_text(document)
if result['structure_elements']['code_blocks'] > 0:
doc_type = "technical_documentation"
elif result['structure_elements']['total_list_items'] > 5:
doc_type = "listicle_article"
elif result['structure_elements']['paragraphs'] > 10:
doc_type = "long_form_article"
else:
doc_type = "plain_text"
Running Tests
# Install dev dependencies
pip install -r requirements-dev.txt
# Run tests
pytest tests/ -v
# With coverage
pytest tests/ --cov=text_quality_analyzer --cov-report=html
Running Examples
python3 examples/basic_usage.py
How It Works
Structure Detection Logic
The module uses a multi-level approach to detect structured text:
Step 1: Identify Structure Indicators
- Has headers (≥1 header)
- Has lists (≥2 list items)
- Has code blocks (≥1 block)
- Has links + paragraphs (≥1 link AND ≥3 paragraphs)
Step 2: Calculate Base Score
- 2+ indicators → base score = 0.7 (definitely structured)
- 1 indicator → base score = 0.5 (possibly structured)
- 0 indicators → base score = 0.0 (not structured)
Step 3: Add Quantity Bonuses
- +0.05 per header (max +0.15)
- +0.02 per list item (max +0.15)
- +0.1 for code blocks
Examples:
- Text with 4 headers + 4 lists + 1 code → score = 0.96 ✅ Structured
- Text with 2 headers only → score = 0.60 ✅ Structured
- Plain text → score = 0.00 ❌ Not structured
Meaningfulness Detection
Uses multiple metrics:
- Letter ratio: Proportion of alphabetic characters (expect 0.5-0.9)
- Space ratio: Proper word spacing (expect 0.05-0.25)
- Stopwords: Presence of common words for the language
- Word length: Average word length in normal range (3-12 chars)
- Dictionary match: Words found in language frequency lists
Texts with score ≥ 0.6 are considered meaningful.
Performance
- Short text (100 chars): ~1-10ms
- Medium text (1,000 chars): ~2-50ms
- Long text (10,000 chars): ~3-100ms
- First run: ~10 seconds (library initialization)
- Subsequent runs: 0-3ms per text
Note: First analysis is slower due to library initialization. Subsequent analyses are very fast.
Requirements
- Python >= 3.9
- CPU-only (no GPU required)
- ~100 MB RAM
- ~10 MB disk space
Supported Languages
Primary support:
- English (en)
- Russian (ru)
- Chinese (zh)
- Spanish (es)
- French (fr)
- German (de)
- Arabic (ar)
- Japanese (ja)
- Portuguese (pt)
- Italian (it)
Plus 87+ more languages via langid.
Limitations
- Maximum text length: 1 MB (configurable)
- Minimum text length for language detection: 10 characters
- Text-only input (no binary data)
- CPU-only processing
Future Improvements
Planned features:
- Sentiment analysis
- Document type classification
- Grammar checking
- Keyword extraction
- HTML/PDF support
- Result caching
License
MIT License - see LICENSE file for details
Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new features
- Submit a pull request
Support
For issues and feature requests, please create an issue in the GitHub repository.
Changelog
0.1.1 (2025-10-28) - Hotfix
Fixed:
- Fixed language confidence display (was showing negative percentages)
- langid returns negative log-likelihood, now properly converted to 0-1 probability
- Confidence now displays correctly as 95-100% for clear texts
- Improved structure detection logic (major improvement)
- Old: Used complex weighted formula that was too conservative
- New: Multi-level approach with explicit structure indicators
- Result: Texts with headers/lists/code are now correctly identified as structured
- Examples: Text with 4 headers + 4 lists + code → score 0.96 (was 0.50)
Tested:
- Comprehensive test suite with 15 diverse texts
- Multiple languages (English, Russian, Chinese, Spanish)
- Various structures (plain, markdown, code-heavy)
- All tests passing ✅
0.1.0 (2025-10-28)
- Initial release
- Language detection for 97+ languages
- Meaningfulness checking with 5 metrics
- Structure analysis for Markdown documents
- Readability scoring with textstat
- Complete test suite
- Usage examples and documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file text_quality_analyzer-0.1.2.tar.gz.
File metadata
- Download URL: text_quality_analyzer-0.1.2.tar.gz
- Upload date:
- Size: 25.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8058885cb84d7da0f78d02ed816b2525863b778291a0f87bc4851bc39eb6ef86
|
|
| MD5 |
b7e00d0d98129e39f0c144d536e34d90
|
|
| BLAKE2b-256 |
5dd42e022271b5559e8c25be5a79f9b575308d8a27a310d0d99b7207ddb7880d
|
File details
Details for the file text_quality_analyzer-0.1.2-py3-none-any.whl.
File metadata
- Download URL: text_quality_analyzer-0.1.2-py3-none-any.whl
- Upload date:
- Size: 23.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e343588e33092beb38e3278b643e69cfd41c35e1d539beec6689485ca7f38fb2
|
|
| MD5 |
3b989ba99b491f881e2c3ee51ac729cd
|
|
| BLAKE2b-256 |
ff52341a68675ac930c2c3dfe4dff503ec5b8032083f5a84a8f53299f9483b0f
|