Universal text analysis module for detecting language, meaningfulness, and structure

These details have not been verified by PyPI

Project links

Homepage

Project description

Text Quality Analyzer

Universal Python module for analyzing text quality, detecting language, meaningfulness, and document structure.

Features

Language Detection: Automatic detection of 97+ languages with confidence scores
Meaningfulness Check: Determines if text is coherent writing or random characters
Structure Analysis: Detects Markdown headers, lists, paragraphs, links, and code blocks
Readability Metrics: Flesch reading ease and other readability scores
Lightweight: CPU-only, works offline, no GPU required
Fast: Analyzes 1MB of text in under 1 second

Installation

From source (recommended for development)

cd text-quality-analyzer
pip install -e .

Install dependencies only

pip install -r requirements.txt

Required dependencies

langid>=1.1.6 - Language detection
textstat>=0.7.3 - Readability metrics
wordfreq>=3.0.0 - Word frequency dictionaries

Quick Start

Basic Usage

from text_quality_analyzer import TextProfiler

profiler = TextProfiler()
result = profiler.analyze_text("Your text here")

print(result["language"])          # 'en'
print(result["is_meaningful"])     # True
print(result["is_structured"])     # False

Convenience Function

from text_quality_analyzer import analyze_text

result = analyze_text("## Hello World\n\nThis is a test.")
print(result)

Usage Examples

Example 1: Language Detection

from text_quality_analyzer import TextProfiler

profiler = TextProfiler()

texts = {
    "English": "This is a test document.",
    "Russian": "Это тестовый документ.",
    "Chinese": "这是一个测试文档。"
}

for name, text in texts.items():
    result = profiler.analyze_text(text)
    print(f"{name}: {result['language']} ({result['language_confidence']:.0%})")

# Output:
# English: en (99%)
# Russian: ru (98%)
# Chinese: zh (99%)

Example 2: Quality Filtering

from text_quality_analyzer import TextProfiler

profiler = TextProfiler()

# Filter meaningful text only
texts = [
    "This is a well-written article.",
    "xkcd1234!@#$%^&*()",
    "Short"
]

for text in texts:
    should_process, reason = profiler.should_process_text(
        text,
        require_meaningful=True,
        allowed_languages=["en"]
    )
    print(f"{'ACCEPT' if should_process else 'REJECT'}: {reason}")

Example 3: Structure Detection

from text_quality_analyzer import TextProfiler

profiler = TextProfiler()

markdown_text = """## Introduction

This document has structure.

- Item 1
- Item 2

[Link](https://example.com)
"""

result = profiler.analyze_text(markdown_text)

if result["is_structured"]:
    elements = result["structure_elements"]
    print(f"Headers: {elements['headers']}")
    print(f"Lists: {elements['total_list_items']}")
    print(f"Links: {elements['total_links']}")

Example 4: Detailed Metrics

from text_quality_analyzer import TextProfiler

profiler = TextProfiler()
result = profiler.analyze_text("Your text here")

# Language
print(f"Language: {result['language']}")
print(f"Confidence: {result['language_confidence']:.0%}")

# Meaningfulness
print(f"Meaningful: {result['is_meaningful']}")
print(f"Score: {result['meaningfulness_score']:.2f}")
metrics = result['meaningfulness_metrics']
print(f"  Letter ratio: {metrics['letter_ratio']:.2f}")
print(f"  Stopword presence: {metrics['stopword_presence']:.2f}")

# Structure
print(f"Structured: {result['is_structured']}")
print(f"Score: {result['structure_score']:.2f}")

# Statistics
print(f"Words: {result['word_count']}")
print(f"Paragraphs: {result['paragraph_count']}")

API Reference

TextProfiler

Main class for text analysis.

profiler = TextProfiler(
    min_confidence=0.7,        # Minimum language detection confidence
    min_meaningfulness=0.6,    # Minimum meaningfulness score
    min_structure=0.5,         # Minimum structure score
    max_text_length=1_000_000  # Maximum text length (1MB)
)

Methods

analyze_text(text: str) -> dict: Full analysis of text
quick_check(text: str) -> dict: Boolean checks only (faster)
get_text_summary(text: str) -> str: Human-readable summary
should_process_text(text, ...) -> (bool, str): Decision helper for pipelines

Individual Components

from text_quality_analyzer import (
    LanguageDetector,
    MeaningfulnessChecker,
    StructureAnalyzer
)

# Use components individually if needed
lang_detector = LanguageDetector()
result = lang_detector.detect("Hello world")

Output Format

The analyze_text() method returns a dictionary with:

{
    "success": True,
    "language": "en",                    # ISO 639-1 code
    "language_confidence": 0.98,         # 0.0-1.0
    "is_meaningful": True,
    "meaningfulness_score": 0.87,        # 0.0-1.0
    "meaningfulness_metrics": {
        "letter_ratio": 0.82,
        "space_ratio": 0.16,
        "stopword_presence": 0.9,
        "avg_word_length": 5.3,
        "dictionary_match": 0.78
    },
    "is_structured": True,
    "structure_score": 0.98,             # 0.0-1.0 (high score due to headers+lists+code)
    "structure_elements": {
        "headers": 3,
        "total_list_items": 5,
        "paragraphs": 4,
        "total_links": 2,
        "code_blocks": 1
    },
    "text_length": 1024,
    "word_count": 145,
    "paragraph_count": 4,
    "readability_index": 58.4,           # Flesch reading ease
    "processing_time_ms": 125
}

Use Cases

Content Filtering

Filter low-quality or spam content before processing:

should_process, reason = profiler.should_process_text(
    user_input,
    require_meaningful=True,
    allowed_languages=["en", "ru"]
)
if not should_process:
    return f"Content rejected: {reason}"

Telegram Bot Integration

Filter and classify Telegram posts:

profiler = TextProfiler()

def should_index_post(post_text):
    result = profiler.analyze_text(post_text)

    # Only index meaningful Russian texts
    if result['language'] != 'ru':
        return False
    if not result['is_meaningful']:
        return False

    # Route structured docs to special processing
    if result['is_structured']:
        route_to_structured_queue(post_text)

    return True

Document Classification

Classify documents by structure:

result = profiler.analyze_text(document)

if result['structure_elements']['code_blocks'] > 0:
    doc_type = "technical_documentation"
elif result['structure_elements']['total_list_items'] > 5:
    doc_type = "listicle_article"
elif result['structure_elements']['paragraphs'] > 10:
    doc_type = "long_form_article"
else:
    doc_type = "plain_text"

Running Tests

# Install dev dependencies
pip install -r requirements-dev.txt

# Run tests
pytest tests/ -v

# With coverage
pytest tests/ --cov=text_quality_analyzer --cov-report=html

Running Examples

python3 examples/basic_usage.py

How It Works

Structure Detection Logic

The module uses a multi-level approach to detect structured text:

Step 1: Identify Structure Indicators

Has headers (≥1 header)
Has lists (≥2 list items)
Has code blocks (≥1 block)
Has links + paragraphs (≥1 link AND ≥3 paragraphs)

Step 2: Calculate Base Score

2+ indicators → base score = 0.7 (definitely structured)
1 indicator → base score = 0.5 (possibly structured)
0 indicators → base score = 0.0 (not structured)

Step 3: Add Quantity Bonuses

+0.05 per header (max +0.15)
+0.02 per list item (max +0.15)
+0.1 for code blocks

Examples:

Text with 4 headers + 4 lists + 1 code → score = 0.96 ✅ Structured
Text with 2 headers only → score = 0.60 ✅ Structured
Plain text → score = 0.00 ❌ Not structured

Meaningfulness Detection

Uses multiple metrics:

Letter ratio: Proportion of alphabetic characters (expect 0.5-0.9)
Space ratio: Proper word spacing (expect 0.05-0.25)
Stopwords: Presence of common words for the language
Word length: Average word length in normal range (3-12 chars)
Dictionary match: Words found in language frequency lists

Texts with score ≥ 0.6 are considered meaningful.

Performance

Short text (100 chars): ~1-10ms
Medium text (1,000 chars): ~2-50ms
Long text (10,000 chars): ~3-100ms
First run: ~10 seconds (library initialization)
Subsequent runs: 0-3ms per text

Note: First analysis is slower due to library initialization. Subsequent analyses are very fast.

Requirements

Python >= 3.9
CPU-only (no GPU required)
~100 MB RAM
~10 MB disk space

Supported Languages

Primary support:

English (en)
Russian (ru)
Chinese (zh)
Spanish (es)
French (fr)
German (de)
Arabic (ar)
Japanese (ja)
Portuguese (pt)
Italian (it)

Plus 87+ more languages via langid.

Limitations

Maximum text length: 1 MB (configurable)
Minimum text length for language detection: 10 characters
Text-only input (no binary data)
CPU-only processing

Future Improvements

Planned features:

Sentiment analysis
Document type classification
Grammar checking
Keyword extraction
HTML/PDF support
Result caching

License

MIT License - see LICENSE file for details

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add tests for new features
Submit a pull request

Support

For issues and feature requests, please create an issue in the GitHub repository.

Changelog

0.1.1 (2025-10-28) - Hotfix

Fixed:

Fixed language confidence display (was showing negative percentages)
- langid returns negative log-likelihood, now properly converted to 0-1 probability
- Confidence now displays correctly as 95-100% for clear texts
Improved structure detection logic (major improvement)
- Old: Used complex weighted formula that was too conservative
- New: Multi-level approach with explicit structure indicators
- Result: Texts with headers/lists/code are now correctly identified as structured
- Examples: Text with 4 headers + 4 lists + code → score 0.96 (was 0.50)

Tested:

Comprehensive test suite with 15 diverse texts
Multiple languages (English, Russian, Chinese, Spanish)
Various structures (plain, markdown, code-heavy)
All tests passing ✅

0.1.0 (2025-10-28)

Initial release
Language detection for 97+ languages
Meaningfulness checking with 5 metrics
Structure analysis for Markdown documents
Readability scoring with textstat
Complete test suite
Usage examples and documentation

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.2

Jan 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_quality_analyzer-0.1.2.tar.gz (25.5 kB view details)

Uploaded Jan 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

text_quality_analyzer-0.1.2-py3-none-any.whl (23.0 kB view details)

Uploaded Jan 10, 2026 Python 3

File details

Details for the file text_quality_analyzer-0.1.2.tar.gz.

File metadata

Download URL: text_quality_analyzer-0.1.2.tar.gz
Upload date: Jan 10, 2026
Size: 25.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for text_quality_analyzer-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`8058885cb84d7da0f78d02ed816b2525863b778291a0f87bc4851bc39eb6ef86`
MD5	`b7e00d0d98129e39f0c144d536e34d90`
BLAKE2b-256	`5dd42e022271b5559e8c25be5a79f9b575308d8a27a310d0d99b7207ddb7880d`

See more details on using hashes here.

File details

Details for the file text_quality_analyzer-0.1.2-py3-none-any.whl.

File metadata

Download URL: text_quality_analyzer-0.1.2-py3-none-any.whl
Upload date: Jan 10, 2026
Size: 23.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for text_quality_analyzer-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e343588e33092beb38e3278b643e69cfd41c35e1d539beec6689485ca7f38fb2`
MD5	`3b989ba99b491f881e2c3ee51ac729cd`
BLAKE2b-256	`ff52341a68675ac930c2c3dfe4dff503ec5b8032083f5a84a8f53299f9483b0f`

See more details on using hashes here.

text-quality-analyzer 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Text Quality Analyzer

Features

Installation

From source (recommended for development)

Install dependencies only

Required dependencies

Quick Start

Basic Usage

Convenience Function

Usage Examples

Example 1: Language Detection

Example 2: Quality Filtering

Example 3: Structure Detection

Example 4: Detailed Metrics

API Reference

TextProfiler

Methods

Individual Components

Output Format

Use Cases

Content Filtering

Telegram Bot Integration

Document Classification

Running Tests

Running Examples

How It Works

Structure Detection Logic

Meaningfulness Detection

Performance

Requirements

Supported Languages

Limitations

Future Improvements

License

Contributing

Support

Changelog

0.1.1 (2025-10-28) - Hotfix

0.1.0 (2025-10-28)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes