Skip to main content

Universal text analysis module for detecting language, meaningfulness, and structure

Project description

Text Quality Analyzer

Universal Python module for analyzing text quality, detecting language, meaningfulness, and document structure.

Features

  • Language Detection: Automatic detection of 97+ languages with confidence scores
  • Meaningfulness Check: Determines if text is coherent writing or random characters
  • Structure Analysis: Detects Markdown headers, lists, paragraphs, links, and code blocks
  • Readability Metrics: Flesch reading ease and other readability scores
  • Lightweight: CPU-only, works offline, no GPU required
  • Fast: Analyzes 1MB of text in under 1 second

Installation

From source (recommended for development)

cd text-quality-analyzer
pip install -e .

Install dependencies only

pip install -r requirements.txt

Required dependencies

  • langid>=1.1.6 - Language detection
  • textstat>=0.7.3 - Readability metrics
  • wordfreq>=3.0.0 - Word frequency dictionaries

Quick Start

Basic Usage

from text_quality_analyzer import TextProfiler

profiler = TextProfiler()
result = profiler.analyze_text("Your text here")

print(result["language"])          # 'en'
print(result["is_meaningful"])     # True
print(result["is_structured"])     # False

Convenience Function

from text_quality_analyzer import analyze_text

result = analyze_text("## Hello World\n\nThis is a test.")
print(result)

Usage Examples

Example 1: Language Detection

from text_quality_analyzer import TextProfiler

profiler = TextProfiler()

texts = {
    "English": "This is a test document.",
    "Russian": "Это тестовый документ.",
    "Chinese": "这是一个测试文档。"
}

for name, text in texts.items():
    result = profiler.analyze_text(text)
    print(f"{name}: {result['language']} ({result['language_confidence']:.0%})")

# Output:
# English: en (99%)
# Russian: ru (98%)
# Chinese: zh (99%)

Example 2: Quality Filtering

from text_quality_analyzer import TextProfiler

profiler = TextProfiler()

# Filter meaningful text only
texts = [
    "This is a well-written article.",
    "xkcd1234!@#$%^&*()",
    "Short"
]

for text in texts:
    should_process, reason = profiler.should_process_text(
        text,
        require_meaningful=True,
        allowed_languages=["en"]
    )
    print(f"{'ACCEPT' if should_process else 'REJECT'}: {reason}")

Example 3: Structure Detection

from text_quality_analyzer import TextProfiler

profiler = TextProfiler()

markdown_text = """## Introduction

This document has structure.

- Item 1
- Item 2

[Link](https://example.com)
"""

result = profiler.analyze_text(markdown_text)

if result["is_structured"]:
    elements = result["structure_elements"]
    print(f"Headers: {elements['headers']}")
    print(f"Lists: {elements['total_list_items']}")
    print(f"Links: {elements['total_links']}")

Example 4: Detailed Metrics

from text_quality_analyzer import TextProfiler

profiler = TextProfiler()
result = profiler.analyze_text("Your text here")

# Language
print(f"Language: {result['language']}")
print(f"Confidence: {result['language_confidence']:.0%}")

# Meaningfulness
print(f"Meaningful: {result['is_meaningful']}")
print(f"Score: {result['meaningfulness_score']:.2f}")
metrics = result['meaningfulness_metrics']
print(f"  Letter ratio: {metrics['letter_ratio']:.2f}")
print(f"  Stopword presence: {metrics['stopword_presence']:.2f}")

# Structure
print(f"Structured: {result['is_structured']}")
print(f"Score: {result['structure_score']:.2f}")

# Statistics
print(f"Words: {result['word_count']}")
print(f"Paragraphs: {result['paragraph_count']}")

API Reference

TextProfiler

Main class for text analysis.

profiler = TextProfiler(
    min_confidence=0.7,        # Minimum language detection confidence
    min_meaningfulness=0.6,    # Minimum meaningfulness score
    min_structure=0.5,         # Minimum structure score
    max_text_length=1_000_000  # Maximum text length (1MB)
)

Methods

  • analyze_text(text: str) -> dict: Full analysis of text
  • quick_check(text: str) -> dict: Boolean checks only (faster)
  • get_text_summary(text: str) -> str: Human-readable summary
  • should_process_text(text, ...) -> (bool, str): Decision helper for pipelines

Individual Components

from text_quality_analyzer import (
    LanguageDetector,
    MeaningfulnessChecker,
    StructureAnalyzer
)

# Use components individually if needed
lang_detector = LanguageDetector()
result = lang_detector.detect("Hello world")

Output Format

The analyze_text() method returns a dictionary with:

{
    "success": True,
    "language": "en",                    # ISO 639-1 code
    "language_confidence": 0.98,         # 0.0-1.0
    "is_meaningful": True,
    "meaningfulness_score": 0.87,        # 0.0-1.0
    "meaningfulness_metrics": {
        "letter_ratio": 0.82,
        "space_ratio": 0.16,
        "stopword_presence": 0.9,
        "avg_word_length": 5.3,
        "dictionary_match": 0.78
    },
    "is_structured": True,
    "structure_score": 0.98,             # 0.0-1.0 (high score due to headers+lists+code)
    "structure_elements": {
        "headers": 3,
        "total_list_items": 5,
        "paragraphs": 4,
        "total_links": 2,
        "code_blocks": 1
    },
    "text_length": 1024,
    "word_count": 145,
    "paragraph_count": 4,
    "readability_index": 58.4,           # Flesch reading ease
    "processing_time_ms": 125
}

Use Cases

Content Filtering

Filter low-quality or spam content before processing:

should_process, reason = profiler.should_process_text(
    user_input,
    require_meaningful=True,
    allowed_languages=["en", "ru"]
)
if not should_process:
    return f"Content rejected: {reason}"

Telegram Bot Integration

Filter and classify Telegram posts:

profiler = TextProfiler()

def should_index_post(post_text):
    result = profiler.analyze_text(post_text)

    # Only index meaningful Russian texts
    if result['language'] != 'ru':
        return False
    if not result['is_meaningful']:
        return False

    # Route structured docs to special processing
    if result['is_structured']:
        route_to_structured_queue(post_text)

    return True

Document Classification

Classify documents by structure:

result = profiler.analyze_text(document)

if result['structure_elements']['code_blocks'] > 0:
    doc_type = "technical_documentation"
elif result['structure_elements']['total_list_items'] > 5:
    doc_type = "listicle_article"
elif result['structure_elements']['paragraphs'] > 10:
    doc_type = "long_form_article"
else:
    doc_type = "plain_text"

Running Tests

# Install dev dependencies
pip install -r requirements-dev.txt

# Run tests
pytest tests/ -v

# With coverage
pytest tests/ --cov=text_quality_analyzer --cov-report=html

Running Examples

python3 examples/basic_usage.py

How It Works

Structure Detection Logic

The module uses a multi-level approach to detect structured text:

Step 1: Identify Structure Indicators

  • Has headers (≥1 header)
  • Has lists (≥2 list items)
  • Has code blocks (≥1 block)
  • Has links + paragraphs (≥1 link AND ≥3 paragraphs)

Step 2: Calculate Base Score

  • 2+ indicators → base score = 0.7 (definitely structured)
  • 1 indicator → base score = 0.5 (possibly structured)
  • 0 indicators → base score = 0.0 (not structured)

Step 3: Add Quantity Bonuses

  • +0.05 per header (max +0.15)
  • +0.02 per list item (max +0.15)
  • +0.1 for code blocks

Examples:

  • Text with 4 headers + 4 lists + 1 code → score = 0.96 ✅ Structured
  • Text with 2 headers only → score = 0.60 ✅ Structured
  • Plain text → score = 0.00 ❌ Not structured

Meaningfulness Detection

Uses multiple metrics:

  • Letter ratio: Proportion of alphabetic characters (expect 0.5-0.9)
  • Space ratio: Proper word spacing (expect 0.05-0.25)
  • Stopwords: Presence of common words for the language
  • Word length: Average word length in normal range (3-12 chars)
  • Dictionary match: Words found in language frequency lists

Texts with score ≥ 0.6 are considered meaningful.

Performance

  • Short text (100 chars): ~1-10ms
  • Medium text (1,000 chars): ~2-50ms
  • Long text (10,000 chars): ~3-100ms
  • First run: ~10 seconds (library initialization)
  • Subsequent runs: 0-3ms per text

Note: First analysis is slower due to library initialization. Subsequent analyses are very fast.

Requirements

  • Python >= 3.9
  • CPU-only (no GPU required)
  • ~100 MB RAM
  • ~10 MB disk space

Supported Languages

Primary support:

  • English (en)
  • Russian (ru)
  • Chinese (zh)
  • Spanish (es)
  • French (fr)
  • German (de)
  • Arabic (ar)
  • Japanese (ja)
  • Portuguese (pt)
  • Italian (it)

Plus 87+ more languages via langid.

Limitations

  • Maximum text length: 1 MB (configurable)
  • Minimum text length for language detection: 10 characters
  • Text-only input (no binary data)
  • CPU-only processing

Future Improvements

Planned features:

  • Sentiment analysis
  • Document type classification
  • Grammar checking
  • Keyword extraction
  • HTML/PDF support
  • Result caching

License

MIT License - see LICENSE file for details

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new features
  4. Submit a pull request

Support

For issues and feature requests, please create an issue in the GitHub repository.

Changelog

0.1.1 (2025-10-28) - Hotfix

Fixed:

  • Fixed language confidence display (was showing negative percentages)
    • langid returns negative log-likelihood, now properly converted to 0-1 probability
    • Confidence now displays correctly as 95-100% for clear texts
  • Improved structure detection logic (major improvement)
    • Old: Used complex weighted formula that was too conservative
    • New: Multi-level approach with explicit structure indicators
    • Result: Texts with headers/lists/code are now correctly identified as structured
    • Examples: Text with 4 headers + 4 lists + code → score 0.96 (was 0.50)

Tested:

  • Comprehensive test suite with 15 diverse texts
  • Multiple languages (English, Russian, Chinese, Spanish)
  • Various structures (plain, markdown, code-heavy)
  • All tests passing ✅

0.1.0 (2025-10-28)

  • Initial release
  • Language detection for 97+ languages
  • Meaningfulness checking with 5 metrics
  • Structure analysis for Markdown documents
  • Readability scoring with textstat
  • Complete test suite
  • Usage examples and documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_quality_analyzer-0.1.2.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text_quality_analyzer-0.1.2-py3-none-any.whl (23.0 kB view details)

Uploaded Python 3

File details

Details for the file text_quality_analyzer-0.1.2.tar.gz.

File metadata

  • Download URL: text_quality_analyzer-0.1.2.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for text_quality_analyzer-0.1.2.tar.gz
Algorithm Hash digest
SHA256 8058885cb84d7da0f78d02ed816b2525863b778291a0f87bc4851bc39eb6ef86
MD5 b7e00d0d98129e39f0c144d536e34d90
BLAKE2b-256 5dd42e022271b5559e8c25be5a79f9b575308d8a27a310d0d99b7207ddb7880d

See more details on using hashes here.

File details

Details for the file text_quality_analyzer-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for text_quality_analyzer-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e343588e33092beb38e3278b643e69cfd41c35e1d539beec6689485ca7f38fb2
MD5 3b989ba99b491f881e2c3ee51ac729cd
BLAKE2b-256 ff52341a68675ac930c2c3dfe4dff503ec5b8032083f5a84a8f53299f9483b0f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page