Skip to main content

A comprehensive set of indicators and metrics for assessing text complexity in Spanish.

Project description

PUCP-Metrix

A comprehensive set of indicators and metrics for assessing text complexity in Spanish, developed by the Artificial Intelligence Group at PUCP (Pontificia Universidad Católica del Perú).

Overview

PUCP-Metrix is a Python library that provides an extensive collection of text complexity metrics specifically designed for Spanish texts. It implements various linguistic and psycholinguistic measures inspired by Coh-Metrix, adapted and optimized for Spanish language processing.

Features

The library calculates over 100 different text complexity metrics organized into several categories:

📊 Descriptive Indices

Basic text structure and length statistics:

  • DESPC: Paragraph count
  • DESPCi: Paragraph count incidence per 1000 words
  • DESSC: Sentence count
  • DESSCi: Sentence count incidence per 1000 words
  • DESWC: Word count (alphanumeric words)
  • DESWCU: Unique word count
  • DESWCUi: Unique word count incidence per 1000 words
  • DESPL: Average paragraph length (sentences per paragraph)
  • DESPLd: Standard deviation of paragraph length
  • DESSL: Average sentence length (words per sentence)
  • DESSLd: Standard deviation of sentence length
  • DESSNSL: Average sentence length excluding stopwords
  • DESSNSLd: Standard deviation of sentence length excluding stopwords
  • DESSLmax: Maximum sentence length
  • DESSLmin: Minimum sentence length
  • DESWLsy: Average syllables per word
  • DESWLsyd: Standard deviation of syllables per word
  • DESCWLsy: Average syllables per content word
  • DESCWLsyd: Standard deviation of syllables per content word
  • DESCWLlt: Average letters per content word
  • DESCWLltd: Standard deviation of letters per content word
  • DESWLlt: Average letters per word
  • DESWLltd: Standard deviation of letters per word
  • DESWNSLlt: Average letters per word (excluding stopwords)
  • DESWNSLltd: Standard deviation of letters per word (excluding stopwords)
  • DESLLlt: Average letters per lemma
  • DESLLltd: Standard deviation of letters per lemma

📖 Readability Indices

Traditional readability formulas adapted for Spanish:

  • RDFHGL: Fernández-Huertas Grade Level
  • RDSPP: Szigriszt-Pazos Perspicuity
  • RDMU: Readability µ index
  • RDSMOG: SMOG index
  • RDFOG: Gunning Fog index
  • RDHS: Honoré Statistic
  • RDBR: Brunet index

🔗 Syntactic Complexity Indices

Measures of syntactic structure complexity:

  • SYNNP: Mean number of modifiers per noun phrase
  • SYNLE: Mean number of words before main verb
  • SYNMEDwrd: Minimal edit distance of words between adjacent sentences
  • SYNMEDlem: Minimal edit distance of lemmas between adjacent sentences
  • SYNMEDpos: Minimal edit distance of POS tags between adjacent sentences
  • SYNCLS1: Ratio of sentences with 1 clause
  • SYNCLS2: Ratio of sentences with 2 clauses
  • SYNCLS3: Ratio of sentences with 3 clauses
  • SYNCLS4: Ratio of sentences with 4 clauses
  • SYNCLS5: Ratio of sentences with 5 clauses
  • SYNCLS6: Ratio of sentences with 6 clauses
  • SYNCLS7: Ratio of sentences with 7 clauses

🎯 Syntactic Pattern Density Indices

Density measures of specific syntactic patterns:

  • DRNP: Noun phrase density per 1000 words
  • DRNPc: Noun phrase count
  • DRVP: Verb phrase density per 1000 words
  • DRVPc: Verb phrase count
  • DRNEG: Negation expression density per 1000 words
  • DRNEGc: Negation expression count
  • DRGER: Gerund form density per 1000 words
  • DRGERc: Gerund count
  • DRINF: Infinitive form density per 1000 words
  • DRINFc: Infinitive count
  • DRCCONJ: Coordinating conjunction density per 1000 words
  • DRCCONJc: Coordinating conjunction count
  • DRSCONJ: Subordinating conjunction density per 1000 words
  • DRSCONJc: Subordinating conjunction count

🌐 Connective Indices

Analysis of discourse connectives:

  • CNCAll: All connectives incidence per 1000 words
  • CNCCaus: Causal connectives incidence per 1000 words
  • CNCLogic: Logical connectives incidence per 1000 words
  • CNCADC: Adversative connectives incidence per 1000 words
  • CNCTemp: Temporal connectives incidence per 1000 words
  • CNCAdd: Additive connectives incidence per 1000 words

🔗 Referential Cohesion Indices

Measures of referential overlap between sentences:

  • CRFNO1: Noun overlap between adjacent sentences
  • CRFAO1: Argument overlap between adjacent sentences
  • CRFSO1: Stem overlap between adjacent sentences
  • CRFCWO1: Content word overlap between adjacent sentences (mean)
  • CRFCWO1d: Content word overlap between adjacent sentences (std dev)
  • CRFANP1: Anaphore overlap between adjacent sentences
  • CRFNOa: Noun overlap between all sentences
  • CRFAOa: Argument overlap between all sentences
  • CRFSOa: Stem overlap between all sentences
  • CRFCWOa: Content word overlap between all sentences (mean)
  • CRFCWOad: Content word overlap between all sentences (std dev)
  • CRFANPa: Anaphore overlap between all sentences

🌊 Semantic Cohesion Indices

LSA-based semantic similarity measures:

  • SECLOSadj: LSA overlap between adjacent sentences (mean)
  • SECLOSadjd: LSA overlap between adjacent sentences (std dev)
  • SECLOSall: LSA overlap between all sentences (mean)
  • SECLOSalld: LSA overlap between all sentences (std dev)
  • SECLOPadj: LSA overlap between adjacent paragraphs (mean)
  • SECLOPadjd: LSA overlap between adjacent paragraphs (std dev)
  • SECLOSgiv: LSA overlap between given and new sentences (mean)
  • SECLOSgivd: LSA overlap between given and new sentences (std dev)

📝 Lexical Diversity Indices

Various measures of vocabulary richness:

  • LDTTRa: Type-token ratio for all words
  • LDTTRcw: Type-token ratio for content words
  • LDTTRno: Type-token ratio for nouns
  • LDTTRvb: Type-token ratio for verbs
  • LDTTRadv: Type-token ratio for adverbs
  • LDTTRadj: Type-token ratio for adjectives
  • LDTTRLa: Type-token ratio for all lemmas
  • LDTTRLno: Type-token ratio for noun lemmas
  • LDTTRLvb: Type-token ratio for verb lemmas
  • LDTTRLadv: Type-token ratio for adverb lemmas
  • LDTTRLadj: Type-token ratio for adjective lemmas
  • LDTTRLpron: Type-token ratio for pronouns
  • LDTTRLrpron: Type-token ratio for relative pronouns
  • LDTTRLipron: Type-token ratio for indefinite pronouns
  • LDTTRLifn: Type-token ratio for functional words
  • LDMLTD: Measure of Textual Lexical Diversity (MTLD)
  • LDVOCd: Vocabulary Complexity Diversity (VoCD)
  • LDMaas: Maas index
  • LDDno: Noun density
  • LDDvb: Verb density
  • LDDadv: Adverb density
  • LDDadj: Adjective density

📊 Word Information Indices

Incidence of different word types:

  • WRDCONT: Content word incidence per 1000 words
  • WRDCONTc: Content word count
  • WRDNOUN: Noun incidence per 1000 words
  • WRDNOUNc: Noun count
  • WRDVERB: Verb incidence per 1000 words
  • WRDVERBc: Verb count
  • WRDADJ: Adjective incidence per 1000 words
  • WRDADJc: Adjective count
  • WRDADV: Adverb incidence per 1000 words
  • WRDADVc: Adverb count
  • WRDPRO: Personal pronoun incidence per 1000 words
  • WRDPROc: Personal pronoun count
  • WRDPRP1s: First person singular pronoun incidence per 1000 words
  • WRDPRP1sc: First person singular pronoun count
  • WRDPRP1p: First person plural pronoun incidence per 1000 words
  • WRDPRP1pc: First person plural pronoun count
  • WRDPRP2s: Second person singular pronoun incidence per 1000 words
  • WRDPRP2sc: Second person singular pronoun count
  • WRDPRP2p: Second person plural pronoun incidence per 1000 words
  • WRDPRP2pc: Second person plural pronoun count
  • WRDPRP3s: Third person singular pronoun incidence per 1000 words
  • WRDPRP3sc: Third person singular pronoun count
  • WRDPRP3p: Third person plural pronoun incidence per 1000 words
  • WRDPRP3pc: Third person plural pronoun count

🎯 Textual Simplicity Indices

Measures of sentence length distribution:

  • TSSRsh: Ratio of short sentences (< 11 words)
  • TSSRmd: Ratio of medium sentences (11-12 words)
  • TSSRlg: Ratio of long sentences (13-14 words)
  • TSSRxl: Ratio of very long sentences (≥ 15 words)

📈 Word Frequency Indices

Measures based on word frequency in Spanish corpora:

  • WFRCno: Rare noun count
  • WFRCnoi: Rare noun incidence per 1000 words
  • WFRCvb: Rare verb count
  • WFRCvbi: Rare verb incidence per 1000 words
  • WFRCadj: Rare adjective count
  • WFRCadji: Rare adjective incidence per 1000 words
  • WFRCadv: Rare adverb count
  • WFRCadvi: Rare adverb incidence per 1000 words
  • WFRCcw: Rare content word count
  • WFRCcwi: Rare content word incidence per 1000 words
  • WFRCcwd: Distinct rare content word count
  • WFRCcwdi: Distinct rare content word incidence per 1000 words
  • WFMcw: Mean frequency of content words
  • WFMw: Mean frequency of all words
  • WFMrw: Mean frequency of rarest words per sentence
  • WFMrcw: Mean frequency of rarest content words per sentence

🧠 Psycholinguistic Indices

Measures based on psycholinguistic properties of words:

Concreteness measures:

  • PSYC: Overall concreteness ratio
  • PSYC0: Very low concreteness ratio (1-2.5)
  • PSYC1: Low concreteness ratio (2.5-4)
  • PSYC2: Medium concreteness ratio (4-5.5)
  • PSYC3: High concreteness ratio (5.5-7)

Imageability measures:

  • PSYIM: Overall imageability ratio
  • PSYIM0: Very low imageability ratio (1-2.5)
  • PSYIM1: Low imageability ratio (2.5-4)
  • PSYIM2: Medium imageability ratio (4-5.5)
  • PSYIM3: High imageability ratio (5.5-7)

Familiarity measures:

  • PSYFM: Overall familiarity ratio
  • PSYFM0: Very low familiarity ratio (1-2.5)
  • PSYFM1: Low familiarity ratio (2.5-4)
  • PSYFM2: Medium familiarity ratio (4-5.5)
  • PSYFM3: High familiarity ratio (5.5-7)

Age of Acquisition measures:

  • PSYAoA: Overall age of acquisition ratio
  • PSYAoA0: Very early acquisition ratio (1-2.5)
  • PSYAoA1: Early acquisition ratio (2.5-4)
  • PSYAoA2: Medium acquisition ratio (4-5.5)
  • PSYAoA3: Late acquisition ratio (5.5-7)

Arousal measures:

  • PSYARO: Overall arousal ratio
  • PSYARO0: Very low arousal ratio (1-3)
  • PSYARO1: Low arousal ratio (3-5)
  • PSYARO2: Medium arousal ratio (5-7)
  • PSYARO3: High arousal ratio (7-9)

Valence measures:

  • PSYVAL: Overall valence ratio
  • PSYVAL0: Very negative valence ratio (1-4)
  • PSYVAL1: Negative valence ratio (3-5)
  • PSYVAL2: Positive valence ratio (5-7)
  • PSYVAL3: Very positive valence ratio (7-9)

Installation

Prerequisites

  • Python 3.12 or higher

Install the package

# Using UV (recommended)
uv add iapucp-metrix

# Or using pip
pip install iapucp-metrix

Install Spanish language model

After installing the package, you need to install the required Spanish spaCy model:

# Using the provided script
./install_es_core_news

# Or manually
uv pip install es_core_news_lg@https://github.com/explosion/spacy-models/releases/download/es_core_news_lg-3.8.0/es_core_news_lg-3.8.0-py3-none-any.whl

Quick Start

from iapucp_metrix.analyzer import Analyzer

# Initialize analyzer
analyzer = Analyzer()

# Process multiple texts efficiently
texts = [
    "Primer texto para analizar...",
    "Segundo texto con contenido diferente...",
    "Tercer texto para completar el análisis..."
]

# Compute metrics with multiprocessing
metrics_list = analyzer.compute_metrics(
    texts, 
    workers=4,     # Use 4 CPU cores
    batch_size=2   # Process 2 texts per batch
)

# Process results
for i, metrics in enumerate(metrics_list):
    print(f"Text {i+1}:")
    print(f"  Readability (Fernández-Huertas): {metrics['RDFHGL']:.2f}")

Development

Setting up the development environment

# Clone the repository
git clone https://github.com/your-org/pucp-metrix.git
cd pucp-metrix

# Install dependencies
uv sync

# Install the Spanish model
./install_es_core_news

# Run tests
uv run pytest

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iapucp_metrix-0.1.0.tar.gz (31.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iapucp_metrix-0.1.0-py3-none-any.whl (31.1 MB view details)

Uploaded Python 3

File details

Details for the file iapucp_metrix-0.1.0.tar.gz.

File metadata

  • Download URL: iapucp_metrix-0.1.0.tar.gz
  • Upload date:
  • Size: 31.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.12

File hashes

Hashes for iapucp_metrix-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d8f576dad3d5cbc8e27db8b3d1d2578438f1d1766e4f6dc9b67f41adfcca8557
MD5 f6f2342a30e821a4ee2df3f14cccc566
BLAKE2b-256 1be9e7ed5ef5edacc8ceff46980e96d9ac39568aaaff596609f0a4b5f2468ba1

See more details on using hashes here.

File details

Details for the file iapucp_metrix-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for iapucp_metrix-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c43f950477c044f4d0c2b9f99bafad769e79392b2bb86d433894cc437565f20b
MD5 0ef042539ad948cd86cf38f368ef20f2
BLAKE2b-256 81cf28305b9ac6b1293cbb38171c6d5ad15368aae06e50780a4102aa0a85c4be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page