Skip to main content

A simple program for calcuating lexical diversity

Project description

To install using pip:

pip install lexical-diversity

Get started:

>>> from lexical_diversity import lex_div as ld

Pre-processing texts:

For convenience, a user can tokenize texts using the tokenize() function or by using a predefined tokenize function (e.g., from NLTK):

>>> text = """The state was named for the Colorado River, which Spanish travelers named the Río Colorado for the ruddy silt the river carried from the mountains. The Territory of Colorado was organized on February 28, 1861, and on August 1, 1876, U.S. President Ulysses S. Grant signed Proclamation 230 admitting Colorado to the Union as the 38th state. Colorado is nicknamed the "Centennial State" because it became a state a century after the signing of the United States Declaration of Independence. Colorado is bordered by Wyoming to the north, Nebraska to the northeast, Kansas to the east, Oklahoma to the southeast, New Mexico to the south, Utah to the west, and touches Arizona to the southwest at the Four Corners. Colorado is noted for its vivid landscape of mountains, forests, high plains, mesas, canyons, plateaus, rivers, and desert lands. Colorado is part of the western or southwestern United States, and one of the Mountain States. Denver is the capital and most populous city of Colorado. Residents of the state are known as Coloradans, although the antiquated term "Coloradoan" is occasionally used."""

>>> tok = ld.tokenize(text)
>>> print(tok[:10])
['the', 'state', 'was', 'named', 'for', 'the', 'colorado', 'river', 'which', 'spanish']

For convenience, you can also lemmatize the texts using the simple flemmatize() function, which is not part of speech specific ('run' as a noun and 'run' as a verb are treated as the same word). However, it is likely better to use a part of speech sensitive lemmatizer (e.g., using spaCy).

>>> flt = ld.flemmatize(text)
>>> print(flt[:10])
['the', 'state', 'be', 'name', 'for', 'the', 'colorado', 'river', 'which', 'spanish']  

Calculating lexical diversity:

Simple TTR

>>> ld.ttr(flt)
0.5777777777777777

Root TTR

>>> ld.root_ttr(flt)
7.751702321999271

Log TTR

>>> ld.log_ttr(flt)
0.8943634681549878

Mass TTR

>>> ld.maas_ttr(flt)
0.04683980831849556

Mean segmental TTR (MSTTR)

By default, the segment size is 50 words. However, this can be customized using the window_length argument.

>>> ld.msttr(flt)
0.7133333333333333

>>> ld.msttr(flt,window_length=25)
0.7885714285714285

Moving average TTR (MATTR)

By default, the window size is 50 words. However, this can be customized using the window_length argument.

>>> ld.mattr(flt)
0.7206106870229007

>>> ld.mattr(flt,window_length=25)
0.7961538461538458

Hypergeometric distribution D (HDD)

A more straightforward and reliable implementation of vocD (Malvern, Richards, Chipere, & Duran, 2004) as per McCarthy and Jarvis (2007, 2010).

>>> ld.hdd(flt)
0.7346993253061275

Measure of lexical textual diversity (MTLD)

Calculates MTLD based on McCarthy and Jarvis (2010).

ld.mtld(flt)
36.50595044690307

Measure of lexical textual diversity (moving average, wrap)

Calculates MTLD using a moving window approach. Instead of calculating partial factors, it wraps to the beginning of the text to complete the last factors.

ld.mtld_ma_wrap(flt)
33.68333333333333

Measure of lexical textual diversity (moving average, bi-directional)

Calculates the average MTLD score by calculating MTLD in each direction using a moving window approach.

ld.mtld_ma_bid(flt)
35.46626265150569

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lexical_diversity-0.1.1.tar.gz (119.6 kB view details)

Uploaded Source

Built Distribution

lexical_diversity-0.1.1-py3-none-any.whl (117.8 kB view details)

Uploaded Python 3

File details

Details for the file lexical_diversity-0.1.1.tar.gz.

File metadata

  • Download URL: lexical_diversity-0.1.1.tar.gz
  • Upload date:
  • Size: 119.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for lexical_diversity-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0a0a6aefdccb9423e1676d3f1767ddd2e5399a2451987ef24f700f55bd7d6210
MD5 a979a2c4014af8e18bb407995ae2f33b
BLAKE2b-256 30110c49e65c234960d4d76145a48f1fe437166f74562d10865e0a5ed31fa818

See more details on using hashes here.

File details

Details for the file lexical_diversity-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: lexical_diversity-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 117.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for lexical_diversity-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c4a70e7e120962dfbc7d3b3dad059cb20b16f8d2a12440ebf2f6ca77c738f2af
MD5 30819b8dbe637db0c6c6cbd18400db53
BLAKE2b-256 6237d6f959b2255b1321b3d359d902dbd83dec3c7bb6443168d79f8911a94ae3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page