Text Analysis Software

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Environment
- Console
Programming Language
- Python :: 3.5

Project description

neolo

Text Analysis Software for Saulo Brandão. Developed by Joshua Crowgey in summer 2014.

usage: neolo [-h] [--dicts DICT [DICT ...]] [--mltd] [--msttr] [--hdd]
             [--verbose] [--wordlen] [--wordtypes] [--hapax] [--punc-ratio]
             [--no-hyphen] [--no-apostrophe] [--sents [ABBREV]]
             [--stemming LANGUAGE]
             TEXT

Extract lexical statistics from a text file.

positional arguments:
  TEXT                  the text you want to investigate

optional arguments:
  -h, --help            show this help message and exit
  --dicts DICT [DICT ...]
                        a list of reference texts to compute neologism count
  --mltd                measure of lexical textual diversity
  --msttr               mean segmental type-token ratio
  --hdd                 HD-D probabilistic TTR
  --verbose, -v         increase the verbosty (can be repeated: -vvv)
  --wordlen, -w         print the distribution of words by length
  --wordtypes, -t       print the distribution of wordtypes (unigrams) by
                        count
  --hapax, -x           print the list of hapax legomena
  --punc-ratio, -p      print the ratio of punctuation tokens out of total
                        tokens
  --no-hyphen, -y       remove the hyphen (-) from the list of punctuation
                        symbols used in tokenization
  --no-apostrophe, -a   remove the apostrophe (') from the list of punctuation
                        symbols used in tokenization
  --sents [ABBREV], -s [ABBREV]
                        print sentence length statistics, uses an (optional)
                        abbreviations file containing stings which don't end
                        sentences (eg: Mr.). One abbreviaion per line, include
                        relevant punctuation. Note that items in the
                        abbreviations file will also be protected during later
                        tokenization.
  --stemming LANGUAGE, -m LANGUAGE
                        stem words using NLTK prior to processing them

Neologism Count

The name of this program reflects this original functionality. Neologism count is computed by referencing known wordlists or dictionaries. Word types found in the text under consideration which are not found in the reference dictionaries/wordlists are considered neologisms.

To show a simple example, suppose you have a text file called mary.txt which contains the following traditional poem:

Mary had a little lamb,
Her fleece was white as snow.
Everywhere that mary went,
the lamb was sure to go.

Supposing you're using the debian distro of GNU/Linux, there is a list of English words stored in /usr/share/dict/words that you can use as a reference. You can ask neolo to check mary.txt for neologisms using the --dicts option. The --dicts option takes a list of one ore more filenames to use as references in calculating neologisms.

user@computer:~/src/neolo$ ./neolo texts/mary.txt --dicts /usr/share/dict/words
Opening texts/mary.txt with encoding:  utf-8 
Tokenizing, downcasing, stemming text: texts/mary.txt ... done.
Counting and sorting words in text: texts/mary.txt ...done.
Opening /usr/share/dict/words with encoding:  utf-8 
Tokenizing, downcasing, stemming dict files: ['/usr/share/dict/words'] ... done.
Counting and sorting words in dictonaries: ['/usr/share/dict/words'] ...done.
Neologism list:

Statistics:
-----------
Text size: 21 tokens in 18 types.
Number of hapax legomena: 15
TTR (type-token ratio): 0.8571428571428571
HTR (hapax-token ratio): 0.7142857142857143
HTyR (hapax-type ratio): 0.8333333333333334
Neologisms:  0 types not found in 1 dictionaries
Dictionaries contained 234937 tokens in 233615 types.

As you can see, there are no words in mary.txt which aren't in the reference wordlist file, so neolo says "Neolgisms: 0 types not found in 1 dictionaries".

However, if you edit mary.txt such that instead of fleece, the poem's second line says ``Her pleece was white as snow.'', now neolo prints a neologism list along with its regular output.

user@computer:~/src/neolo$ ./neolo texts/mary.txt --dicts /usr/share/dict/words
Opening texts/mary.txt with encoding:  utf-8 
Tokenizing, downcasing, stemming text: texts/mary.txt ... done.
Counting and sorting words in text: texts/mary.txt ...done.
Opening /usr/share/dict/words with encoding:  utf-8 
Tokenizing, downcasing, stemming dict files: ['/usr/share/dict/words'] ... done.
Counting and sorting words in dictonaries: ['/usr/share/dict/words'] ...done.
Neologism list:
pleece

Statistics:
-----------
Text size: 21 tokens in 18 types.
Number of hapax legomena: 15
TTR (type-token ratio): 0.8571428571428571
HTR (hapax-token ratio): 0.7142857142857143
HTyR (hapax-type ratio): 0.8333333333333334
Neologisms:  1 types not found in 1 dictionaries
Dictionaries contained 234937 tokens in 233615 types.

MLTD

MSTTR

HD-D

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Environment
- Console
Programming Language
- Python :: 3.5

Release history Release notifications | RSS feed

This version

0.1.2

Apr 21, 2019

0.1.1

Apr 19, 2019

0.1.0

Mar 27, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neolo-0.1.2.tar.gz (9.1 kB view details)

Uploaded Apr 21, 2019 Source

Built Distribution

neolo-0.1.2-py3-none-any.whl (10.0 kB view details)

Uploaded Apr 21, 2019 Python 3

File details

Details for the file neolo-0.1.2.tar.gz.

File metadata

Download URL: neolo-0.1.2.tar.gz
Upload date: Apr 21, 2019
Size: 9.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3rc1

File hashes

Hashes for neolo-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`f630924470a39f8037cb3ac0a4123078c73c8e08fc1d19171f81a4c209bc57ec`
MD5	`8ab7eacbb5a69eeaf2104368df9a768e`
BLAKE2b-256	`7378a7ce957eb45fe01a0e28b0bd03774df60289ca9524ab620ef0c0a5e9c124`

See more details on using hashes here.

File details

Details for the file neolo-0.1.2-py3-none-any.whl.

File metadata

Download URL: neolo-0.1.2-py3-none-any.whl
Upload date: Apr 21, 2019
Size: 10.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3rc1

File hashes

Hashes for neolo-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`df670c6629f40862554225dcd17916bfcd108a3a95492079a5d7aaba66f3709b`
MD5	`e8fcdab6dfb199f2bfead7a4ffbc6519`
BLAKE2b-256	`4a999778f8d5d442fb1bb09177c74f149fb3a8589437ecbd16e663eba39c7411`

See more details on using hashes here.

neolo 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

neolo

Neologism Count

MLTD

MSTTR

HD-D

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes