A corpus-linguistic tool to extract and search for linguistic features

These details have not been verified by PyPI

Project links

Homepage

Project description

ling_feature_extractor

Description

A corpus-linguistic tool to extract and search for linguistic features in a text or a corpus.
There are 95 built-in linguistic features in the main version versus 98 features in the Thesis_Project version. Deleted features are words per utterance, number of utterances, and number of overlaps, which are not deemed as generally accessible in a normal corpus.
Over 2/3 of these features come from Biber et al.(2006) with 42 features also present in Biber(1988). These features are generally known as part of the Multi-Dimensional (MD) analysis framework.
The program is mainly tested on two online accessible corpora, namely British Academic Spoken Corpus and Michigan Corpus of Academic Englush, but due to copyright concerns, here it is tested on the test_sample.

Prerequisites

Computer Langauges:
- Python 3.6+: check with cmd: python --version or python3 --version (Download Page);
- Java 1.8+: check with cmd: 'java --version' (Download Page).
Python packages

Package	Description	Pip download
stanfordcorenlp	A Python wrapper for StanforeCoreNLP	`pip/pip3 install stanfordcorenlp`
pandas	Used for storing extracted feature frequencies	`pip/pip3 install pandas`

Besides, built-in packages are heavily employed in the program, especially the built-in re package for Regular Expression.

Installation

Directly download from this page and cd to the project folder.
By pip: pip/pip3 install LFExtractor

Usage

path to StanfordCoreNLP

Please specify the directory to StanfordCoreNLP in the text_processor.py under LFE folder when first using the program.

nlp = StanfordCoreNLP("/path/to/StanfordCoreNLP/")

Example: nlp = StanfordCoreNLP("/Users/wzx/p_package/stanford-corenlp-4.1.0")

Dealing with a corpus of files

from LFE.extractor import CorpusLFE

lfe = CorpusLFE('/directory/to/the/corpus/under/analysis/')
# get frequency data and tagged corpus and extracted features by default
lfe.corpus_feature_fre_extraction() lfe.corpus_feature_fre_extraction()    # lfe.corpus_feature_fre_extraction(normalized_rate=100, save_tagged_corpus=True, save_extracted_features=True, left=0, right=0). 
# change the normalized_rate, trun off tagged text and leave extracted text with specified context to display
lfe.corpus_feature_fre_extraction(1000, False, True, 2, 3) # extract frequency data only, and the data are normalized at 1000 words.  

# get frequency data only
lfe.corpus_feature_fre_extraction(save_tagged_corpus=False, save_extracted_features=False)
# get tagged corpus only
lfe.save_tagged_corpus()
# get extracted feature only
lfe.save_corpus_extracted_features()   # lfe.save_corpus_extracted_features(left=0, right=0)
# set how many words to display besides the target pattern
lfe.save_corpus_extracted_features(2, 3)

# extract and save specific linguistic feature by feature name
# to see the built-in features' names, use `show_feature_names()`
from LFE.extractor import *
print(show_feature_names())   # Six letter words and longer, Contraction, Agentless passive, By passive...
# specify which feature to extract and save
lfe.save_corpus_one_extracted_feature_by_name('Six letter words and longer')

# extract and save specific linguistic feature by feature regex, for example, 'you know' 
lfe.save_corpus_one_extracted_feature_by_regex(r'you_\S+ know_\S+', 2, 2, feature_name='You Know')  # Extract phrase 'you know' along with 2 words spanning around. Also remember the '_\S+' at the end of each word since the corpus will be automatically POS tagged.
# for more complex structure, the features_set.py can be ultilized, for example, to extract "article + adj + noun" structure
from LFE import features_set as fs
ART = fs.ART
ADJ = fs.ADJ
NOUN = fs.NOUN
lfe.save_corpus_one_extracted_feature_by_regex(rf'{ART} {ADJ} {NOUN}', 2, 2, 'Noun phrase')
# result example (use test_sample): away_RB by_IN	【 the_DT whole_JJ thing_NN 】	In_IN fact_NN

Dealing with a text

from LFE import extractor as ex

# check the functionalities contained in ex by dir(ex)
# show built-in feature names
print(ex.show_feature_names())   # Six letter words and longer, Contraction, Agentless passive, By passive...
# get built-in features' regex by its name
print(ex.get_feature_regex_by_name('Contraction'))  # (n't| '\S\S?)_[^P]\S+
# get built-in features' names by regex
print(ex.get_feature_name_by_regex(r"(n't| '\S\S?)_[^P]\S+"))  # Contraction

# text processing
# tagged file
ex.save_single_tagged_text('/path/to/the/file')
# cleaned file
ex.save_single_cleaned_text('/path/to/the/file')

# display extracted feature by name
res = ex.display_extracted_feature_by_name('/path/to/the/file', 'Contraction', left=0, right=0)
print(res)  #  's_VBZ, n't_NEG, 've_VBP...
# save the result
ex.save_extracted_feature_by_name('/path/to/the/file', 'Contraction', left=0, right=0)

# display extracted feature by regex, for example, noun phrase
from LFE import features_set as fs

ART = fs.ART
ADJ = fs.ADJ
NOUN = fs.NOUN
res = ex.display_extracted_feature_by_regex(rf'{ART} {ADJ} {NOUN}', 2, 2, 'Noun phrase')
print(res)  # One_CD is_VBZ	【 the_DT extraordinary_JJ evidence_NN 】	of_IN human_JJ
# save the result
ex.save_extracted_feature_by_regex(rf'{ART} {ADJ} {NOUN}', 2, 2, 'Noun phrase')

# get the frequency data of all the linguistic features for a file 
res = ex.get_single_file_feature_fre(file_path, normalized_rate=100, save_tagged_file=True, save_extracted_features=True, left=0, right=0)
print(res)

Dealing with a part of a corpus

from LFE.extractor import *

lfe = CorpusLFE('/directory/to/the/corpus/under/analysis/')
# get_filepath_list and select the files you want to examine and construct a list
fp_list = lfe.get_filepath_list()   
# loop through the list and use the functionalities mentioned above to get the results you want

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.1

Nov 26, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

LFExtractor-1.0.1.tar.gz (31.6 kB view details)

Uploaded Nov 26, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

LFExtractor-1.0.1-py3-none-any.whl (31.5 kB view details)

Uploaded Nov 26, 2020 Python 3

File details

Details for the file LFExtractor-1.0.1.tar.gz.

File metadata

Download URL: LFExtractor-1.0.1.tar.gz
Upload date: Nov 26, 2020
Size: 31.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.23.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.5

File hashes

Hashes for LFExtractor-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`09af8bec5ebd2e76b8982b2adbd1d849a6414339fc0ef795c9dc5dc876cde6c0`
MD5	`d3d2f73d4cf6d9e13ea4a99eebd51d5a`
BLAKE2b-256	`964cd6c977b06e402d31ea5728dff6442cb29e1d734b159495fe16defb4e8a78`

See more details on using hashes here.

File details

Details for the file LFExtractor-1.0.1-py3-none-any.whl.

File metadata

Download URL: LFExtractor-1.0.1-py3-none-any.whl
Upload date: Nov 26, 2020
Size: 31.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.23.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.5

File hashes

Hashes for LFExtractor-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6ecfb39118e9912934675f4a24608265fbd0f634465133c38eeab5891dc1c693`
MD5	`e8f022fb6c62b526739c5014f0ab8150`
BLAKE2b-256	`8ad7ef5588aa3cb0f9f8c82b60e7109e38c600350766fd742f2ac365c449bfa1`

See more details on using hashes here.

LFExtractor 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ling_feature_extractor

Description

Prerequisites

Installation

Usage

path to StanfordCoreNLP

Dealing with a corpus of files

Dealing with a text

Dealing with a part of a corpus

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes