Skip to main content

Generating Vocabulary Management Profiles in Python

Project description

VMP

Generate Vocabulary Management Profiles (vmp) for an individual text or corpus (text datasets).

from vmp import VMP, LoadData

# Example 1: Using a list of strings
data = ["This is the first text.", "Here is the second text."]
result = VMP.calculate(
    data=data,
    delta_values=[9, 11],  # Select odd number/s for delta values
    common_words_option='both',  # Options: 'yes', 'no', 'both'
    num_common_words=1000,  # Optional parameter for number of common words
    common_words_url='https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english.txt',
    # common_words_file='path_to_your_common_words_file.txt',  # Alternatively, use this
    clean_option=True  # Default is True
)
print("Results for list of strings:")
print(result)

# Example 2: Using a DataFrame with .txt files
data_loader = LoadData()
df_txt = data_loader.load_data('path_to_your_txt_files_directory', file_type='txt')
result_txt = VMP.calculate(
    data=df_txt,
    delta_values=[9, 11],  # Select odd number/s for delta values
    common_words_option='both',  # Options: 'yes', 'no', 'both'
    num_common_words=1000,  # Optional parameter for number of common words
    common_words_url='https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english.txt',
    # common_words_file='path_to_your_common_words_file.txt',  # Alternatively, use this
    clean_option=True  # Default is True
)
print("Results for DataFrame with .txt files:")
print(result_txt)

# Example 3: Using a DataFrame with .csv file
data_loader = LoadData()
df_csv = data_loader.load_data('path_to_your_csv_file.csv', file_type='csv')
result_csv = VMP.calculate(
    data=df_csv,
    delta_values=[9, 11],  # Select odd number/s for delta values
    common_words_option='both',  # Options: 'yes', 'no', 'both'
    num_common_words=1000,  # Optional parameter for number of common words
    common_words_url='https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english.txt',
    # common_words_file='path_to_your_common_words_file.txt',  # Alternatively, use this
    clean_option=True  # Default is True
)
print("Results for DataFrame with .csv file:")
print(result_csv)

# Example 4: Using a DataFrame with .gz file
data_loader = LoadData()
df_gz = data_loader.load_data('path_to_your_gz_file.gz', file_type='gz')
result_gz = VMP.calculate(
    data=df_gz,
    delta_values=[9, 11],  # Select odd number/s for delta values
    common_words_option='both',  # Options: 'yes', 'no', 'both'
    num_common_words=1000,  # Optional parameter for number of common words
    common_words_url='https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english.txt',
    # common_words_file='path_to_your_common_words_file.txt',  # Alternatively, use this
    clean_option=True  # Default is True
)
print("Results for DataFrame with .gz file:")
print(result_gz)

The package contains all preprocessing. Only the delta_x and stopword list need to be specified.

Input

The VMP.calculate method requires a text or corpus input. These can be loaded either as an individual .txt document, a directory, or corpus, containing multiple .txt documents, or a .csv or .gz file where each row contains the text of a particular document. (supports .txt and .gz files).

Output

The vmp.calculate function returns a dictionary where the results are structured as follows:

index: The index position of the interval in the original text.
last_pos: The position of the last token in the interval within the original text.
avg_score: The average score for the interval, representing the relative distance of repeated tokens within the window.
last_word: The last word in the interval.
context: The text within the interval, providing context for the analysis.
last_previous_position: A dictionary showing the last previous position of each token in the interval before the current window.
filename: The source filename or identifier of the text being analyzed.
delta_x: The size of the interval (window) used in the analysis.
vocab_option: Indicates whether common words were replaced with 'x' (commonYes) or not (commonNo).

Installation

pip install vmp

pip install git+https://github.com/matthewdurward/vmp.git

How It Works

Vocabulary Management Profiles (VMPs) were initially conceived by Youmans (https://journals.sagepub.com/doi/abs/10.2190/BY6N-ABUA-EM1D-RX0V) as a form of discourse and narrative analysis.

This package follows Youmans' implementation of the VMP2.2 (https://web.archive.org/web/20060911150345/http://web.missouri.edu/~youmansc/vmp/help/vmp22.html)

VMP2.2 calculates ratios using a wrap-around method during the second pass through the text. This means that the first occurrence of a word near the beginning of the text is compared to its last occurrence near the end, resulting in a ratio closer to 0.0 rather than 1.0. Words that appear only once in the text retain a ratio of 1.0. Unlike the initial pass analysis, VMP2.2 avoids a rapid downtrend at the beginning of the text, reflecting a more familiar second reading where the start of the text is as well-known as the end. This approach aligns with our typical reading patterns, where rhetorical structures are more evident during subsequent readings rather than the first.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vmp-0.0.13.tar.gz (8.6 kB view details)

Uploaded Source

Built Distribution

vmp-0.0.13-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file vmp-0.0.13.tar.gz.

File metadata

  • Download URL: vmp-0.0.13.tar.gz
  • Upload date:
  • Size: 8.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.9

File hashes

Hashes for vmp-0.0.13.tar.gz
Algorithm Hash digest
SHA256 2fbd042c519cb5689b239d5884885a3b231cb3b2c2d3c809f22171a1278023b7
MD5 367dbd8ddbf21696e34e336218aa5eb6
BLAKE2b-256 2578f462c408d80dba6d2bd4ec4cb03579476888f3685e3c594fbb1e771b3ddb

See more details on using hashes here.

File details

Details for the file vmp-0.0.13-py3-none-any.whl.

File metadata

  • Download URL: vmp-0.0.13-py3-none-any.whl
  • Upload date:
  • Size: 7.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.9

File hashes

Hashes for vmp-0.0.13-py3-none-any.whl
Algorithm Hash digest
SHA256 9b0af53ced9fa8373cb631a32f40a52d733cdf7c331288ee72e39eb0a82adc96
MD5 c0bd5ef9d97d64e37114f9c7b961d498
BLAKE2b-256 4e1c9a18d2f32cf3b8749df071e70187884f5ef8b5336612b37f353496c393e3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page