Pipelines and management structure for NLP analysis of a corpus of texts

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

nlp_pipeline

Collection of NLP tools for processing and analyzing text data.

Introduction

The fundamental input of the library is a metadata file. By default this will contain the columns ["text_id", "web_filepath", "local_raw_filepath", "local_txt_filepath", "detected_language"]. The only one that needs to be provided by the user is "web_filepath". Ergo, a list of URLs containing text documents is all that is required to use the library. If the corpus had additional columns of interest, like titles, etc., those can be passed via the metadata_addt_column_names argument when instantiating the initial nlp_processor function. More information below.

Fundamentally the library takes the list of documents and downloads, transforms, and organizes them according to a specific filestructure. These files can then be used to generate insights, such as word counts, etc.

Example code

from nlp_pipeline.nlp_pipeline import nlp_processor

# additional columns I want to track in the metadata
metadata_addt_column_names = ["title", "year"] 

# instantiating the processor object
processor = nlp_processor(
	data_path = "path_to_store_text_documents/",
	metadata_addt_column_names = metadata_addt_column_names,
	windows_tesseract_path = "path_to_tesseract.exe", # if on Windows, otherwise leave blank and have it installed in your path
	windows_poppler_path = "path_to_poppler/bin" # if on Windows, otherwise leave blank and have it installed in your path
)

# this will generate a metadata file and create the directory structure
# you can now add additional data to the metadata file, (titles, etc.). When finished, run the following so the metadata in the processor object will reflect the local file
processor.refresh_object_metadata()

# if you ever make changes to the local files, e.g., delete a PDF, run the following to make sure the metadata file reflects that
processor.sync_local_metadata()

# download some documents with metadat IDs 1, 2, and 3
text_ids = [1,2,3]
processor.download_text_id(text_ids)

# convert the PDFs or HTMLs to .txt
processor.convert_to_text(text_ids)

# transform the text (stemming, etc.)
# run help(processor.transform_text) for more information
processor.transform_text(
        text_ids = text_ids,
        path_prefix = "all_transformed", # what to prefix the files with this transformation
        perform_lower = True, # lower case the text
        perform_replace_newline_period = True, # replace periods and newline characters with |
        perform_remove_punctuation = True, # remove punctuation marks
        perform_remove_stopwords = True, # remove stopwords (the, and, etc.)
        perform_stemming = True, # stem the words (run = runs, etc.)
        stemmer = "snowball" # which stemmer to use. If in doubt, use snowball
)

# from the transformed text, generate a CSV with word counts in each document
processor.gen_word_count_csv(
        text_ids = text_ids, 
        path_prefix = "all_transformed", # prefix used previously for the transformation
        exclude_words = ["for"] # list of words to not include in the word counts
)

# get sentiment of a group of texts
processor.gen_sentiment_csv(text_ids, "all_transformed")

# get n_words, sentences, and pages of texts
processor.gen_summary_stats_csv(text_ids, "all_transformed")

# bar plot of most common words in a document or group of documents
p, plot_df = processor.bar_plot_word_count(
	text_ids = text_ids, 
	path_prefix = "all_transformed", # prefix used previously for the transformation
	n_words = 10, # top n words to show
	title = "Plot Title"
)

# word cloud of most common words in a document or group of documents
p, plot_df = processor.word_cloud(
	text_ids = text_ids, 
	path_prefix = "all_transformed", # prefix used previously for the transformation
	n_words = 10 # top n words to show
)

# plot of word occurrences over time
p, plot_df = processor.plot_word_occurrences(
    text_ids_list = text_ids, # can be a list of lists, [[1,2,3], [4,5,6]], for counts by decade e.g.
    word = "green", 
    path_prefix = "all_transformed", 
    x_labels = [1,2,3],
    title = "Plot Title"
)

# plot average sentiment or neutral proportion in documents
p, plot_df = processor.plot_sentiment(
    text_ids_list = text_ids, 
    path_prefix = "all_transformed", 
    x_labels = [1,2,3],
    sentiment_col = "neutral_proportion",
    title = "Plot Title"
)

# plot various summary stats in documents
p, plot_df = processor.plot_summary_stats(
    text_ids_list = text_ids, 
    path_prefix = "all_transformed", 
    x_labels = [1,2,3],
    summary_stat_col = "n_words", # one of: n_words, n_unique_words, n_sentences, n_pages, avg_word_length, avg_word_incidence
    title = "Plot Title"
)

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.36

Apr 22, 2024

0.0.35

Mar 20, 2024

0.0.34

Mar 20, 2024

0.0.33

Mar 20, 2024

0.0.32

Mar 20, 2024

0.0.31

Feb 22, 2024

0.0.30

Feb 20, 2024

0.0.29

Feb 20, 2024

0.0.28

Feb 20, 2024

0.0.27

Feb 19, 2024

0.0.26

Jun 5, 2023

0.0.25

Jun 2, 2023

0.0.24

May 9, 2023

0.0.23

May 5, 2023

0.0.22

May 5, 2023

0.0.21

May 5, 2023

0.0.20

May 4, 2023

0.0.19

May 3, 2023

0.0.18

Apr 21, 2023

0.0.17

Apr 20, 2023

0.0.16

Apr 11, 2023

0.0.15

Apr 7, 2023

0.0.14

Apr 6, 2023

0.0.13

Mar 3, 2023

0.0.12

Mar 1, 2023

0.0.11

Feb 28, 2023

0.0.10

Feb 23, 2023

0.0.9

Feb 22, 2023

0.0.8

Feb 22, 2023

0.0.7

Feb 21, 2023

0.0.6

Feb 20, 2023

0.0.5

Feb 19, 2023

0.0.4

Feb 19, 2023

This version

0.0.3

Feb 17, 2023

0.0.2

Feb 17, 2023

0.0.1

Feb 16, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlp_pipeline-0.0.3.tar.gz (14.7 kB view hashes)

Uploaded Feb 17, 2023 Source

Hashes for nlp_pipeline-0.0.3.tar.gz

Hashes for nlp_pipeline-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`412a9841544c4769d6896b3845ceb49f8b38981f7d97830210bd241bfec3cc13`
MD5	`0a3f35548302982dd8d7db0f35e90055`
BLAKE2b-256	`8bc683b652c555666911d34d32ad1fd1e61e2c1da9a1af68c7db83c2aff37492`