Skip to main content

The alpacka Python package, used to extract and visualize metadata from text data sets

Project description

Code for the alpacka Python package, used to extract metadata from text data sets

Folder "functions" contains functions for calculating the NCOF and TF-IDF score for a user specified data set.

The file "Pipes" contains pipelins for the two methods that can be used to create a better workflow when using the package as well as a tool for loading the data.

To use the package begin by importing Pipes and then you can initiate the Data loader, NCOF, or TFIDF class.

Alter the config.ini file to change the base setup and alter the paths to your data

Walkthrough

Install the alpacka package through pip, or download the package through github.

> pip install alpacka

Link to github repo

Set up

Many of the classes and functions in the alpacka package require data paths and other infromation to be able to run. For convenience sake the code reads this info from a config file so that you as a user only have to input this info once.

The config file is formatted as a standard config.ini file and an example is located in the package directory. the directory can be accessed by calling pip show alpacka in you prompt and following the Location path to the alpacka folder.

   > pip show alpacka
	...
   > Location: c:\users\path\to\venv\lib\site-packages

Copy the config.ini file and place it in you project directory / working path for convenience, but can be placed anywhare you like.

Now we are ready to start to work with alpacka.

Import the data processor from data loader and intanciate it

    from alpacka.pipes.data_loader import data_process
    d = data_process(config_path = "config.ini")

Settings are loaded from the config.ini file.

Import and instanciate the NCOF and TF-IDF methods.

    from alpacka.pipes.ncof_pipeline import ncof_pipe  
	ncof = ncof_pipe(config_path="config.ini")  

    from alpacka.pipes.tfidf_pipeline import tfidf_pipe
	tfidf = tfidf_pipe(config_path="config.ini")

Load the data using the data_process class

The data path can be changed by altering the config.ini file . In the config.ini file alter data_file & data_folder to the file name and path to the location of your data.

[Data]  
    Verbose = True  
    num_words = None  
    Supported_inputs = list  
    Input_data_type = list  
    stop_word_path = alpacka/functions/Stopord.txt  
    data_file = data_file_name.csv  			<-----
    data_folder = path\to\data\folder 			<-----

Or your can call the set_data_file and set_data_folder and input the name of you data file and path to your data folder from the Data_processclass.

d.set_data_file('data_file_name.csv')
d.set_data_folder('path\to\data\folder')

Now you should be ready to load your data. Load the data by calling the load_filemethod from your Data_processclass. Currently the alpacka package only supports .csv files as input.

The required inputs of the load_filecall is the names of the columns that contains the data and its labels.

data , labels = d.load_file( 'preprocessed_text', 'label')

In this example a csv file were loaded where the data are contained in the column named preprocessed_text and the labels in a column named label.

Now that the data is loaded we can simply calculate our NCOF and TF-IDF score for the data set by calling the calc_NCOF & calc_TFIDFmethods from their respective class. The scores will be saved wihin the classes and can be vieved and assigned to a external variable by calling .get_Score().

ncof.calc_ncof(data, labels)
score_ncof = ncof.get_Score()

tfidf.calc_TFIDF(data, labels)
score_tfidf = tfidf.get_Score()

From the score the outliers can be seperated & identified by calling .split_score().

No inputs are needed, the method is fully self contained.

    ncof.split_score()
    tfidf.split_score()

Now the NCOF score is ready to be plotted, and its outliers get be viewed by calling

This will extract the indexes of the NCOF outliers.

	pos_outliers = ncof.get_Pos_outliers()

and

	neg_outliers = ncof.get_neg_outliers()

The TF-IDF methods requires one additional step before it is ready to be plotted.

This step is to identify which outliers are only occuring in the positive or negative class and is done by calling .unique_outliers_per_class().

tfidf.unique_outliers_per_class()

The results can be viewed by calling

	 pos_outliers =	tfidf.get_outliers_unique_pos()

and

neg_outliers =	tfidf.get_outliers_unique_neg() 

This will extract the indexes of the TF-IDF outliers.

The reults can now finally be plotted by the .scatter() method.

For simplicity the method does not require any inputs, if more contoll is needed for the plot it is recomended that a custom function is created.

ncof.scatter()
tfidf.scatter()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alpacka-0.0.57.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

alpacka-0.0.57-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file alpacka-0.0.57.tar.gz.

File metadata

  • Download URL: alpacka-0.0.57.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for alpacka-0.0.57.tar.gz
Algorithm Hash digest
SHA256 c7891fc31f26c285d9b1ddf85ed4ea749d9adc1602555eaf2235902beae1cf08
MD5 1dade1c3cf12d2342eb6b563558c2ffd
BLAKE2b-256 237e8ce9d5a3d92929af22a3326030050d4af8c4e36c0a0c0a44d3e50ced60ea

See more details on using hashes here.

File details

Details for the file alpacka-0.0.57-py3-none-any.whl.

File metadata

  • Download URL: alpacka-0.0.57-py3-none-any.whl
  • Upload date:
  • Size: 20.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for alpacka-0.0.57-py3-none-any.whl
Algorithm Hash digest
SHA256 3d677ecb5fca5cf2f679bfffbeedc2867bd905c5c8f4250e542ab035bde2684d
MD5 2d91af5367c0b438a7368658caacbdac
BLAKE2b-256 05a17c08d5d2ed70975a68edb525efaa9a2945c7238173c0a9d21a151e0d8c6d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page