Skip to main content

Generic functions for SixAds data science projects

Project description

sixadsml

Package that is used by the SixAds data science department. To know more about sixads, visit https://sixads.net/.

The github link for this package is https://bitbucket.org/eligijus112/sixadsml/src/master/

Installation

In anaconda prompt type (windows users):

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py
pip install SixAdsDS

sixadsml.clean_text

Functions for text preprocesing/summarizing. A common way to use these functions is to combine them into a pipeline which the input is a list containing strings.

lemmatize_word

lemmatize_word(string_list, engine=<WordNetLemmatizer>)

Lemmatize words using one of the WordNet engines

Parameters

string_list : list List which stores strings

engine : WordNetLemmatizer() (default) An object from the nltk.stem.wodnet library

Returns

List with the same length as *string_list* where each word in each
string is lemmatized

to_str

to_str(string_list)

Converts every list element to str type

Parameters

string_list : list List which stores strings

Returns

List with the same length as *string_list* where every list element
is converted to a str type object

rm_short_words

rm_short_words(string_list, lower_bound=1, upper_bound=2)

Removes characters that are in the range of lower_bound and upper_bound

Parameters

string_list : list List which stores strings

lower_bound: int Integer indicating the lower bound of a character length

upper_bound: int Integer indicating the upper bound of a character length

Returns

List with the same length as *string_list* where every character that is
split by whitespace is removed if it has a length in the range
[lower_bound, upper_bound]

Examples

string_list = ['python is awesome', 'R is good as well']

rm_short_words(string_list)

rm_short_words(string_list, 4, 5)

to_single

to_single(string_list)

Converts every word in string_list to it's singular form.

Parameters

string_list: list List which stores strings

Returns

List with the same length as *string_list* where every word is converted
to singular form

to_lower

to_lower(string_list)

Makes every word in the string_list lowercase

Parameters

string_list : list List which stores strings

Returns

List with the same length as *string_list* where every word is converted
to lowercase

rm_stop_words

rm_stop_words(string_list)

Removes stop words using the nltk stopwords module.

Parameters

string_list : list List which stores strings

Returns

List with the same length as *string_list* where every string is without
stopwords

rm_punctuations

rm_punctuations(string_list)

Removes punctuations and other special characters from a string list

Parameters

string_list : list List which stores strings

Returns

List with the same length as *string_list* where every string is without
punctuations and other special characters

rm_digits

rm_digits(string_list)

Removes digits from a string list

Parameters

string_list : list List which stores strings

Returns

List with the same length as *string_list* where every string is without
digits

stem_words

stem_words(string_list, stemmer=<nltk.stem.snowball.SnowballStemmer object at 0x000001C95E8BCA20>)

A function to stemm the words in a given string vector

Parameters

string_list : list List which stores strings

stemmer : word stemmer from nltk.stem library; nltk.stem.SnowballStemmer('english') default

Returns

List with the same length as *string_list* where every character is stemmed

clean_ws

clean_ws(string_list)

Cleans one or more whitespaces

Parameters

string_list : list List which stores strings

Returns

List with the same length as *string_list* where every string has only
one or less whitespace

build_vocab

build_vocab(string_list, verbose=True)

A function that creates a term frequency vocabulary from the text

Parameters

string_list : list List which stores strings

verbose : boolean; default=True Whether to show the timing of the for loop

Returns

dictionary that each key is a unique term in the string_list and
the key value is the number of times a certain term appeared in the
string

Example

string_list = string_list = ['python is awesome', 'R is awesome as well']

build_vocab(string_list)

sixadsml.images

Functions to preproces images from the web or a local machine

img_read_url

img_read_url(url, h=256, w=256, to_grey=False, timeout=2)

Returns an image via an url

Parameters

url : string url (in a string format)

h: int Desired height of the returned image (px)

w: int Desired width of the returned image (px)

to_grey: bool should the image be returned in greyscale?

timeout: int maximum wait time before dropping the request

Returns

A numpy array width dimensions (h, w, 3) or (h, w, 1) if to_grey=True

img_read_url_PIL

img_read_url_PIL(url, h=256, w=256, timeout=2)

Returns an image via an url (using PIL framework)

Parameters

url : string url (in a string format)

h: int Desired height of the returned image (px)

w: int Desired width of the returned image (px)

timeout: int maximum wait time before dropping the request

Returns

PIL.Image.Image

img_read

img_read(path, h=256, w=256, to_grey=False)

Reads an image from the local machine

Parameters

path : string path to image on a local machine

h: int Desired height of the returned image (px)

w: int Desired width of the returned image (px)

to_grey: bool Should the image be returned in greyscale?

Returns

Numpy array width dimensions (h, w, 3) or (h, w, 1) if to_grey=True

return_image_hist

return_image_hist(image, no_bins_per_channel=10, normalize=False)

Function to get the histogram of the colours in a photo

Parameters

image : numpy ndarray A numpy array with the shape (x, y, 3)

no_bins_per_channel: int How many bins should a histgoram have for each channel of colors

normalize : bool Should the coordinates add up to 1?

Returns

A list of size 3 * no_bins_per_channel representing the distribution
of colors in the image

sixadsml.utility

Utility functions

make_connection

make_connection(specs)

Creates a connection based on the information in the specs. Ussually, the specs dictionary is the output of the read_yaml function

Parameters

specs : dictionary A dictionary that stores the user, password, host and db keys

Returns

An sql_alchemy connection object

exec_file

exec_file(file, add_params=None)

Executes a file with the .py extension

Parameters

file: string path to the python file

add_params: additional parameters that are used in the file that is beeing executed

Returns

Whatever output the executable file outputs

read_yaml

read_yaml(file)

Reads a .yml or .yaml file

Parameters

path: string

path to the .yml or .yaml files

Returns

Dictionary with the .yml or .yaml file contents

chunks_of_n

chunks_of_n(l, n)

Splits a list into n equal sizes

Parameters

l: list

n: int

Returns

A list of size *n* with the items of *l* splited equaly

unique

unique(l)

A handy function to return unique elements of a list or a numpy array

Parameters

l : list or array

Returns

A list or array containing unique elements of l

sixadsml.sql_utility

Functions to deal with downloading and writting data to the database

Get_sql

Get_sql(self, /, *args, **kwargs)

Class that deals with downloading data

get_google_tree

Get_sql.get_google_tree(connection)

Function to download the google taxonomy tree from the sixads database

Parameters

connection: sql_alchemy connection object

Returns

A pandas dataframe

get_data

Get_sql.get_data(connection, select_part, from_part, where_part='')

Function that construcs a query from the given parts and executes it

Parameters

select_part: list list of strings identifying the desired columns

from_part: string the table name

where_part: string additional constaints

Returns

A pandas dataframe

Write_sql

Write_sql(self, /, *args, **kwargs)

Class that deals with writting data

write_to_table

Write_sql.write_to_table(specs, table, data, if_exists='replace')

Writes data to the desired table

Parameters

specs: dictionary Must contain the keys user, password, host and db

table: string A string refering to the table which we want to write to

data: pandas dataframe Data which we want to write to the table

if_exists: string What to do if the table already exists. Possible string values: 'replace', 'append', 'fail'

sixadsml.embeddings

Class for dealing with word embeddings

load_from_text

load_from_text(path)

Reads the word embeddings from a txt documents and saves it as a dictionary

Parameters

path: string path to a txt document containing the word embeddings

Returns

A dictionary where the key values are individual words and the
values are the vectors

tokenize_text

tokenize_text(string_list, max_features, max_len)

Creates a tokenizer from a given text list

Parameters

string_list: list List containing strings

max_features: int The maximum number of unique words that the tokenizer saves in memory

max_len: int The length of the vector into which all elements of string_list will be converted to.

Returns

A tuple of the tokenized text and the fitted tokenizer for future use.
The first element of the tuple is an array of shape (len(*string_list*), max_len)

create_embedding_matrix

create_embedding_matrix(embeddings, tokenizer, max_features, embed_size=300)

Function to create the embedding matrix to use in neural networks. This goes directly to the embedding layer.

Parameters

embeddings: dictionary output of load_from_text() function

tokenizer: keras.Tokenizer object output of tokenize_text() function

max_features: int how many unique tokens to use

embed_size: int how many coordinates does the embedding have; default=300

Returns

A numpy.ndarray of shape (max_features, embed_size)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SixAdsDS-0.1.6.tar.gz (10.4 kB view hashes)

Uploaded Source

Built Distribution

SixAdsDS-0.1.6-py3-none-any.whl (12.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page