Generic functions for SixAds data science projects
Project description
sixadsml
Package that is used by the SixAds data science department. To know more about sixads, visit https://sixads.net/.
The github link for this package is https://bitbucket.org/eligijus112/sixadsml/src/master/
Installation
In anaconda prompt type (windows users):
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py
pip install SixAdsDS
sixadsml.clean_text
Functions for text preprocesing/summarizing. A common way to use these functions is to combine them into a pipeline which the input is a list containing strings.
lemmatize_word
lemmatize_word(string_list, engine=<WordNetLemmatizer>)
Lemmatize words using one of the WordNet engines
Parameters
string_list : list List which stores strings
engine : WordNetLemmatizer() (default) An object from the nltk.stem.wodnet library
Returns
List with the same length as *string_list* where each word in each
string is lemmatized
to_str
to_str(string_list)
Converts every list element to str type
Parameters
string_list : list List which stores strings
Returns
List with the same length as *string_list* where every list element
is converted to a str type object
rm_short_words
rm_short_words(string_list, lower_bound=1, upper_bound=2)
Removes characters that are in the range of lower_bound and upper_bound
Parameters
string_list : list List which stores strings
lower_bound: int Integer indicating the lower bound of a character length
upper_bound: int Integer indicating the upper bound of a character length
Returns
List with the same length as *string_list* where every character that is
split by whitespace is removed if it has a length in the range
[lower_bound, upper_bound]
Examples
string_list = ['python is awesome', 'R is good as well']
rm_short_words(string_list)
rm_short_words(string_list, 4, 5)
to_single
to_single(string_list)
Converts every word in string_list to it's singular form.
Parameters
string_list: list List which stores strings
Returns
List with the same length as *string_list* where every word is converted
to singular form
to_lower
to_lower(string_list)
Makes every word in the string_list lowercase
Parameters
string_list : list List which stores strings
Returns
List with the same length as *string_list* where every word is converted
to lowercase
rm_stop_words
rm_stop_words(string_list)
Removes stop words using the nltk stopwords module.
Parameters
string_list : list List which stores strings
Returns
List with the same length as *string_list* where every string is without
stopwords
rm_punctuations
rm_punctuations(string_list)
Removes punctuations and other special characters from a string list
Parameters
string_list : list List which stores strings
Returns
List with the same length as *string_list* where every string is without
punctuations and other special characters
rm_digits
rm_digits(string_list)
Removes digits from a string list
Parameters
string_list : list List which stores strings
Returns
List with the same length as *string_list* where every string is without
digits
stem_words
stem_words(string_list, stemmer=<nltk.stem.snowball.SnowballStemmer object at 0x000001C95E8BCA20>)
A function to stemm the words in a given string vector
Parameters
string_list : list List which stores strings
stemmer : word stemmer from nltk.stem library; nltk.stem.SnowballStemmer('english') default
Returns
List with the same length as *string_list* where every character is stemmed
clean_ws
clean_ws(string_list)
Cleans one or more whitespaces
Parameters
string_list : list List which stores strings
Returns
List with the same length as *string_list* where every string has only
one or less whitespace
build_vocab
build_vocab(string_list, verbose=True)
A function that creates a term frequency vocabulary from the text
Parameters
string_list : list List which stores strings
verbose : boolean; default=True Whether to show the timing of the for loop
Returns
dictionary that each key is a unique term in the string_list and
the key value is the number of times a certain term appeared in the
string
Example
string_list = string_list = ['python is awesome', 'R is awesome as well']
build_vocab(string_list)
sixadsml.images
Functions to preproces images from the web or a local machine
img_read_url
img_read_url(url, h=256, w=256, to_grey=False, timeout=2)
Returns an image via an url
Parameters
url : string url (in a string format)
h: int Desired height of the returned image (px)
w: int Desired width of the returned image (px)
to_grey: bool should the image be returned in greyscale?
timeout: int maximum wait time before dropping the request
Returns
A numpy array width dimensions (h, w, 3) or (h, w, 1) if to_grey=True
img_read_url_PIL
img_read_url_PIL(url, h=256, w=256, timeout=2)
Returns an image via an url (using PIL framework)
Parameters
url : string url (in a string format)
h: int Desired height of the returned image (px)
w: int Desired width of the returned image (px)
timeout: int maximum wait time before dropping the request
Returns
PIL.Image.Image
img_read
img_read(path, h=256, w=256, to_grey=False)
Reads an image from the local machine
Parameters
path : string path to image on a local machine
h: int Desired height of the returned image (px)
w: int Desired width of the returned image (px)
to_grey: bool Should the image be returned in greyscale?
Returns
Numpy array width dimensions (h, w, 3) or (h, w, 1) if to_grey=True
return_image_hist
return_image_hist(image, no_bins_per_channel=10, normalize=False)
Function to get the histogram of the colours in a photo
Parameters
image : numpy ndarray A numpy array with the shape (x, y, 3)
no_bins_per_channel: int How many bins should a histgoram have for each channel of colors
normalize : bool Should the coordinates add up to 1?
Returns
A list of size 3 * no_bins_per_channel representing the distribution
of colors in the image
sixadsml.utility
Utility functions
make_connection
make_connection(specs)
Creates a connection based on the information in the specs. Ussually, the specs dictionary is the output of the read_yaml function
Parameters
specs : dictionary A dictionary that stores the user, password, host and db keys
Returns
An sql_alchemy connection object
exec_file
exec_file(file, add_params=None)
Executes a file with the .py extension
Parameters
file: string path to the python file
add_params: additional parameters that are used in the file that is beeing executed
Returns
Whatever output the executable file outputs
read_yaml
read_yaml(file)
Reads a .yml or .yaml file
Parameters
path: string
path to the .yml or .yaml files
Returns
Dictionary with the .yml or .yaml file contents
chunks_of_n
chunks_of_n(l, n)
Splits a list into n equal sizes
Parameters
l: list
n: int
Returns
A list of size *n* with the items of *l* splited equaly
unique
unique(l)
A handy function to return unique elements of a list or a numpy array
Parameters
l : list or array
Returns
A list or array containing unique elements of l
sixadsml.sql_utility
Functions to deal with downloading and writting data to the database
Get_sql
Get_sql(self, /, *args, **kwargs)
Class that deals with downloading data
get_google_tree
Get_sql.get_google_tree(connection)
Function to download the google taxonomy tree from the sixads database
Parameters
connection: sql_alchemy connection object
Returns
A pandas dataframe
get_data
Get_sql.get_data(connection, select_part, from_part, where_part='')
Function that construcs a query from the given parts and executes it
Parameters
select_part: list list of strings identifying the desired columns
from_part: string the table name
where_part: string additional constaints
Returns
A pandas dataframe
Write_sql
Write_sql(self, /, *args, **kwargs)
Class that deals with writting data
write_to_table
Write_sql.write_to_table(specs, table, data, if_exists='replace')
Writes data to the desired table
Parameters
specs: dictionary Must contain the keys user, password, host and db
table: string A string refering to the table which we want to write to
data: pandas dataframe Data which we want to write to the table
if_exists: string What to do if the table already exists. Possible string values: 'replace', 'append', 'fail'
sixadsml.embeddings
Class for dealing with word embeddings
load_from_text
load_from_text(path)
Reads the word embeddings from a txt documents and saves it as a dictionary
Parameters
path: string path to a txt document containing the word embeddings
Returns
A dictionary where the key values are individual words and the
values are the vectors
tokenize_text
tokenize_text(string_list, max_features, max_len)
Creates a tokenizer from a given text list
Parameters
string_list: list List containing strings
max_features: int The maximum number of unique words that the tokenizer saves in memory
max_len: int The length of the vector into which all elements of string_list will be converted to.
Returns
A tuple of the tokenized text and the fitted tokenizer for future use.
The first element of the tuple is an array of shape (len(*string_list*), max_len)
create_embedding_matrix
create_embedding_matrix(embeddings, tokenizer, max_features, embed_size=300)
Function to create the embedding matrix to use in neural networks. This goes directly to the embedding layer.
Parameters
embeddings: dictionary output of load_from_text() function
tokenizer: keras.Tokenizer object output of tokenize_text() function
max_features: int how many unique tokens to use
embed_size: int how many coordinates does the embedding have; default=300
Returns
A numpy.ndarray of shape (max_features, embed_size)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.