Generic functions for SixAds data science projects
Project description
sixadsml
Package that is used by the SixAds data science department. To know more about sixads, visit https://sixads.net/.
The github link for this package is https://bitbucket.org/eligijus112/sixadsml/src/master/
Installation
In anaconda prompt type (windows users):
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py
pip install SixAdsDS
sixadsml.clean_text
Functions for text preprocesing/summarizing. A common way to use these functions is to combine them into a pipeline which the input is a list containing strings.
lemmatize_word
lemmatize_word(string_list, engine=<WordNetLemmatizer>)
Lemmatize words using one of the WordNet engines
Parameters
string_list : list List which stores strings
engine : WordNetLemmatizer() (default) An object from the nltk.stem.wodnet library
Returns
List with the same length as *string_list* where each word in each
string is lemmatized
to_str
to_str(string_list)
Converts every list element to str type
Parameters
string_list : list List which stores strings
Returns
List with the same length as *string_list* where every list element
is converted to a str type object
rm_short_words
rm_short_words(string_list, lower_bound=1, upper_bound=2)
Removes characters that are in the range of lower_bound and upper_bound
Parameters
string_list : list List which stores strings
lower_bound: int Integer indicating the lower bound of a character length
upper_bound: int Integer indicating the upper bound of a character length
Returns
List with the same length as *string_list* where every character that is
split by whitespace is removed if it has a length in the range
[lower_bound, upper_bound]
Examples
string_list = ['python is awesome', 'R is good as well']
rm_short_words(string_list)
rm_short_words(string_list, 4, 5)
to_single
to_single(string_list)
Converts every word in string_list to it's singular form.
Parameters
string_list: list List which stores strings
Returns
List with the same length as *string_list* where every word is converted
to singular form
to_lower
to_lower(string_list)
Makes every word in the string_list lowercase
Parameters
string_list : list List which stores strings
Returns
List with the same length as *string_list* where every word is converted
to lowercase
rm_stop_words
rm_stop_words(string_list)
Removes stop words using the nltk stopwords module.
Parameters
string_list : list List which stores strings
Returns
List with the same length as *string_list* where every string is without
stopwords
rm_punctuations
rm_punctuations(string_list)
Removes punctuations and other special characters from a string list
Parameters
string_list : list List which stores strings
Returns
List with the same length as *string_list* where every string is without
punctuations and other special characters
rm_digits
rm_digits(string_list)
Removes digits from a string list
Parameters
string_list : list List which stores strings
Returns
List with the same length as *string_list* where every string is without
digits
stem_words
stem_words(string_list, stemmer=<nltk.stem.snowball.SnowballStemmer object at 0x000001C95E8BCA20>)
A function to stemm the words in a given string vector
Parameters
string_list : list List which stores strings
stemmer : word stemmer from nltk.stem library; nltk.stem.SnowballStemmer('english') default
Returns
List with the same length as *string_list* where every character is stemmed
clean_ws
clean_ws(string_list)
Cleans one or more whitespaces
Parameters
string_list : list List which stores strings
Returns
List with the same length as *string_list* where every string has only
one or less whitespace
build_vocab
build_vocab(string_list, verbose=True)
A function that creates a term frequency vocabulary from the text
Parameters
string_list : list List which stores strings
verbose : boolean; default=True Whether to show the timing of the for loop
Returns
dictionary that each key is a unique term in the string_list and
the key value is the number of times a certain term appeared in the
string
Example
string_list = string_list = ['python is awesome', 'R is awesome as well']
build_vocab(string_list)
sixadsml.images
Functions to preproces images from the web or a local machine
img_read_url
img_read_url(url, h=256, w=256, to_grey=False, timeout=2)
Returns an image via an url
Parameters
url : string url (in a string format)
h: int Desired height of the returned image (px)
w: int Desired width of the returned image (px)
to_grey: bool should the image be returned in greyscale?
timeout: int maximum wait time before dropping the request
Returns
A numpy array width dimensions (h, w, 3) or (h, w, 1) if to_grey=True
img_read_url_PIL
img_read_url_PIL(url, h=256, w=256, timeout=2)
Returns an image via an url (using PIL framework)
Parameters
url : string url (in a string format)
h: int Desired height of the returned image (px)
w: int Desired width of the returned image (px)
timeout: int maximum wait time before dropping the request
Returns
PIL.Image.Image
img_read
img_read(path, h=256, w=256, to_grey=False)
Reads an image from the local machine
Parameters
path : string path to image on a local machine
h: int Desired height of the returned image (px)
w: int Desired width of the returned image (px)
to_grey: bool Should the image be returned in greyscale?
Returns
Numpy array width dimensions (h, w, 3) or (h, w, 1) if to_grey=True
return_image_hist
return_image_hist(image, no_bins_per_channel=10, normalize=False)
Function to get the histogram of the colours in a photo
Parameters
image : numpy ndarray A numpy array with the shape (x, y, 3)
no_bins_per_channel: int How many bins should a histgoram have for each channel of colors
normalize : bool Should the coordinates add up to 1?
Returns
A list of size 3 * no_bins_per_channel representing the distribution
of colors in the image
sixadsml.utility
Utility functions
make_connection
make_connection(specs)
Creates a connection based on the information in the specs. Ussually, the specs dictionary is the output of the read_yaml function
Parameters
specs : dictionary A dictionary that stores the user, password, host and db keys
Returns
An sql_alchemy connection object
exec_file
exec_file(file, add_params=None)
Executes a file with the .py extension
Parameters
file: string path to the python file
add_params: additional parameters that are used in the file that is beeing executed
Returns
Whatever output the executable file outputs
read_yaml
read_yaml(file)
Reads a .yml or .yaml file
Parameters
path: string
path to the .yml or .yaml files
Returns
Dictionary with the .yml or .yaml file contents
chunks_of_n
chunks_of_n(l, n)
Splits a list into n equal sizes
Parameters
l: list
n: int
Returns
A list of size *n* with the items of *l* splited equaly
unique
unique(l)
A handy function to return unique elements of a list or a numpy array
Parameters
l : list or array
Returns
A list or array containing unique elements of l
sixadsml.sql_utility
Functions to deal with downloading and writting data to the database
Get_sql
Get_sql(self, /, *args, **kwargs)
Class that deals with downloading data
get_google_tree
Get_sql.get_google_tree(connection)
Function to download the google taxonomy tree from the sixads database
Parameters
connection: sql_alchemy connection object
Returns
A pandas dataframe
get_data
Get_sql.get_data(connection, select_part, from_part, where_part='')
Function that construcs a query from the given parts and executes it
Parameters
select_part: list list of strings identifying the desired columns
from_part: string the table name
where_part: string additional constaints
Returns
A pandas dataframe
Write_sql
Write_sql(self, /, *args, **kwargs)
Class that deals with writting data
write_to_table
Write_sql.write_to_table(specs, table, data, if_exists='replace')
Writes data to the desired table
Parameters
specs: dictionary Must contain the keys user, password, host and db
table: string A string refering to the table which we want to write to
data: pandas dataframe Data which we want to write to the table
if_exists: string What to do if the table already exists. Possible string values: 'replace', 'append', 'fail'
sixadsml.embeddings
Class for dealing with word embeddings
load_from_text
load_from_text(path)
Reads the word embeddings from a txt documents and saves it as a dictionary
Parameters
path: string path to a txt document containing the word embeddings
Returns
A dictionary where the key values are individual words and the
values are the vectors
tokenize_text
tokenize_text(string_list, max_features, max_len)
Creates a tokenizer from a given text list
Parameters
string_list: list List containing strings
max_features: int The maximum number of unique words that the tokenizer saves in memory
max_len: int The length of the vector into which all elements of string_list will be converted to.
Returns
A tuple of the tokenized text and the fitted tokenizer for future use.
The first element of the tuple is an array of shape (len(*string_list*), max_len)
create_embedding_matrix
create_embedding_matrix(embeddings, tokenizer, max_features, embed_size=300)
Function to create the embedding matrix to use in neural networks. This goes directly to the embedding layer.
Parameters
embeddings: dictionary output of load_from_text() function
tokenizer: keras.Tokenizer object output of tokenize_text() function
max_features: int how many unique tokens to use
embed_size: int how many coordinates does the embedding have; default=300
Returns
A numpy.ndarray of shape (max_features, embed_size)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file SixAdsDS-0.1.6.tar.gz.
File metadata
- Download URL: SixAdsDS-0.1.6.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.29.1 CPython/3.6.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8cc61c6c85b0b281c72feb3fb63444c9a906c699c224bd0c6c1ce5c7bfdb85c6
|
|
| MD5 |
4b736eceea971127e3d7d019d946615d
|
|
| BLAKE2b-256 |
e059b4ea1ea62adc0817a80ef9d1ea107aaa197ce3e47744fbb3b491d8c4226f
|
File details
Details for the file SixAdsDS-0.1.6-py3-none-any.whl.
File metadata
- Download URL: SixAdsDS-0.1.6-py3-none-any.whl
- Upload date:
- Size: 12.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.29.1 CPython/3.6.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bede90ba5bb26d6a754c35d7820eb648dd5891e9d4c51b56916c98eb8ccf25b6
|
|
| MD5 |
a133982bba14a4bd4adc3ab05d4be0f2
|
|
| BLAKE2b-256 |
78c5636394c3e4b04ffa9cad475c9b376facf78704403b742ae862863dc330dd
|