A toolbox for Information Retrieval & Text Mining.
Project description
Information Retrieval & Text Mining Toolbox
This repository holds functions pivotal for IRTM processing. This repo. is staged for continuous development.
Quick Install using 'pip/pip3' & GitHub
pip install git+https://github.com/KanishkNavale/IRTM-Toolbox.git
Import Module
from irtm.toolbox import *
Using Functions
-
Soundex: A phonetic algorithm for indexing names by sound, as pronounced in English.
print(soundex('Muller')) print(soundex('Mueller'))
>>> 'M466' >>> 'M466'
-
Tokenizer: Converts a sequence of characters into a sequence of tokens.
print(tokenize('LINUX')) print(tokenize('Text Mining 2021'))
>>> ['linux'] >>> ['text', 'mining']
-
Vectorize: Converts a string to token based weight tensor.
vector = vectorize([ 'texts ([string]): a multiline or a single line string.', 'dict ([list], optional): list of tokens. Defaults to None.', 'enable_Idf (bool, optional): use IDF or not. Defaults to True.', 'normalize (str, optional): normalization of vector. Defaults to l2.', 'max_dim ([int], optional): dimension of vector. Defaults to None.', 'smooth (bool, optional): restricts value >0. Defaults to True.', 'weightedTf (bool, optional): Tf = 1+log(Tf). Defaults to True.', 'return_features (bool, optional): feature vector. Defaults to False.' ]) print(f'Vector Shape={vector.shape}')
>>> Vector Shape=(8, 37)
-
Predict Token Weights: Computes importance of a token based on classification optimization.
dictionary = ['vector', 'string', 'bool'] vector = vectorize([ 'X ([np.array]): vectorized matrix columns arraged as per the dictionary.', 'y ([labels]): True classification labels.', 'epochs ([int]): Optimization epochs.', 'verbose (bool, optional): Enable verbose outputs. Defaults to False.', 'dict ([type], optional): list of tokens. Defaults to None.' ], dict=dictionary) labels = np.random.randint(1, size=(vector.shape[0], 1)) weights = predict_weights(vector, labels, 100, dict=dictionary)
>>> Token-Weights Mappings: {'vector': 0.22097790924850977, 'string': 0.39296369957440075, 'bool': 0.689853175081446}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
irtm-0.0.3.tar.gz
(5.1 kB
view hashes)
Built Distribution
irtm-0.0.3-py3-none-any.whl
(5.1 kB
view hashes)