A toolbox for Information Retrieval & Text Mining.
Project description
Information Retrieval & Text Mining Toolbox
This repository holds functions pivotal for IRTM processing. This repo. is staged for continuous development.
Quick Install using 'pip/pip3' & GitHub
pip install git+https://github.com/KanishkNavale/IRTM-Toolbox.git
Import Module
from irtm.toolbox import *
Using Functions
-
Soundex: A phonetic algorithm for indexing names by sound, as pronounced in English.
print(soundex('Muller')) print(soundex('Mueller'))
>>> 'M466' >>> 'M466'
-
Tokenizer: Converts a sequence of characters into a sequence of tokens.
print(tokenize('LINUX')) print(tokenize('Text Mining 2021'))
>>> ['linux'] >>> ['text', 'mining']
-
Vectorize: Converts a string to token based weight tensor.
vector = vectorize([ 'texts ([string]): a multiline or a single line string.', 'dict ([list], optional): list of tokens. Defaults to None.', 'enable_Idf (bool, optional): use IDF or not. Defaults to True.', 'normalize (str, optional): normalization of vector. Defaults to l2.', 'max_dim ([int], optional): dimension of vector. Defaults to None.', 'smooth (bool, optional): restricts value >0. Defaults to True.', 'weightedTf (bool, optional): Tf = 1+log(Tf). Defaults to True.', 'return_features (bool, optional): feature vector. Defaults to False.' ]) print(f'Vector Shape={vector.shape}')
>>> Vector Shape=(8, 37)
-
Predict Token Weights: Computes importance of a token based on classification optimization.
dictionary = ['vector', 'string', 'bool'] vector = vectorize([ 'X ([np.array]): vectorized matrix columns arraged as per the dictionary.', 'y ([labels]): True classification labels.', 'epochs ([int]): Optimization epochs.', 'verbose (bool, optional): Enable verbose outputs. Defaults to False.', 'dict ([type], optional): list of tokens. Defaults to None.' ], dict=dictionary) labels = np.random.randint(1, size=(vector.shape[0], 1)) weights = predict_weights(vector, labels, 100, dict=dictionary)
>>> Token-Weights Mappings: {'vector': 0.22097790924850977, 'string': 0.39296369957440075, 'bool': 0.689853175081446}
-
Page Rank: Computes page rank from a chain matrix
chain_matrix = np.array([[0, 0, 1], [1, 0, 1], [0, 1, 0]]) print(page_rank(chain_matrix)) rank, TPM = page_rank(chain_matrix, return_TransMatrix=True) print(f'Page Rank: {rank} \nTransition Probablity Matrix: \n{TPM}')
>>> [0.0047 0.997 0.0767] >>> Page Rank: [0.0047 0.997 0.0767] Transition Probablity Matrix: [[0.03333333 0.03333333 0.93333333] [0.48333333 0.03333333 0.48333333] [0.03333333 0.93333333 0.03333333]]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
irtm-0.0.4.tar.gz
(6.3 kB
view details)
Built Distribution
irtm-0.0.4-py3-none-any.whl
(5.9 kB
view details)
File details
Details for the file irtm-0.0.4.tar.gz
.
File metadata
- Download URL: irtm-0.0.4.tar.gz
- Upload date:
- Size: 6.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38086b858d3b712d07d08e70816fcba9a84aeb2ac87136ba67e21486e61852a6 |
|
MD5 | 3ddcb1c310d0e56506b34c3b2318d5d8 |
|
BLAKE2b-256 | 92f234672d84cc281b67fa04dcf76b725f31918dc3b80941e73234bf686d45e2 |
File details
Details for the file irtm-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: irtm-0.0.4-py3-none-any.whl
- Upload date:
- Size: 5.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea4b123b41b6a53e812e86668b21f71d6e779256111ed8f66088315f368cd4b3 |
|
MD5 | 236e5bc86668fa942f7f01b3ea671d7f |
|
BLAKE2b-256 | 1131f3b23d000b644bd511f2225fdf69ba91815dc39278465152d84f9affeaef |