A toolbox for Information Retrieval & Text Mining.
Project description
Information Retrieval & Text Mining Toolbox
This repository holds functions pivotal for IRTM processing. This repo. is staged for continuous development.
Quick Install using 'pip/pip3' & GitHub
pip install git+https://github.com/KanishkNavale/IRTM-Toolbox.git
Import Module
from irtm.toolbox import *
Using Functions
-
Soundex: A phonetic algorithm for indexing names by sound, as pronounced in English.
print(soundex('Muller')) print(soundex('Mueller'))
>>> 'M466' >>> 'M466'
-
Tokenizer: Converts a sequence of characters into a sequence of tokens.
print(tokenize('LINUX')) print(tokenize('Text Mining 2021'))
>>> ['linux'] >>> ['text', 'mining']
-
Vectorize: Converts a string to token based weight tensor.
vector = vectorize([ 'texts ([string]): a multiline or a single line string.', 'dict ([list], optional): list of tokens. Defaults to None.', 'enable_Idf (bool, optional): use IDF or not. Defaults to True.', 'normalize (str, optional): normalization of vector. Defaults to l2.', 'max_dim ([int], optional): dimension of vector. Defaults to None.', 'smooth (bool, optional): restricts value >0. Defaults to True.', 'weightedTf (bool, optional): Tf = 1+log(Tf). Defaults to True.', 'return_features (bool, optional): feature vector. Defaults to False.' ]) print(f'Vector Shape={vector.shape}')
>>> Vector Shape=(8, 37)
-
Predict Token Weights: Computes importance of a token based on classification optimization.
dictionary = ['vector', 'string', 'bool'] vector = vectorize([ 'X ([np.array]): vectorized matrix columns arraged as per the dictionary.', 'y ([labels]): True classification labels.', 'epochs ([int]): Optimization epochs.', 'verbose (bool, optional): Enable verbose outputs. Defaults to False.', 'dict ([type], optional): list of tokens. Defaults to None.' ], dict=dictionary) labels = np.random.randint(1, size=(vector.shape[0], 1)) weights = predict_weights(vector, labels, 100, dict=dictionary)
>>> Token-Weights Mappings: {'vector': 0.22097790924850977, 'string': 0.39296369957440075, 'bool': 0.689853175081446}
-
Page Rank: Computes page rank from a chain matrix
chain_matrix = np.array([[0, 0, 1], [1, 0, 1], [0, 1, 0]]) print(page_rank(chain_matrix)) rank, TPM = page_rank(chain_matrix, return_TransMatrix=True) print(f'Page Rank: {rank} \nTransition Probablity Matrix: \n{TPM}')
>>> [0.0047 0.997 0.0767] >>> Page Rank: [0.0047 0.997 0.0767] Transition Probablity Matrix: [[0.03333333 0.03333333 0.93333333] [0.48333333 0.03333333 0.48333333] [0.03333333 0.93333333 0.03333333]]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file irtm-0.0.4.tar.gz.
File metadata
- Download URL: irtm-0.0.4.tar.gz
- Upload date:
- Size: 6.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38086b858d3b712d07d08e70816fcba9a84aeb2ac87136ba67e21486e61852a6
|
|
| MD5 |
3ddcb1c310d0e56506b34c3b2318d5d8
|
|
| BLAKE2b-256 |
92f234672d84cc281b67fa04dcf76b725f31918dc3b80941e73234bf686d45e2
|
File details
Details for the file irtm-0.0.4-py3-none-any.whl.
File metadata
- Download URL: irtm-0.0.4-py3-none-any.whl
- Upload date:
- Size: 5.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea4b123b41b6a53e812e86668b21f71d6e779256111ed8f66088315f368cd4b3
|
|
| MD5 |
236e5bc86668fa942f7f01b3ea671d7f
|
|
| BLAKE2b-256 |
1131f3b23d000b644bd511f2225fdf69ba91815dc39278465152d84f9affeaef
|