Project description

Information Retrieval & Text Mining Toolbox

This repository holds functions pivotal for IRTM processing. This repo. is staged for continuous development.

Quick Install using 'pip/pip3' & GitHub

pip install git+https://github.com/KanishkNavale/IRTM-Toolbox.git

Import Module

from irtm.toolbox import *

Using Functions

Soundex: A phonetic algorithm for indexing names by sound, as pronounced in English.
```
print(soundex('Muller'))
print(soundex('Mueller'))
```
```
>>> 'M466'
>>> 'M466'
```

Tokenizer: Converts a sequence of characters into a sequence of tokens.

print(tokenize('LINUX'))
print(tokenize('Text Mining 2021'))

>>> ['linux']
>>> ['text', 'mining']

Vectorize: Converts a string to token based weight tensor.

vector = vectorize([
        'texts ([string]): a multiline or a single line string.',
        'dict ([list], optional): list of tokens. Defaults to None.',
        'enable_Idf (bool, optional): use IDF or not. Defaults to True.',
        'normalize (str, optional): normalization of vector. Defaults to l2.',
        'max_dim ([int], optional): dimension of vector. Defaults to None.',
        'smooth (bool, optional): restricts value >0. Defaults to True.',
        'weightedTf (bool, optional): Tf = 1+log(Tf). Defaults to True.',
        'return_features (bool, optional): feature vector. Defaults to False.'
        ])

print(f'Vector Shape={vector.shape}')

>>> Vector Shape=(8, 37)

Predict Token Weights: Computes importance of a token based on classification optimization.

dictionary = ['vector', 'string', 'bool']
vector = vectorize([
        'X ([np.array]): vectorized matrix columns arraged as per the dictionary.',
        'y ([labels]): True classification labels.',
        'epochs ([int]): Optimization epochs.',
        'verbose (bool, optional): Enable verbose outputs. Defaults to False.',
        'dict ([type], optional): list of tokens. Defaults to None.'
        ], dict=dictionary)

labels = np.random.randint(1, size=(vector.shape[0], 1))
weights = predict_weights(vector, labels, 100, dict=dictionary)

>>> Token-Weights Mappings: {'vector': 0.22097790924850977, 
                             'string': 0.39296369957440075, 
                             'bool': 0.689853175081446}

Page Rank: Computes page rank from a chain matrix

chain_matrix = np.array([[0, 0, 1],
                         [1, 0, 1],
                         [0, 1, 0]])

print(page_rank(chain_matrix))

rank, TPM = page_rank(chain_matrix, return_TransMatrix=True)
print(f'Page Rank: {rank} \nTransition Probablity Matrix: \n{TPM}')

>>> [0.0047 0.997  0.0767]
>>> Page Rank: [0.0047 0.997  0.0767] 
    Transition Probablity Matrix: 
    [[0.03333333 0.03333333 0.93333333]
    [0.48333333 0.03333333 0.48333333]
    [0.03333333 0.93333333 0.03333333]]

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.4

Sep 6, 2021

0.0.3

Aug 31, 2021

0.0.2

Aug 29, 2021

0.0.1

Aug 29, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

irtm-0.0.4.tar.gz (6.3 kB view hashes)

Uploaded Sep 6, 2021 Source

Built Distribution

irtm-0.0.4-py3-none-any.whl (5.9 kB view hashes)

Uploaded Sep 6, 2021 Python 3

Hashes for irtm-0.0.4.tar.gz

Hashes for irtm-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`38086b858d3b712d07d08e70816fcba9a84aeb2ac87136ba67e21486e61852a6`
MD5	`3ddcb1c310d0e56506b34c3b2318d5d8`
BLAKE2b-256	`92f234672d84cc281b67fa04dcf76b725f31918dc3b80941e73234bf686d45e2`

Hashes for irtm-0.0.4-py3-none-any.whl

Hashes for irtm-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ea4b123b41b6a53e812e86668b21f71d6e779256111ed8f66088315f368cd4b3`
MD5	`236e5bc86668fa942f7f01b3ea671d7f`
BLAKE2b-256	`1131f3b23d000b644bd511f2225fdf69ba91815dc39278465152d84f9affeaef`