Skip to main content

A toolbox for Information Retrieval & Text Mining.

Project description

Information Retrieval & Text Mining Toolbox

This repository holds functions pivotal for IRTM processing. This repo. is staged for continuous development.

Quick Install using 'pip/pip3' & GitHub

pip install git+https://github.com/KanishkNavale/IRTM-Toolbox.git

Import Module

from irtm.toolbox import *

Using Functions

  1. Soundex: A phonetic algorithm for indexing names by sound, as pronounced in English.

    print(soundex('Muller'))
    print(soundex('Mueller'))
    
    >>> 'M466'
    >>> 'M466'
    
  2. Tokenizer: Converts a sequence of characters into a sequence of tokens.

    print(tokenize('LINUX'))
    print(tokenize('Text Mining 2021'))
    
    >>> ['linux']
    >>> ['text', 'mining']
    
  3. Vectorize: Converts a string to token based weight tensor.

    vector = vectorize([
            'texts ([string]): a multiline or a single line string.',
            'dict ([list], optional): list of tokens. Defaults to None.',
            'enable_Idf (bool, optional): use IDF or not. Defaults to True.',
            'normalize (str, optional): normalization of vector. Defaults to l2.',
            'max_dim ([int], optional): dimension of vector. Defaults to None.',
            'smooth (bool, optional): restricts value >0. Defaults to True.',
            'weightedTf (bool, optional): Tf = 1+log(Tf). Defaults to True.',
            'return_features (bool, optional): feature vector. Defaults to False.'
            ])
    
    print(f'Vector Shape={vector.shape}')
    
    >>> Vector Shape=(8, 37)
    
  4. Predict Token Weights: Computes importance of a token based on classification optimization.

    dictionary = ['vector', 'string', 'bool']
    vector = vectorize([
            'X ([np.array]): vectorized matrix columns arraged as per the dictionary.',
            'y ([labels]): True classification labels.',
            'epochs ([int]): Optimization epochs.',
            'verbose (bool, optional): Enable verbose outputs. Defaults to False.',
            'dict ([type], optional): list of tokens. Defaults to None.'
            ], dict=dictionary)
    
    labels = np.random.randint(1, size=(vector.shape[0], 1))
    weights = predict_weights(vector, labels, 100, dict=dictionary)
    
    >>> Token-Weights Mappings: {'vector': 0.22097790924850977, 
                                 'string': 0.39296369957440075, 
                                 'bool': 0.689853175081446}
    
  5. Page Rank: Computes page rank from a chain matrix

    chain_matrix = np.array([[0, 0, 1],
                             [1, 0, 1],
                             [0, 1, 0]])
    
    print(page_rank(chain_matrix))
    
    rank, TPM = page_rank(chain_matrix, return_TransMatrix=True)
    print(f'Page Rank: {rank} \nTransition Probablity Matrix: \n{TPM}')
    
    >>> [0.0047 0.997  0.0767]
    >>> Page Rank: [0.0047 0.997  0.0767] 
        Transition Probablity Matrix: 
        [[0.03333333 0.03333333 0.93333333]
        [0.48333333 0.03333333 0.48333333]
        [0.03333333 0.93333333 0.03333333]]
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

irtm-0.0.4.tar.gz (6.3 kB view hashes)

Uploaded Source

Built Distribution

irtm-0.0.4-py3-none-any.whl (5.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page