Skip to main content

A toolbox for Information Retrieval & Text Mining.

Project description

Information Retrieval & Text Mining Toolbox

This repository holds functions pivotal for IRTM processing. This repo. is staged for continuous development.

Quick Install using 'pip/pip3' & GitHub

pip install git+https://github.com/KanishkNavale/IRTM-Toolbox.git

Import Module

from irtm.toolbox import *

Using Functions

  1. Soundex: A phonetic algorithm for indexing names by sound, as pronounced in English.

    print(soundex('Muller'))
    print(soundex('Mueller'))
    
    >>> 'M466'
    >>> 'M466'
    
  2. Tokenizer: Converts a sequence of characters into a sequence of tokens.

    print(tokenize('LINUX'))
    print(tokenize('Text Mining 2021'))
    
    >>> ['linux']
    >>> ['text', 'mining']
    
  3. Vectorize: Converts a string to token based weight tensor.

    vector = vectorize([
            'texts ([string]): a multiline or a single line string.',
            'dict ([list], optional): list of tokens. Defaults to None.',
            'enable_Idf (bool, optional): use IDF or not. Defaults to True.',
            'normalize (str, optional): normalization of vector. Defaults to l2.',
            'max_dim ([int], optional): dimension of vector. Defaults to None.',
            'smooth (bool, optional): restricts value >0. Defaults to True.',
            'weightedTf (bool, optional): Tf = 1+log(Tf). Defaults to True.',
            'return_features (bool, optional): feature vector. Defaults to False.'
            ])
    
    print(f'Vector Shape={vector.shape}')
    
    >>> Vector Shape=(8, 37)
    
  4. Predict Token Weights: Computes importance of a token based on classification optimization.

    dictionary = ['vector', 'string', 'bool']
    vector = vectorize([
            'X ([np.array]): vectorized matrix columns arraged as per the dictionary.',
            'y ([labels]): True classification labels.',
            'epochs ([int]): Optimization epochs.',
            'verbose (bool, optional): Enable verbose outputs. Defaults to False.',
            'dict ([type], optional): list of tokens. Defaults to None.'
            ], dict=dictionary)
    
    labels = np.random.randint(1, size=(vector.shape[0], 1))
    weights = predict_weights(vector, labels, 100, dict=dictionary)
    
    >>> Token-Weights Mappings: {'vector': 0.22097790924850977, 
                                 'string': 0.39296369957440075, 
                                 'bool': 0.689853175081446}
    
  5. Page Rank: Computes page rank from a chain matrix

    chain_matrix = np.array([[0, 0, 1],
                             [1, 0, 1],
                             [0, 1, 0]])
    
    print(page_rank(chain_matrix))
    
    rank, TPM = page_rank(chain_matrix, return_TransMatrix=True)
    print(f'Page Rank: {rank} \nTransition Probablity Matrix: \n{TPM}')
    
    >>> [0.0047 0.997  0.0767]
    >>> Page Rank: [0.0047 0.997  0.0767] 
        Transition Probablity Matrix: 
        [[0.03333333 0.03333333 0.93333333]
        [0.48333333 0.03333333 0.48333333]
        [0.03333333 0.93333333 0.03333333]]
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

irtm-0.0.4.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

irtm-0.0.4-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file irtm-0.0.4.tar.gz.

File metadata

  • Download URL: irtm-0.0.4.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for irtm-0.0.4.tar.gz
Algorithm Hash digest
SHA256 38086b858d3b712d07d08e70816fcba9a84aeb2ac87136ba67e21486e61852a6
MD5 3ddcb1c310d0e56506b34c3b2318d5d8
BLAKE2b-256 92f234672d84cc281b67fa04dcf76b725f31918dc3b80941e73234bf686d45e2

See more details on using hashes here.

File details

Details for the file irtm-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: irtm-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for irtm-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 ea4b123b41b6a53e812e86668b21f71d6e779256111ed8f66088315f368cd4b3
MD5 236e5bc86668fa942f7f01b3ea671d7f
BLAKE2b-256 1131f3b23d000b644bd511f2225fdf69ba91815dc39278465152d84f9affeaef

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page