Skip to main content

A Python text analysis library for relevance and subtheme detection

Project description

TextScope :books::mag:

PyPI - Downloads

TextScope is a Python package that helps determine the relevance of a text to predefined profiles of interest and aligns it with specific subthemes. The package is designed to be flexible and configurable via a config.py file.

Installation

You can install TextScope using pip:

pip install textscope

Configuration

Before using TextScope, define your profiles of interest and subthemes in the config.py file. Example:

THEMES = ['technology', 'AI', 'machine learning', 'software']

SUBTHEMES = {
    '1': 'Natural Language Processing',
    '2': 'Transfomer-based architecture'
    '3': 'Computer Vision and multimodality'
}

In this config example, we defined a series of themes or keyphrases related to AI. They can be used in combination with the relevance filter to keep only highly on-topic documents. We also defined a series of subthemes to determine whether the analyzed text discuss the subtheme or not.

Relevance Analysis

To determine if a text is relevant to any of the predefined themes or profiles:

from textscope.relevance_analyzer import RelevanceAnalyzer

model_name = 'intfloat/multilingual-e5-large-instruct'
text = "This article discusses the latest advancements in AI and machine learning."
analyzer = RelevanceAnalyzer(model_name)
rel_score = analyzer.analyze(text)
print(rel_score)  ## it will return a high score of relevance for the themes (> 86.)

One possible application of this method would be to filter out texts that are not highly relevant to the topic. Future versions of the TextScope will include a filter_corpus method that will remove the out-of-scope texts from a corpus (currently under development). NOTE: TextScope is agnostic to the embedding model underneath, but we highly recommend to use e5 multilingual instruct version.It is highly flexible and accepts instructions in natural language.

The default config file provided with TextScope defines a profile of interest in Spanish related to pathological gambling and a list of subthemes representing symptoms of the pathology. It is an example of the multilingual support of this package and its application to complex real scenarios.

Subtheme Analysis

This class allows to test whether a text discuss or not the subthemes defined in the config:

from textscope.subtheme_analyzer import SubthemeAnalyzer

model_name =  intfloat/multilingual-e5-large-instruct'
text = "Transformer-based architecture is the state-of-the-art in NLP."
analyzer = SubthemeAnalyzer(model_name)
subth_pres = analyzer.analyze_bin(text) # default threshold set to 86.
print(subth_pres)  # For this sentence and subthemes it should output {'1':1, '2':1, '3':0}

If we do not want a binary output, we also provide a method that outputs the similarity:

from textscope.subtheme_analyzer import SubthemeAnalyzer

model_name =  intfloat/multilingual-e5-large-instruct'
text = "Transformer-based architecture is the state-of-the-art in NLP."
analyzer = SubthemeAnalyzer(model_name)
subth_prob = analyzer.analyze(text) # default threshold set to 86.
print(subth_prob)  # For this sentence and subthemes it should output {'1':1, '2':1, '3':0}

Testing

To run tests for TextScope, use the following command:

pytest -s tests/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textscope-0.1.4.tar.gz (19.5 kB view hashes)

Uploaded Source

Built Distribution

textscope-0.1.4-py3-none-any.whl (19.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page