A Python text analysis library for relevance and subtheme detection
Project description
TextScope :books::mag:
TextScope is a Python package that helps determine the relevance of a text to predefined profiles of interest and aligns it with specific subthemes. The package is designed to be flexible and configurable via a config.py
file.
Installation
You can install TextScope using pip:
pip install textscope
Configuration
Before using TextScope, define your profiles of interest and subthemes in the config.py file. Example:
THEMES = ['technology', 'AI', 'machine learning', 'software']
SUBTHEMES = {
'1': 'Natural Language Processing',
'2': 'Transfomer-based architecture'
'3': 'Computer Vision and multimodality'
}
In this config
example, we defined a series of themes or keyphrases related to AI. They can be used in combination with the relevance filter to keep only highly on-topic documents. We also defined a series of subthemes to determine whether the analyzed text discuss the subtheme or not.
Relevance Analysis
To determine if a text is relevant to any of the predefined themes or profiles:
from textscope.relevance_analyzer import RelevanceAnalyzer
model_name = 'intfloat/multilingual-e5-large-instruct'
text = "This article discusses the latest advancements in AI and machine learning."
analyzer = RelevanceAnalyzer(model_name)
rel_score = analyzer.analyze(text)
print(rel_score) ## it will return a high score of relevance for the themes (> 86.)
One possible application of this method would be to filter out texts that are not highly relevant to the topic. Future versions of the TextScope will include a filter_corpus method that will remove the out-of-scope texts from a corpus (currently under development). NOTE: TextScope is agnostic to the embedding model underneath, but we highly recommend to use e5 multilingual instruct version.It is highly flexible and accepts instructions in natural language.
The default config file provided with TextScope defines a profile of interest in Spanish related to pathological gambling and a list of subthemes representing symptoms of the pathology. It is an example of the multilingual support of this package and its application to complex real scenarios.
Subtheme Analysis
This class allows to test whether a text discuss or not the subthemes defined in the config
:
from textscope.subtheme_analyzer import SubthemeAnalyzer
model_name = intfloat/multilingual-e5-large-instruct'
text = "Transformer-based architecture is the state-of-the-art in NLP."
analyzer = SubthemeAnalyzer(model_name)
subth_pres = analyzer.analyze_bin(text) # default threshold set to 86.
print(subth_pres) # For this sentence and subthemes it should output {'1':1, '2':1, '3':0}
If we do not want a binary output, we also provide a method that outputs the similarity:
from textscope.subtheme_analyzer import SubthemeAnalyzer
model_name = intfloat/multilingual-e5-large-instruct'
text = "Transformer-based architecture is the state-of-the-art in NLP."
analyzer = SubthemeAnalyzer(model_name)
subth_prob = analyzer.analyze(text) # default threshold set to 86.
print(subth_prob) # For this sentence and subthemes it should output {'1':1, '2':1, '3':0}
Testing
To run tests for TextScope, use the following command:
pytest -s tests/
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for textscope-0.1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1322374cde4a4fcb596f8013f2988d96315b3d7a820aba34a0038a64e063f284 |
|
MD5 | 66b05dec3d23c685c655c236675d6fdd |
|
BLAKE2b-256 | 31a4f6f90dee5479d84997809b7dbb70c5b3d0e2826c52be2ac91fb4b667d768 |