An automatic thesaurus generator.
Project description
thesaurus_generator
This module provides a simple tool to automatically generate thesaurus in Spanish from a given txt file.
Installation
You can install this module using pypi:
pip install thesaurus-generator
Usage
Here is a simple example:
from thesaurus_generator import ThesaurusGenerator
# Generate the thesaurus.
t = ThesaurusGenerator()
thesaurus = t.generate('./topics/topic_2.txt')
# Save the thesaurus in JSON format.
t.save_thesaurus('thesaurus.json')
Configuration
As in the example above, you can use this module without providing any configuration, as the default configuration works well in most cases. Anyway, if you want some customization, feel free to edit the configuration.
Here is a summary of the configuration supported:
verbosedefines if logs describing the process will appear or not.use_spacydefines if you want the pipeline to use Spacy's pipeline or not. It is True by default, and should be True unless you have an incompatibility with Spacy: https://spacy.io/use_spacydefines if you want the pipeline to use Stanza's pipeline or not. It is True by default, and should be True unless you have an incompatibility with Stanza: https://www.stanza.es/key_termsdefines the configuration to extract the most important terms from the text. If its value is'auto', then the default configuration will be used. Here is the format of the configuration: -configdefines the configuration used to extract the most important terms of each length. This piece of configuraiton is an array of objects, where the object at indexidefines the configuration used to extract terms formed byi-1words. It must contain three objects. Here is what each object must contain: -criteriadefines the criteria used to extract the key words. It can betf-idf, which will point the relevance of each term based on a TF-iDF index;text-relevance, which will point the relevance of each term according to the similarity between the embedding of the whole text and the embedding of the term; orboth, which is the average of each metric. -ratiodefines the ratio of the elements with the highest score that will be kept. E.g: a ratio of 0.05 indicates that the 5% of the terms with the higher score will be kept. -remove_stop_wordsdefines a criteria to remove stop words. The possible values areNone, which indicates that the stop words will not be removed;'hard'that indicates that the terms that contain any stop word will be discarted; and'soft'that only discards the terms that are totally formed by stop words. -stop_wordsis a list with stop words to be considered.key_terms_from_modelsdefines the configuration used to extract the key terms using external models. If the value is'auto', the default configuration is loaded. Here is a description of the configuration: -modelsdefines the models to be used in a string format with the models separated using a comma. E.g:textrank,keybert,yake. The models available are count, keybert, rake, spacy, textrank and yake. -verbosedefines if the model will log the amount of terms that were extracted using every model. -count.ratiodefines the ratio of terms to be kept when using the count model. This model considers more important the terms that are most repeated in the text. -keybertdefines the configurations to use for this model. As KeyBERT (https://github.com/MaartenGr/KeyBERT) model has a wide variarity of configuration, this piece of configuration supports an array of configurations, that will be used to extract terms using the model and will be joined once all the configurations are run. Here is the format of the configuration. Each element must contain an object with the following configuration:diversity,nr_candidates,num_terms,use_maxsumanduse_mmr. Here you can find the meaning of each property: https://github.com/MaartenGr/KeyBERT -rake.ratiodefines the ratio of terms to be kept when using the rake model. This model considers more important the terms that are most repeated in the text. https://pypi.org/project/rake-nltk/ -spacy.ratiodefines the ratio of terms to be kept when using the spacy model. This model considers more important the terms that are most repeated in the text. https://spacy.io/ -textrank.ratiodefines the ratio of terms to be kept when using the textrank model. This model considers more important the terms that are most repeated in the text. https://github.com/davidadamojr/TextRank -yake.ratiodefines the ratio of terms to be kept when using the yake model. This model considers more important the terms that are most repeated in the text. https://pypi.org/project/yake/special_charactersis a list of characters that when a term contains them it will be removed. You can provide an empty array to disable this feature.filter_termsdefines the criteria used to discard the irrelevant terms. If its value is'auto', then the default configuration will be used. Here is the format of the configuration: -criteriadefines the criteria to filter terms. If the value is'included', then the terms that match the patterns inincluded_pos_taggingwill be included; and if its value is'excluded', then the terms that do not match the patterns inexcluded_pos_taggingwill be discarted. -pos_tagging_groupsdefines a mapping between keywords and Spacy POS tagging terms (https://web.archive.org/web/20190206204307/). -included_pos_taggingis a list of patterns that will be included in the terms extracted. Each element is a list of elements used as keys inpos_tagging_groups. -excluded_pos_taggingis a list of patterns that will be excluded in the terms extracted. Each element is a list of elements used as keys inpos_tagging_groups.similaritydefines the similarity measure between the terms that appear in the generated thesaurus. If its value is'auto', then the default configuration will be used. Here is the format of the configuration: -metricis the metric used to calculate the similarity. It can be'spacy', which uses the document similarity defined by Spacy (https://spacy.io/api/doc);'transformers', which uses the similarity defined in sentence_transformers (https://pypi.org/project/sentence-transformers/); or'tfhub', which uses the similarity defined in this TF-hub model: https://tfhub.dev/google/universal-sentence-encoder/4 -remove_stop_wordsif the stop words in the terms are removed before running the metric.thesaurus_similarity_thresholddefines the minumum score of relevance needed between two terms to be included in the generated thesaurus.
Default configuration
Here is the default configuration the module uses:
{
"verbose": false,
"use_spacy": true,
"use_stanza": true,
"key_terms": {
"config": [
{
"criteria": "text-relevance",
"ratio": 0.05,
"remove_stop_words": "soft"
},
{
"criteria": "text-relevance",
"ratio": 0.05,
"remove_stop_words": "soft"
},
{
"criteria": "text-relevance",
"ratio": 0.02,
"remove_stop_words": "soft"
}
],
"stop_words": `nltk.corpus.stopwords.words('spanish')`
},
"key_terms_from_models": {
"models": "textrank,keybert,yake",
"verbose": false,
"count": { "ratio": 0.2 },
"keybert": [
{
"diversity": 0.5,
"nr_candidates": 20,
"num_terms": 15,
"use_maxsum": false,
"use_mmr": false
},
{
"diversity": 0.5,
"nr_candidates": 20,
"num_terms": 15,
"use_maxsum": true,
"use_mmr": false
},
{
"diversity": 0.7,
"nr_candidates": 20,
"num_terms": 15,
"use_maxsum": false,
"use_mmr": true
},
{
"diversity": 0.2,
"nr_candidates": 20,
"num_terms": 15,
"use_maxsum": false,
"use_mmr": true
}
],
"rake": { "ratio": 0.1 },
"spacy": { "ratio": 0.1 },
"textrank": { "ratio": 0.2 },
"yake": { "num_terms": 125 }
},
"special_characters": ["𝒇", "𝑓", "𝒈", "α"],
"filter_terms": {
"criteria": "included",
"pos_tagging_groups": {
"ADJ": ["ADJ"],
"ADV": ["ADV"],
"DET": ["DET", "ADP", "SCONJ", "CCONJ"],
"NOUN": ["NOUN", "PROPN", "NUM"],
"OTHER": ["PUNCT", "SPACE", "PART", "SYM", "INTJ", "X"],
"PRON": ["PRON"],
"VERB": ["VERB", "AUX"]
},
"excluded_pos_tagging": [
["PRON"],
["ADJ"],
["DET"],
["ADV"],
["OTHER"],
["DET", "NOUN"],
["*", "DET"],
["DET", "VERB"],
["DET", "PRON"],
["DET", "ADJ"],
["DET", "ADV"],
["DET", "OTHER"],
["DET", "DET", "*"],
["*", "DET", "DET"],
["*", "*", "DET"]
],
"included_pos_tagging": [
["NOUN"],
["NOUN", "ADJ"],
["NOUN", "ADV", "ADJ"],
["NOUN", "DET", "NOUN"]
]
},
"similarity": {
"metric": "transformers",
"remove_stop_words": true
},
"thesaurus_similarity_threshold": 0.8
}
Other features
Once you run the generate method in the ThesaurusGenerator class, you will have access to the following attributes:
thesaurusis the generated thesaurus.termsis the list of extracted terms.filtered_termsis the list of terms after filtered.token_pair_similaritiesis a list of all the terms pairs and their similarity.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file thesaurus_generator-0.0.7.tar.gz.
File metadata
- Download URL: thesaurus_generator-0.0.7.tar.gz
- Upload date:
- Size: 16.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60c6f781d89c03c20c8f222fafa1c1d4985eb00ecb1fe31dc617a61dd8728f98
|
|
| MD5 |
aa7cd3422f67535d1d9d7c4a16377272
|
|
| BLAKE2b-256 |
9a51426c83965c905c222a2b7116043f2c918159a8170c872fb233895d44455e
|