Takes a list of documents and returns fully automated & labeled dictionaries where topic names are keys and semantically similar keywords from the documents as values
Project description
docs2tops stands for documents to topics.
What it basically does is:
- extracting ngrams from the documents
- extracting meaningful moregrams (2 or more grams)
- creates semi-automated dictionary - if user provided some possible topics, docs2tops provides similar keywords per topics provided
- creates fully-automated dictionary
in both cases (either user inputs some topics or not), docs2tops returns 2 dictionaries. if user did not provide any topic, first dictionary will be empty with a message only.
in all cases, fully-automated dictionary will be created.
docs2tops function takes list of documents optionally, you can provide candidate_topics_list, moregrams_sample_size.
docs2tops(docs_input_list, candidate_topics_list=None, moregrams_sample_size=None)
installation
Run the following to install:
pip install docs2tops
usage
from docs2tops import docs2tops
import pandas as pd
df = pd.read_csv(r"C:\Users\my_file.csv")
docs = df['my_texual_content'].to_list()
candidate_topics_list = ['smell', 'taste', 'delivery', 'packaging']
moregrams_sample_size = 100
user_input_dict, fully_auto_dict = docs2tops(docs_input_list=docs,
candidate_topics_list=candidate_topics_list,
moregrams_sample_size=moregrams_sample_size)
list_dicts = [user_input_dict, fully_auto_dict]
for result in list_dicts:
print(result)
print('number of topics: ', len(result))
print('---')
Developing docs2tops
to install docs2tops, along with the tools you need to develop and run tests, run the following in your virtual environment:
pip install -e .[dev]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for docs2tops-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 564bf92702e084ca95c0b3de3cf15702e4216ad95444f81b6d3c4f7ac2aa4c77 |
|
MD5 | 74fbd186975aeb079f23a92832b79c1e |
|
BLAKE2b-256 | 54efb1be69c8970ddad973c9557738dd5df9001a7114801ac104f3f77cf75b10 |