Explicit online topic extraction for documents from Wikipedia
Project description
Overview of the package wikitopic
Wikitopic is an explicit topic extractor from English documents. It is developed in Emerging Risk Group at Kemmy Business School, University of Limerick as a part of another research in Cyber Risk Prediction by Dr Arash Kia and Dr Finbarr Murphy.
The package uses the acsending sorted frequent words in a list of lists structure where each list stores all the words with the same frequency. The words are extracted from the pre-processed document. Each bunch of words with same frequency are added to a search string one by one to find the best matching topic from Wikipedia. Wikitopic addes words to its search expression list by list from the frequent words sorted list of lists and searches Wikipedia with it until it reaches the minimum search result (The last search result before empty search result). The first item in the the minimum search list result will be the best matched topic of the document. Wikitopic also outputs a list with the first item of the search list for all the steps until reaching the final result. This list shows a top-down path from a general topic to the most specific topic for the document. Wikitopic also produces a list of top n frequent words as the third element of its output.
Citation information
Please if you use the package in your research, cite it in your paper like this: Kia, A. N., Murphy, F., 2020. Wikipedia topic extractor. GitHub; [accessed date]. https://github.com/conkrug/wikitopic.
Installation information and requirements
pre-requisites for the package are nltk and wikipedia package in python. Also re, string, and collections must be installed first. If nltk and wikipedia packages are not installed you can install them with these commands:
pip install nltk
pip install wikipedia
For Anaconda distribution you can do the following:
conda install -c conda-forge wikipedia
conda install -c anaconda nltk
After installing the pre-requisites (if not installed before!), you can install the wikitopic package with this command:
pip install wikitopic
Quick-start examples
This lines of code show a simple example of topic extraction with wikitopic:
from wikitopic import WikiTopicExtractor
print(WikiTopicExtractor("This is a sample English text"))
As you can see, it is possible to put a sentence directly as input of the WikiTopicExtractor class and get the output.
Output: Best topic extracted for the document is: Lorem ipsum Path to the topic is: ['No Topic', 'Lorem ipsum'] 5 most frequent words in the document are: ['text', 'english', 'sample']
Or you can read a text file from a path and find the wikitopic:
import wikitopic
path = 'c://cav//taxonomy//sampleText.txt' #Change the path according to your own system
text = wikitopic.read_from_file(path)
w = WikiTopicExtractor(text, 10) #10 for top 10 frequent words
result = w.output_topic()
print("Best matched topic is: ", result[0])
print("From general to specific topic: ", result[1])
print("Top 10 frequent words", result[2])
Acknowledgements
This work was part of a bigger project that was funded by the European Union’s Horizon 2020 research and innovation program via MALAGA Project under grant agreement No 844864 funded this work.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file wikitopic-1.0.1.tar.gz
.
File metadata
- Download URL: wikitopic-1.0.1.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/49.3.0 requests-toolbelt/0.9.1 tqdm/4.44.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e8a7ac023d158a9db8d9218da204a50eda4e93efbe1f06f5d0c5c8ceaf8b28e |
|
MD5 | 38a3cab17e9e664677e783c8dbf08a58 |
|
BLAKE2b-256 | f5dda387f608d94be113e683070ce3f4811dedcb171131e0da0d6cbe67ce118e |
File details
Details for the file wikitopic-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: wikitopic-1.0.1-py3-none-any.whl
- Upload date:
- Size: 9.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/49.3.0 requests-toolbelt/0.9.1 tqdm/4.44.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab3ea079757e623330b31136fb4501031895ed0d72f591ac811cf3a340be39d1 |
|
MD5 | ede1bcda7acf7241e32b254cfbbfebe5 |
|
BLAKE2b-256 | 35ba5a3b9e6564527e95ecaa629e65ab8fdf655afc6c772d5f3c96e166ef39d5 |