A Python package to generate document profiler or extract metadata from text in parallel using several Docker images
Project description
DocProfiler
A Python package to generate document profiles and extract metadata from text in parallel using several Docker images and NLP tools/framweworks.
Abstract
Amount of unstructured data has been growing continously which includes text documents, social media text, web blogs and articles. Extracting metadata and generating profiles can increase performance of Text/Document Reterival in field of Information Retrieval and also help us to analyze and understand data by converting it from a completly unstructured to a semi-structured format. Motive behind this python library is to combine open-source NLP tools/technologies together in a much efficient and easy way with help of Docker images and Asynchronous functionalities of Python to process them in parallel.
NLP tools/framework
Task | Framework/Model | Docker Image (GPU support available) | Ports |
---|---|---|---|
Unsupervised Keyphrase Extraction | SIFRank-2020 | docker pull aayushpatel007/sifrank-keyphrases | 5000 |
Named Entity Recognition | FlairNer | docker pull aayushpatel007/flair_ner | 5001 |
Entity Linking | TAGME | docker pull aayushpatel007/tagme-entity-linking | 5002 |
Text Summarization | TextRank | docker pull aayushpatel007/text-summarization | 5003 |
GeoParsing | Mordecai (Upcoming) | --Upcoming-- | N.A |
Language Detection | --Upcoming-- | --Upcoming-- | N.A |
Readability Analysis | --Upcoming-- | --Upcoming-- | N.A |
Run Docker containers
To generate a document profile, you would need to run the above containers: (It is not necassary to run all the containers. You can select the task and run it accordingly.) However, the ports mentioned above needs to be opened for the following containers:
Eg:
docker container run -d -p 5000:5000 aayushpatel007/sifrank-keyphrases 0 # Replace 0 with -1 while running on a CPU.
docker container run -d -p 5001:5001 aayushpatel007/flair_ner 1 # defaut uses ner-ontonotes trained model. If running on CPU, you can replace 1 with 0.
docker container run -d -p 5002:5002 aayushpatel007/tagme-entity-linking 0.2 "tag_me_api_token" # Note that when running entity-linking container you need TAGME API token
docker container run -d - p 5003:5003 aayushpatel007/text-summarization
Now you can access your API using:
http://host_ip_address:5000/sifrank # For Unsupervised Keyphrase detection
http://host_ip_address:5001/flairner # For Named Entity Recognition. 14 entities (PERSON, GPE, EVENTS, ORG ,NORP ,ETC)
http://host_ip_address:5002/tagme # For Entity-Linking
http://host_ip_address:5003/textrank # For Text Summarization
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for docprofiler-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9bf70172bfb147833caf11177606db9b1454e3922830ecafa6dd189aa4c842a7 |
|
MD5 | 6c212b904756e15475ad9623bd6e1a38 |
|
BLAKE2b-256 | 0202bb822588545d8adaf34d9b658a53641ec4482c3e40e15505f3ccae0ef01b |