Skip to main content

Counting stop words.

Project description

Common Phrase Detection

coverage report static analysis lint report maintainability complexity Code style: black Pre-commit license

This is an API python library which is developed for detecting stop phrases.

Table of Contents

Background

NLP (Natural Language Processing) techniques is very helpful in various applications such as sentiment analysis, chatbots and other areas. For developing NLP models a need for a large & clean corpus for learning words relations is indisputable. One of the challanges in achieving a clean corpus is stop phrases. Stop phrases usually does not contain much information about the text and so must be identified and removed from the text.
This is the aim of this repo to provide a structure for processing HTML pages (which are a valuable source of text for all languages) and finding a certain number of possible combinations of words and using human input for identifying stop phrases.

Install

  1. Make sure you have docker,docker-compose and python 3.8 and above installed.

  2. create a .env file with desired values based on .env.example file.

  3. After cloning the project, go to the project directory and run below command.

docker-compose -f docker-compose-dev.yml build
  1. After the images are built successfully, run below command for starting the project.
docker-compose -f docker-compose-dev.yml up -d
  1. We need to create a database and collection in mongo in order to use the API. First run mongo bash.
docker exec -it db bash
  1. Authenticate in mongo container.
mongo -u ${MONGO_INITDB_ROOT_USERNAME} -p ${MONGO_INITDB_ROOT_PASSWORD} -- authenticationDatabase admin
  1. Create the database and collection based on MONGO_PHRASE_DB and MONGO_PHRASE_COL names you provided in step 2.
use phrasedb;  # Database creation
db.createCollection("common_phrase");  # Collection creation
  1. Now you're ready yo use the API section.

API

This API has three endpoints.

Document Process

Here you can pass a HTML text in request body to process it.

The process stages are:

  • Fetching all H1-H6 and p tags

  • Cleaning text

  • Finding bags (from 1 to 5 bags of word)

  • Counting the number of occurences in text

  • Integrating results in database (Updating count field of the phrase if already exists, otherwise inserting a new record)

Status Updater

Updates statuses.

Changing the status of a phrase to either stop or highlight.

Data Fetcher

Fetching data from database based on the statuses. Here you can fetch phrases based on 4 different situation for statuses:

  • Stop phrases

  • Highlight phrases

  • Phrases that have status (either stop or highlight)

  • Phrases which statuses are not yet determined

API details

  • API Base URL
127.0.0.1:8000
  • API Swagger UI
127.0.0.1:8000/docs

For futher details and how to make request to each endpoint refer to the swagger of the API.

Maintainers

Maani Beygi
Reza Shabrang

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phrase_counter-0.2.6.tar.gz (19.4 kB view details)

Uploaded Source

Built Distribution

phrase_counter-0.2.6-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file phrase_counter-0.2.6.tar.gz.

File metadata

  • Download URL: phrase_counter-0.2.6.tar.gz
  • Upload date:
  • Size: 19.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for phrase_counter-0.2.6.tar.gz
Algorithm Hash digest
SHA256 4be023e51a985d3379c07ccba02b9b85eec8e5efcf9bc5531de878c16ea4b814
MD5 9b2b501e8e56d3148a84db79a3b78fed
BLAKE2b-256 231c249830e689ab1188a153b40936676c980c95be931e7fa46b1a7d776d83a5

See more details on using hashes here.

File details

Details for the file phrase_counter-0.2.6-py3-none-any.whl.

File metadata

File hashes

Hashes for phrase_counter-0.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 f5a869dd60d96bdc8fb303b412564b96f970a89077052f5c9238f260b4c13bbe
MD5 169ea34d39cdac66c45176bab5483ee6
BLAKE2b-256 4f5855e094310ac711acb343f98f7250427338b9097c1fd86b6c42f97bceec67

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page