WordCount in Python with a lot more functionality
Project description
# WordCount Python - [ wc.py ]
Utility tool to count the occurrences of words in text using an NLP tokenizer.
## Overview
This repository contains the CLI and SDK for the WordCount Python [ wc.py ].
`wc.py` provides a set of tools to analyse the number of occurences of words across a single or multiple documents. It can be accessed through the CLI, or directly through the SDK provided by the `WCExtractor` and the `WCCore` classes in the `wcpy` module.
For the **CLI interface quickstart** please refer to the **User Guide below**.
For the **SDK interface quickstart** please refer to the **SDK Interface below**.
For more advanced documnetation please refer to the official [WCPY documentation](https://axsauze.github.io/wcpy/).
# Installation
You can install it from pip by running:
```
pip install wc.py
```
This will install the script in your computer so you'll be able to call it directly with `wc.py`.
# CLI User Guide
## Usage
After installing it you can view the usage and options with `wc.py -h`:
```
usage: wc.py [-h] [-v] [--limit LIMIT] [--reverse]
[--filter-words FILTER_WORDS [FILTER_WORDS ...]]
[--file-ext FILE_EXT] [--truncate TRUNCATE]
[--columns COLUMNS [COLUMNS ...]] [--output-file OUTPUT_FILE]
paths [paths ...]
Count the number of words in the files on a folder
positional arguments:
paths (REQUIRED) Path(s) to folders and/or files to count words from
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
--limit LIMIT (Optional) Limit the number of results that you would like to display.
--reverse (Optional) List is sorted in ascending order by default, use this flag to reverse sorting to descending order.
--filter-words FILTER_WORDS [FILTER_WORDS ...]
(Optional) You can get results filtered to only the list of words provided.
--file-ext FILE_EXT (Optional) This is the default file extention for the files being used
--truncate TRUNCATE (Optional) Output is often quite large, you can truncate the output by passing a number greater than 5
--columns COLUMNS [COLUMNS ...]
(Optional) This argument allows you to choose the columns to be displayed in the output. Options are: word, count, files and sentences.
--output-file OUTPUT_FILE
(Optional) Define an output file to save the output
EXAMPLE USAGE:
wc.py ./
wc.py ./ --limit 10
wc.py doc1.txt doc2.txt --filter-words tool awesome an
wc.py docs/ tests/ --truncate 100 --columns word count
wc.py ./ --filter-words tool awesome an --truncate 50 --output output.txt
```
## Examples
#### Counts of word occurences in documents in this folder recusively
```
wc.py ./
```
#### Word occurrences in this folder docs with limit of the top 10
```
wc.py ./ --limit 10
```
#### Word occurences in multiple files showing only specific words
```
wc.py doc2.txt doc1.txt --filter-words tool awesome an
```
#### Word occurences in folder with output truncated and only 2 columns
```
wc.py tests/test_data/ --truncate 20 --columns word count
```
#### Saving output to file
```
wc.py ./ --filter-words tool awesome an --truncate 50 --output output.txt
```
#### Get the current version
```
wc.py -v
```
# SDK Interface
It is possible to interact with the SDK in multiple levels, the two most common usecases will be:
* WCCore class - Interact with filepaths
* WCExtractor class - Interact with files and text
## WCCore class
### generate_wc_dict(self, paths)
This function finds all the files in a given set of paths, and builds a dictionary with the following structure:
### generate_wc_list(self, paths)
This function finds all the files in a given set of paths, and builds a sorted list (by word count) of the following structure
## WCExtractor class
### extract_wc_from_file
This function extracts all the text from a file and builds a wc_dict object
### extract_wc_from_line
This function extracts all the words from a line and adds it to a wc_dict object
## WCExtractorProcessor class
This class does all the processing to convert a WC Dict into a sorted WCList object.
### process_dict_wc_to_list
As function name suggest, this function converts a WCDict object into a sorted WCList object.
## Core WC Types
### WCDict
```
{
<WORD_1:: STR>: {
word_count: <WORD_COUNT:: INT>,
files: {
<FILE_PATH:: STR>: [
<LINE_1:: STR>,
<LINE_2:: STR>,
…
]
},
{
…
}
},
<WORD_2:: STR>: ...
}
```
### WCList
```
[
{
"word": <WORD_1:: STR>,
word_count: <WORD_COUNT:: INT>,
files: {
<FILE_PATH:: STR>: [
<LINE_1:: STR>,
<LINE_2:: STR>,
…
]
}
},
{
"word": <WORD_2:: STR>,
...
}
]
```
# Contributing
If you'd like to contribute, feel free to submit a pull request, open bugs/issues and join the discussion.
## Install VirtualEnv and Requirements
Python 3.X is used, and it's strongly recommended to set up the project in a virtual environment:
```
virtualenv --no-site-packages -p python3 venv
```
Then install it using the setup.py command
```
python setup.py install_data
```
You can also install the requirements directly by running
```
python -r requirements.txt
```
## NLTK
This package uses the NLTK `english.pickle` dataset. The dataset includes in both, the repository and the PyPi package, however if you want to donwload more of the languages you can do so with the following command:
```
python -c "import nltk; nltk.download('punkt')"
```
## Testing
`py.test` is used to run the tests, in order to run it simply run:
```
python setup.py test
```
## Cleaning
To clean all the files generated during runtime simply run:
```
python setup.py clean
```
# Roadmap
* Support multiple types of documents
Utility tool to count the occurrences of words in text using an NLP tokenizer.
## Overview
This repository contains the CLI and SDK for the WordCount Python [ wc.py ].
`wc.py` provides a set of tools to analyse the number of occurences of words across a single or multiple documents. It can be accessed through the CLI, or directly through the SDK provided by the `WCExtractor` and the `WCCore` classes in the `wcpy` module.
For the **CLI interface quickstart** please refer to the **User Guide below**.
For the **SDK interface quickstart** please refer to the **SDK Interface below**.
For more advanced documnetation please refer to the official [WCPY documentation](https://axsauze.github.io/wcpy/).
# Installation
You can install it from pip by running:
```
pip install wc.py
```
This will install the script in your computer so you'll be able to call it directly with `wc.py`.
# CLI User Guide
## Usage
After installing it you can view the usage and options with `wc.py -h`:
```
usage: wc.py [-h] [-v] [--limit LIMIT] [--reverse]
[--filter-words FILTER_WORDS [FILTER_WORDS ...]]
[--file-ext FILE_EXT] [--truncate TRUNCATE]
[--columns COLUMNS [COLUMNS ...]] [--output-file OUTPUT_FILE]
paths [paths ...]
Count the number of words in the files on a folder
positional arguments:
paths (REQUIRED) Path(s) to folders and/or files to count words from
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
--limit LIMIT (Optional) Limit the number of results that you would like to display.
--reverse (Optional) List is sorted in ascending order by default, use this flag to reverse sorting to descending order.
--filter-words FILTER_WORDS [FILTER_WORDS ...]
(Optional) You can get results filtered to only the list of words provided.
--file-ext FILE_EXT (Optional) This is the default file extention for the files being used
--truncate TRUNCATE (Optional) Output is often quite large, you can truncate the output by passing a number greater than 5
--columns COLUMNS [COLUMNS ...]
(Optional) This argument allows you to choose the columns to be displayed in the output. Options are: word, count, files and sentences.
--output-file OUTPUT_FILE
(Optional) Define an output file to save the output
EXAMPLE USAGE:
wc.py ./
wc.py ./ --limit 10
wc.py doc1.txt doc2.txt --filter-words tool awesome an
wc.py docs/ tests/ --truncate 100 --columns word count
wc.py ./ --filter-words tool awesome an --truncate 50 --output output.txt
```
## Examples
#### Counts of word occurences in documents in this folder recusively
```
wc.py ./
```
#### Word occurrences in this folder docs with limit of the top 10
```
wc.py ./ --limit 10
```
#### Word occurences in multiple files showing only specific words
```
wc.py doc2.txt doc1.txt --filter-words tool awesome an
```
#### Word occurences in folder with output truncated and only 2 columns
```
wc.py tests/test_data/ --truncate 20 --columns word count
```
#### Saving output to file
```
wc.py ./ --filter-words tool awesome an --truncate 50 --output output.txt
```
#### Get the current version
```
wc.py -v
```
# SDK Interface
It is possible to interact with the SDK in multiple levels, the two most common usecases will be:
* WCCore class - Interact with filepaths
* WCExtractor class - Interact with files and text
## WCCore class
### generate_wc_dict(self, paths)
This function finds all the files in a given set of paths, and builds a dictionary with the following structure:
### generate_wc_list(self, paths)
This function finds all the files in a given set of paths, and builds a sorted list (by word count) of the following structure
## WCExtractor class
### extract_wc_from_file
This function extracts all the text from a file and builds a wc_dict object
### extract_wc_from_line
This function extracts all the words from a line and adds it to a wc_dict object
## WCExtractorProcessor class
This class does all the processing to convert a WC Dict into a sorted WCList object.
### process_dict_wc_to_list
As function name suggest, this function converts a WCDict object into a sorted WCList object.
## Core WC Types
### WCDict
```
{
<WORD_1:: STR>: {
word_count: <WORD_COUNT:: INT>,
files: {
<FILE_PATH:: STR>: [
<LINE_1:: STR>,
<LINE_2:: STR>,
…
]
},
{
…
}
},
<WORD_2:: STR>: ...
}
```
### WCList
```
[
{
"word": <WORD_1:: STR>,
word_count: <WORD_COUNT:: INT>,
files: {
<FILE_PATH:: STR>: [
<LINE_1:: STR>,
<LINE_2:: STR>,
…
]
}
},
{
"word": <WORD_2:: STR>,
...
}
]
```
# Contributing
If you'd like to contribute, feel free to submit a pull request, open bugs/issues and join the discussion.
## Install VirtualEnv and Requirements
Python 3.X is used, and it's strongly recommended to set up the project in a virtual environment:
```
virtualenv --no-site-packages -p python3 venv
```
Then install it using the setup.py command
```
python setup.py install_data
```
You can also install the requirements directly by running
```
python -r requirements.txt
```
## NLTK
This package uses the NLTK `english.pickle` dataset. The dataset includes in both, the repository and the PyPi package, however if you want to donwload more of the languages you can do so with the following command:
```
python -c "import nltk; nltk.download('punkt')"
```
## Testing
`py.test` is used to run the tests, in order to run it simply run:
```
python setup.py test
```
## Cleaning
To clean all the files generated during runtime simply run:
```
python setup.py clean
```
# Roadmap
* Support multiple types of documents
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wcpy-1.2.tar.gz
(349.3 kB
view details)
Built Distribution
wcpy-1.2-py3-none-any.whl
(352.6 kB
view details)
File details
Details for the file wcpy-1.2.tar.gz
.
File metadata
- Download URL: wcpy-1.2.tar.gz
- Upload date:
- Size: 349.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e08d65c07a00605894f233da25c0fb79d02ba49792eddd1854dbde2ca9da078c |
|
MD5 | 3bed6fddf5e75ed3d668fd6d2e1f543e |
|
BLAKE2b-256 | 17111b08765fcf0b4f3cb23cc95352709ee1e1dbf5d186f6f6d4e8a7006ea44a |
File details
Details for the file wcpy-1.2-py3-none-any.whl
.
File metadata
- Download URL: wcpy-1.2-py3-none-any.whl
- Upload date:
- Size: 352.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d4d8d97c1b87589b8c1a708ed7688a9327881231fd460841687dd14c5a7870e3 |
|
MD5 | e8f5f7c2fb8549592cc730e8786c880f |
|
BLAKE2b-256 | d8bdb5720a66061c8b9fb118c1d317833ec94d219da427de4cf1f5d1c1e4e0b5 |