NLP Document Classification Tool.
Project description
SortStream
Grant Holtes 2020
A GUI Document Classification Tool
Sortstream allows non-technical users to train and distribute AI document sorting models.
Benefits
1. Your data never leaves your system
SortStream never sends your data via the internet with all processing happening on your computer, including model training. SortStream doesn’t even require an internet connection to run, with the only internet connection required used to verify the product code. This allows you to retain full ownership of your sensitive data and abide by local sovereignty laws
2. Simple to use and incorporate into existing automation solutions
SortStream uses a simple directory based approach to allow non-technical users to easily adapt their workflows to use SortStream. This allows allows existing automation or RPA tools to use with SortStream without complex APIs.
3. Reliable results
Sortstream provides a confidence when it predicts a document’s type, allowing the user to filter out predictions where the model is uncertain of the results. This confidence level is statistically interpretable as the probability of the document being a certain type, based on the training data used.
Usage
GUI Usage
from sortstream import document_classifier_gui
document_classifier_gui() #opens a user interface.
Usual Usage
from sortstream import document_classifier
sorter = document_classifier()
#data folder defaults to current working directory
sorter.fit(data_folder = "/PATH/TO/ROOT/DATA/FOLDER")
sorter.predict()
Installation:
Use pip to install SortStream and its dependancies.
$ pip install -e git+https://github.com/Gholtes/SortStream.git#egg=sortstream
Requirements
Python 3.7+. A simply method to install Python is to use install anaconda or by installing directly from python.org
Python packages: numpy, sklearn, nltk, requests, PyPDF2
Operation
SortStream uses a directory based approch to labelling training data and outputting predictions.
The training dataset is orgaised as a root folder containing folders for each class or type of document. Each of these subfolders should contain the training examples for each class.
When predict is called, sortstream will gather input documents from a folder and attempt to move them into the approporate class subfolder. These class prediction subfolders are in the 'predictions' folder.
Fit - How to Build a Model:
In order to sort documents, the system needs to learn how to identify the category of each document. To do this, the user needs to provide a number of examples of each type of document the system would be expected to sort.
- Find a few examples of each category (between 10 and 100 is a good sample size)
- For each category, make a folder that contains all of the examples. The name of this folder will be the name of the category.
- Put all of these category example folders in a folder named data which is in the same file location as the program.
- Open the program and click build model or
fit()
The program will learn from the examples in the data folder, and build a model. This model is saved alongside the program file. Do not change the folder name of the saved model.
Each time build model or fit()
is used, a new model is created. If multiple models are present in the folder, the program will use the first one it finds.
Predict - Classifying Documents:
- Place the documents that need to be classified into input_files_to_be_sorted folder. This folder can either be created manually or will be automatically created the first time you press the predict button or call the
predict()
method. - Open the program, use the slider to select a confidence value (see Confidence), and click predict or call the
predict()
method.
The files will be sorted from the input_files_to_be_sorted folder into subfolders in the predictions folder. Each subfolder contains the documents that belong to a specific category.
If the system is not confident in a result, it will move the file to a specific subfolder named NotClassified in the predictions folder. These need to be sorted manually.
Each time predict is used, a log file is saved to the logs folder. This CSV file lists every file classified and includes confidence levels for all categories alongside other logs.
Sharing Models:
The model files, which are named SortStreammodelXXXXX can be shared between users, and will be used if in the same folder as the SortStream.py program file.
Each time build model is used, a new model is created. If multiple models are present in the folder, the program will use the first one it finds.
Confidence:
When classifying documents, the user can select the desired confidence level. This is the probability that the classification made by the program is correct, based on the data that the model was trained on.
The below heuristic can be used:
High confidence:
It is highly unlikely that the program will classify a document incorrectly. However, it is likely that only a few documents will be classified, with the remaining requiring manual sorting.
Low confidence:
It is more likely that the program will classify a document incorrectly, but very few documents will require manual sorting.
Ultimately the best confidence level to use will change based on the use case, so it is worth inspecting the logs files to see what confidence levels the program is producing on your data. It is also worth considering the cost of misclassification relative to the cost of manually sorting documents when choosing the confidence level.
Feature Pipeline
I am excited to be working on a number of additional pieces of functionality to make SortStream even more useful. These include:
- Support for MS Office documents
- Command line interface (CLI) for easy integration into automation solutions
- Advanced user settings to customise and evaluate the model
- Automatic dataset balancing
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file SortStream-1.2.2.tar.gz
.
File metadata
- Download URL: SortStream-1.2.2.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 10f609a73696455bd2649729e1b4f8b10bcfbf221dff02141fbd86dad9699d72 |
|
MD5 | 22d0aee7d8a24d15cb17d31cfad3ce9c |
|
BLAKE2b-256 | 29110d26917d549a7933a20fd5e642108a1e3b184a5cc0d1d555bdfcbc38625f |