Parser for entity/address free text (based on libpostal/spacy)
Project description
Payment text parser
Description
Inputs a text string and parses entity/address free text input to:
- Flag entity fields ('ORG' for companies, 'PER' for individuals, 'PER_ORG' for uncertain decision) <= Based on Spacy
- Flag address components of the address ('house number', 'town', 'country', ...) <= Based on Libpostal
- Flag other fields (i.e. not entity nor address) with POS tags ('NE', 'ADJ', 'NN', ..) <= Based on CoreNLP
More generally, the package includes following features:
- A data generator for entity/address fields and free text fields (based on open data)
- A model distinguishing entity/address field from free text (using Keras/TensorFlow and CoreNLP) to apply dedicated heuristics
- A series of cleaning_postprocessing steps including true case recognition (CoreNLP)
- A parser of entity/address/other fields as described above using re-trained vanilla Spacy model (based on labeled open data)
- Simple heuristics and metrics applied after the parsing to improve accuracy
- Part-of-speech (POS) tagging of the remaining flags (Spacy and/or CoreNLP) for downstream processing
This package is specifically intended to be used together with the upstream Swiftflow pipeline that parses all fields from the SWIFT MT messages, including the entity/address and free text fields, which are decisive for inter-banking transactional communication.
Installation
The package uses essentally Libpostal and Spacy. Also, it uses Keras on Tensorflow to recognize if the text input is a free text or an entity/address text.
Pre-requiste: Libpostal
Refer to Libpostal installation.
Once Libpostal is installed, the Python binder postal
will be installed as part
of the package with pip (see below)
Payment_text_parser
The other dependencies, including Spacy, will be installed via pip
on the present package:
Create environment
One recommends to use Python 3.7.
Native Python:
/usr/local/bin/python3 -m venv <my_env>
source <my_env>/bin/activate`
Conda:
conda create --name <my_env> python=3.7` conda activate <my_env>
From pip
pip install payment-text-parser --use-feature=2020-resolver
python -m spacy download de_core_news_sm
From git
pip install git+https://gitlab.com/alpina-analytics/payment_text_parser.git
python -m spacy download de_core_news_sm
From requirements.txt
git clone https://gitlab.com/alpina-analytics/payment_text_parser.git
cd payment_text_parser
pip install -r requirements.txt
python -m spacy download de_core_news_sm
export PYTHONPATH=$(pwd)
Usage
Script
from payment_text_parser.entity_extractor.entity_extractor import ExtractorClass
e = ExtractorClass(text)
d_res = e.d_res
Webserver
# Launch
python main.py
# Test
curl -H "Content-type: application/json" -X POST http://127.0.0.1:5000/parse -d '{"text":"John Deere Les Abues 2 75000 Paris"}'
Optional : start Stanford NLP server
Required if :
- Field type detection enabled by
ExtractorClass(text,check_field_type=True)
- POS-tagging of rest fields enabled by
ExtractorClass(text,create_nlp_tags_rest_text=True)
If not started, an warning message will be prompted, however full processing can still take place.
CoreNLP server can be started as follow:
cd ./core_nlp/stanford-corenlp-full-2018-10-05
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-german.properties \
-preload tokenize,ssplit,pos,parse \
-status_port 9000 -port 9000 -timeout 15000
References
Spacy
Libpostal
https://github.com/openvenues/libpostal
CoreNLP
https://stackoverflow.com/questions/33259191/installing-libicu-dev-on-mac https://stackoverflow.com/questions/50217214/import-error-for-icu-in-mac-and-ubuntu-although-pyicu-is-installed-correctly/50364835#50364835 https://www.khalidalnajjar.com/setup-use-stanford-corenlp-server-python/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for payment_text_parser-0.0.9.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4fe18fe3e91d7f220f361b081820d17051b49a6564b2f570d80a8b1abb096c9 |
|
MD5 | 2975cf65e789058495aba21a8e7ae8d5 |
|
BLAKE2b-256 | e72a20b4ed82bf9c72dab152312b0455fcfda15ac11c653195a5cfa15b4c7b64 |
Hashes for payment_text_parser-0.0.9-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a394085a157070f71259e48e03a57dd6ac275b739d54aac14007a0b67667c087 |
|
MD5 | bf6746af81911293053d4605728a5c91 |
|
BLAKE2b-256 | ed6a23c67d2f0629e34d9aa550c63e368fa0907e87724ebd89eefe8493577734 |