This framework wraps the spaCy framework and creates light weight features in a class hierarchy that reflects the structure of natural language
Project description
Zensols Natural Language Parsing
From the paper DeepZensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility. This framework wraps the spaCy framework and creates light weight features in a class hierarchy that reflects the structure of natural language. The motivation is to generate features from the parsed text in an object oriented fashion that is fast and easy to pickle.
Other features include:
- Parse and normalize a stream of tokens as stop words, punctuation filters, up/down casing, porter stemming and others.
- Detached features that are safe and easy to pickle to disk.
- Configuration drive parsing and token normalization using configuration factories.
- Pretty print functionality for easy natural language feature selection.
- A comprehensive scoring module including following scoring methods:
- Rouge
- Bleu
- SemEval-2013 Task 9.1
- Levenshtein distance
- Exact match
Documentation
Obtaining / Installing
The library can be installed with pip from the pypi repository:
pip3 install zensols.nlp
The smallest base spaCy model will automatically be downloaded on the first use. You can download other models, such as the medium base model using the following command:
python -m spacy download en_core_web_md
Usage
A parser using the default configuration can be obtained by:
from zensols.nlp import FeatureDocumentParser
parser: FeatureDocumentParser = FeatureDocumentParser.default_instance()
doc = parser('Obama was the 44th president of the United States.')
for tok in doc.tokens:
print(tok.norm, tok.pos_, tok.tag_)
print(doc.entities)
output:
Obama PROPN NNP
was AUX VBD
the DET DT
45th ADJ JJ
president NOUN NN
of ADP IN
the United States DET DT
. PUNCT .
(<Obama>, <45th>, <the United States>)
However, minimal effort is needed to configure the parser using a resource library:
from io import StringIO
from zensols.config import ImportIniConfig, ImportConfigFactory
from zensols.nlp import FeatureDocument, FeatureDocumentParser
CONFIG = """
# import the `zensols.nlp` library
[import]
config_file = resource(zensols.nlp): resources/obj.conf
# override the parse to keep only the norm, ent
[doc_parser]
token_feature_ids = set: ent_, tag_
"""
if (__name__ == '__main__'):
fac = ImportConfigFactory(ImportIniConfig(StringIO(CONFIG)))
doc_parser: FeatureDocumentParser = fac('doc_parser')
sent = 'He was George Washington and first president of the United States.'
doc: FeatureDocument = doc_parser(sent)
for tok in doc.tokens:
tok.write()
This uses a resource library to source in the configuration from this package so minimal configuration is necessary. More advanced configuration examples are also available.
See the feature documents for more information.
Scoring
Certain scores in the scoring module need additional Python packages. These are installed with:
pip install -R src/python/requirements-score.txt
Attribution
This project, or example code, uses:
- spaCy for natural language parsing
- msgpack and smart-open for Python disk serialization
- nltk for the porter stemmer functionality
Citation
If you use this project in your research please use the following BibTeX entry:
@inproceedings{landes-etal-2023-deepzensols,
title = "{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility",
author = "Landes, Paul and
Di Eugenio, Barbara and
Caragea, Cornelia",
editor = "Tan, Liling and
Milajevs, Dmitrijs and
Chauhan, Geeticka and
Gwinnup, Jeremy and
Rippeth, Elijah",
booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
month = dec,
year = "2023",
address = "Singapore, Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.nlposs-1.16",
pages = "141--146"
}
Changelog
An extensive changelog is available here.
Community
Please star this repository and let me know how and where you use this API. Contributions as pull requests, feedback, and any input is welcome.
License
Copyright (c) 2020 - 2026 Paul Landes
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zensols_nlp-1.13.1-py3-none-any.whl.
File metadata
- Download URL: zensols_nlp-1.13.1-py3-none-any.whl
- Upload date:
- Size: 69.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60cdc9e6e5c107b65d75d749a6b30e8c0c948aece648de9a52851d06e65e387f
|
|
| MD5 |
a6b2aa012c9a8b7e4f209e50c4243880
|
|
| BLAKE2b-256 |
24e189627eeb4b73207721a8e8f4bce8416aaca06c89a1e511ddefffd09ae3e6
|