HuSpaCy: industrial strength Hungarian natural language processing
Project description
HuSpaCy is a spaCy library providing industrial-strength Hungarian language processing facilities through spaCy models. The released pipelines consist of a tokenizer, sentence splitter, lemmatizer, tagger (predicting morphological features as well), dependency parser and a named entity recognition module. Word and phrase embeddings are also available through spaCy's API. All models have high throughput, decent memory usage and close to state-of-the-art accuracy. A live demo is available here, model releases are published to Hugging Face Hub.
This repository contains material to build HuSpaCy and all of its models in a reproducible way.
Available Models
We provide several pretrained models, the (hu_core_news_lg
) one is a CNN-based large model which achieves a good balance between accuracy and processing speed.
This default model provides tokenization, sentence splitting, part-of-speech tagging (UD labels w/ detailed morphosyntactic features), lemmatization, dependency parsing and named entity recognition and ships with pretrained word vectors.
The second model (hu_core_news_trf
) is built on huBERT and provides the same functionality as the large model except the word vectors.
It comes with much higher accuracy in the price of increased computational resource usage. We suggest using it with GPU support.
The hu_core_news_md
pipeline greatly improves on hu_core_news_lg
's throughput by loosing some accuracy. This model could be a good choice when processing speed is crucial.
A demo of these models is available at Hugging Face Spaces.
Models' changes are recorded in respective changelog files. (lg
, md
trf
, vectors
)
Installation
To get started using the tool, first, we need to download one of the models. The easiest way to achieve this is to install huspacy
(from PyPI) and then fetch a model through its API.
pip install huspacy
import huspacy
# Download the latest CPU optimized model
huspacy.download()
You can install the latest models directly from ๐ค Hugging Face Hub:
- CPU optimized large model:
pip install https://huggingface.co/huspacy/hu_core_news_lg/resolve/main/hu_core_news_lg-any-py3-none-any.whl
- GPU optimized transformers model:
pip install https://huggingface.co/huspacy/hu_core_news_trf/resolve/main/hu_core_news_trf-any-py3-none-any.whl
To speed up inference on GPUs, CUDA should be installed as described in https://spacy.io/usage.
Usage
HuSpaCy is fully compatible with spaCy's API, newcomers can easily get started with spaCy 101 guide.
Although HuSpacy models can be loaded with spacy.load(...)
, the tool provides convenience methods to easily access downloaded models.
# Load the model using spacy.load(...)
import spacy
nlp = spacy.load("hu_core_news_lg")
# Load the default large model (if downloaded)
import huspacy
nlp = huspacy.load()
# Load the model directly as a module
import hu_core_news_lg
nlp = hu_core_news_lg.load()
# Process texts
doc = nlp("Csiribiri csiribiri zabszalma - nรฉgy csillag kรถzt alszom ma.")
API Documentation is available in our website.
Development
Each model has its own dependency structure managed by poetry
. For details check the models' readmes (lg
, trf
, vectors
).
Repository structure
โโโ .github -- Github configuration files
โโโ hu_core_news_lg -- SpaCy 3.x project files for building the large model
โ โโโ configs -- SpaCy pipeline configuration files
โ โโโ meta.json -- model metadata
โ โโโ poetry.lock -- Poetry lock file
โ โโโ poetry.toml -- Poetry configs
โ โโโ project.lock -- Auto-generated project script
โ โโโ project.yml -- SpaCy Project file describing steps needed to build the model
โ โโโ pyproject.toml -- Python project definition file
โ โโโ CHANGELOG.md -- Model changelog
โ โโโ README.md -- Instructions on building a model from scratch
โโโ hu_core_news_trf -- Spacy 3.x project files for building the transformer based model
โ โโโ configs -- SpaCy pipeline configuration files
โ โโโ meta.json -- model metadata
โ โโโ poetry.lock -- Poetry lock file
โ โโโ poetry.toml -- Poetry configs
โ โโโ project.lock -- Auto-generated project script
โ โโโ project.yml -- SpaCy Project file describing steps needed to build the model
โ โโโ pyproject.toml -- Python project definition file
โ โโโ CHANGELOG.md -- Model changelog
โ โโโ README.md -- Instructions on building a model from scratch
โโโ hu_vectors_web_lg -- Spacy 3.x project files for building word vectors
โ โโโ configs -- SpaCy pipeline configuration files
โ โโโ poetry.lock -- Poetry lock file
โ โโโ poetry.toml -- Poetry configs
โ โโโ project.lock -- Auto-generated project script
โ โโโ project.yml -- SpaCy Project file describing steps needed to build the model
โ โโโ pyproject.toml -- Python project definition file
โ โโโ CHANGELOG.md -- Model changelog
โ โโโ README.md -- Instructions on building a model from scratch
โโโ huspacy -- subproject for the PyPI distributable package
โ โโโ huspacy -- huspacy python package
โ โโโ test -- huspacy tests
โ โโโ poetry.lock -- Poetry lock file
โ โโโ poetry.toml -- Poetry configs
โ โโโ pyproject.toml -- Python project definition file
โ โโโ CHANGELOG.md -- HuSpaCy changelog
โ โโโ README.md -> ../README.md
โโโ scripts -- CLI scripts
โโโ LICENSE -- License file
โโโ README.md -- This file
Citing
If you use the models or this library in your research please cite this paper.
Additionally, please indicate the version of the model you used so that your research can be reproduced.
@misc{HuSpaCy:2021,
title = {{HuSpaCy: an industrial-strength Hungarian natural language processing toolkit}},
booktitle = {{XVIII. Magyar Sz{\'a}m{\'\i}t{\'o}g{\'e}pes Nyelv{\'e}szeti Konferencia}},
author = {Orosz, Gy{\"o}rgy and Sz{\' a}nt{\' o}, Zsolt and Berkecz, P{\' e}ter and Szab{\' o}, Gerg{\H o} and Farkas, Rich{\' a}rd},
location = {{Szeged}},
year = {2022},
}
License
This library is released under the Apache 2.0 License
Trained models have their own license (CC BY-SA 4.0) as described on the models page.
Contact
For feature request issues and bugs please use the GitHub Issue Tracker. Otherwise, please use the Discussion Forums.
Authors
HuSpaCy is implemented in the SzegedAI team, coordinated by Orosz Gyรถrgy in the Hungarian AI National Laboratory, MILAB program.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file huspacy_nightly-0.5.0.dev100-py3-none-any.whl
.
File metadata
- Download URL: huspacy_nightly-0.5.0.dev100-py3-none-any.whl
- Upload date:
- Size: 15.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50306ad4b705269359693f5512da33698fbbb9ba30c25d90df0b5c46f1a5d7c3 |
|
MD5 | d413004ed5e2d923290acdaa4696be2c |
|
BLAKE2b-256 | 6437a932d082ba40ec5493bed5b10fcbbc250f4dfb1ff2ead5f9c6155c90055e |