HuSpaCy: industrial strength Hungarian natural language processing
Project description
HuSpaCy: Industrial-strength Hungarian NLP
HuSpaCy is a spaCy model and library providing industrial-strength Hungarian language processing facilities. A live demo is available here. This repository contain material to build the models for HuSpaCy.
Installation
To get started using the latest Hungarian model, you can fetch the model by installing huspacy
from PyPI:
pip install huspacy
This should be followed by the model download:
import huspacy
huspacy.download()
Alternatively, one can install the latest models directly from Hugging Face Hub:
pip install https://huggingface.co/huspacy/hu_core_news_lg/resolve/main/hu_core_news_lg-any-py3-none-any.whl
To speed up inference, you might want to run the models on GPU for which you need to add CUDA support for spacy as described in here.
Usage
# Load the model through huspacy
import huspacy
huspacy.load()
# Load the mode using spacy.load().
import spacy
nlp = spacy.load("hu_core_news_lg")
# Or load the model directly as a module.
import hu_core_news_lg
nlp = hu_core_news_lg.load()
# Either way you get the same model and can start processing your texts.
doc = nlp('Csiribiri csiribiri zabszalma - négy csillag közt alszom ma.')
For a detailed guide on usage, check spaCy's documentation.
Available Models
Currently, we only support a single large model which has a good balance between accuracy and speed. You can play around with the tool capabilities in this interactive demo.
hu_core_news_lg
provides tokenization, sentence splitting, part-of-speech tagging (UD labels w/ detailed morphosyntactic features), lemmatization, dependency parsing and named entity recognition and ships with pretrained word vectors.
Models' changes are recorded in the changelog.
Development
Installing requirements
poetry install
will install all the dependencies- For better performance you might need to reinstall spacy with GPU support, e.g.
poetry add spacy[cuda92]
will add support for CUDA 9.2
Repository structure
├── .github -- Github configuration files
├── data -- Data files
│ ├── external -- External models required to train models (e.g. word vectors)
│ ├── processed -- Processed data ready to feed spacy
│ └── raw -- Raw data, mostly corpora as they are obtained from the web
├── hu_core_news_lg -- Spacy 3.x project files for building a model for news texts
│ ├── configs -- Spacy pipeline configuration files
│ ├── project.lock -- Auto-generated project script
│ ├── project.yml -- Spacy3 Project file describing steps needed to build the model
│ └── README.md -- Instructions on building a model from scratch
├── huspacy -- subproject for the PyPI distributable package
├── tools -- Source package for tools
│ └── cli -- Command line scripts (Python)
├── models -- Trained models and their metadata
├── resources -- Resource files
├── scripts -- Bash scripts
├── tests -- Test files
├── CHANGELOG.md -- Keeps the changelog
├── LICENSE -- License file
├── poetry.lock -- Locked poetry dependencies files
├── poetry.toml -- Poetry configurations
├── pyproject.toml -- Python project configutation, including dependencies managed with Poetry
└── README.md -- This file
Citing
If you use the models or this library in your research please cite this paper.
Additionally, please indicate the version of the model you used so that your research can be reproduced.
License
This library is released under the Apache 2.0 License. See the LICENSE
file for more details.
The trained models have their own license as described on the models hub.
Contact
For feature request issues and bugs please use the GitHub Issue Tracker. Otherwise, please use the Discussion Forums.
Acknowledgments
The project was supported by the Ministry of Innovation and Technology NRDI Office within the framework of the Artificial Intelligence National Laboratory Program.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for huspacy-0.4.0a2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5faca43c9a64a9c54e7eaad9b0e2eaeddb585b6dce963fd2dad3c99266a32060 |
|
MD5 | be5c6c37d5e738cd2b819f5b35657403 |
|
BLAKE2b-256 | 08a7ab752980460f63888d12f1a1da4b698967cbf36f6c369188e009ccad57a8 |