HuSpaCy: industrial strength Hungarian natural language processing

These details have not been verified by PyPI

Project links

Project description

project logo

PyPI - Wheel

HuSpaCy is a spaCy library providing industrial-strength Hungarian language processing facilities through spaCy models. The released pipelines consist of a tokenizer, sentence splitter, lemmatizer, tagger (predicting morphological features as well), dependency parser and a named entity recognition module. Word and phrase embeddings are also available through spaCy's API. All models have high throughput, decent memory usage and close to state-of-the-art accuracy. A live demo is available here, model releases are published to Hugging Face Hub.

This repository contains material to build HuSpaCy and all of its models in a reproducible way.

Installation

To get started using the tool, first, we need to download one of the models. The easiest way to achieve this is to install huspacy (from PyPI) and then fetch a model through its API.

pip install huspacy

import huspacy

# Download the latest CPU optimized model
huspacy.download()

Install the models directly

You can install the latest models directly from 🤗 Hugging Face Hub:

CPU optimized large model: pip install hu_core_news_lg@https://huggingface.co/huspacy/hu_core_news_lg/resolve/main/hu_core_news_lg-any-py3-none-any.whl
GPU optimized transformers model: pip install hu_core_news_trf@https://huggingface.co/huspacy/hu_core_news_trf/resolve/main/hu_core_news_trf-any-py3-none-any.whl

To speed up inference on GPU, CUDA must be installed as described in https://spacy.io/usage.

Quickstart

HuSpaCy is fully compatible with spaCy's API, newcomers can easily get started with spaCy 101 guide.

Although HuSpacy models can be loaded with spacy.load(...), the tool provides convenience methods to easily access downloaded models.

# Load the model using spacy.load(...)
import spacy
nlp = spacy.load("hu_core_news_lg")

# Load the default large model (if downloaded)
import huspacy
nlp = huspacy.load()

# Load the model directly as a module
import hu_core_news_lg
nlp = hu_core_news_lg.load()

To process texts, you can simply call the loaded model (i.e. the nlp callable object)

doc = nlp("Csiribiri csiribiri zabszalma - négy csillag közt alszom ma.")

As HuSpaCy is built on spaCy, the returned doc document contains all the annotations given by the pipeline components.

API Documentation is available in our website.

Models overview

We provide several pretrained models:

hu_core_news_lg is a CNN-based large model which achieves a good balance between accuracy and processing speed. This default model provides tokenization, sentence splitting, part-of-speech tagging (UD labels w/ detailed morphosyntactic features), lemmatization, dependency parsing and named entity recognition and ships with pretrained word vectors.
hu_core_news_trf is built on huBERT and provides the same functionality as the large model except the word vectors. It comes with much higher accuracy in the price of increased computational resource usage. We suggest using it with GPU support.
hu_core_news_md greatly improves on hu_core_news_lg's throughput by loosing some accuracy. This model could be a good choice when processing speed is crucial.
hu_core_news_trf_xl is an experimental model built on XLM-RoBERTa-large. It provides the same functionality as the hu_core_news_trf model, however it comes with slightly higher accuracy in the price of significantly increased computational resource usage. We suggest using it with GPU support.

HuSpaCy's model versions follows spaCy's versioning scheme.

A demo of the models is available at Hugging Face Spaces.

To read more about the model's architecture we suggest reading the relevant sections from spaCy's documentation.

Comparison

Models	`md`	`lg`	`trf`	`trf_xl`
Embeddings	100d floret	300d floret	transformer: `huBERT`	transformer: `XLM-RoBERTa-large`
Target hardware	CPU	CPU	GPU	GPU
Accuracy	⭑⭑⭑⭒	⭑⭑⭑⭑	⭑⭑⭑⭑⭒	⭑⭑⭑⭑⭑
Resource usage	⭑⭑⭑⭑⭑	⭑⭑⭑⭑	⭑⭑	⭒

Citation

If you use HuSpaCy or any of its models, please cite it as:

@InProceedings{HuSpaCy:2023,
    author= {"Orosz, Gy{\"o}rgy and Szab{\'o}, Gerg{\H{o}} and Berkecz, P{\'e}ter and Sz{\'a}nt{\'o}, Zsolt and Farkas, Rich{\'a}rd"},
    editor= {"Ek{\v{s}}tein, Kamil and P{\'a}rtl, Franti{\v{s}}ek and Konop{\'i}k, Miloslav"},
    title = {{"Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines"}},
    booktitle = {{"Text, Speech, and Dialogue"}},
    year = "2023",
    publisher = {{"Springer Nature Switzerland"}},
    address = {{"Cham"}},
    pages = "58--69",
    isbn = "978-3-031-40498-6"
}

@InProceedings{HuSpaCy:2021,
  title = {{HuSpaCy: an industrial-strength Hungarian natural language processing toolkit}},
  booktitle = {{XVIII. Magyar Sz{\'a}m{\'\i}t{\'o}g{\'e}pes Nyelv{\'e}szeti Konferencia}},
  author = {Orosz, Gy{\"o}rgy and Sz{\' a}nt{\' o}, Zsolt and Berkecz, P{\' e}ter and Szab{\' o}, Gerg{\H o} and Farkas, Rich{\' a}rd},
  location = {{Szeged}},
  pages = "59--73",
  year = {2022},
}

Contact

For feature requests, issues and bugs please use the GitHub Issue Tracker. Otherwise, reach out to us in the Discussion Forum.

Authors

HuSpaCy is implemented in the SzegedAI team, coordinated by Orosz György in the Hungarian AI National Laboratory, MILAB program.

License

This library is released under the Apache 2.0 License

Trained models have their own license (CC BY-SA 4.0) as described on the models page.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.12.1

Oct 28, 2024

0.12.0

Oct 28, 2024

0.11.0

Oct 27, 2023

0.10.1

Aug 30, 2023

0.9.0

May 23, 2023

0.8.1

Mar 24, 2023

0.8.0

Mar 23, 2023

0.7.0

Feb 8, 2023

0.6.0

Nov 11, 2022

0.5.1

Oct 25, 2022

0.5.0

Oct 12, 2022

0.4.3

Apr 27, 2022

0.4.2

Jan 6, 2022

0.4.1

Jan 5, 2022

0.4.0a8 pre-release

Dec 15, 2021

0.4.0a7 pre-release

Dec 15, 2021

0.4.0a6 pre-release

Dec 15, 2021

0.4.0a5 pre-release

Dec 15, 2021

0.4.0a4 pre-release

Dec 14, 2021

0.4.0a3 pre-release

Dec 14, 2021

0.4.0a2 pre-release

Dec 14, 2021

0.4.0a1 pre-release

Dec 14, 2021

0.4.0a0 pre-release

Dec 14, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

huspacy-0.12.1-py3-none-any.whl (92.8 kB view details)

Uploaded Oct 28, 2024 Python 3

File details

Details for the file huspacy-0.12.1-py3-none-any.whl.

File metadata

Download URL: huspacy-0.12.1-py3-none-any.whl
Upload date: Oct 28, 2024
Size: 92.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for huspacy-0.12.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9150dd1d2105b35aa85d08663243db4c194cddb0b45718499a3a7d6aba1dafb5`
MD5	`1451d4f63bc2e20ef9509c9f1506abef`
BLAKE2b-256	`7083d6f73d6d89195a0f7eeb1da7ee0fb811b2a10071998524859b5168270c27`

See more details on using hashes here.

huspacy 0.12.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Install the models directly

Quickstart

Models overview

Comparison

Citation

Contact

Authors

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes