Skip to main content

Use fast UDPipe models directly in spaCy

Project description

spaCy + UDPipe

This package wraps the fast and efficient UDPipe language-agnostic NLP pipeline (via its Python bindings), so you can use UDPipe pre-trained models as a spaCy pipeline for 50+ languages out-of-the-box. Inspired by spacy-stanza, this package offers slightly less accurate models that are in turn much faster (see benchmarks for UDPipe and Stanza).

Installation

Use the package manager pip to install spacy-udpipe.

pip install spacy-udpipe

After installation, use spacy_udpipe.download() to download the pre-trained model for the desired language.

Usage

The loaded UDPipeLanguage class returns a spaCy Language object, i.e., the object you can use to process text and create a Doc object.

import spacy_udpipe

spacy_udpipe.download("en") # download English model

text = "Wikipedia is a free online encyclopedia, created and edited by volunteers around the world."
nlp = spacy_udpipe.load("en")

doc = nlp(text)
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_)

As all attributes are computed once and set in the custom Tokenizer, the Language.pipeline is empty.

Loading a custom model

The following code snippet demonstrates how to load a custom UDPipe model (for the Croatian language):

import spacy_udpipe

nlp = spacy_udpipe.load_from_path(lang="hr",
                                  path="./custom_croatian.udpipe",
                                  meta={"description": "Custom 'hr' model"})
text = "Wikipedija je enciklopedija slobodnog sadržaja."

doc = nlp(text)
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_)

This can be done for any of the languages supported by spaCy. For an exhaustive list, see spaCy languages.

Authors and acknowledgment

Created by Antonio Šajatović during an internship at Text Analysis and Knowledge Engineering Lab (TakeLab).

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update the tests as appropriate. Tests are run automatically for each pull request on the master branch. To start the tests locally, first, install the package with pip install -e ., then run pytest in the root source directory.

License

MIT © Text Analysis and Knowledge Engineering Lab (TakeLab)

Project status

Maintained by Text Analysis and Knowledge Engineering Lab (TakeLab).

Notes

  • All available pre-trained models are licensed under CC BY-NC-SA 4.0.

  • A full list of pre-trained models for supported languages is available in languages.json.

  • This package exposes a spacy_languages entry point in its setup.py so full suport for serialization is enabled:

    nlp = spacy_udpipe.load("en")
    nlp.to_disk("./udpipe-spacy-model")
    

    To properly load a saved model, you must pass the udpipe_model argument when loading it:

    udpipe_model = spacy_udpipe.UDPipeModel("en")
    nlp = spacy.load("./udpipe-spacy-model", udpipe_model=udpipe_model)
    
  • Known possible issues:

    • Tag map

      Token.tag_ is a CoNLL XPOS tag (language-specific part-of-speech tag), defined for each language separately by the corresponding Universal Dependencies treebank. Mappings between XPOS and Universal Dependencies POS tags should be defined in a TAG_MAP dictionary (located in language-specific tag_map.py files), along with optional morphological features. See spaCy tag map for more details.

    • Syntax iterators

      In order to extract Doc.noun_chunks, a proper syntax iterator implementation for the language of interest is required. For more details, please see spaCy syntax iterators.

    • Other language-specific issues

      A quick way to check language-specific defaults in spaCy is to visit spaCy language support. Also, please see spaCy language data for details regarding other language-specific data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_udpipe-0.2.1.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spacy_udpipe-0.2.1-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file spacy_udpipe-0.2.1.tar.gz.

File metadata

  • Download URL: spacy_udpipe-0.2.1.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.10

File hashes

Hashes for spacy_udpipe-0.2.1.tar.gz
Algorithm Hash digest
SHA256 9afd2abb15a246d912f059dcf33bcab51815c7d2177d8b50ec6e857c0545037d
MD5 18b98be61202710a42d65e942d8e77dd
BLAKE2b-256 d3fea77086cac7b3d68366a6930dac18b1dee78655738da66626f4cd4ce33b1f

See more details on using hashes here.

File details

Details for the file spacy_udpipe-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: spacy_udpipe-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.10

File hashes

Hashes for spacy_udpipe-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 982721a07da20db52b061d5aff7849f3c5533f8e76551d3e29a2534667b8cc4f
MD5 2573d606a9a9e5dfc2872db88d7b6089
BLAKE2b-256 724d5b9e668d7cd6f5bf0675a6e6a23d7234d76c299cf6c048573ec8d8c9eb6c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page