Skip to main content

Adapted Stanford NLP Python Library with improvements for specific languages.

Project description

A CLASSLA Fork of Stanza For Processing South Slavic Languages

Installation

pip

We recommend that you install Classla via pip, the Python package manager. To install, run:

pip install classla

This will also help to resolve all dependencies.

Running Classla

Getting started

To run your first Classla pipeline, follow these steps:

>>> import classla
>>> classla.download('sl')                            # to download models in Slovene
>>> nlp = classla.Pipeline('sl')                      # to initialize default Slovene pipeline
>>> doc = nlp("France Prešeren je rojen v Vrbi.")     # to run pipeline
>>> print(doc.conll_file.conll_as_string())           # to print output in conllu format
# newpar id = 1
# sent_id = 1.1
# text = France Prešeren je rojen v Vrbi.
1	France	France	PROPN	Npmsn	Case=Nom|Gender=Masc|Number=Sing	4	nsubj	_	NER=B-per
2	Prešeren	Prešeren	PROPN	Npmsn	Case=Nom|Gender=Masc|Number=Sing	1	flat_name	_	NER=I-per
3	je	biti	AUX	Va-r3s-n	Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin	4	cop	_	NER=O
4	rojen	rojen	ADJ	Appmsnn	Case=Nom|Definite=Ind|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part	0	root	_	NER=O
5	v	v	ADP	Sl	Case=Loc	6	case	_	NER=O
6	Vrbi	Vrba	PROPN	Npfsl	Case=Loc|Gender=Fem|Number=Sing	4	obl	_	NER=B-loc|SpaceAfter=No
7	.	.	PUNCT	Z	_	4	punct	_	NER=O

You can also look into pipeline_demo.py file for usage examples.

Processors

Classla pipeline is built from multiple units. These units are called processors. By default classla runs tokenize, ner, pos, lemma and depparse processors.

You can specify which processors classla runs, with processors attribute as in the following example.

>>> nlp = classla.Pipeline('sl', processors='tokenize,ner,pos,lemma')

Tokenization (tokenize)

In case you already have tokenized text, you should split the text (with i.e. spaces) and pass attribute tokenize_pretokenized=True.

By default classla uses a rule-based tokenizer - reldi-tokeniser.

Most important attributes:

tokenize_pretokenized   - [boolean]     ignores tokenizer

Part-of-speech tagging (pos)

Pos tagging processor will create output, that will contain part-of-speech tags and other features presented on universal dependencies webiste . It is optional and requires you to use tokenize processor beforehand.

Lemmatisation (lemma)

Lemmatization processor will produce lemmas for each word in input. It requires the usage of both tokenize and pos processors.

Parsing (depparse)

Parsing processor (named depparse in code) creates connections between words explained on universal dependencies website . It requires tokenizer and pos processors.

NER (ner)

Ner processor will try to find named entities in text. It requires tokenize processor.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

classla-0.0.3.tar.gz (154.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

classla-0.0.3-py3-none-any.whl (204.9 kB view details)

Uploaded Python 3

File details

Details for the file classla-0.0.3.tar.gz.

File metadata

  • Download URL: classla-0.0.3.tar.gz
  • Upload date:
  • Size: 154.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.1

File hashes

Hashes for classla-0.0.3.tar.gz
Algorithm Hash digest
SHA256 f6adb734a1384288421c048a8fac77937d919833ab7694df7f7568dbb08ff56a
MD5 85b92680f2b5ff6bcc3cdee9b38ce1e4
BLAKE2b-256 6cd123451d9374e9cc04d72b2df793dd65aabe63c2b6ed8e55fe88467c482b61

See more details on using hashes here.

File details

Details for the file classla-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: classla-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 204.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.1

File hashes

Hashes for classla-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a370fdb5e471a1db4bb7d6f1f2b3b88ec9f363893c367d45103d071627849c43
MD5 f60b2ef4a199d86b27170d5e386051bd
BLAKE2b-256 e84138334704e6696cf5d46c242ea45ef32bb219daa59e62baf44965bcc51a2e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page