Adapted Stanford NLP Python Library with improvements for specific languages.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language
- Python :: 3.6
- Python :: 3.7
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

A CLASSLA Fork of Stanza For Processing Slovene, Croatian, Serbian and Bulgarian

Description

This pipeline allows for processing of Slovene, Croatian, Serbian and Bulgarian on the levels of

tokenization and sentence splitting
part-of-speech tagging
lemmatization
dependency parsing
named entity recognition

Installation

pip

We recommend that you install CLASSLA via pip, the Python package manager. To install, run:

pip install classla

This will also resolve all dependencies.

Running CLASSLA

Getting started

To run the CLASSLA pipeline for the first time, follow these steps:

>>> import classla
>>> classla.download('sl')                            # download models for Slovene, use hr for Croatian, sr for Serbian, bg for Bulgarian
>>> nlp = classla.Pipeline('sl')                      # initialize the default Slovene pipeline, use hr for Croatian, sr for Serbian, bg for Bulgarian
>>> doc = nlp("France Prešeren je rojen v Vrbi.")     # run the pipeline
>>> print(doc.conll_file.conll_as_string())           # print the output in CoNLL-U format
# newpar id = 1
# sent_id = 1.1
# text = France Prešeren je rojen v Vrbi.
1	France	France	PROPN	Npmsn	Case=Nom|Gender=Masc|Number=Sing	4	nsubj	_	NER=B-per
2	Prešeren	Prešeren	PROPN	Npmsn	Case=Nom|Gender=Masc|Number=Sing	1	flat_name	_	NER=I-per
3	je	biti	AUX	Va-r3s-n	Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin	4	cop	_	NER=O
4	rojen	rojen	ADJ	Appmsnn	Case=Nom|Definite=Ind|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part	0	root	_	NER=O
5	v	v	ADP	Sl	Case=Loc	6	case	_	NER=O
6	Vrbi	Vrba	PROPN	Npfsl	Case=Loc|Gender=Fem|Number=Sing	4	obl	_	NER=B-loc|SpaceAfter=No
7	.	.	PUNCT	Z	_	4	punct	_	NER=O

You can also consult the pipeline_demo.py file for usage examples.

Processors

The CLASSLA pipeline is built from multiple units. These units are called processors. By default CLASSLA runs the tokenize, ner, pos, lemma and depparse processors.

You can specify which processors `CLASSLA should run, via the processors attribute as in the following example, performing tokenization, named entity recognition, part-of-speech tagging and lemmatization.

>>> nlp = classla.Pipeline('sl', processors='tokenize,ner,pos,lemma')

Another popular option might be to perform tokenization, part-of-speech tagging, lemmatization and dependency parsing.

>>> nlp = classla.Pipeline('sl', processors='tokenize,pos,lemma,depparse')

Tokenization and sentence splitting

The tokenization and sentence splitting processor tokenize is the first processor and is required for any further processing.

In case you already have tokenized text, you should separate tokens via spaces and pass the attribute tokenize_pretokenized=True.

By default CLASSLA uses a rule-based tokenizer - reldi-tokeniser.

Part-of-speech tagging

The POS tagging processor pos will general output that contains morphosyntactic description following the MULTEXT-East standard and universal part-of-speech tags and universal features following the Universal Dependencies standard. This processing requires the usage of the tokenize processor.

Lemmatization

The lemmatization processor lemma will produce lemmas (basic forms) for each token in the input. It requires the usage of both the tokenize and pos processors.

Dependency parsing

The dependency parsing processor depparse performs syntactic dependency parsing of sentences following the Universal Dependencies formalism. It requires the tokenize and pos processors.

Named entity recognition

The named entity recognition processor ner identifies named entities in text following the IOB2 format. It requires only the tokenize processor.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language
- Python :: 3.6
- Python :: 3.7
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

2.1.1

Apr 10, 2024

2.1

Aug 8, 2023

2.0

Feb 16, 2023

1.2.0

Jun 29, 2022

1.1.1

May 6, 2022

1.1.0

Jan 12, 2022

1.0.2

Aug 25, 2021

1.0.1

Mar 23, 2021

1.0.0

Mar 4, 2021

0.0.11

Nov 9, 2020

0.0.10

Sep 21, 2020

0.0.9

Sep 21, 2020

0.0.8

Sep 16, 2020

0.0.7

Sep 11, 2020

0.0.6

Sep 7, 2020

This version

0.0.4

Jul 10, 2020

0.0.3

Jul 10, 2020

0.0.2

Jul 2, 2020

0.0.1

Jul 2, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

classla-0.0.4.tar.gz (155.0 kB view hashes)

Uploaded Jul 10, 2020 Source

Built Distribution

classla-0.0.4-py3-none-any.whl (205.2 kB view hashes)

Uploaded Jul 10, 2020 Python 3

Hashes for classla-0.0.4.tar.gz

Hashes for classla-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`ea327cb8c6cb9589f61e2efad224979bbdc1a67aa197c356465927c830c99099`
MD5	`f2758c3880018a618097cc9154d95cfa`
BLAKE2b-256	`0339187cf1911042503f875ed16afd11827c4855411bf4473005d0bafb3f0d02`

Hashes for classla-0.0.4-py3-none-any.whl

Hashes for classla-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4a87c86588dfb88ff33f6acbc0912da25fa7e20a3b58ed0d8726be8e33106a9a`
MD5	`5af5177dc258c0466bb74534b844227d`
BLAKE2b-256	`504583da1f9d2209b0a30e970e9d87d1f33b1e025526adc35bca2d4399e30f15`