Skip to main content

Cutting-edge experimental spaCy components and features

Project description

spacy-experimental: Cutting-edge experimental spaCy components and features

This package includes experimental components and features for spaCy v3.x, for example model architectures, pipeline components and utilities.

Azure Pipelines pypi Version

Installation

Install with pip:

python -m pip install -U pip setuptools wheel
python -m pip install spacy-experimental

Using spacy-experimental

Components and features may be modified or removed in any release, so always specify the exact version as a package requirement if you're experimenting with a particular component, e.g.:

spacy-experimental==0.147.0

Then you can add the experimental components to your config or import from spacy_experimental:

[components.experimental_edit_tree_lemmatizer]
factory = "experimental_edit_tree_lemmatizer"

Components

Edit tree lemmatizer

[components.experimental_edit_tree_lemmatizer]
factory = "experimental_edit_tree_lemmatizer"
# token attr to use as backoff with the predicted trees are not applicable; null to leave unset
backoff = "orth"
# prune trees that are applied less than this frequency in the training data
min_tree_freq = 2
# whether to overwrite existing lemma annotation
overwrite = false
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}
# try to apply at most the k most probable edit trees
top_k = 1

Trainable character-based tokenizers

Two trainable tokenizers represent tokenization as a sequence tagging problem over individual characters and use the existing spaCy tagger and NER architectures to perform the tagging.

In the spaCy pipeline, a simple "pretokenizer" is applied as the pipeline tokenizer to split each doc into individual characters and the trainable tokenizer is a pipeline component that retokenizes the doc. The pretokenizer needs to be configured manually in the config or with spacy.blank():

nlp = spacy.blank(
    "en",
    config={
        "nlp": {
            "tokenizer": {"@tokenizers": "spacy-experimental.char_pretokenizer.v1"}
        }
    },
)

The two tokenizers currently reset any existing tag or entity annotation respectively in the process of retokenizing.

Character-based tagger tokenizer

In the tagger version experimental_char_tagger_tokenizer, the tagging problem is represented internally with character-level tags for token start (T), token internal (I), and outside a token (O). This representation comes from Elephant: Sequence Labeling for Word and Sentence Segmentation (Evang et al., 2013).

This is a sentence.
TIIIOTIOTOTIIIIIIIT

With the option annotate_sents, S replaces T for the first token in each sentence and the component predicts both token and sentence boundaries.

This is a sentence.
SIIIOTIOTOTIIIIIIIT

A config excerpt for experimental_char_tagger_tokenizer:

[nlp]
pipeline = ["experimental_char_tagger_tokenizer"]
tokenizer = {"@tokenizers":"spacy-experimental.char_pretokenizer.v1"}

[components]

[components.experimental_char_tagger_tokenizer]
factory = "experimental_char_tagger_tokenizer"
annotate_sents = true
scorer = {"@scorers":"spacy-experimental.tokenizer_senter_scorer.v1"}

[components.experimental_char_tagger_tokenizer.model]
@architectures = "spacy.Tagger.v1"
nO = null

[components.experimental_char_tagger_tokenizer.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.experimental_char_tagger_tokenizer.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 128
attrs = ["ORTH","LOWER","IS_DIGIT","IS_ALPHA","IS_SPACE","IS_PUNCT"]
rows = [1000,500,50,50,50,50]
include_static_vectors = false

[components.experimental_char_tagger_tokenizer.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 128
depth = 4
window_size = 4
maxout_pieces = 2

Character-based NER tokenizer

In the NER version, each character in a token is part of an entity:

T	B-TOKEN
h	I-TOKEN
i	I-TOKEN
s	I-TOKEN
 	O
i	B-TOKEN
s	I-TOKEN
	O
a	B-TOKEN
 	O
s	B-TOKEN
e	I-TOKEN
n	I-TOKEN
t	I-TOKEN
e	I-TOKEN
n	I-TOKEN
c	I-TOKEN
e	I-TOKEN
.	B-TOKEN

A config excerpt for experimental_char_ner_tokenizer:

[nlp]
pipeline = ["experimental_char_ner_tokenizer"]
tokenizer = {"@tokenizers":"spacy-experimental.char_pretokenizer.v1"}

[components]

[components.experimental_char_ner_tokenizer]
factory = "experimental_char_ner_tokenizer"
scorer = {"@scorers":"spacy-experimental.tokenizer_scorer.v1"}

[components.experimental_char_ner_tokenizer.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.experimental_char_ner_tokenizer.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.experimental_char_ner_tokenizer.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 128
attrs = ["ORTH","LOWER","IS_DIGIT","IS_ALPHA","IS_SPACE","IS_PUNCT"]
rows = [1000,500,50,50,50,50]
include_static_vectors = false

[components.experimental_char_ner_tokenizer.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 128
depth = 4
window_size = 4
maxout_pieces = 2

The NER version does not currently support sentence boundaries, but it would be easy to extend using a B-SENT entity type.

Architectures

None currently.

Other

Tokenizers

  • spacy-experimental.char_pretokenizer.v1: Tokenize a text into individual characters.

Scorers

  • spacy-experimental.tokenizer_scorer.v1: Score tokenization.
  • spacy-experimental.tokenizer_senter_scorer.v1: Score tokenization and sentence segmentation.

Bug reports and issues

Please report bugs in the spaCy issue tracker or open a new thread on the discussion board for other issues.

Older documentation

See the READMEs in earlier tagged versions for details about components in earlier releases.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy-experimental-0.2.0.tar.gz (19.5 kB view hashes)

Uploaded Source

Built Distributions

spacy_experimental-0.2.0-cp310-cp310-win_amd64.whl (274.4 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

spacy_experimental-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (285.4 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

spacy_experimental-0.2.0-cp310-cp310-macosx_10_9_x86_64.whl (308.4 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

spacy_experimental-0.2.0-cp39-cp39-win_amd64.whl (273.8 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

spacy_experimental-0.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (285.7 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

spacy_experimental-0.2.0-cp39-cp39-macosx_10_9_x86_64.whl (307.6 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

spacy_experimental-0.2.0-cp38-cp38-win_amd64.whl (277.3 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

spacy_experimental-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (293.1 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

spacy_experimental-0.2.0-cp38-cp38-macosx_10_9_x86_64.whl (300.6 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

spacy_experimental-0.2.0-cp37-cp37m-win_amd64.whl (269.0 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

spacy_experimental-0.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (281.5 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

spacy_experimental-0.2.0-cp37-cp37m-macosx_10_9_x86_64.whl (295.8 kB view hashes)

Uploaded CPython 3.7m macOS 10.9+ x86-64

spacy_experimental-0.2.0-cp36-cp36m-win_amd64.whl (268.6 kB view hashes)

Uploaded CPython 3.6m Windows x86-64

spacy_experimental-0.2.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (281.7 kB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64

spacy_experimental-0.2.0-cp36-cp36m-macosx_10_9_x86_64.whl (297.4 kB view hashes)

Uploaded CPython 3.6m macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page