No project description provided
Project description
spacy-experimental: Cutting-edge experimental spaCy components and features
This package includes experimental components and features for spaCy v3.x, for example model architectures, pipeline components and utilities.
Installation
Install with pip
:
python -m pip install -U pip setuptools wheel
python -m pip install spacy-experimental
Using spacy-experimental
Components and features may be modified or removed in any release, so always specify the exact version as a package requirement if you're experimenting with a particular component, e.g.:
spacy-experimental==0.147.0
Then you can add the experimental components to your config or import from
spacy_experimental
:
[components.experimental_edit_tree_lemmatizer]
factory = "experimental_edit_tree_lemmatizer"
Components
Edit tree lemmatizer
[components.experimental_edit_tree_lemmatizer]
factory = "experimental_edit_tree_lemmatizer"
# token attr to use as backoff with the predicted trees are not applicable; null to leave unset
backoff = "orth"
# prune trees that are applied less than this frequency in the training data
min_tree_freq = 2
# whether to overwrite existing lemma annotation
overwrite = false
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}
# try to apply at most the k most probable edit trees
top_k = 1
Trainable character-based tokenizers
Two trainable tokenizers represent tokenization as a sequence tagging problem over individual characters and use the existing spaCy tagger and NER architectures to perform the tagging.
In the spaCy pipeline, a simple "pretokenizer" is applied as the pipeline
tokenizer to split each doc into individual characters and the trainable
tokenizer is a pipeline component that retokenizes the doc. The pretokenizer
needs to be configured manually in the config or with spacy.blank()
:
nlp = spacy.blank(
"en",
config={
"nlp": {
"tokenizer": {"@tokenizers": "spacy-experimental.char_pretokenizer.v1"}
}
},
)
The two tokenizers currently reset any existing tag or entity annotation respectively in the process of retokenizing.
Character-based tagger tokenizer
In the tagger version experimental_char_tagger_tokenizer
, the tagging problem
is represented internally with character-level tags for token start (T
),
token internal (I
), and outside a token (O
). This representation comes from
Elephant: Sequence Labeling for Word and Sentence
Segmentation (Evang et al., 2013).
This is a sentence.
TIIIOTIOTOTIIIIIIIT
With the option annotate_sents
, S
replaces T
for the first token in each
sentence and the component predicts both token and sentence boundaries.
This is a sentence.
SIIIOTIOTOTIIIIIIIT
A config excerpt for experimental_char_tagger_tokenizer
:
[nlp]
pipeline = ["experimental_char_tagger_tokenizer"]
tokenizer = {"@tokenizers":"spacy-experimental.char_pretokenizer.v1"}
[components]
[components.experimental_char_tagger_tokenizer]
factory = "experimental_char_tagger_tokenizer"
annotate_sents = true
scorer = {"@scorers":"spacy-experimental.tokenizer_senter_scorer.v1"}
[components.experimental_char_tagger_tokenizer.model]
@architectures = "spacy.Tagger.v1"
nO = null
[components.experimental_char_tagger_tokenizer.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"
[components.experimental_char_tagger_tokenizer.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 128
attrs = ["ORTH","LOWER","IS_DIGIT","IS_ALPHA","IS_SPACE","IS_PUNCT"]
rows = [1000,500,50,50,50,50]
include_static_vectors = false
[components.experimental_char_tagger_tokenizer.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 128
depth = 4
window_size = 4
maxout_pieces = 2
Character-based NER tokenizer
In the NER version, each character in a token is part of an entity:
T B-TOKEN
h I-TOKEN
i I-TOKEN
s I-TOKEN
O
i B-TOKEN
s I-TOKEN
O
a B-TOKEN
O
s B-TOKEN
e I-TOKEN
n I-TOKEN
t I-TOKEN
e I-TOKEN
n I-TOKEN
c I-TOKEN
e I-TOKEN
. B-TOKEN
A config excerpt for experimental_char_ner_tokenizer
:
[nlp]
pipeline = ["experimental_char_ner_tokenizer"]
tokenizer = {"@tokenizers":"spacy-experimental.char_pretokenizer.v1"}
[components]
[components.experimental_char_ner_tokenizer]
factory = "experimental_char_ner_tokenizer"
scorer = {"@scorers":"spacy-experimental.tokenizer_scorer.v1"}
[components.experimental_char_ner_tokenizer.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null
[components.experimental_char_ner_tokenizer.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"
[components.experimental_char_ner_tokenizer.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 128
attrs = ["ORTH","LOWER","IS_DIGIT","IS_ALPHA","IS_SPACE","IS_PUNCT"]
rows = [1000,500,50,50,50,50]
include_static_vectors = false
[components.experimental_char_ner_tokenizer.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 128
depth = 4
window_size = 4
maxout_pieces = 2
The NER version does not currently support sentence boundaries, but it would be
easy to extend using a B-SENT
entity type.
Architectures
None currently.
Other
Tokenizers
spacy-experimental.char_pretokenizer.v1
: Tokenize a text into individual characters.
Scorers
spacy-experimental.tokenizer_scorer.v1
: Score tokenization.spacy-experimental.tokenizer_senter_scorer.v1
: Score tokenization and sentence segmentation.
Older documentation
See the READMEs in earlier tagged versions for details about components in earlier releases.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for spacy_experimental-0.1.0-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 27b3450902091af01df21af105c8e231b013556f2e0a523711632ef67433b41f |
|
MD5 | ce124c39766d41741411fc589cdd3020 |
|
BLAKE2b-256 | 7991263ec8f24bfad416a4ffd7ffff36315f8039e3c5a472f7fd21f5cdde2bbf |
Hashes for spacy_experimental-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 90a7936017129d2dd8ea69e7b459bd7a3c0407217214fbad49ccdf1ba57a1ae6 |
|
MD5 | 40f1f511dc55b9ef05fedef8f677cfb3 |
|
BLAKE2b-256 | ccbc9929ac72230aed1d03f9e38c80a8255e4b44a70380b4a9eb17d6ce63045d |
Hashes for spacy_experimental-0.1.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e06091be2c1c5b87149bc16153575e477500f3c3350c37f4f3b58b721c7b134e |
|
MD5 | 2adf4c190e1431273f06705592c855d2 |
|
BLAKE2b-256 | 3a61b287d4c890df8bb11bb8f1b3c40b99c606ad1d2142a8c70706ccf536580b |
Hashes for spacy_experimental-0.1.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 58e54c78cca6313109fc1a40844add4bc8708d96305e37b4c3ec5b075298e182 |
|
MD5 | b27112db9b676e7beadd71f2727a4dee |
|
BLAKE2b-256 | 90549b9faf67595a9b4a6ed4f480c742ee5057b52019de9b3eca1e2d358a5619 |
Hashes for spacy_experimental-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5c78c3ce0fbefc986f7fb126f2310b4fd673d8303f62e790f4c128131e61f43f |
|
MD5 | 4b014d6c5daddf2846794e732ee1a8e4 |
|
BLAKE2b-256 | 0c6a8cc2bfee28c827ed78765a32b3495223d60d4a5339d315753f99b91708b4 |
Hashes for spacy_experimental-0.1.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0dfe01ebc5768cd5876ee52d9a470715be681962eaa6dba56dd7d4f42c3939f3 |
|
MD5 | 1e6c814f0b1377902b7bf0569d20034b |
|
BLAKE2b-256 | 24f410a94f56896c310acb48232121a18705f33fd3078a3c6d27b99b8d85751e |
Hashes for spacy_experimental-0.1.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7bf0b9e4ccaf7adfc7146920515c8276a5d29c86d666c22829e7b82d2a8e2df8 |
|
MD5 | b102ad70db39fe68120768e59aa266ec |
|
BLAKE2b-256 | a70278042cbd3a27f7fd324c52b79f71b2f33680a1ff2a5004671a2669f96379 |
Hashes for spacy_experimental-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a11fd3a77889c25e1604002afc810e63703816aec2d29560987d6a25b96baa4 |
|
MD5 | d17fdf9a11d14109759c43d9c7a81a8d |
|
BLAKE2b-256 | bfe0a49519bc2ea7cbace5ed5fa8e8fac3e85e51741b5098fa53c86977662dd3 |
Hashes for spacy_experimental-0.1.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ad4315caf922c9224343a978919f607901798aaca78be4eab72864785fe9114 |
|
MD5 | ec89d15cb10ff90cdafdac0b1cf700ce |
|
BLAKE2b-256 | da99998c053226fcd1c985fabdc1d13708e16395692b0d4ae6ac1246865ce593 |
Hashes for spacy_experimental-0.1.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 42811250d79f5853262d129e0302fe9ef80465ad2fc49483d5b0d0d79bd3c759 |
|
MD5 | 40517ece02d4007cc6d36087693a854e |
|
BLAKE2b-256 | 4c94e115e4356fca8af56ce45627898276cf932e0bbf095a52104b3488572fe7 |
Hashes for spacy_experimental-0.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ac2b8cbdf3d53be412ceb3866fecead34026a87403d33bed39d41c4f37544b0 |
|
MD5 | 54eac2c21097314d022914f301b840ee |
|
BLAKE2b-256 | 030f7cd87886f13d628caf0f14202da664507c745ae586d5356c1799159085db |
Hashes for spacy_experimental-0.1.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 57f93236ff4d344f1a5d25321a412e44cb84a67d0cbac7648377e23b3157cbf1 |
|
MD5 | 07d6557c6fb74383223eb57866f83710 |
|
BLAKE2b-256 | 35de86c3c963744ddcfbb45fcd915c9b994a9f62e8e3457cf69ac310602e3b12 |
Hashes for spacy_experimental-0.1.0-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b2bdab278c6850129817b46054227dbb2c4eb5f75132c13d27ad3fef02431058 |
|
MD5 | 6f22e0b87fc3fd3da6528035917653ce |
|
BLAKE2b-256 | 1b527410316c7bc9418e568fc949dffaf0762ac16ff9788080fa25caac203067 |
Hashes for spacy_experimental-0.1.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c061e250f2e57fedcf22ba6d3529d9f5746df9ad7bc1de3daeb57b1864fb6a21 |
|
MD5 | 74916675d5fba0b96af724c045c9c6b8 |
|
BLAKE2b-256 | 1c742c2d997499899a33f288924bf6ac541c3c2f56b9bb359fea064336512a13 |
Hashes for spacy_experimental-0.1.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb447db80f06f2ad5fcc82f9f1e2dacfba717592672b6b3bd4fe643a21eeb05e |
|
MD5 | 06d9faf86278a92c21325a3c7ee47159 |
|
BLAKE2b-256 | f5337ad2a5dc6efa49dbb8a1bd3f47b584ac88149a51bc23b4a70dc07be89504 |