Cutting-edge experimental spaCy components and features
Project description
spacy-experimental: Cutting-edge experimental spaCy components and features
This package includes experimental components and features for spaCy v3.x, for example model architectures, pipeline components and utilities.
Installation
Install with pip
:
python -m pip install -U pip setuptools wheel
python -m pip install spacy-experimental
Using spacy-experimental
Components and features may be modified or removed in any release, so always specify the exact version as a package requirement if you're experimenting with a particular component, e.g.:
spacy-experimental==0.147.0
Then you can add the experimental components to your config or import from
spacy_experimental
:
[components.experimental_char_ner_tokenizer]
factory = "experimental_char_ner_tokenizer"
Components
Trainable character-based tokenizers
Two trainable tokenizers represent tokenization as a sequence tagging problem over individual characters and use the existing spaCy tagger and NER architectures to perform the tagging.
In the spaCy pipeline, a simple "pretokenizer" is applied as the pipeline
tokenizer to split each doc into individual characters and the trainable
tokenizer is a pipeline component that retokenizes the doc. The pretokenizer
needs to be configured manually in the config or with spacy.blank()
:
nlp = spacy.blank(
"en",
config={
"nlp": {
"tokenizer": {"@tokenizers": "spacy-experimental.char_pretokenizer.v1"}
}
},
)
The two tokenizers currently reset any existing tag or entity annotation respectively in the process of retokenizing.
Character-based tagger tokenizer
In the tagger version experimental_char_tagger_tokenizer
, the tagging problem
is represented internally with character-level tags for token start (T
),
token internal (I
), and outside a token (O
). This representation comes from
Elephant: Sequence Labeling for Word and Sentence
Segmentation (Evang et al., 2013).
This is a sentence.
TIIIOTIOTOTIIIIIIIT
With the option annotate_sents
, S
replaces T
for the first token in each
sentence and the component predicts both token and sentence boundaries.
This is a sentence.
SIIIOTIOTOTIIIIIIIT
A config excerpt for experimental_char_tagger_tokenizer
:
[nlp]
pipeline = ["experimental_char_tagger_tokenizer"]
tokenizer = {"@tokenizers":"spacy-experimental.char_pretokenizer.v1"}
[components]
[components.experimental_char_tagger_tokenizer]
factory = "experimental_char_tagger_tokenizer"
annotate_sents = true
scorer = {"@scorers":"spacy-experimental.tokenizer_senter_scorer.v1"}
[components.experimental_char_tagger_tokenizer.model]
@architectures = "spacy.Tagger.v1"
nO = null
[components.experimental_char_tagger_tokenizer.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"
[components.experimental_char_tagger_tokenizer.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 128
attrs = ["ORTH","LOWER","IS_DIGIT","IS_ALPHA","IS_SPACE","IS_PUNCT"]
rows = [1000,500,50,50,50,50]
include_static_vectors = false
[components.experimental_char_tagger_tokenizer.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 128
depth = 4
window_size = 4
maxout_pieces = 2
Character-based NER tokenizer
In the NER version, each character in a token is part of an entity:
T B-TOKEN
h I-TOKEN
i I-TOKEN
s I-TOKEN
O
i B-TOKEN
s I-TOKEN
O
a B-TOKEN
O
s B-TOKEN
e I-TOKEN
n I-TOKEN
t I-TOKEN
e I-TOKEN
n I-TOKEN
c I-TOKEN
e I-TOKEN
. B-TOKEN
A config excerpt for experimental_char_ner_tokenizer
:
[nlp]
pipeline = ["experimental_char_ner_tokenizer"]
tokenizer = {"@tokenizers":"spacy-experimental.char_pretokenizer.v1"}
[components]
[components.experimental_char_ner_tokenizer]
factory = "experimental_char_ner_tokenizer"
scorer = {"@scorers":"spacy-experimental.tokenizer_scorer.v1"}
[components.experimental_char_ner_tokenizer.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null
[components.experimental_char_ner_tokenizer.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"
[components.experimental_char_ner_tokenizer.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 128
attrs = ["ORTH","LOWER","IS_DIGIT","IS_ALPHA","IS_SPACE","IS_PUNCT"]
rows = [1000,500,50,50,50,50]
include_static_vectors = false
[components.experimental_char_ner_tokenizer.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 128
depth = 4
window_size = 4
maxout_pieces = 2
The NER version does not currently support sentence boundaries, but it would be
easy to extend using a B-SENT
entity type.
Biaffine parser
A biaffine dependency parser, similar to that proposed in [Deep Biaffine Attention for Neural Dependency Parsing](Deep Biaffine Attention for Neural Dependency Parsing) (Dozat & Manning, 2016). The parser consists of two parts: an edge predicter and an edge labeler. For example:
[components.experimental_arc_predicter]
factory = "experimental_arc_predicter"
[components.experimental_arc_labeler]
factory = "experimental_arc_labeler"
The arc predicter requires that a previous component (such as senter
) sets
sentence boundaries during training. Therefore, such a component must be
added to annotating_components
:
[training]
annotating_components = ["senter"]
The biaffine parser sample project provides an example biaffine parser pipeline.
Span Finder
The SpanFinder is a new experimental component that identifies span boundaries by tagging potential start and end tokens. It's an ML approach to suggest candidate spans with higher precision.
SpanFinder
uses the following parameters:
threshold
: Probability threshold for predicted spans.predicted_key
: Name of the SpanGroup the predicted spans are saved to.training_key
: Name of the SpanGroup the training spans are read from.max_length
: Max length of the predicted spans. No limit when set to0
. Defaults to0
.min_length
: Min length of the predicted spans. No limit when set to0
. Defaults to0
.
Here is a config excerpt for the SpanFinder
together with a SpanCategorizer
:
[nlp]
lang = "en"
pipeline = ["tok2vec","span_finder","spancat"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3
[components.span_finder]
factory = "experimental_span_finder"
threshold = 0.35
predicted_key = "span_candidates"
training_key = ${vars.spans_key}
min_length = 0
max_length = 0
[components.span_finder.scorer]
@scorers = "spacy-experimental.span_finder_scorer.v1"
predicted_key = ${components.span_finder.predicted_key}
training_key = ${vars.spans_key}
[components.span_finder.model]
@architectures = "spacy-experimental.SpanFinder.v1"
[components.span_finder.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO=2
[components.span_finder.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
[components.spancat]
factory = "spancat"
max_positive = null
spans_key = ${vars.spans_key}
threshold = 0.5
[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"
[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128
[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null
[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
[components.spancat.suggester]
@misc = "spacy-experimental.span_finder_suggester.v1"
predicted_key = ${components.span_finder.predicted_key}
This package includes a spaCy project which shows how to train and use the SpanFinder
together with SpanCategorizer
.
Architectures
None currently.
Other
Tokenizers
spacy-experimental.char_pretokenizer.v1
: Tokenize a text into individual characters.
Scorers
spacy-experimental.tokenizer_scorer.v1
: Score tokenization.spacy-experimental.tokenizer_senter_scorer.v1
: Score tokenization and sentence segmentation.
Misc
Suggester functions for spancat:
Subtree suggester: Uses dependency annotation to suggest tokens with their syntactic descendants.
spacy-experimental.subtree_suggester.v1
spacy-experimental.ngram_subtree_suggester.v1
Chunk suggester: Suggests noun chunks using the noun chunk iterator, which requires POS and dependency annotation.
spacy-experimental.chunk_suggester.v1
spacy-experimental.ngram_chunk_suggester.v1
Sentence suggester: Uses sentence boundaries to suggest sentence spans.
spacy-experimental.sentence_suggester.v1
spacy-experimental.ngram_sentence_suggester.v1
The package also contains a merge_suggesters
function which can be used to combine suggestions from multiple suggesters.
Here are two config excerpts for using the subtree suggester
with and without the ngram functionality:
[components.spancat.suggester]
@misc = "spacy-experimental.subtree_suggester.v1"
[components.spancat.suggester]
@misc = "spacy-experimental.ngram_subtree_suggester.v1"
sizes = [1, 2, 3]
Note that all the suggester functions are registered in @misc
.
Bug reports and issues
Please report bugs in the spaCy issue tracker or open a new thread on the discussion board for other issues.
Older documentation
See the READMEs in earlier tagged versions for details about components in earlier releases.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for spacy_experimental-0.5.0-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f2cb9618b057b72213a5d17a474fdf0f67192b5234537fd58c3e8c2fbfe0ea7e |
|
MD5 | e94f786a61a0aefe5d7b47efcffc7933 |
|
BLAKE2b-256 | 6a95c411c1a8b75c3719079282c8ae98fc70f29f73f51228b0c140ca97d537bd |
Hashes for spacy_experimental-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7cf912ce36fade0de88cc2c94f79c00b893bc9da0918f976f4cb5c19ccbbf140 |
|
MD5 | 656e904e33a84fb9b040ea8a65fe8bd6 |
|
BLAKE2b-256 | e52d72d86cd1c239eeed93f587953a568ba3a7846fec560508b414b571f80355 |
Hashes for spacy_experimental-0.5.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0e84d48dfb57d0812c450401ef444f2d2f9713af3b035ff8ff3982bf2e2de6a8 |
|
MD5 | 4f254789011251444516fd098ba44273 |
|
BLAKE2b-256 | a5255bb3dbfc0300c8813c262f0324127d62954b78bdbe06f8717c69da66ee63 |
Hashes for spacy_experimental-0.5.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d7c481f39765cd225512bc478eccd1b7e7f32dee665e3bf60a9b6d79e0d65f85 |
|
MD5 | 41b75b4bb196ea555666ae3e705bdb3a |
|
BLAKE2b-256 | a46ef7a681822abffeb25a831c193597c5b57b935982046f78d4a36cb54c3ef1 |
Hashes for spacy_experimental-0.5.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d2a181ac2ded8a930c508ce228ba41de3af69983ea613128135c14ab0a1cd722 |
|
MD5 | 0f8ff73bb6d31ae0b7df4e975375564b |
|
BLAKE2b-256 | e03434cbfea35e740e7fbaeeee881f1f883b1e4d4770c94889763c4c1d7f0c09 |
Hashes for spacy_experimental-0.5.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f80deb37b1e5336bb305ff56624b2266fa8ade023e2bd7cbb9dd62cf0aead01 |
|
MD5 | 84aeb80a555b482331b2f3505a8dd279 |
|
BLAKE2b-256 | d4b84459958f6a1abff84e9a5bf9a67ca321dd96f0298c5c5a2ab2d907ec3597 |
Hashes for spacy_experimental-0.5.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 66bf992b3f6c6fa47f36d50e8a87213e43c335ebbc0a6245d50fe98dc455586a |
|
MD5 | 6ff30a4e790190644a7b6ff8a361d3b5 |
|
BLAKE2b-256 | d85dd71d11de7735b418a2982fcc278b42bc6ee5e3e2d2568c3d528a5a95cbfa |
Hashes for spacy_experimental-0.5.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bcce58e287bd336757971297b613eb9989dcfa672845d73c1c626a98efe428b4 |
|
MD5 | a276c33ab3f2009ca350465d52184256 |
|
BLAKE2b-256 | 0897b94b70493826819dda642bfe4162c545761d3169e0002787450237bfa247 |
Hashes for spacy_experimental-0.5.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24a30db16d04cf5453139a05c8288fed312ca4cf920ae20588a7aa85c57de6a3 |
|
MD5 | 8f21f472c8d04b6104fc2df03a0c29fd |
|
BLAKE2b-256 | 718883a150b17aae274dd3ab62a5f6c075602a3dc6641899e330ae450f3ff7cd |
Hashes for spacy_experimental-0.5.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a30f4f19f36389b9d03786f4f158320cd7041a2942a348bbbe36f777587bc56e |
|
MD5 | e5fbbdf8579a231a4add7c40dd0ee9d2 |
|
BLAKE2b-256 | b334ebbac90ba052279efdddcfab77a4ffd5d4069e6b07d924faccb802f672e2 |
Hashes for spacy_experimental-0.5.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 56868bcb75585a8fa142d9cc40a1cc0faa114062e17c784f5deaafad309f2e16 |
|
MD5 | db95f47125497b6c5cbe4a9ffaeb056d |
|
BLAKE2b-256 | 3a6eba93f8041f47b40751e84b73410325f49b03883b07f5308b155cf8a16079 |
Hashes for spacy_experimental-0.5.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25a647d7029e1f486af2c337c23bef4e0fdb426785883eabc1f38894ac0a3754 |
|
MD5 | 2eb962009fd8e540ada0901a1f7f25b1 |
|
BLAKE2b-256 | cf3d87ad4c186bcbf02bc4f172308ba6816c739e4db07442285920383a5b2adb |
Hashes for spacy_experimental-0.5.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7dfeb01d01887ab1a18ccc45b961331a8e67be6b59f617265e0db13e1e208289 |
|
MD5 | c613443900bcc7ea5d1fed14be3033f8 |
|
BLAKE2b-256 | 403bb228a79d999e90f6fc264c2fe58e59c1a7d7d0976c7b25a59cccd6aea6f5 |
Hashes for spacy_experimental-0.5.0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a88c96f7a7abe3514cc7be865e87734c064e00cb39e05694a8b3042460eabe3f |
|
MD5 | cc807e1c6e7ad51b4a279d7fa12d037c |
|
BLAKE2b-256 | a6b212e6ff13cd73c8d913928831277dea88b18ecac87e04ba28495f87a4ec82 |
Hashes for spacy_experimental-0.5.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 73a51f2b08d43b925ef7caa45b31769fc22d588daedacb2c5da0bd5a28f572c8 |
|
MD5 | e15060ecab1dbd38264807b9f0cc7dcb |
|
BLAKE2b-256 | 1521d7a34506d3205e08431d13ea27703aabbc87b53cd18b5e9359c4664a89cd |
Hashes for spacy_experimental-0.5.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4c0e0d180032956552d72a064c10761f23d967c1ca176d6a78ac58bbc0ff6cf2 |
|
MD5 | 29df18702c14335910fab5d3bc3eae67 |
|
BLAKE2b-256 | d4e305172cb18875618902ab44564ae4452f8c2f804d38417cb7040da3a8694d |
Hashes for spacy_experimental-0.5.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2742756b14a5a02a3b52c09427bf37d8f27b35a45ce903cbee363039bdf6c228 |
|
MD5 | ab00c8fb7763009a565b349d93cbaa22 |
|
BLAKE2b-256 | 6164c8af9959fdfb55f6cf7843f5d15f8ad313ca0a654243d91064abf8040f50 |
Hashes for spacy_experimental-0.5.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 939fa9abaf1dde707e853b70aef8d7f9a42053b48e9bf4c81a671782d10cee5c |
|
MD5 | 69a6af13afad88740e4951b18656cefb |
|
BLAKE2b-256 | e9f4766374c3634c1707fb99157f75e5c85ed40ecaff5ae15833b537820cceed |
Hashes for spacy_experimental-0.5.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e3054076ff817935f8a7223fc36a6634894f4ac7041db72d35329a916564aed |
|
MD5 | 8dd1ff3d39375540f4283ee35429564b |
|
BLAKE2b-256 | 411c318678491b72c579640d3bf5e71ed21a34fec268cbea9ba4f319a089a3d7 |
Hashes for spacy_experimental-0.5.0-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 068106e2d3954c4446979448dd78a3b0f9c460b6a09b376c067fa99294b0a8c8 |
|
MD5 | 3a442fde8fb4a6d5cf44a8937558e408 |
|
BLAKE2b-256 | cf877bc459c5986d7f609dce0e1c4461dd369ee980bf26d45e798bba454562c0 |
Hashes for spacy_experimental-0.5.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 645e0a63ed734411d80a62bd760b62433fd8db2c50591b1789fcde54ea553d58 |
|
MD5 | da78164127cd51a7a1f4cf863c0a62c6 |
|
BLAKE2b-256 | a09f616d2c2da2ed149b6fd80c8275591bf186b9d6109228daf0936d7ac65ec6 |
Hashes for spacy_experimental-0.5.0-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7f30003ccb5dde885b2af3fc1049ea47a738b65cbf756b3c57bad41df308fe4 |
|
MD5 | 1719efacb244fd7a57230dc23ba5cdc4 |
|
BLAKE2b-256 | 0a7ec00535f31537c620e27643baf62ac110d93281ba07fabac99c4ef485e0f4 |
Hashes for spacy_experimental-0.5.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7839e05a00ec3aca25afae834509f77e3cdd7d5d18ca1b61d88e73f1eb49069f |
|
MD5 | 93240c036b1616c24db39c46bb920be2 |
|
BLAKE2b-256 | cfa1e9469f8b57cf6dd25821b2c1e00dfe0da8ee644788d81df43d65c8ef23f7 |