A custom pipeline component for spaCy that can convert any parsed Doc and its sentences into CoNLL-U format. Also provides a command line entry point.
Project description
Parsing to CoNLL with spaCy, spacy-stanza, and spacy-udpipe
This version (2.1.0) is the last version to support spaCy v2 and spacy-stanfordnlp . New versions will require spaCy v3. spacy-stanza will still be supported.
This module allows you to parse text into CoNLL-U format. You can use it as a command line tool, or embed it in your own scripts by adding it as a custom pipeline component to a spaCy, spacy-stanfordnlp, spacy-stanza, or spacy-udpipe pipeline. It also provides an easy-to-use function to quickly initialize a parser.
Note that the module simply takes a parser’s output and puts it in a formatted string adhering to the linked ConLL-U format. The output tags depend on the spaCy model used. If you want Universal Depencies tags as output, I advise you to use this library in combination with spacy-stanza, which is a spaCy interface using stanza and its models behind the scenes. Those models use the Universal Dependencies formalism and yield state-of-the-art performance. stanza is a new and improved version of stanfordnlp. The spaCy wrapper for stanfordnlp, spacy-stanfordnlp is also supported in this library but its development has been superseded by the stanza wrapper. Its use is not recommended. As an alternative to the Stanford models, you can use the spaCy wrapper for UDPipe, spacy-udpipe, which is slightly less accurate than stanza but much faster.
Installation
By default, this package automatically installs only spaCy and the packaging package as dependencies.
Because spaCy’s models are not necessarily trained on Universal Dependencies conventions, their output labels are not UD either. By using spacy-stanza or spacy-udpipe, we get the easy-to-use interface of spaCy as a wrapper around stanza and UDPipe respectively, including their models that are trained on UD data.
NOTE: spacy-stanfordnlp, spacy-stanza and spacy-udpipe are not installed automatically as a dependency for this library, because it might be too much overhead for those who don’t need UD. If you wish to use their functionality (e.g. better performance, real UD output), you have to install them manually.
If you want to retrieve CoNLL info as a pandas DataFrame, this library will automatically export it if it detects that pandas is installed. See the Usage section for more.
To install the library, simply use pip.
pip install spacy_conll
Usage
When the ConllFormatter
is added to a spaCy pipeline, it adds CoNLL properties for Token
, sentence
Span
and Doc
. Note that arbitrary Span
’s are not included and do not receive these properties.
On all three of these levels, two custom properties are exposed by default, ._.conll
and its string
representation ._.conll_str
. However, if you have pandas installed, then ._.conll_pd
will be added
automatically, too!
._.conll
: raw CoNLL formatin
Token
: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as values.in sentence
Span
: a list of its tokens’._.conll
dictionaries (list of dictionaries).in a
Doc
: a list of its sentences’._.conll
lists (list of list of dictionaries).
._.conll_str
: string representation of the CoNLL formatin
Token
: tab-separated representation of the contents of the CoNLL fields ending with a newline.in sentence
Span
: the expected CoNLL format where each row represents a token. WhenConllFormatter(include_headers=True)
is used, two header lines are included as well, as per the CoNLL format.in
Doc
: all its sentences’._.conll_str
combined and separated by new lines.
._.conll_pd
: pandas representation of the CoNLL formatin
Token
: aSeries
representation of this token’s CoNLL properties.in sentence
Span
: aDataFrame
representation of this sentence, with the CoNLL names as column headers.in
Doc
: a concatenation of its sentences’DataFrame
’s, leading to a new aDataFrame
whose index is reset.
You can use spacy_conll in your own Python code as a custom pipeline component, or you can use the built-in command-line script which offers typically needed functionality. See the following section for more.
In Python
This library offers the ConllFormatter
class which serves as a custom spaCy pipeline component. It can be
instantiated as follows.
nlp = <initialise parser>
conllformatter = ConllFormatter(nlp)
nlp.add_pipe(conllformatter, last=True)
Because this library supports different spaCy wrappers (spacy, stanfordnlp, stanza, and udpipe), a
convenience function is available as well. With utils.init_parser
you can easily instantiate a parser with a
single line. You can find the function’s signature below. Have a look at the source code to read more about all the
possible arguments or try out the examples.
NOTE: is_tokenized
does not work for spacy-udpipe and disable_sbd
only works for spacy.
Recently, spacy-udpipe has made a change to allow pretokenized text but it depends on the input format and cannot
be fixed at initialisation of the parser. See release v0.3.0 of spacy-udpipe or this PR. Using
is_tokenized
for spacy-stanfordnlp or spacy-stanza also effects sentence segmentation, effectively
only splitting on new lines.
def init_parser(parser: str = 'spacy',
model_or_lang: str = 'en',
*,
is_tokenized: bool = False,
disable_sbd: bool = False,
parser_opts: Optional[Dict] = None,
**kwargs) -> Language:
For instance, if you want to load a Dutch stanza model in silent mode with the CoNLL formatter already attached,
you can simply use the following snippet. parser_opts
is passed to the stanza pipeline initialisation
automatically. Any other keyword arguments (kwargs
), on the other hand, are passed to the ConllFormatter
initialisation.
from spacy_conll import init_parser
nlp = init_parser('stanza', 'nl', parser_opts={'verbose': False})
The ConllFormatter
allows you to customize the extension names and you can also specify conversion maps for
the output properties.
To illustrate, here is an advanced example, showing the more complex options:
ext_names
: changes the attribute names to a custom key by using a dictionary.conversion_maps
: a two-level dictionary that looks like{field_name: {tag_name: replacement}}
. In other words, you can specify in which field a certain value should be replaced by another. This is especially useful when you are not satisfied with the tagset of a model and wish to change some tags to an alternative
The example below
shows how to manually add the component;
changes the custom attribute
conll_pd
topandas
(conll_pd
only availabe if pandas is installed);converts any
-PRON-
lemma toPRON
.
import spacy
from spacy_conll import ConllFormatter
nlp = spacy.load('en')
conllformatter = ConllFormatter(nlp,
ext_names={'conll_pd': 'pandas'},
conversion_maps={'lemma': {'-PRON-': 'PRON'}})
nlp.add_pipe(conllformatter, after='parser')
doc = nlp('I like cookies.')
print(doc._.pandas)
This is the same as:
from spacy_conll import init_parser
nlp = init_parser(ext_names={'conll_pd': 'pandas'},
conversion_maps={'lemma': {'-PRON-': 'PRON'}})
doc = nlp('I like cookies.')
print(doc._.pandas)
The snippets above will output a pandas DataFrame by using ._.pandas
rather than the standard
._.conll_pd
, and all occurrences of “-PRON-” in the lemma field are replaced by “PRON”.
id form lemma upostag ... head deprel deps misc
0 1 I PRON PRON ... 2 nsubj _ _
1 2 like like VERB ... 0 ROOT _ _
2 3 cookies cookie NOUN ... 2 dobj _ SpaceAfter=No
3 4 . . PUNCT ... 2 punct _ SpaceAfter=No
[4 rows x 10 columns]
Command line
Upon installation, a command-line script is added under tha alias parse-as-conll
. You can use it to parse a
string or file into CoNLL format given a number of options.
> parse-as-conll -h
usage: parse-as-conll [-h] [-f INPUT_FILE] [-a INPUT_ENCODING] [-b INPUT_STR]
[-o OUTPUT_FILE] [-c OUTPUT_ENCODING] [-m MODEL_OR_LANG]
[-s] [-t] [-d] [-e] [-j N_PROCESS]
[-p {spacy,stanfordnlp,stanza,udpipe}] [-v]
Parse an input string or input file to CoNLL-U format using a spaCy-wrapped
parser.
optional arguments:
-h, --help show this help message and exit
-f INPUT_FILE, --input_file INPUT_FILE
Path to file with sentences to parse. Has precedence
over 'input_str'. (default: None)
-a INPUT_ENCODING, --input_encoding INPUT_ENCODING
Encoding of the input file. Default value is system
default. (default: cp1252)
-b INPUT_STR, --input_str INPUT_STR
Input string to parse. (default: None)
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Path to output file. If not specified, the output will
be printed on standard output. (default: None)
-c OUTPUT_ENCODING, --output_encoding OUTPUT_ENCODING
Encoding of the output file. Default value is system
default. (default: cp1252)
-m MODEL_OR_LANG, --model_or_lang MODEL_OR_LANG
language model to use (must be installed). Defaults to
an English model (default: en)
-s, --disable_sbd Whether to disable spaCy automatic sentence boundary
detection. In practice, disabling means that every
line will be parsed as one sentence, regardless of its
actual content. Only works when using 'spacy' as
'parser'. (default: False)
-t, --is_tokenized Whether your text has already been tokenized (space-
seperated). Setting this option has difference
consequences for different parsers: SpaCy will simply
not do any further tokenisation: we simply split the
tokens on whitespace; Stanfordnlp and Stanza will not
tokenize but in addition, will also only do sentence
splitting on newlines. No additional sentence
segmentation is done; For UDpipe we also simply
disable tokenisation and use white-spaced tokens
(works from 0.3.0 upwards). No further sentence
segmentation is done. (default: False)
-d, --include_headers
Whether to include headers before the output of every
sentence. These headers include the sentence text and
the sentence ID as per the CoNLL format. (default:
False)
-e, --no_force_counting
Whether to disable force counting the 'sent_id',
starting from 1 and increasing for each sentence.
Instead, 'sent_id' will depend on how spaCy returns
the sentences. Must have 'include_headers' enabled.
(default: False)
-j N_PROCESS, --n_process N_PROCESS
Number of processes to use in nlp.pipe(). -1 will use
as many cores as available. Requires spaCy v2.2.2.
Might not work for a 'parser' other than 'spacy'.
(default: 1)
-p {spacy,stanfordnlp,stanza,udpipe}, --parser {spacy,stanfordnlp,stanza,udpipe}
Which parser to use. Parsers other than 'spacy' need
to be installed separately. So if you wish to use
'stanfordnlp' models, 'spacy-stanfordnlp' needs to be
installed. For 'stanza' you need 'spacy-stanza', and
for 'udpipe' the 'spacy-udpipe' library is required.
(default: spacy)
-v, --verbose Whether to always print the output to stdout,
regardless of 'output_file'. (default: False)
For example, parsing a single line, multi-sentence string:
> parse-as-conll --input_str "I like cookies . What about you ?" --is_tokenized --include_headers
# sent_id = 1
# text = I like cookies .
1 I -PRON- PRON PRP PronType=prs 2 nsubj _ _
2 like like VERB VBP VerbForm=fin|Tense=pres 0 ROOT _ _
3 cookies cookie NOUN NNS Number=plur 2 dobj _ _
4 . . PUNCT . PunctType=peri 2 punct _ _
# sent_id = 2
# text = What about you ?
1 What what PRON WP _ 2 dep _ _
2 about about ADP IN _ 0 ROOT _ _
3 you -PRON- PRON PRP PronType=prs 2 pobj _ _
4 ? ? PUNCT . PunctType=peri 2 punct _ _
For example, parsing a large input file and writing output to a given output file, using four processes (multiprocessing might be only supported in spacy):
> parse-as-conll --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4
Credits
Based on the initial work by rgalhama.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for spacy_conll-2.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 818943aeb10c3b1f37ae8239a665708e8e16110afe1746ddea21ac9a6225303f |
|
MD5 | 6b066da4b5cfcf16aecc671eecb53a5f |
|
BLAKE2b-256 | a73c746596f15464aa9ab17608db3abcdbc3cd70f3b4219eb0ea761c070983ea |