Skip to main content

A custom pipeline component for spaCy that can convert any parsed Doc and its sentences into CoNLL-U format. Also provides a command line entry point.

Project description

Parsing to CoNLL with spaCy

This module allows you to parse a text to CoNLL-U format. You can use it as a command line tool, or embed it in your own scripts by adding it as a custom component to a spaCy pipeline.

Installation

Requires spacy and an installed spaCy language model.

pip install spacy_conll

Usage

Command line

> python -m spacy_conll -h
usage: [-h] [-f INPUT_FILE] [-a INPUT_ENCODING] [-b INPUT_STR]
       [-t] [-o OUTPUT_FILE] [-c OUTPUT_ENCODING] [-m MODEL] [-s]
       [-d] [-e] [-j N_PROCESS] [-v]

Parse an input string or input file to CoNLL-U format.

optional arguments:
  -h, --help            show this help message and exit
  -f INPUT_FILE, --input_file INPUT_FILE
                        Path to file with sentences to parse. Has precedence
                        over 'input_str'. (default: None)
  -a INPUT_ENCODING, --input_encoding INPUT_ENCODING
                        Encoding of the input file. Default value is system
                        default. (default: cp1252)
  -b INPUT_STR, --input_str INPUT_STR
                        Input string to parse. (default: None)
  -t, --is_tokenized    Indicates whether your text has already been tokenized
                        (space-seperated). (default: False)
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        Path to output file. If not specified, the output will
                        be printed on standard output. (default: None)
  -c OUTPUT_ENCODING, --output_encoding OUTPUT_ENCODING
                        Encoding of the output file. Default value is system
                        default. (default: cp1252)
  -m MODEL, --model MODEL
                        spaCy model to use (must be installed). (default:
                        en_core_web_sm)
  -s, --disable_sbd     Disables spaCy automatic sentence boundary detection.
                        In practice, disabling means that every line will be
                        parsed as one sentence, regardless of its actual
                        content. (default: False)
  -d, --include_headers
                        To include headers before the output of every
                        sentence. These headers include the sentence text and
                        the sentence ID. (default: False)
  -e, --no_force_counting
                        To disable force counting the 'sent_id', starting from
                        1 and increasing for each sentence. Instead, 'sent_id'
                        will depend on how spaCy returns the sentences. Must
                        have 'include_headers' enabled. (default: False)
  -j N_PROCESS, --n_process N_PROCESS
                        Number of processes to use in nlp.pipe(). -1 will use
                        as many cores as available. Requires spaCy v2.2.2.
                        (default: 1)
  -v, --verbose         To print the output to stdout, regardless of
                        'output_file'. (default: False)

For example, parsing a sentence:

>  python -m spacy_conll --input_str "I like cookies . What about you ?" --is_tokenized --include_headers
# sent_id = 1
# text = I like cookies .
1       I       -PRON-  PRON    PRP     PronType=prs    2       nsubj   _       _
2       like    like    VERB    VBP     VerbForm=fin|Tense=pres 0       ROOT    _       _
3       cookies cookie  NOUN    NNS     Number=plur     2       dobj    _       _
4       .       .       PUNCT   .       PunctType=peri  2       punct   _       _

# sent_id = 2
# text = What about you ?
1       What    what    NOUN    WP      PronType=int|rel        2       dep     _       _
2       about   about   ADP     IN      _       0       ROOT    _       _
3       you     -PRON-  PRON    PRP     PronType=prs    2       pobj    _       _
4       ?       ?       PUNCT   .       PunctType=peri  2       punct   _       _

For example, parsing a large input file and writing output to output file, using four processes:

> python -m spacy_conll --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4

In Python

spacy_conll is intended to be used a custom pipeline component in spaCy. Three custom extensions are accessible, by default named conll_str, conll_str_headers, and conll.

  • conll_str: returns the string representation of the CoNLL format

  • conll_str_headers: returns the string representation of the CoNLL format including headers. These headers consist of two lines, namely # sent_id = <i>, indicating which sentence it is in the overall document, and # text = <sentence>, which simply shows the original sentence’s text

  • conll: returns the output as (a list of) tuple(s) where each line is a tuple of its column values

When adding the component to the spaCy pipeline, it is important to insert it after the parser, as shown in the example below.

import spacy
from spacy_conll import ConllFormatter

nlp = spacy.load('en')
conllformatter = ConllFormatter(nlp)
nlp.add_pipe(conllformatter, after='parser')
doc = nlp('I like cookies. Do you?')
print(doc._.conll_str_headers)

The snippet above will return (and print) the following string:

# sent_id = 1
# text = I like cookies.
1   I       -PRON-  PRON    PRP     PronType=prs    2       nsubj   _       _
2   like    like    VERB    VBP     VerbForm=fin|Tense=pres 0       ROOT    _       _
3   cookies cookie  NOUN    NNS     Number=plur     2       dobj    _       _
4   .       .       PUNCT   .       PunctType=peri  2       punct   _       _

# sent_id = 2
# text = Do you?
1   Do      do      AUX     VBP     VerbForm=fin|Tense=pres 0       ROOT    _       _
2   you     -PRON-  PRON    PRP     PronType=prs    1       nsubj   _       _
3   ?       ?       PUNCT   .       PunctType=peri  1       punct   _       _

DEPRECATED: Spacy2ConllParser

There are two main methods, parse() and parseprint(). The latter is a convenience method for printing the output of parse() to stdout (default) or a file.

from spacy_conll import Spacy2ConllParser
spacyconll = Spacy2ConllParser()

# `parse` returns a generator of the parsed sentences
for parsed_sent in spacyconll.parse(input_str="I like cookies.\nWhat about you?\nI don't like 'em!"):
    do_something_(parsed_sent)

# `parseprint` prints output to stdout (default) or a file (use `output_file` parameter)
# This method is called when using the command line
spacyconll.parseprint(input_str='I like cookies.')

Credits

Based on the initial work by rgalhama.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_conll-1.0.1.tar.gz (10.8 kB view hashes)

Uploaded Source

Built Distribution

spacy_conll-1.0.1-py3-none-any.whl (11.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page