Skip to main content

A custom pipeline component for spaCy that can convert any parsed Doc and its sentences into CoNLL-U format. Also provides a command line entry point.

Project description

Parsing to CoNLL with spaCy

This module allows you to parse a text to CoNLL-U format. You can use it as a command line tool, or embed it in your own scripts by adding it as a custom component to a spaCy pipeline.

Installation

Requires spacy and an installed spaCy language model.

pip install spacy_conll

Usage

Command line

> python -m spacy_conll -h
usage: [-h] [-f INPUT_FILE] [-a INPUT_ENCODING] [-b INPUT_STR]
       [-t] [-o OUTPUT_FILE] [-c OUTPUT_ENCODING] [-m MODEL] [-s]
       [-d] [-e] [-j N_PROCESS] [-v]

Parse an input string or input file to CoNLL-U format.

optional arguments:
  -h, --help            show this help message and exit
  -f INPUT_FILE, --input_file INPUT_FILE
                        Path to file with sentences to parse. Has precedence
                        over 'input_str'. (default: None)
  -a INPUT_ENCODING, --input_encoding INPUT_ENCODING
                        Encoding of the input file. Default value is system
                        default. (default: cp1252)
  -b INPUT_STR, --input_str INPUT_STR
                        Input string to parse. (default: None)
  -t, --is_tokenized    Indicates whether your text has already been tokenized
                        (space-seperated). (default: False)
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        Path to output file. If not specified, the output will
                        be printed on standard output. (default: None)
  -c OUTPUT_ENCODING, --output_encoding OUTPUT_ENCODING
                        Encoding of the output file. Default value is system
                        default. (default: cp1252)
  -m MODEL, --model MODEL
                        spaCy model to use (must be installed). (default:
                        en_core_web_sm)
  -s, --disable_sbd     Disables spaCy automatic sentence boundary detection.
                        In practice, disabling means that every line will be
                        parsed as one sentence, regardless of its actual
                        content. (default: False)
  -d, --include_headers
                        To include headers before the output of every
                        sentence. These headers include the sentence text and
                        the sentence ID. (default: False)
  -e, --no_force_counting
                        To disable force counting the 'sent_id', starting from
                        1 and increasing for each sentence. Instead, 'sent_id'
                        will depend on how spaCy returns the sentences. Must
                        have 'include_headers' enabled. (default: False)
  -j N_PROCESS, --n_process N_PROCESS
                        Number of processes to use in nlp.pipe(). -1 will use
                        as many cores as available. Requires spaCy v2.2.2.
                        (default: 1)
  -v, --verbose         To print the output to stdout, regardless of
                        'output_file'. (default: False)

For example, parsing a sentence:

>  python -m spacy_conll --input_str "I like cookies . What about you ?" --is_tokenized --include_headers
# sent_id = 1
# text = I like cookies .
1       I       -PRON-  PRON    PRP     PronType=prs    2       nsubj   _       _
2       like    like    VERB    VBP     VerbForm=fin|Tense=pres 0       ROOT    _       _
3       cookies cookie  NOUN    NNS     Number=plur     2       dobj    _       _
4       .       .       PUNCT   .       PunctType=peri  2       punct   _       _

# sent_id = 2
# text = What about you ?
1       What    what    NOUN    WP      PronType=int|rel        2       dep     _       _
2       about   about   ADP     IN      _       0       ROOT    _       _
3       you     -PRON-  PRON    PRP     PronType=prs    2       pobj    _       _
4       ?       ?       PUNCT   .       PunctType=peri  2       punct   _       _

For example, parsing a large input file and writing output to output file, using four processes:

> python -m spacy_conll --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4

In Python

spacy_conll is intended to be used a custom pipeline component in spaCy. Three custom extensions are accessible, by default named conll_str, conll_str_headers, and conll.

  • conll_str: returns the string representation of the CoNLL format

  • conll_str_headers: returns the string representation of the CoNLL format including headers. These headers consist of two lines, namely # sent_id = <i>, indicating which sentence it is in the overall document, and # text = <sentence>, which simply shows the original sentence’s text

  • conll: returns the output as (a list of) tuple(s) where each line is a tuple of its column values

When adding the component to the spaCy pipeline, it is important to insert it after the parser, as shown in the example below.

import spacy
from spacy_conll import ConllFormatter

nlp = spacy.load('en')
conllformatter = ConllFormatter(nlp)
nlp.add_pipe(conllformatter, after='parser')
doc = nlp('I like cookies. Do you?')
print(doc._.conll_str_headers)

The snippet above will return (and print) the following string:

# sent_id = 1
# text = I like cookies.
1   I       -PRON-  PRON    PRP     PronType=prs    2       nsubj   _       _
2   like    like    VERB    VBP     VerbForm=fin|Tense=pres 0       ROOT    _       _
3   cookies cookie  NOUN    NNS     Number=plur     2       dobj    _       _
4   .       .       PUNCT   .       PunctType=peri  2       punct   _       _

# sent_id = 2
# text = Do you?
1   Do      do      AUX     VBP     VerbForm=fin|Tense=pres 0       ROOT    _       _
2   you     -PRON-  PRON    PRP     PronType=prs    1       nsubj   _       _
3   ?       ?       PUNCT   .       PunctType=peri  1       punct   _       _

DEPRECATED: Spacy2ConllParser

There are two main methods, parse() and parseprint(). The latter is a convenience method for printing the output of parse() to stdout (default) or a file.

from spacy_conll import Spacy2ConllParser
spacyconll = Spacy2ConllParser()

# `parse` returns a generator of the parsed sentences
for parsed_sent in spacyconll.parse(input_str="I like cookies.\nWhat about you?\nI don't like 'em!"):
    do_something_(parsed_sent)

# `parseprint` prints output to stdout (default) or a file (use `output_file` parameter)
# This method is called when using the command line
spacyconll.parseprint(input_str='I like cookies.')

Credits

Based on the initial work by rgalhama.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_conll-1.0.1.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spacy_conll-1.0.1-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file spacy_conll-1.0.1.tar.gz.

File metadata

  • Download URL: spacy_conll-1.0.1.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.3

File hashes

Hashes for spacy_conll-1.0.1.tar.gz
Algorithm Hash digest
SHA256 470e820ecc607e66002aaf7ae0ecc537af5a5d8973d0663ade7ec434cae5cef1
MD5 a3834b343c7fd00b264d5aa8b7db6c31
BLAKE2b-256 0fccb0a446cd38f86d89b2502037cf1cc43da48f2dd9202ee23a81b30f0070da

See more details on using hashes here.

File details

Details for the file spacy_conll-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: spacy_conll-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.3

File hashes

Hashes for spacy_conll-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5ad8f08ade7ae46a3458b269783529cf84ea2682b4dc9b206e76a35d3a45cd5a
MD5 216f1ccc8f238164d3eaf5cd9c05fbcc
BLAKE2b-256 86e962ec030c7a709d220349a04cb22314c40f9a2010ca42fc31360eff7cce98

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page