spacy-conll

A custom pipeline component for spaCy that can convert any parsed Doc and its sentences into CoNLL-U format. Also provides a command line entry point.

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
- Science/Research
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering
- Text Processing

Project description

Parsing to CoNLL with spaCy

This module allows you to parse a text to CoNLL-U format. You can use it as a command line tool, or embed it in your own scripts by adding it as a custom component to a spaCy pipeline.

Installation

Requires spaCy and an installed spaCy language model. When using the module from the command line, you also need the packaging package.

pip install spacy_conll

Usage

Command line

> python -m spacy_conll -h
usage: [-h] [-f INPUT_FILE] [-a INPUT_ENCODING] [-b INPUT_STR]
       [-t] [-o OUTPUT_FILE] [-c OUTPUT_ENCODING] [-m MODEL] [-s]
       [-d] [-e] [-j N_PROCESS] [-v]

Parse an input string or input file to CoNLL-U format.

optional arguments:
  -h, --help            show this help message and exit
  -f INPUT_FILE, --input_file INPUT_FILE
                        Path to file with sentences to parse. Has precedence
                        over 'input_str'. (default: None)
  -a INPUT_ENCODING, --input_encoding INPUT_ENCODING
                        Encoding of the input file. Default value is system
                        default. (default: cp1252)
  -b INPUT_STR, --input_str INPUT_STR
                        Input string to parse. (default: None)
  -t, --is_tokenized    Indicates whether your text has already been tokenized
                        (space-seperated). (default: False)
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        Path to output file. If not specified, the output will
                        be printed on standard output. (default: None)
  -c OUTPUT_ENCODING, --output_encoding OUTPUT_ENCODING
                        Encoding of the output file. Default value is system
                        default. (default: cp1252)
  -m MODEL, --model MODEL
                        spaCy model to use (must be installed). (default:
                        en_core_web_sm)
  -s, --disable_sbd     Disables spaCy automatic sentence boundary detection.
                        In practice, disabling means that every line will be
                        parsed as one sentence, regardless of its actual
                        content. (default: False)
  -d, --include_headers
                        To include headers before the output of every
                        sentence. These headers include the sentence text and
                        the sentence ID. (default: False)
  -e, --no_force_counting
                        To disable force counting the 'sent_id', starting from
                        1 and increasing for each sentence. Instead, 'sent_id'
                        will depend on how spaCy returns the sentences. Must
                        have 'include_headers' enabled. (default: False)
  -j N_PROCESS, --n_process N_PROCESS
                        Number of processes to use in nlp.pipe(). -1 will use
                        as many cores as available. Requires spaCy v2.2.2.
                        (default: 1)
  -v, --verbose         To print the output to stdout, regardless of
                        'output_file'. (default: False)

For example, parsing a sentence:

>  python -m spacy_conll --input_str "I like cookies . What about you ?" --is_tokenized --include_headers
# sent_id = 1
# text = I like cookies .
1       I       -PRON-  PRON    PRP     PronType=prs    2       nsubj   _       _
2       like    like    VERB    VBP     VerbForm=fin|Tense=pres 0       ROOT    _       _
3       cookies cookie  NOUN    NNS     Number=plur     2       dobj    _       _
4       .       .       PUNCT   .       PunctType=peri  2       punct   _       _

# sent_id = 2
# text = What about you ?
1       What    what    NOUN    WP      PronType=int|rel        2       dep     _       _
2       about   about   ADP     IN      _       0       ROOT    _       _
3       you     -PRON-  PRON    PRP     PronType=prs    2       pobj    _       _
4       ?       ?       PUNCT   .       PunctType=peri  2       punct   _       _

For example, parsing a large input file and writing output to output file, using four processes:

> python -m spacy_conll --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4

In Python

spacy_conll is intended to be used a custom pipeline component in spaCy. Three custom extensions are accessible, by default named conll_str, conll_str_headers, and conll.

conll_str: returns the string representation of the CoNLL format
conll_str_headers: returns the string representation of the CoNLL format including headers. These headers consist of two lines, namely # sent_id = <i>, indicating which sentence it is in the overall document, and # text = <sentence>, which simply shows the original sentence’s text
conll: returns the output as (a list of) tuple(s) where each line is a tuple of its column values

When adding the component to the spaCy pipeline, it is important to insert it after the parser, as shown in the example below.

import spacy
from spacy_conll import ConllFormatter

nlp = spacy.load('en')
conllformatter = ConllFormatter(nlp)
nlp.add_pipe(conllformatter, after='parser')
doc = nlp('I like cookies. Do you?')
print(doc._.conll_str_headers)

The snippet above will return (and print) the following string:

# sent_id = 1
# text = I like cookies.
1   I       -PRON-  PRON    PRP     PronType=prs    2       nsubj   _       _
2   like    like    VERB    VBP     VerbForm=fin|Tense=pres 0       ROOT    _       _
3   cookies cookie  NOUN    NNS     Number=plur     2       dobj    _       _
4   .       .       PUNCT   .       PunctType=peri  2       punct   _       _

# sent_id = 2
# text = Do you?
1   Do      do      AUX     VBP     VerbForm=fin|Tense=pres 0       ROOT    _       _
2   you     -PRON-  PRON    PRP     PronType=prs    1       nsubj   _       _
3   ?       ?       PUNCT   .       PunctType=peri  1       punct   _       _

DEPRECATED: Spacy2ConllParser

There are two main methods, parse() and parseprint(). The latter is a convenience method for printing the output of parse() to stdout (default) or a file.

from spacy_conll import Spacy2ConllParser
spacyconll = Spacy2ConllParser()

# `parse` returns a generator of the parsed sentences
for parsed_sent in spacyconll.parse(input_str="I like cookies.\nWhat about you?\nI don't like 'em!"):
    do_something_(parsed_sent)

# `parseprint` prints output to stdout (default) or a file (use `output_file` parameter)
# This method is called when using the command line
spacyconll.parseprint(input_str='I like cookies.')

Credits

Based on the initial work by rgalhama.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
- Science/Research
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering
- Text Processing

Release history Release notifications | RSS feed

4.0.1

Jul 2, 2024

3.4.0

Apr 2, 2023

3.3.0

Jan 17, 2023

3.2.0

Apr 4, 2022

3.1.0

Oct 31, 2021

3.0.2

Jul 14, 2021

3.0.1

Jul 14, 2021

3.0.0

Jul 12, 2021

3.0.0rc3 pre-release

Jul 8, 2021

3.0.0rc2 pre-release

Jul 7, 2021

3.0.0rc1 pre-release

Jun 29, 2021

2.1.0

Jun 30, 2021

2.0.0

May 11, 2020

1.3.0

Apr 28, 2020

1.2.0

Feb 2, 2020

This version

1.1.0

Jan 21, 2020

1.0.1

Jan 15, 2020

1.0.0

Jan 15, 2020

0.1.6

Jan 17, 2019

0.1.5

Jan 17, 2019

0.1.0

Jan 16, 2019

0.0.3

Jan 14, 2019

0.0.2

Jan 14, 2019

0.0.1

Jan 14, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_conll-1.1.0.tar.gz (11.4 kB view hashes)

Uploaded Jan 21, 2020 Source

Built Distribution

spacy_conll-1.1.0-py3-none-any.whl (11.6 kB view hashes)

Uploaded Jan 21, 2020 Python 3

Hashes for spacy_conll-1.1.0.tar.gz

Hashes for spacy_conll-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`912c963bd5f3a0d0ac44a3fc888d2776523eb72c7f23920ca9402bc00da54b09`
MD5	`2b5272daf099293ec63d46097997ffda`
BLAKE2b-256	`2717634c507b863610b13b2f808d741e903beaa33e53ca72ea2fcbc9856cdb34`

Hashes for spacy_conll-1.1.0-py3-none-any.whl

Hashes for spacy_conll-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a4f43a31b780c563fdb3c88b845656248f9333f07ec6871ee4bf77a9f0474fac`
MD5	`776f8bf63b3978eb0642180ed70f483e`
BLAKE2b-256	`0fe0d3120020701f33befe63cb130082624c718a7d4b31c2824f0ce79d8d9f75`