Skip to main content

Pandas Dataframe integration for spaCy

Project description

DframCy

Package Version Python 3.6 Build Status codecov Code style: black

DframCy is a light-weight utility module to integrate Pandas Dataframe to spaCy's linguistic annotation and training tasks. DframCy provides clean APIs to convert spaCy's linguistic annotations, Matcher and PhraseMatcher information to Pandas dataframe, also supports training and evaluation of NLP pipeline from CSV/XLXS/XLS without any changes to spaCy's underlying APIs.

Getting Started

DframCy can be easily installed. Just need to the following:

Requirements

  • Python 3.5 or later
  • Pandas
  • spaCy >= 2.2.0

Also need to download spaCy's language model:

python -m spacy download en_core_web_sm

For more information refer to: Models & Languages

Installation:

This package can be installed from PyPi by running:

pip install dframcy

To build from source:

git clone https://github.com/yash1994/dframcy.git
cd dframcy
python setup.py install

Usage

Linguistic Annotations

Get linguistic annotation in the dataframe. For linguistic annotations (dataframe column names) refer to spaCy's Token API document.

import spacy
from dframcy import DframCy

nlp = spacy.load("en_core_web_sm")

dframcy = DframCy(nlp)
doc = dframcy.nlp(u"Apple is looking at buying U.K. startup for $1 billion")

# default columns: ["id", "text", "start", "end", "pos_", "tag_", "dep_", "head", "ent_type_"]
annotation_dataframe = dframcy.to_dataframe(doc)

# can also pass columns names (spaCy's linguistic annotation attributes)
annotation_dataframe = dframcy.to_dataframe(doc, columns=["text", "lemma_", "lower_", "is_punct"])

# for separate entity dataframe
token_annotation_dataframe, entity_dataframe = dframcy.to_dataframe(doc, separate_entity_dframe=True)

# custom attributes can also be included
from spacy.tokens import Token
fruit_getter = lambda token: token.text in ("apple", "pear", "banana")
Token.set_extension("is_fruit", getter=fruit_getter)
doc = dframcy.nlp(u"I have an apple")

annotation_dataframe = dframcy.to_dataframe(doc, custom_attributes=["is_fruit"])

Rule-Based Matching

# Token-based Matching
import spacy

nlp = spacy.load("en_core_web_sm")

from dframcy.matcher import DframCyMatcher, DframCyPhraseMatcher
dframcy_matcher = DframCyMatcher(nlp)
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
dframcy_matcher.add("HelloWorld", None, pattern)
doc = dframcy_matcher.nlp("Hello, world! Hello world!")
matches_dataframe = dframcy_matcher(doc)

# Phrase Matching
dframcy_phrase_matcher = DframCyPhraseMatcher(nlp)
terms = [u"Barack Obama", u"Angela Merkel",u"Washington, D.C."]
patterns = [dframcy_phrase_matcher.get_nlp().make_doc(text) for text in terms]
dframcy_phrase_matcher.add("TerminologyList", None, *patterns)
doc = dframcy_phrase_matcher.nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
                                u"converse in the Oval Office inside the White House in Washington, D.C.")
phrase_matches_dataframe = dframcy_phrase_matcher(doc)

Command Line Interface

Dframcy supports command line arguments for conversion of plain text file to linguistically annotated text in CSV/JSON format, training and evaluation of language models from CSV/XLS formatted training data. Training data example. CLI arguments for training and evaluation are exactly same as spaCy's CLI, only difference is the format of training data.

# convert
dframcy convert -i plain_text.txt -o annotations.csv -t csv

# train
dframcy train -l en -o spacy_models -t train.csv -d test.csv

# evaluate
dframcy evaluate -m spacy_model/ -d test.csv

# train text classifier
dframcy textcat -o spacy_model/ -t data/textcat_training.csv -d data/textcat_training.csv

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dframcy-0.1.5.tar.gz (39.3 kB view details)

Uploaded Source

File details

Details for the file dframcy-0.1.5.tar.gz.

File metadata

  • Download URL: dframcy-0.1.5.tar.gz
  • Upload date:
  • Size: 39.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.44.1 CPython/3.6.7

File hashes

Hashes for dframcy-0.1.5.tar.gz
Algorithm Hash digest
SHA256 16cba1aa386f437c218f21e17dd661241f17c8256a1543fb17a6fc20b844bd70
MD5 1cbf37c02cfa66fc9cc248e3e1e92d9e
BLAKE2b-256 4d362550d135440c21c0cae6f258740b1bfeb4bcb3ade85c11e96d635bdb89cf

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page