Skip to main content

Pandas Dataframe integration for spaCy

Project description

DframCy

Package Version Python 3.6 Build Status codecov Code style: black

DframCy is a light-weight utility module to integrate Pandas Dataframe to spaCy's linguistic annotation and training tasks. DframCy provides clean APIs to convert spaCy's linguistic annotations, Matcher and PhraseMatcher information to Pandas dataframe, also supports training and evaluation of NLP pipeline from CSV/XLXS/XLS without any changes to spaCy's underlying APIs.

Getting Started

DframCy can be easily installed. Just need to the following:

Requirements

  • Python 3.6 or later
  • Pandas
  • spaCy >= 3.0.0

Also need to download spaCy's language model:

python -m spacy download en_core_web_sm

For more information refer to: Models & Languages

Installation:

This package can be installed from PyPi by running:

pip install dframcy

To build from source:

git clone https://github.com/yash1994/dframcy.git
cd dframcy
python setup.py install

Usage

Linguistic Annotations

Get linguistic annotation in the dataframe. For linguistic annotations (dataframe column names) refer to spaCy's Token API document.

import spacy
from dframcy import DframCy

nlp = spacy.load("en_core_web_sm")

dframcy = DframCy(nlp)
doc = dframcy.nlp(u"Apple is looking at buying U.K. startup for $1 billion")

# default columns: ["id", "text", "start", "end", "pos_", "tag_", "dep_", "head", "ent_type_"]
annotation_dataframe = dframcy.to_dataframe(doc)

# can also pass columns names (spaCy's linguistic annotation attributes)
annotation_dataframe = dframcy.to_dataframe(doc, columns=["text", "lemma_", "lower_", "is_punct"])

# for separate entity dataframe
token_annotation_dataframe, entity_dataframe = dframcy.to_dataframe(doc, separate_entity_dframe=True)

# custom attributes can also be included
from spacy.tokens import Token
fruit_getter = lambda token: token.text in ("apple", "pear", "banana")
Token.set_extension("is_fruit", getter=fruit_getter)
doc = dframcy.nlp(u"I have an apple")

annotation_dataframe = dframcy.to_dataframe(doc, custom_attributes=["is_fruit"])

Rule-Based Matching

# Token-based Matching
import spacy

nlp = spacy.load("en_core_web_sm")

from dframcy.matcher import DframCyMatcher, DframCyPhraseMatcher, DframCyDependencyMatcher
dframcy_matcher = DframCyMatcher(nlp)
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
dframcy_matcher.add("HelloWorld", None, pattern)
doc = dframcy_matcher.nlp("Hello, world! Hello world!")
matches_dataframe = dframcy_matcher(doc)

# Phrase Matching
dframcy_phrase_matcher = DframCyPhraseMatcher(nlp)
terms = [u"Barack Obama", u"Angela Merkel",u"Washington, D.C."]
patterns = [dframcy_phrase_matcher.get_nlp().make_doc(text) for text in terms]
dframcy_phrase_matcher.add("TerminologyList", None, *patterns)
doc = dframcy_phrase_matcher.nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
                                u"converse in the Oval Office inside the White House in Washington, D.C.")
phrase_matches_dataframe = dframcy_phrase_matcher(doc)

# Dependency Matching
dframcy_dependency_matcher = DframCyDependencyMatcher(nlp)
pattern = [{"RIGHT_ID": "founded_id", "RIGHT_ATTRS": {"ORTH": "founded"}}]
doc = dframcy_dependency_matcher.nlp(u"Bill Gates founded Microsoft. And Elon Musk founded SpaceX")
dependency_matches_dataframe = dframcy_dependency_matcher(doc)

Command Line Interface

Dframcy supports command-line arguments for the conversion of a plain text file to linguistically annotated text in CSV/JSON format. Previous versions of Dframcy were used to support CLI utilities for training and evaluation of spaCy models from CSV/XLS files. After the v3 release, spaCy's training pipeline has become much more flexible and robust so didn't want to introduce additional step using Dframcy for just format conversion (CSV/XLS to spaCy’s binary format).

# convert
dframcy dframe -i plain_text.txt -o annotations.csv -f csv

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dframcy-0.1.6.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

dframcy-0.1.6-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file dframcy-0.1.6.tar.gz.

File metadata

  • Download URL: dframcy-0.1.6.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.8.5

File hashes

Hashes for dframcy-0.1.6.tar.gz
Algorithm Hash digest
SHA256 23ac9e64430ac5bba51980b99cbdef9585f88c6e4a7bb9d62a65dba4a8241bec
MD5 74bf2bfe31732ceb44bf91fd1395c5c3
BLAKE2b-256 30936b842ecc160b77d76954b07ad3311f6c039d4718669dc125f88c248e62ff

See more details on using hashes here.

Provenance

File details

Details for the file dframcy-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: dframcy-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 13.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.8.5

File hashes

Hashes for dframcy-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 119ee537697717a7e96a5780cca11b6ed6fa190c3004d8402a88850a9a8b045c
MD5 685b6b4540999342f2cde6502fc72750
BLAKE2b-256 7cfbf5298c497597d20fe861a8032d56fb78a3f5fc33535a62f3033d1235fc56

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page