Pandas Dataframe integration for spaCy
Project description
DframCy
DframCy is a light-weight utility module to integrate Pandas Dataframe to spaCy's linguistic annotation and training tasks. DframCy provides clean APIs to convert spaCy's linguistic annotations, Matcher and PhraseMatcher information to Pandas dataframe, also supports training and evaluation of NLP pipeline from CSV/XLXS/XLS without any changes to spaCy's underlying APIs.
Getting Started
DframCy can be easily installed. Just need to the following:
Requirements
- Python 3.6 or later
- Pandas
- spaCy >= 3.0.0
Also need to download spaCy's language model:
python -m spacy download en_core_web_sm
For more information refer to: Models & Languages
Installation:
This package can be installed from PyPi by running:
pip install dframcy
To build from source:
git clone https://github.com/yash1994/dframcy.git
cd dframcy
python setup.py install
Usage
Linguistic Annotations
Get linguistic annotation in the dataframe. For linguistic annotations (dataframe column names) refer to spaCy's Token API document.
import spacy
from dframcy import DframCy
nlp = spacy.load("en_core_web_sm")
dframcy = DframCy(nlp)
doc = dframcy.nlp(u"Apple is looking at buying U.K. startup for $1 billion")
# default columns: ["id", "text", "start", "end", "pos_", "tag_", "dep_", "head", "ent_type_"]
annotation_dataframe = dframcy.to_dataframe(doc)
# can also pass columns names (spaCy's linguistic annotation attributes)
annotation_dataframe = dframcy.to_dataframe(doc, columns=["text", "lemma_", "lower_", "is_punct"])
# for separate entity dataframe
token_annotation_dataframe, entity_dataframe = dframcy.to_dataframe(doc, separate_entity_dframe=True)
# custom attributes can also be included
from spacy.tokens import Token
fruit_getter = lambda token: token.text in ("apple", "pear", "banana")
Token.set_extension("is_fruit", getter=fruit_getter)
doc = dframcy.nlp(u"I have an apple")
annotation_dataframe = dframcy.to_dataframe(doc, custom_attributes=["is_fruit"])
Rule-Based Matching
# Token-based Matching
import spacy
nlp = spacy.load("en_core_web_sm")
from dframcy.matcher import DframCyMatcher, DframCyPhraseMatcher, DframCyDependencyMatcher
dframcy_matcher = DframCyMatcher(nlp)
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
dframcy_matcher.add("HelloWorld", None, pattern)
doc = dframcy_matcher.nlp("Hello, world! Hello world!")
matches_dataframe = dframcy_matcher(doc)
# Phrase Matching
dframcy_phrase_matcher = DframCyPhraseMatcher(nlp)
terms = [u"Barack Obama", u"Angela Merkel",u"Washington, D.C."]
patterns = [dframcy_phrase_matcher.get_nlp().make_doc(text) for text in terms]
dframcy_phrase_matcher.add("TerminologyList", None, *patterns)
doc = dframcy_phrase_matcher.nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
u"converse in the Oval Office inside the White House in Washington, D.C.")
phrase_matches_dataframe = dframcy_phrase_matcher(doc)
# Dependency Matching
dframcy_dependency_matcher = DframCyDependencyMatcher(nlp)
pattern = [{"RIGHT_ID": "founded_id", "RIGHT_ATTRS": {"ORTH": "founded"}}]
doc = dframcy_dependency_matcher.nlp(u"Bill Gates founded Microsoft. And Elon Musk founded SpaceX")
dependency_matches_dataframe = dframcy_dependency_matcher(doc)
Command Line Interface
Dframcy supports command-line arguments for the conversion of a plain text file to linguistically annotated text in CSV/JSON format. Previous versions of Dframcy were used to support CLI utilities for training and evaluation of spaCy models from CSV/XLS files. After the v3 release, spaCy's training pipeline has become much more flexible and robust so didn't want to introduce additional step using Dframcy for just format conversion (CSV/XLS to spaCy’s binary format).
# convert
dframcy dframe -i plain_text.txt -o annotations.csv -f csv
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dframcy-0.1.6.tar.gz
.
File metadata
- Download URL: dframcy-0.1.6.tar.gz
- Upload date:
- Size: 13.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 23ac9e64430ac5bba51980b99cbdef9585f88c6e4a7bb9d62a65dba4a8241bec |
|
MD5 | 74bf2bfe31732ceb44bf91fd1395c5c3 |
|
BLAKE2b-256 | 30936b842ecc160b77d76954b07ad3311f6c039d4718669dc125f88c248e62ff |
File details
Details for the file dframcy-0.1.6-py3-none-any.whl
.
File metadata
- Download URL: dframcy-0.1.6-py3-none-any.whl
- Upload date:
- Size: 13.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 119ee537697717a7e96a5780cca11b6ed6fa190c3004d8402a88850a9a8b045c |
|
MD5 | 685b6b4540999342f2cde6502fc72750 |
|
BLAKE2b-256 | 7cfbf5298c497597d20fe861a8032d56fb78a3f5fc33535a62f3033d1235fc56 |