Excel Integration with SpaCy. Includes, Entity training, Entity matcher pipe.
Project description
ExcelCy is a toolkit to integrate Excel to spaCy NLP training experiences. Training NER using XLSX from PDF, DOCX, PPT, PNG or JPG. ExcelCy has pipeline to match Entity with PhraseMatcher or Matcher in regular expression.
ExcelCy is Powerful
Simple Style Training, from spaCy documentation, demonstrates how to train NER using spaCy:
TRAIN_DATA = [
("Uber blew through $1 million a week", {'entities': [(0, 4, 'ORG')]}),
("Google rebrands its business apps", {'entities': [(0, 6, "ORG")]})]
nlp = spacy.blank('en')
optimizer = nlp.begin_training()
for i in range(20):
random.shuffle(TRAIN_DATA)
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/model')
The TRAIN_DATA, describes sentences and annotated entities to be trained. It is cumbersome to always count the characters. With ExcelCy, (start,end) characters can be omitted.
from excelcy import ExcelCy
# collect sentences, annotate Entities and train NER using spaCy
excelcy = ExcelCy.execute(file_path='https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx')
# use the nlp object as per spaCy API
doc = excelcy.nlp('Google rebrands its business apps')
# or save it for faster bootstrap for application
excelcy.nlp.to_disk('/model')
ExcelCy is Friendly
ExcelCy training is divided into phases, the example Excel file can be found in tests/data/test_data_01.xlsx :
1. Discovery
The first phase is to collect sentences from data source in sheet “source”. The data source can be either:
Text: Direct sentence values.
Files: PDF, DOCX, PPT, PNG or JPG will be parsed using textract.
2. Preparation
Next phase, the sentences will be analysed in sheet “prepare”, based on:
Current Data Model: Using spaCy API of nlp(sentence).ents
Phrase pattern: Robertus Johansyah, Uber, Google, Amazon
Regex pattern: ^([0-1]?[0-9]|2[0-3]):[0-5][0-9]$
3. Training
Main phase of NER training, which described in Simple Style Training. The data is iterated from sheet “train”, check sheet “config” to control the parameters.
4. Consolidation
The last phase, is to test/save the results and repeat the phases if required.
ExcelCy is Comprehensive
Under the hood, ExcelCy has strong and well-defined data storage. At any given phase above, the data can be inspected.
from excelcy import ExcelCy
excelcy = ExcelCy()
# load configuration from XLSX or YML or JSON
# excelcy.load(file_path='test_data_01.xlsx')
# or define manually
excelcy.storage.config = Config(nlp_base='en_core_web_sm', train_iteration=2, train_drop=0.2)
print(json.dumps(excelcy.storage.items(), indent=2))
# add sources
excelcy.storage.source.add(kind='text', value='Robertus Johansyah is the maintainer ExcelCy')
excelcy.storage.source.add(kind='textract', value='tests/data/source/test_source_01.txt')
excelcy.discover()
print(json.dumps(excelcy.storage.items(), indent=2))
# add phrase matcher Robertus Johansyah -> PERSON
excelcy.storage.prepare.add(kind='phrase', value='Robertus Johansyah', entity='PERSON')
excelcy.prepare()
print(json.dumps(excelcy.storage.items(), indent=2))
# train it
excelcy.train()
print(json.dumps(excelcy.storage.items(), indent=2))
# test it
doc = excelcy.nlp('Robertus Johansyah is maintainer ExcelCy')
print(json.dumps(excelcy.storage.items(), indent=2))
Features
Load multiple data sources such as Word documents, PowerPoint presentations, PDF or images.
Import/Export configuration with JSON, YML or Excel.
Add custom Entity labels.
Rule based phrase matching using PhraseMatcher
Rule based matching using regex + Matcher
Train Named Entity Recogniser with ease
Install
Either use the famous pip or clone this repository and execute the setup.py file.
$ pip install excelcy
# ensure you have the language model installed before
$ spacy download en
Train
To train the spaCy model:
from excelcy import ExcelCy
excelcy = ExcelCy.execute(file_path='test_data_01.xlsx')
Data Definition
ExcelCy has data definition which expressed in api.yml. As long as, data given in this specific format and structure, ExcelCy will able to support any type of data format. Check out, the Excel file format in api.xlsx. Data classes are defined with attrs, check in storage.py for more detail.
TODO
[X] Start get cracking into spaCy
[ ] More features and enhancements listed here
[ ] [link] Add CLI support
[ ] [link] Add export outputs such as identified Entities, Tags
[ ] Add special case for tokenisation described here
[ ] Add custom tags.
[ ] Add classifier text training described here
[ ] Add exception subtext when there is multiple occurrence in text. (Google Pay is awesome Google product)
[ ] Add tag annotation in sheet: train
[ ] Add ref in data storage
[ ] Improve speed and performance
[X] Add list of patterns easily (such as kitten breed.
[X] Add more data structure check in Excel and more warning messages
[X] Add plugin, otherwise just extends for now.
[X] [link] Improve experience
[X] [link] Add more file format such as YML, JSON. Make standardise and well documented on data structure.
[X] Add support to accept sentences to Excel
[X] Submit to Prodigy Universe
FAQ
What is that idx columns in the Excel sheet?
The idea is to give reference between two things. Imagine in sheet “train”, like to know where the sentence generated from in sheet “source”.
Can ExcelCy import/export to X, Y, Z data format?
ExcelCy has strong and well-defined data storage, thanks to attrs. It is possible to import/export data in any format.
ExcelCy accepts suggestions/ideas?
Yes! Please submit them into new issue with label “enhancement”.
Acknowledgement
This project uses other awesome projects:
attrs: Python Classes Without Boilerplate.
pyexcel: Single API for reading, manipulating and writing data in csv, ods, xls, xlsx and xlsm files.
pyyaml: The next generation YAML parser and emitter for Python.
spacy: Industrial-strength Natural Language Processing (NLP) with Python and Cython.
textract: extract text from any document. no muss. no fuss.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.