Skip to main content

A full SpaCy pipeline and models for scientific/biomedical documents.

Project description

SciSpaCy

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds tokenization rules on top of spaCy's rule-based tokenizer, and a custom sentence segmenter that adds sentence segmentation rules on top of spaCy's statistical sentence segmenter.

Usage

Using SciSpaCy as is

To use SciSpaCy as is, follow these steps:

  1. Clone this repository
  2. From within this repository, run ./scripts/create_model_package.sh ./scispacy/models/combined_rule_tokenizer_and_segmenter
  3. Run python setup.py sdist
  4. Run pip install --user dist/scispacy-1.0.0.tar.gz
  5. Run pip install --user dist/en_scispacy_core_web_sm-1.0.0.tar.gz

Once you have completed the above steps, you can load SciSpaCy as you would any other spaCy model. For example:

import spacy
nlp = spacy.load("en_scispacy_core_web_sm")

To make full use of this package, you will also need to preprocess the text that you will be running through spaCy. This means passing the raw text through custom_tokenizer.remove_new_lines() before passing it through spaCy.

Modifying SciSpaCy

Changing the tokenizer or segmenter

To change the tokenizer or segmenter, all you need to do is change the tokenization or segmentation function, rebuild the model folder, and then follow the above steps for using SciSpaCy as is. In detail:

  1. Change the tokenizer (combined_rule_tokenizer() in scispacy/custom_tokenizer.py) and/or segmenter(combined_rule_sentence_segmenter() in scispacy/custom_sentence_segmenter.py)
  2. Rebuild the model folder by running save_model(create_combined_rule_model, /path/to/model/folder) in scispacy/util.py
  3. Edit the newly create meta.json as you see fit
  4. Go through the steps above for using SciSpaCy as is

Adding a new pipe or further customization

Adding a new pipe requires that you

  1. Create your new pipe
  2. Save your model following the pattern described above for changing the tokenizer or segmenter (steps 2 and 3)
  3. Add your new pipe to Language.factories in proto_model/__init__.py
  4. Follow the steps for using SciSpaCy as is

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scispacy-0.0.0.post0.tar.gz (33.2 kB view details)

Uploaded Source

File details

Details for the file scispacy-0.0.0.post0.tar.gz.

File metadata

  • Download URL: scispacy-0.0.0.post0.tar.gz
  • Upload date:
  • Size: 33.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.7

File hashes

Hashes for scispacy-0.0.0.post0.tar.gz
Algorithm Hash digest
SHA256 ff552e7ad09f4cb21c01cd411dc74d2d08f76d527707554994cc0934024bac67
MD5 1ee0a795abce1a308be92e746dd580be
BLAKE2b-256 10a7eb006140811df1dc93327239076296d2b904df66f875ed0a489a2dc1645b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page