A full SpaCy pipeline and models for scientific/biomedical documents.

These details have not been verified by PyPI

Project links

Homepage

Project description

SciSpaCy

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds tokenization rules on top of spaCy's rule-based tokenizer, and a custom sentence segmenter that adds sentence segmentation rules on top of spaCy's statistical sentence segmenter.

Usage

Using SciSpaCy as is

To use SciSpaCy as is, follow these steps:

Clone this repository
From within this repository, run ./scripts/create_model_package.sh ./scispacy/models/combined_rule_tokenizer_and_segmenter
Run python setup.py sdist
Run pip install --user dist/scispacy-1.0.0.tar.gz
Run pip install --user dist/en_scispacy_core_web_sm-1.0.0.tar.gz

Once you have completed the above steps, you can load SciSpaCy as you would any other spaCy model. For example:

import spacy
nlp = spacy.load("en_scispacy_core_web_sm")

To make full use of this package, you will also need to preprocess the text that you will be running through spaCy. This means passing the raw text through custom_tokenizer.remove_new_lines() before passing it through spaCy.

Modifying SciSpaCy

Changing the tokenizer or segmenter

To change the tokenizer or segmenter, all you need to do is change the tokenization or segmentation function, rebuild the model folder, and then follow the above steps for using SciSpaCy as is. In detail:

Change the tokenizer (combined_rule_tokenizer() in scispacy/custom_tokenizer.py) and/or segmenter(combined_rule_sentence_segmenter() in scispacy/custom_sentence_segmenter.py)
Rebuild the model folder by running save_model(create_combined_rule_model, /path/to/model/folder) in scispacy/util.py
Edit the newly create meta.json as you see fit
Go through the steps above for using SciSpaCy as is

Adding a new pipe or further customization

Adding a new pipe requires that you

Create your new pipe
Save your model following the pattern described above for changing the tokenizer or segmenter (steps 2 and 3)
Add your new pipe to Language.factories in proto_model/__init__.py
Follow the steps for using SciSpaCy as is

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.5.5

Oct 27, 2024

0.5.4

Mar 8, 2024

0.5.3

Sep 30, 2023

0.5.2

Apr 29, 2023

0.5.1

Sep 7, 2022

0.5.0

Mar 10, 2022

0.4.0

Feb 12, 2021

0.3.0

Oct 16, 2020

0.2.5

Jul 8, 2020

0.2.4

Oct 22, 2019

0.2.3

Aug 22, 2019

0.2.2

Jun 3, 2019

0.2.0

Apr 3, 2019

0.1.0

Feb 20, 2019

This version

0.0.0.post0

Jan 28, 2019

0.0.0

Jan 28, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scispacy-0.0.0.post0.tar.gz (33.2 kB view details)

Uploaded Jan 28, 2019 Source

File details

Details for the file scispacy-0.0.0.post0.tar.gz.

File metadata

Download URL: scispacy-0.0.0.post0.tar.gz
Upload date: Jan 28, 2019
Size: 33.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.7

File hashes

Hashes for scispacy-0.0.0.post0.tar.gz
Algorithm	Hash digest
SHA256	`ff552e7ad09f4cb21c01cd411dc74d2d08f76d527707554994cc0934024bac67`
MD5	`1ee0a795abce1a308be92e746dd580be`
BLAKE2b-256	`10a7eb006140811df1dc93327239076296d2b904df66f875ed0a489a2dc1645b`

See more details on using hashes here.

scispacy 0.0.0.post0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SciSpaCy

Usage

Using SciSpaCy as is

Modifying SciSpaCy

Changing the tokenizer or segmenter

Adding a new pipe or further customization

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes