Skip to main content

PYthon Automated Term Extraction

Project description

PYthon Automated Term Extraction

Code style: black Built with spaCy

Python implementation of term extraction algorithms such as C-Value, Basic, Combo Basic, Weirdness and Term Extractor using Spacy POS tagging.

Warning: Weirdness and Term Extractor doesn't work through pip at the moment due to errors with upload CSV files.

Installation

Using pip:

pip install pyate https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz

Quickstart

To get started, simply call one of the implemented algorithms. According to studies, combo_basic is the most precise, though basic and cvalue is not too far behind.

from pyate import combo_basic

# source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1994795/
string = """Central to the development of cancer are genetic changes that endow these “cancer cells” with many of the
hallmarks of cancer, such as self-sufficient growth and resistance to anti-growth and pro-death signals. However, while the
genetic changes that occur within cancer cells themselves, such as activated oncogenes or dysfunctional tumor suppressors,
are responsible for many aspects of cancer development, they are not sufficient. Tumor promotion and progression are
dependent on ancillary processes provided by cells of the tumor environment but that are not necessarily cancerous 
themselves. Inflammation has long been associated with the development of cancer. This review will discuss the reflexive 
relationship between cancer and inflammation with particular focus on how considering the role of inflammation in physiologic 
processes such as the maintenance of tissue homeostasis and repair may provide a logical framework for understanding the 
connection between the inflammatory response and cancer."""

print(combo_basic(string).sort_values(ascending=False))
""" (Output)
dysfunctional tumor                1.443147
tumor suppressors                  1.443147
genetic changes                    1.386294
cancer cells                       1.386294
dysfunctional tumor suppressors    1.298612
logical framework                  0.693147
sufficient growth                  0.693147
death signals                      0.693147
many aspects                       0.693147
inflammatory response              0.693147
tumor promotion                    0.693147
ancillary processes                0.693147
tumor environment                  0.693147
reflexive relationship             0.693147
particular focus                   0.693147
physiologic processes              0.693147
tissue homeostasis                 0.693147
cancer development                 0.693147
dtype: float64
"""

If you would like to add this to a spacy pipeline, simply use add Spacy's add_pipe method.

import spacy
from pyate.term_extraction_pipeline import TermExtractionPipeline

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(TermExtractionPipeline())
doc = nlp(string)
print(doc._.combo_basic.sort_values(ascending=False).head(5))
""" (Output)
dysfunctional tumor                1.443147
tumor suppressors                  1.443147
genetic changes                    1.386294
cancer cells                       1.386294
dysfunctional tumor suppressors    1.298612
dtype: float64
"""

Also, TermExtractionPipeline.__init__ is defined as follows

__init__(
  self,
  func: Callable[..., pd.Series] = combo_basic,
  *args,
  **kwargs
)

where func is essentially your term extracting algorithm that takes in a corpus (either a string or iterator of strings) and outputs a Pandas Series of term-value pairs of terms and their respective termhoods. func is by default combo_basic. args and kwargs are for you to overide default values for the function, which you can find by running help (might document later on).

Summary of functions

Each of cvalue, basic, combo_basic, weirdness and term_extractor take in a string or an iterator of strings and outputs a Pandas Series of term-value pairs, where higher values indicate higher chance of being a domain specific term. Furthermore, weirdness and term_extractor take a general_corpus key word argument which must be an iterator of strings.

Todo

  • Add PU-ATR algorithm since its precision is a lot higher, though more computationally expensive
  • Page Rank algorithm
  • Add sources

Sources

I can not seem to find the original Basic and Combo Basic papers but I found papers that referenced them.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyate-0.3.2.tar.gz (9.2 kB view details)

Uploaded Source

File details

Details for the file pyate-0.3.2.tar.gz.

File metadata

  • Download URL: pyate-0.3.2.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.6.9

File hashes

Hashes for pyate-0.3.2.tar.gz
Algorithm Hash digest
SHA256 6384e3364324b0cdd5b9710b31eff77a3feca6872cc8ec569ed377af2dae7682
MD5 33ae9428dfd832e194763a9e71497bbc
BLAKE2b-256 58d41637fcc821f5d73bb6695ded01aa6cc09576033e9e01324561366df68cb6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page