Skip to main content

A toolkit for Scientific Document Processing

Project description

**CHANGE LOG

Source Code

  • Rename SingleSummarization to Summarization.
  • Change the format of output files from .txt to .json.

Documentation

  • Move the definition of Pipeline class from Usage to Contribution Guide.
  • Add catalog for Contribution Guide.
  • Add examples for choosing devices in Usage.

SciAssist

PyPI Status PyTorch Lightning Config: Hydra Template
ReadTheDocs

AboutInstallationUsageContribution

About

This is the repository of SciAssist, which is a toolkit to assist scientists' research. SciAssist currently supports reference string parsing, more functions are under active development by WING@NUS, Singapore. The project was built upon an open-sourced template by ashleve.

Installation

pip install SciAssist

Setup Grobid for pdf processing

After you install the package, you can simply setup grobid with the CLI:

setup_grobid

This will setup Grobid. And after installation, starts the Grobid server with:

run_grobid

Usage

Here are some example usages.

Reference string parsing:

from SciAssist import ReferenceStringParsing

# Set device="cpu" if you want to use only CPU. The default device is "gpu".
# ref_parser = ReferenceStringParsing(device="cpu")
ref_parser = ReferenceStringParsing(device="gpu")

# For string
res = ref_parser.predict(
    """Calzolari, N. (1982) Towards the organization of lexical definitions on a 
    database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles 
    University, Prague, pp.61-64.""", type="str")
# For text
res  = ref_parser.predict("test.txt", type="txt")
# For pdf
res = ref_parser.predict("test.pdf")

Summarizarion for single document:

from SciAssist import Summarization

# Set device="cpu" if you want to use only CPU. The default device is "gpu".
# pipleine = Summarization(device="cpu")
summerizer = Summarization(device="gpu")

text = """1 INTRODUCTION . Statistical learning theory studies the learning properties of machine learning algorithms , and more fundamentally , the conditions under which learning from finite data is possible . 
In this context , classical learning theory focuses on the size of the hypothesis space in terms of different complexity measures , such as combinatorial dimensions , covering numbers and Rademacher/Gaussian complexities ( Shalev-Shwartz & Ben-David , 2014 ; Boucheron et al. , 2005 ) . 
Another more recent approach is based on defining suitable notions of stability with respect to perturbation of the data ( Bousquet & Elisseeff , 2001 ; Kutin & Niyogi , 2002 ) . 
In this view , the continuity of the process that maps data to estimators is crucial , rather than the complexity of the hypothesis space . 
Different notions of stability can be considered , depending on the data perturbation and metric considered ( Kutin & Niyogi , 2002 ) . 
Interestingly , the stability and complexity approaches to characterizing the learnability of problems are not at odds with each other , and can be shown to be equivalent as shown in Poggio et al . ( 2004 ) and Shalev-Shwartz et al . ( 2010 ) . 
In modern machine learning overparameterized models , with a larger number of parameters than the size of the training data , have become common . 
The ability of these models to generalize is well explained by classical statistical learning theory as long as some form of regularization is used in the training process ( Bühlmann & Van De Geer , 2011 ; Steinwart & Christmann , 2008 ) . 
However , it was recently shown - first for deep networks ( Zhang et al. , 2017 ) , and more recently for kernel methods ( Belkin et al. , 2019 ) - that learning is possible in the absence of regularization , i.e. , when perfectly fitting/interpolating the data . 
Much recent work in statistical learning theory has tried to find theoretical ground for this empirical finding . 
Since learning using models that interpolate is not exclusive to deep neural networks , we study generalization in the presence of interpolation in the case of kernel methods . 
We study both linear and kernel least squares problems in this paper . """

# For string
res = summerizer.predict(text, type="str")
# For text
res = summerizer.predict("bodytext.txt", type="txt")
# For pdf
res = summerizer.predict("raw.pdf")

Dataset mention extraction:

from SciAssist import DatasetExtraction

# Set device="cpu" if you want to use only CPU. The default device is "gpu".
# ref_parser = DatasetExtraction(device="cpu")
extractor = DatasetExtraction(device="gpu")

# For string
res = extractor.extract("The impact of gender identity on emotions was examined by researchers using a subsample from the National Longitudinal Study of Adolescent Health. The study aimed to investigate the direct effects of gender identity on emotional experiences and expression. By focusing on a subsample of the larger study, the researchers were able to hone in on the specific relationship between gender identity and emotions. Through their analysis, the researchers sought to determine whether gender identity could have a significant and direct impact on emotional well-being. The findings of the study have important implications for our understanding of the complex interplay between gender identity and emotional experiences, and may help to inform future interventions and support for individuals who experience gender-related emotional distress.", type="str")
# For text: please input the path of your .txt file
res = extractor.extract("test.txt", type="txt")
# For pdf: please input the path of your .pdf file
res = extractor.predict("test.pdf", type="pdf")

Contribution

Here's a simple introduction about how to incorporate a new task into SciAssist. Generally, to add a new task, you will need to:

1. Git clone this repo and prepare the virtual environment.
2. Install Grobid Server.
3. Create a LightningModule and a DataLightningModule.
4. Train a model.
5. Provide a pipeline for users.

We provide a step-by-step contribution guide, see SciAssist’s documentation.

LICENSE

This toolkit is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International. Read LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SciAssist-0.1.4.tar.gz (108.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

SciAssist-0.1.4-py3-none-any.whl (130.5 kB view details)

Uploaded Python 3

File details

Details for the file SciAssist-0.1.4.tar.gz.

File metadata

  • Download URL: SciAssist-0.1.4.tar.gz
  • Upload date:
  • Size: 108.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for SciAssist-0.1.4.tar.gz
Algorithm Hash digest
SHA256 e0b1ca32fce6c4ee89ab583baaf2212f16388af6ed40e6b8bdbba4cfd0cd2dcf
MD5 3a41cc65a10e762724b0df8c4332bb8b
BLAKE2b-256 b48efde03ab3ea0b16742758c9c9517b80420b0209daf31d618b927a5e876ced

See more details on using hashes here.

File details

Details for the file SciAssist-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: SciAssist-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 130.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for SciAssist-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 dda99ce7d9f50f26bf7881bc8643f1cbba782b724a0b16dafc1878165bac3e3f
MD5 9e189c5adad5120923e9a44048337f2c
BLAKE2b-256 106d901c7c7e4de3efade5fa1b623f36783593cde77ee64e94ee08089b2e4e81

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page