Skip to main content

ETL for parsing scientific papers.

Project description

Scholaretl

An Extract, Transfrom and Load (ETL) API made to parse scientific papers. This package is meant to be used with scholarag, our Retreival Augmented Generation (RAG) tool. It is mainly used to parse scientific paper coming from different sources, to make it compatible with ususal databases.

  1. Quickstart
  2. List of endpoints
  3. Docker Image
  4. Grobid parsing
  5. Funding and Acknowledgement

Quickstart

Step 1 : Install the package.

Simply install the package with PyPi.

pip install scholaretl

You can also clone the GitHub repo and install the package yourself.

Step 2 : Run the FastApi app.

A simple script is installed with the package, and allows to run the app locally. By default the API is open on port 8000.

scholaretl-api

See the -h flag for non default arguments.

Step 3 : Test the app.

Now that the server is running, you can either curl it to get information.

curl http://localhost:8000/settings

Or open a browser at : http://localhost:8000/docs and try some of the endpoints. For example, use the parse/pypdf endpoint to parse a local pdf file. Parsing xml files works out of the box. Keep in mind that the xml parsing endpoints are meant to be used with files comming from specific scientific journals. (see List of endpoints)

List of endpoints

Once the app is deployed, all these endpoints will be available to use :

  • /parse/pubmed_xml: parses XMLs coming from PubMed.
  • /parse/jats_xml: Parses XMLs coming from PMC.
  • /parse/tei_xml: Parses XMLs produced by Grobid.
  • /parse/xocs_xml: Parses XMLs coming from Scopus (Elsevier)
  • /parse/pypdf: Parses PDFs without keeping the structure of the document.
  • /parse/grobidpdf: Parses PDFs keeping the structure of the document (REQUIRES grobid, see Grobid parsing).

Docker image

If a docker container is required, it can be build using the provided Dockerfile. Make sure you have Docker installed.

docker build -t scholaretl:latest . --platform linux/amd64

It can then be tested by runing the container locally. The flag --platform linux/amd64 depends on the desired deployement and should be changed accordingly. Scholaretl:latest can be sutomized at will. The image can then be activated using :

docker run -d -p 8080:8080 scholaretl:latest

The Api will accept requests on port 8080, ie you can acces the UI at : http://localhost:8080/docs.

Grobid parsing

To parse documents with the Grobid enpoint, It requires a Grobid server to be running. To deploy it, simply run

docker run -p 8070:8070 -d lfoppiano/grobid:0.7.3

Then pass the server's url to the script in a .env file:

echo SCHOLARETL__GROBID__URL=http://localhost:8070 > .env
scholaretl-api

You can also add the server's url in the .env manually. See the env.example file for more information.

If using docker, pass the server's URL as an environment variable.

docker run -p 8080:8080 -d -e SCHOLARETL__GROBID__URL=http://host.docker.internal:8070 scholaretl:latest

Funding and Acknowledgement

The development of this software was supported by funding to the Blue Brain Project, a research center of the École polytechnique fédérale de Lausanne (EPFL), from the Swiss government’s ETH Board of the Swiss Federal Institutes of Technology.

Copyright (c) 2024 Blue Brain Project/EPFL

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scholaretl-0.0.6.tar.gz (25.2 kB view details)

Uploaded Source

Built Distribution

scholaretl-0.0.6-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

File details

Details for the file scholaretl-0.0.6.tar.gz.

File metadata

  • Download URL: scholaretl-0.0.6.tar.gz
  • Upload date:
  • Size: 25.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for scholaretl-0.0.6.tar.gz
Algorithm Hash digest
SHA256 447b0c25685ad842ecdc56780f10d8bba44713ea0372c37b67e401d5af0a6862
MD5 e502fb1099991822d95fb9bfb843e982
BLAKE2b-256 f009116e63404a37e0d0f9e1e0e3c478dafe1058f388e7173cfaa31d3a9f568b

See more details on using hashes here.

Provenance

The following attestation bundles were made for scholaretl-0.0.6.tar.gz:

Publisher: release.yml on BlueBrain/scholaretl

Attestations:

File details

Details for the file scholaretl-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: scholaretl-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 26.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for scholaretl-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 1a123bae59d0ccc239db4adb0bd2d98fb971d1c6cea4416e2be715e74c9cd479
MD5 5218d3289378aef87b9aa0f67147dbc8
BLAKE2b-256 fa150053c30fc1a01b8397d25d356bc2936e5cc3fdc3c756cc074dc7e732d837

See more details on using hashes here.

Provenance

The following attestation bundles were made for scholaretl-0.0.6-py3-none-any.whl:

Publisher: release.yml on BlueBrain/scholaretl

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page