Skip to main content

ETL for parsing scientific papers.

Project description

Scholaretl

An Extract, Transfrom and Load (ETL) API made to parse scientific papers. This package is meant to be used with scholarag, our Retreival Augmented Generation (RAG) tool. It is mainly used to parse scientific paper coming from different sources, to make it compatible with ususal databases.

  1. Quickstart
  2. List of endpoints
  3. Docker Image
  4. Grobid parsing
  5. Funding and Acknowledgement

Quickstart

Step 1 : Install the package.

Simply install the package with PyPi.

pip install scholaretl

You can also clone the GitHub repo and install the package yourself.

Step 2 : Run the FastApi app.

A simple script is installed with the package, and allows to run the app locally. By default the API is open on port 8000.

scholaretl-api

See the -h flag for non default arguments.

Step 3 : Test the app.

Now that the server is running, you can either curl it to get information.

curl http://localhost:8000/settings

Or open a browser at : http://localhost:8000/docs and try some of the endpoints. For example, use the parse/pypdf endpoint to parse a local pdf file. Parsing xml files works out of the box. Keep in mind that the xml parsing endpoints are meant to be used with files comming from specific scientific journals. (see List of endpoints)

List of endpoints

Once the app is deployed, all these endpoints will be available to use :

  • /parse/pubmed_xml: parses XMLs coming from PubMed.
  • /parse/jats_xml: Parses XMLs coming from PMC.
  • /parse/tei_xml: Parses XMLs produced by Grobid.
  • /parse/xocs_xml: Parses XMLs coming from Scopus (Elsevier)
  • /parse/pypdf: Parses PDFs without keeping the structure of the document.
  • /parse/grobidpdf: Parses PDFs keeping the structure of the document (REQUIRES grobid, see Grobid parsing).

Docker image

If a docker container is required, it can be build using the provided Dockerfile. Make sure you have Docker installed.

docker build -t scholaretl:latest . --platform linux/amd64

It can then be tested by runing the container locally. The flag --platform linux/amd64 depends on the desired deployement and should be changed accordingly. Scholaretl:latest can be sutomized at will. The image can then be activated using :

docker run -d -p 8080:8080 scholaretl:latest

The Api will accept requests on port 8080, ie you can acces the UI at : http://localhost:8080/docs.

Grobid parsing

To parse documents with the Grobid enpoint, It requires a Grobid server to be running. To deploy it, simply run

docker run -p 8070:8070 -d lfoppiano/grobid:0.7.3

Then pass the server's url to the script in a .env file:

echo SCHOLARETL__GROBID__URL=http://localhost:8070 > .env
scholaretl-api

You can also add the server's url in the .env manually. See the env.example file for more information.

If using docker, pass the server's URL as an environment variable.

docker run -p 8080:8080 -d -e SCHOLARETL__GROBID__URL=http://host.docker.internal:8070 scholaretl:latest

Funding and Acknowledgement

The development of this software was supported by funding to the Blue Brain Project, a research center of the École polytechnique fédérale de Lausanne (EPFL), from the Swiss government’s ETH Board of the Swiss Federal Institutes of Technology.

Copyright (c) 2024 Blue Brain Project/EPFL

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scholaretl-0.0.5.tar.gz (25.2 kB view hashes)

Uploaded Source

Built Distribution

scholaretl-0.0.5-py3-none-any.whl (26.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page