ETL for parsing scientific papers.
Project description
Scholaretl
An Extract, Transfrom and Load (ETL) API made to parse scientific papers. This package is meant to be used with scholarag, our Retreival Augmented Generation (RAG) tool. It is mainly used to parse scientific paper coming from different sources, to make it compatible with ususal databases.
Quickstart
Step 1 : Install the package.
Simply install the package with PyPi.
pip install scholaretl
You can also clone the GitHub repo and install the package yourself.
Step 2 : Run the FastApi app.
A simple script is installed with the package, and allows to run the app locally. By default the API is open on port 8000.
scholaretl-api
See the -h
flag for non default arguments.
Step 3 : Test the app.
Now that the server is running, you can either curl it to get information.
curl http://localhost:8000/settings
Or open a browser at : http://localhost:8000/docs
and try some of the endpoints. For example, use the parse/pypdf
endpoint to parse a local pdf file. Parsing xml files works out of the box. Keep in mind that the xml parsing endpoints are meant to be used with files comming from specific scientific journals. (see List of endpoints)
List of endpoints
Once the app is deployed, all these endpoints will be available to use :
/parse/pubmed_xml
: parses XMLs coming from PubMed./parse/jats_xml
: Parses XMLs coming from PMC./parse/tei_xml
: Parses XMLs produced by Grobid./parse/xocs_xml
: Parses XMLs coming from Scopus (Elsevier)/parse/pypdf
: Parses PDFs without keeping the structure of the document./parse/grobidpdf
: Parses PDFs keeping the structure of the document (REQUIRES grobid, see Grobid parsing).
Docker image
If a docker container is required, it can be build using the provided Dockerfile. Make sure you have Docker installed.
docker build -t scholaretl:latest . --platform linux/amd64
It can then be tested by runing the container locally. The flag --platform linux/amd64
depends on the desired deployement and should be changed accordingly. Scholaretl:latest
can be sutomized at will.
The image can then be activated using :
docker run -d -p 8080:8080 scholaretl:latest
The Api will accept requests on port 8080
, ie you can acces the UI at : http://localhost:8080/docs
.
Grobid parsing
To parse documents with the Grobid enpoint, It requires a Grobid server to be running. To deploy it, simply run
docker run -p 8070:8070 -d lfoppiano/grobid:0.7.3
Then pass the server's url to the script in a .env file:
echo SCHOLARETL__GROBID__URL=http://localhost:8070 > .env
scholaretl-api
You can also add the server's url in the .env
manually. See the env.example
file for more information.
If using docker, pass the server's URL as an environment variable.
docker run -p 8080:8080 -d -e SCHOLARETL__GROBID__URL=http://host.docker.internal:8070 scholaretl:latest
Funding and Acknowledgement
The development of this software was supported by funding to the Blue Brain Project, a research center of the École polytechnique fédérale de Lausanne (EPFL), from the Swiss government’s ETH Board of the Swiss Federal Institutes of Technology.
Copyright (c) 2024 Blue Brain Project/EPFL
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scholaretl-0.0.6.tar.gz
.
File metadata
- Download URL: scholaretl-0.0.6.tar.gz
- Upload date:
- Size: 25.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 447b0c25685ad842ecdc56780f10d8bba44713ea0372c37b67e401d5af0a6862 |
|
MD5 | e502fb1099991822d95fb9bfb843e982 |
|
BLAKE2b-256 | f009116e63404a37e0d0f9e1e0e3c478dafe1058f388e7173cfaa31d3a9f568b |
Provenance
The following attestation bundles were made for scholaretl-0.0.6.tar.gz
:
Publisher:
release.yml
on BlueBrain/scholaretl
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
scholaretl-0.0.6.tar.gz
- Subject digest:
447b0c25685ad842ecdc56780f10d8bba44713ea0372c37b67e401d5af0a6862
- Sigstore transparency entry: 145128131
- Sigstore integration time:
- Predicate type:
File details
Details for the file scholaretl-0.0.6-py3-none-any.whl
.
File metadata
- Download URL: scholaretl-0.0.6-py3-none-any.whl
- Upload date:
- Size: 26.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a123bae59d0ccc239db4adb0bd2d98fb971d1c6cea4416e2be715e74c9cd479 |
|
MD5 | 5218d3289378aef87b9aa0f67147dbc8 |
|
BLAKE2b-256 | fa150053c30fc1a01b8397d25d356bc2936e5cc3fdc3c756cc074dc7e732d837 |
Provenance
The following attestation bundles were made for scholaretl-0.0.6-py3-none-any.whl
:
Publisher:
release.yml
on BlueBrain/scholaretl
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
scholaretl-0.0.6-py3-none-any.whl
- Subject digest:
1a123bae59d0ccc239db4adb0bd2d98fb971d1c6cea4416e2be715e74c9cd479
- Sigstore transparency entry: 145128135
- Sigstore integration time:
- Predicate type: