Skip to main content

getpaper - papers download made easy!

Project description

getpaper

Paper downloader

getting started

Install the library with:

pip install getpaper

If you want to edit getpaper repository consider installing it locally:

pip install -e .

On linux systems you sometimes need to check that build essentials are installed:

sudo apt install build-essential.

It is also recommended to use micromamba, conda, anaconda or other environments to avoid bloating system python with too many dependencies.

Usage

Downloading papers

After the installation you can either import the library into your python code or you can use the console scripts.

If you install from pip calling download will mean calling getpaper/download.py , for parse - getpaper/parse.py , for index - getpaper/index.py

download download_pubmed --pubmed 22266545 --folder papers --name pmid

Downloads the paper with pubmed id into the folder 'papers' and uses the pubmed id as name

download download_doi --doi 10.1519/JSC.0b013e318225bbae --folder papers

Downloads the paper with DOI into the folder papers, as --name is not specified doi is used as name

It is also possible to download many papers in parallel with download_papers(dois: List[str], destination: Path, threads: int) function, for example:

from pathlib import Path
from typing import List
from getpaper.download import download_papers
dois: List[str] = ["10.3390/ijms22031073", "10.1038/s41597-020-00710-z", "wrong"]
destination: Path = Path("./data/output/test/papers").absolute().resolve()
threads: int = 5
results = download_papers(dois, destination, threads)
successful = results[0]
failed = results[1]

Here results will be OrderedDict[str, Path] with successfully downloaded doi->paper_path and List[str] with failed dois, in current example:

(OrderedDict([('10.3390/ijms22031073',
               PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.3390/ijms22031073.pdf')),
              ('10.1038/s41597-020-00710-z',
               PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.1038/s41597-020-00710-z.pdf'))]),
 ['wrong'])

Same function can be called from the command line:

download download_papers --dois "10.3390/ijms22031073" --dois "10.1038/s41597-020-00710-z" --dois "wrong" --folder "data/output/test/papers" --threads 5

You can also call download.py script directly:

python getpaper/download.py download_papers --dois "10.3390/ijms22031073" --dois "10.1038/s41597-020-00710-z" --dois "wrong" --folder "data/output/test/papers" --threads 5

Parsing the papers

You can parse the downloaded papers with the unstructured library. For example if the papers are in the folder test, you can run:

getpaper/parse.py parse_folder --folder data/output/test/papers --cores 5

You can also switch between different PDF parsers:

getpaper/parse.py parse_folder --folder data/output/test/papers --parser pdf_miner --cores 5

You can also parse papers on a per-file basis, for example:

getpaper/parse.py parse_paper --paper data/output/test/papers/10.3390/ijms22031073.pdf

Count tokens

To evaluate how much you want to split texts and how much embeddings will cost you it is useful to compute token number:

getpaper/parse.py count_tokens --path /home/antonkulaga/sources/non-animal-models/data/inputs/datasets

Indexing papers

We also provide features to index the papers with openai or llama embeddings and save them in chromadb vector store. For openai embeddings to work you have to create .env file and specify your openai key there, see .env.template as example

For example if you have your papers inside data/output/test/papers folder, and you want to make a ChromaDB index at data/output/test/index you can do it by:

getpaper/index.py index_papers --papers data/output/test/papers --folder data/output/test/index --collection mypapers --chunk_size 6000

It is possible to use both Chroma and Qdrant. To use qdrant we provide docker-compose file to set it up:

cd services
docker compose -f docker-compose.yaml up

then you can run the indexing of the paper with Qdrant:

getpaper/index.py index_papers --papers data/output/test/papers --url http://localhost:6333 --collection mypapers --chunk_size 6000 --database Qdrant

You can also take a look if things were added to the collection with qdrant web UI by checking http://localhost:6333/dashboard

Indexing with Llama-2 embeddings

You can also use llama-2 embeddings if you install llama-cpp-python and pass a path to the model, for example for https://huggingface.co/TheBloke/Llama-2-13B-GGML model:

getpaper/index.py index_papers --papers data/output/test/papers --url http://localhost:6333 --collection papers_llama2_2000 --chunk_size 2000 --database Qdrant --embeddings llama --model /home/antonkulaga/sources/getpaper/data/models/llama-2-13b-chat.ggmlv3.q2_K.bin

Instead of explicitly pathing the model path you can also include the path to LLAMA_MODEL to the .env file as:

LLAMA_MODEL="/home/antonkulaga/sources/getpaper/data/models/llama-2-13b-chat.ggmlv3.q2_K.bin"

Note: if you want to use Qdrant cloud you do not need docker-compose, but you need to provide a key and look at qdrant cloud setting for the url to give.

getpaper/index.py index_papers --papers data/output/test/papers --url https://5bea7502-97d4-4876-98af-0cdf8af4bd18.us-east-1-0.aws.cloud.qdrant.io:6333 --key put_your_key_here --collection mypapers --chunk_size 6000 --database Qdrant

Note: there are temporal issues with embeddings for llama.

Examples

You can run examples.py to see usage examples

Additional requirements

index.py has local dependencies on other modules, for this reason if you are running it inside getpaper project folder consider having it installed locally:

pip install -e .

Detectron2 is required for using models from the layoutparser model zoo but is not automatically installed with this package. For macOS and Linux, build from source with:

pip install 'git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

getpaper-0.2.8.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

getpaper-0.2.8-py2.py3-none-any.whl (20.4 kB view details)

Uploaded Python 2Python 3

File details

Details for the file getpaper-0.2.8.tar.gz.

File metadata

  • Download URL: getpaper-0.2.8.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for getpaper-0.2.8.tar.gz
Algorithm Hash digest
SHA256 205d7d28627ca3534c471aa5f37592bfa7511f55a419ca04dbf5726325cf545b
MD5 4771ec907b2c13eec0572614e3849feb
BLAKE2b-256 ce06643a49ce7d94141383ac6c37590ae70c47ec475d69b175ed9a11df64044b

See more details on using hashes here.

File details

Details for the file getpaper-0.2.8-py2.py3-none-any.whl.

File metadata

  • Download URL: getpaper-0.2.8-py2.py3-none-any.whl
  • Upload date:
  • Size: 20.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for getpaper-0.2.8-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 5c1e654ea4fa187478377b2bf2398f6881c6068290e7b1cf037a2ec1fa40525b
MD5 b8e1b7756aa933765ea73c070c5cdf61
BLAKE2b-256 3efe85cb314dbad1730909955d74de1cae65cc5f133a7f7e3dcf6fe41914da7a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page