Skip to main content

getpaper - papers download made easy!

Project description

getpaper

Paper downloader

getting started

Install the library with:

pip install getpaper

On linux systems you sometimes need to check that build essentials are installed:

sudo apt install build-essential.

It is also recommended to use micromamba, conda, anaconda or other environments to avoid bloating system python with too many dependencies.

Usage

Downloading papers

After the installation you can either import the library into your python code or you can use the console scripts.

If you install from pip calling download will mean calling getpaper/download.py , for parse - getpaper/parse.py , for index - getpaper/index.py

download download download_pubmed --pubmed 22266545 --folder papers --name pmid

Downloads the paper with pubmed id into the folder 'papers' and uses the pubmed id as name

download download download_doi --doi 10.1519/JSC.0b013e318225bbae --folder papers

Downloads the paper with DOI into the folder papers, as --name is not specified doi is used as name

It is also possible to download many papers in parallel with download_papers(dois: List[str], destination: Path, threads: int) function, for example:

from pathlib import Path
from typing import List
from getpaper.download import download_papers
dois: List[str] = ["10.3390/ijms22031073", "10.1038/s41597-020-00710-z", "wrong"]
destination: Path = Path("./data/output/test/papers").absolute().resolve()
threads: int = 5
results = download_papers(dois, destination, threads)
successful = results[0]
failed = results[1]

Here results will be OrderedDict[str, Path] with successfully downloaded doi->paper_path and List[str] with failed dois, in current example:

(OrderedDict([('10.3390/ijms22031073',
               PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.3390/ijms22031073.pdf')),
              ('10.1038/s41597-020-00710-z',
               PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.1038/s41597-020-00710-z.pdf'))]),
 ['wrong'])

Same function can be called from the command line:

download download_papers --dois "10.3390/ijms22031073" --dois "10.1038/s41597-020-00710-z" --dois "wrong" --folder "data/output/test/papers" --threads 5

Parsing the papers

You can parse the downloaded papers with the unstructured library. For example if the papers are in the folder test, you can run:

getpaper/parse.py parse_folder --folder data/output/test/papers --cores 5

You can also parse papers on a per-file basis, for example:

getpaper/parse.py parse_paper --paper data/output/test/papers/10.3390/ijms22031073.pdf

Count tokens

To evaluate how much you want to split texts and how much embeddings will cost you it is useful to compute token number:

getpaper/parse.py count_tokens --path /home/antonkulaga/sources/non-animal-models/data/inputs/datasets

Indexing papers

We also provide features to index the papers with openai or lambda embeddings and save them in chromadb vector store. For openai embeddings to work you have to create .env file and specify your openai key there, see .env.template as example For example if you have your papers inside data/output/test/papers folder, and you want to make a ChromaDB index at data/output/test/index you can do it by:

getpaper/index.py index_papers --papers data/output/test/papers --folder data/output/test/index --collection mypapers --chunk_size 6000

Examples

You can run examples.py to see usage examples

Additional requirements

Detectron2 is required for using models from the layoutparser model zoo but is not automatically installed with this package. For macOS and Linux, build from source with:

pip install 'git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

getpaper-0.2.0.tar.gz (15.5 kB view details)

Uploaded Source

Built Distribution

getpaper-0.2.0-py2.py3-none-any.whl (17.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file getpaper-0.2.0.tar.gz.

File metadata

  • Download URL: getpaper-0.2.0.tar.gz
  • Upload date:
  • Size: 15.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for getpaper-0.2.0.tar.gz
Algorithm Hash digest
SHA256 825771b2eb90d3b146e81746743f445450ccad52d110120ed18294c3b1571908
MD5 5c0beed59c751f7dc2d2c92c27a59970
BLAKE2b-256 45ca809c73db40f7483add8ff4e4032b387b5be3e3e000c6ff06996eb3d68dd5

See more details on using hashes here.

File details

Details for the file getpaper-0.2.0-py2.py3-none-any.whl.

File metadata

  • Download URL: getpaper-0.2.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for getpaper-0.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 3f75b7d776a0b9a2bfc2fa36405b544caeb36b46ebc8d5961252fc02244fa226
MD5 6c0c8e96665223e521fd0adefbc483cc
BLAKE2b-256 faba91949dfa32dcc2aaf6c2b36affbe5fd7932f8a7c169fecf190f5fa56752b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page