Skip to main content

getpaper - papers download made easy!

Project description

getpaper

Paper downloader

getting started

Install the library with:

pip install getpaper

If you want to edit getpaper repository consider installing it locally:

pip install -e .

On linux systems you sometimes need to check that build essentials are installed:

sudo apt install build-essential.

It is also recommended to use micromamba, conda, anaconda or other environments to avoid bloating system python with too many dependencies.

Usage

Downloading papers

After the installation you can either import the library into your python code or you can use the console scripts.

If you install from pip calling download will mean calling getpaper/download.py , for parse - getpaper/parse.py , for index - getpaper/index.py

download download_pubmed --pubmed 22266545 --folder "data/output/test/papers" --name pmid

Downloads the paper with pubmed id into the folder 'papers' and uses the pubmed id as name

download download_doi --doi 10.1038/s41597-020-00710-z --folder "data/output/test/papers"

Downloads the paper with DOI into the folder papers, as --name is not specified doi is used as name

It is also possible to download many papers in parallel with download_papers(dois: List[str], destination: Path, threads: int) function, for example:

from pathlib import Path
from typing import List
from getpaper.download import download_papers
dois: List[str] = ["10.3390/ijms22031073", "10.1038/s41597-020-00710-z", "wrong"]
destination: Path = Path("./data/output/test/papers").absolute().resolve()
threads: int = 5
results = download_papers(dois, destination, threads)
successful = results[0]
failed = results[1]

Here results will be OrderedDict[str, Path] with successfully downloaded doi->paper_path and List[str] with failed dois, in current example:

(OrderedDict([('10.3390/ijms22031073',
               PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.3390/ijms22031073.pdf')),
              ('10.1038/s41597-020-00710-z',
               PosixPath('/home/antonkulaga/sources/getpaper/notebooks/data/output/test/papers/10.1038/s41597-020-00710-z.pdf'))]),
 ['wrong'])

Same function can be called from the command line:

download download_papers --dois "10.3390/ijms22031073" --dois "10.1038/s41597-020-00710-z" --dois "wrong" --folder "data/output/test/papers" --threads 5

You can also call download.py script directly:

python getpaper/download.py download_papers --dois "10.3390/ijms22031073" --dois "10.1038/s41597-020-00710-z" --dois "wrong" --folder "data/output/test/papers" --threads 5

Parsing the papers

You can parse the downloaded papers with the unstructured library. For example if the papers are in the folder test, you can run:

getpaper/parse.py parse_folder --folder data/output/test/papers --cores 5

You can also switch between different PDF parsers:

getpaper/parse.py parse_folder --folder data/output/test/papers --parser pdf_miner --cores 5

You can also parse papers on a per-file basis, for example:

getpaper/parse.py parse_paper --paper data/output/test/papers/10.3390/ijms22031073.pdf

Combining parsing and downloading

getpaper/parse.py download_and_parse --doi 10.1038/s41597-020-00710-z

Count tokens

To evaluate how much you want to split texts and how much embeddings will cost you it is useful to compute token number:

getpaper/parse.py count_tokens --path /home/antonkulaga/sources/non-animal-models/data/inputs/datasets

Examples

You can run examples.py to see usage examples

Additional requirements

index.py has local dependencies on other modules, for this reason if you are running it inside getpaper project folder consider having it installed locally:

pip install -e .

Detectron2 is required for using models from the layoutparser model zoo but is not automatically installed with this package. For macOS and Linux, build from source with:

pip install 'git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2'

Note

Since 0.3.0 version all indexing features were moved to indexpaper library

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

getpaper-0.4.7.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

getpaper-0.4.7-py2.py3-none-any.whl (16.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file getpaper-0.4.7.tar.gz.

File metadata

  • Download URL: getpaper-0.4.7.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for getpaper-0.4.7.tar.gz
Algorithm Hash digest
SHA256 ddcb8edb20edbaf2fe1af92fbdf541dd32ff9a61216e2ec77b8a7edd1c2f5281
MD5 623e186c478df0ef99d888edfcfaa0aa
BLAKE2b-256 40352064aa66cd88625e6569be8cb28002aa9473c97b2d199ef5b239935853db

See more details on using hashes here.

File details

Details for the file getpaper-0.4.7-py2.py3-none-any.whl.

File metadata

  • Download URL: getpaper-0.4.7-py2.py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for getpaper-0.4.7-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 c1de149351400ef8205de098207aad8cec95f4c922bac787da46f7785fb22597
MD5 92e730be8297f0ed8b5d0764296b5094
BLAKE2b-256 2172f62fd89049449907c3e7477078093da352ff70b6ff28285e2c7dcfe05a3f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page