Skip to main content

getpaper - papers download made easy!

Project description

getpaper

Paper downloader

getting started

Install the library with:

pip install getpaper

Usage

Downloading papers

After the installation you can either import the library into your python code or you can use the console scripts, for example:

download download download_pubmed --pubmed 22266545 --folder papers --name pmid

Downloads the paper with pubmed id into the folder 'papers' and uses the pubmed id as name

download download download_doi --doi 10.1519/JSC.0b013e318225bbae --folder papers

Downloads the paper with DOI into the folder papers, as --name is not specified doi is used as name

Parsing the papers

You can parse the downloaded papers with the unstructure library. For example if the papers are in the folder test, you can run:

getpaper/parse.py parse_folder --folder /home/antonkulaga/sources/getpaper/test

You can also parse papers on a per file basis, for example:

getpaper/parse.py parse_paper --paper /home/antonkulaga/sources/getpaper/test/22266545.pdf

Indexing papers

We also provide features to index the papers with openai or lambda embeddings and save them in chromadb vector store. For openai embeddings to work you have to create .env file and specify your openai key there, see .env.template as example For example if you have your papers inside data/output/test/papers folder and you want to make a ChromaDB index at data/output/test/index you can do it by:

python getpaper/index.py index_papers --papers data/output/test/papers --folder data/output/test/index --collection mypapers --chunk_size 6000

Examples

You can run examples.py to see usage examples

Additional requirements

Detectron2 is required for using models from the layoutparser model zoo but is not automatically installed with this package. For MacOS and Linux, build from source with:

pip install 'git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

getpaper-0.0.9.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

getpaper-0.0.9-py2.py3-none-any.whl (8.8 kB view details)

Uploaded Python 2Python 3

File details

Details for the file getpaper-0.0.9.tar.gz.

File metadata

  • Download URL: getpaper-0.0.9.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for getpaper-0.0.9.tar.gz
Algorithm Hash digest
SHA256 32bcb25a7e8321d1f6e0aed6e9921ebcf9b6c1bb8435dbf08656ca17915e4b14
MD5 2acfeba82e08ec1487778c02706ecbef
BLAKE2b-256 d1448f98d961c15cd5959d53a4ee8037b554ca115d23291e78fa9b2f8dcdd279

See more details on using hashes here.

File details

Details for the file getpaper-0.0.9-py2.py3-none-any.whl.

File metadata

  • Download URL: getpaper-0.0.9-py2.py3-none-any.whl
  • Upload date:
  • Size: 8.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for getpaper-0.0.9-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 48872bf06cfa23f4f8e77f479e6b88fdc577d386cfb6de130c4074f2adc5612b
MD5 2dfb4828394c188bf804e978b09783ac
BLAKE2b-256 8832fc840fc6e2296799b795cde456fcd837d7d6b40db96f4569e7ef10f87d0f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page