Skip to main content

ArXiv-Miner: Mine/Scrape Arxiv-Papers To Structured Datasets

Project description

ArXiv-Miner

ArXiv Miner is a toolkit for mining research papers on CS ArXiv.

What is ArXiv-Miner

arxiv-miner is a quick handy library that helps power Sci-Genie. Sci-Genie is a search engine for quickly searching through full text of papers on CS ArXiv. arxiv-miner helps extract and parse LaTeX documents from CS ArXiv. It also supports storage and search of those parsed documents using Elasticsearch. The library can be applicable for all other domains like Math, Physics, Biology etc.

Documentation

All documentation on how to install and use arxiv-miner is provided in the documentation website or inside the docs folder. Contribution guidelines are also provided there.

Why was ArXiv-Miner created ?

ArXiv Miner was created for easily scraping, parsing and searching research content on ArXiv. This library was created after stitching together solutions from the code of various tools like arxiv-sanity, arxiv-vanity/engrafo, arxivscraper, tex2py, cso-classifier and axcell. Parsed structure of the content can be useful in search or any scientific research mining/AI applications as a heuristic baseline.

Core Components of ArXiv-Miner

  • Scraping
  • Parsing
  • Indexing/Storage

Family Of Projects With ArXiv-Miner

Disclaimer

This project was developed like a Cowboy coder over the COVID-19 pandemic. Hence, this may have bugs and not the most well optimized code. The primary reason for development was to aid CS and Machine Learning/AI research, but this tool can be extended to all 3M+ documents on ArXiv.

Call For Contributors

Any help with contributions to improve the project or fix bugs are completely welcome. Please read the contribution guide in the documentation.

Credits and Appreciation

This project like all others has been built on shoulders of giants. A big thanks to the creators of the following libraries/open source projects that aided the development of arxiv-miner, and it's family of projects:

Licence

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arxiv_miner-2.0.3.tar.gz (57.8 kB view details)

Uploaded Source

Built Distribution

arxiv_miner-2.0.3-py3-none-any.whl (49.0 kB view details)

Uploaded Python 3

File details

Details for the file arxiv_miner-2.0.3.tar.gz.

File metadata

  • Download URL: arxiv_miner-2.0.3.tar.gz
  • Upload date:
  • Size: 57.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.2

File hashes

Hashes for arxiv_miner-2.0.3.tar.gz
Algorithm Hash digest
SHA256 8239aafe164bf3791a3113ba6ddbc1d08dead0cc1bc87b86efc9e184e01895df
MD5 563e841236adfe2c4406fb827389089b
BLAKE2b-256 ade2d65585c7b8c4499c00dac2013b6d4e92fc1933f3a51f81daa6be89f73a88

See more details on using hashes here.

File details

Details for the file arxiv_miner-2.0.3-py3-none-any.whl.

File metadata

  • Download URL: arxiv_miner-2.0.3-py3-none-any.whl
  • Upload date:
  • Size: 49.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.2

File hashes

Hashes for arxiv_miner-2.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d47bc3b29a2f46769d701596fdbdef99799d0cfa312672662d869db81e1269f7
MD5 8026274003d27e1966e97e37a118fc7f
BLAKE2b-256 e175b4923e31637ff7cfc137b9fdd13ad1f34854ae0a96b38a1567c5af771df2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page