ArXiv-Miner: Mine/Scrape Arxiv-Papers To Structured Datasets
Project description
ArXiv-Miner
ArXiv Miner is a toolkit for mining research papers on CS ArXiv.
What is ArXiv-Miner
arxiv-miner
is a quick handy library that helps power Sci-Genie. Sci-Genie is a search engine for quickly searching through full text of papers on CS ArXiv. arxiv-miner
helps extract and parse LaTeX documents from CS ArXiv. It also supports storage and search of those parsed documents using Elasticsearch. The library can be applicable for all other domains like Math, Physics, Biology etc.
Documentation
All documentation on how to install and use arxiv-miner
is provided in the documentation website or inside the docs folder. Contribution guidelines are also provided there.
Why was ArXiv-Miner created ?
ArXiv Miner was created for easily scraping, parsing and searching research content on ArXiv. This library was created after stitching together solutions from the code of various tools like arxiv-sanity, arxiv-vanity/engrafo, arxivscraper, tex2py, cso-classifier and axcell. Parsed structure of the content can be useful in search or any scientific research mining/AI applications as a heuristic baseline.
Core Components of ArXiv-Miner
- Scraping
- Parsing
- Indexing/Storage
Family Of Projects With ArXiv-Miner
arxiv-table-miner
: Coming Soon.arxiv-table-ml-models
: Coming Soon.semantic-scholar-data-pipeline
: https://github.com/valayDave/semantic-scholar-data-pipeline
Disclaimer
This project was developed like a Cowboy coder over the COVID-19 pandemic. Hence, this may have bugs and not the most well optimized code. The primary reason for development was to aid CS and Machine Learning/AI research, but this tool can be extended to all 3M+ documents on ArXiv.
Call For Contributors
Any help with contributions to improve the project or fix bugs are completely welcome. Please read the contribution guide in the documentation.
Credits and Appreciation
This project like all others has been built on shoulders of giants. A big thanks to the creators of the following libraries/open source projects that aided the development of arxiv-miner
, and it's family of projects:
- arxiv-sanity
- arxiv-vanity/engrafo
- arxivscraper
- tex2py
- cso-classifier
- axcell
- elasticsearch
- Semantic Scholar Open Research corpus
- metaflow
- docsify
Licence
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file arxiv_miner-2.0.3.tar.gz
.
File metadata
- Download URL: arxiv_miner-2.0.3.tar.gz
- Upload date:
- Size: 57.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8239aafe164bf3791a3113ba6ddbc1d08dead0cc1bc87b86efc9e184e01895df |
|
MD5 | 563e841236adfe2c4406fb827389089b |
|
BLAKE2b-256 | ade2d65585c7b8c4499c00dac2013b6d4e92fc1933f3a51f81daa6be89f73a88 |
File details
Details for the file arxiv_miner-2.0.3-py3-none-any.whl
.
File metadata
- Download URL: arxiv_miner-2.0.3-py3-none-any.whl
- Upload date:
- Size: 49.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d47bc3b29a2f46769d701596fdbdef99799d0cfa312672662d869db81e1269f7 |
|
MD5 | 8026274003d27e1966e97e37a118fc7f |
|
BLAKE2b-256 | e175b4923e31637ff7cfc137b9fdd13ad1f34854ae0a96b38a1567c5af771df2 |