Wrapper for the arXiv API
Project description
arXiv Loader
This tool is a wrapper of the arXiv API that allows you to retrieve metadata of articles published on arXiv as pandas.DataFrame
.
Please abide by the Terms of Usage of the arXiv API.
Installation
pip install arxivloader
Usage
Please consult the arXiv API documentation for help in constructing a valid query string.
Searching by keyword
To search for a keyword the query needs to start with search_query=
followed by a prefix and the search word.
Possible prefixes are
Prefix | Explanation |
---|---|
ti | Title |
au | Author |
abs | Abstract |
co | Comments |
jr | Journal Reference |
cat | Subject Category |
rn | Report Number |
id | arXiv ID |
all | All of the above |
Please have a look at the arXiv API documentation for details.
import arxivloader
keyword = "DustPy"
prefix = "all"
query = "search_query={pf}:{kw}".format(pf=prefix, kw=keyword)
columns = ["id", "title", "authors"]
df = arxivloader.load(query, columns=columns)
print(df)
id | title | authors | |
---|---|---|---|
0 | 2207.00322v2 | DustPy: A Python Package for Dust Evolution in Protoplanetary Disks | Sebastian Markus Stammler; Tilman Birnstiel |
1 | 2110.04007v1 | The formation of wide exoKuiper belts from migrating dust traps | E. Miller; S. Marino; S. M. Stammler; P. Pinilla; C. Lenz; T. Birnstiel; Th. Henning |
Searching by id
To search for a specific arXiv ID the query needs to start with id_list=
followed by a comma-separated list of arXiv IDs:
import arxivloader
IDs = ["1909.04674", "1909.10526"]
query = "id_list={}".format(",".join(IDs))
columns = ["id", "title", "authors"]
df = arxivloader.load(query, columns=columns)
print(df)
id | title | authors | |
---|---|---|---|
0 | 1909.04674v1 | The DSHARP Rings: Evidence of Ongoing Planetesimal Formation? | Sebastian M. Stammler; Joanna Drazkowska; Til Birnstiel; Hubert Klahr; Cornelis P. Dullemond; Sean M. Andrews |
1 | 1909.10526v1 | Including Dust Coagulation in Hydrodynamic Models of Protoplanetary Disks: Dust Evolution in the Vicinity of a Jupiter-mass Planet | Joanna Drazkowska; Shengtai Li; Til Birnstiel; Sebastian M. Stammler; Hui Li |
Filtering specific articles by keywords
If both, search_query=
and id_list=
are present, the given arXiv articles are filtered by the give key word.
import arxivloader
keyword = "DSHARP"
prefix = "ti"
IDs = ["1909.04674", "1909.10526"]
query = "search_query={pf}:{kw}&id_list={ids}".format(pf=prefix, kw=keyword, ids=",".join(IDs))
columns = ["id", "title", "authors"]
df = arxivloader.load(query, columns=columns)
print(df)
id | title | authors | |
---|---|---|---|
0 | 1909.04674v1 | The DSHARP Rings: Evidence of Ongoing Planetesimal Formation? | Sebastian M. Stammler; Joanna Drazkowska; Til Birnstiel; Hubert Klahr; Cornelis P. Dullemond; Sean M. Andrews |
Searching by date
It is possible to only retrieve entries in a specified date window.
This query selects all publications that have been submitted to astro-ph.EP
on July 1st 2022 between 8am and 1pm.
import arxivloader
prefix = "cat"
cat = "astro-ph.EP"
submittedDate = "[20220701080000+TO+20220701130000]"
query = "search_query={pf}:{cat}+AND+submittedDate:{sd}".format(pf=prefix, cat=cat, sd=submittedDate)
columns = ["id", "title", "authors", "published"]
df = arxivloader.load(query, columns=columns, sortBy="submittedDate", sortOrder="ascending")
print(df)
id | title | authors | published | |
---|---|---|---|---|
0 | 2207.00273v1 | Whistler Waves As a Signature of Converging Magnetic Holes in Space Plasmas | Wence Jiang; Daniel Verscharen; Hui Li; Chi Wang; Kristopher G. Klein | 2022-07-01 08:55:54 |
1 | 2207.00322v2 | DustPy: A Python Package for Dust Evolution in Protoplanetary Disks | Sebastian Markus Stammler; Tilman Birnstiel | 2022-07-01 10:25:59 |
Searching by category
It is possible to search large number of articles by category. Please be responsible with the traffic this query causes.
import arxivloader
keyword = "astro-ph.EP"
prefix = "cat"
query = "search_query={pf}:{kw}".format(pf=prefix, kw=keyword)
columns = ["id", "title", "primary_category", "categories", "published"]
df = arxivloader.load(query, columns=columns, sortBy="submittedDate", sortOrder="descending", num=1000, page_size=100)
print(df.head(5))
id | title | primary_category | categories | published | |
---|---|---|---|---|---|
0 | 2210.11357v1 | The Key Factors Controlling the Seasonality of Planetary Climate | physics.ao-ph | physics.ao-ph; astro-ph.EP | 2022-10-20 15:45:43 |
1 | 2210.11305v1 | On the origin of the dichotomy of stellar activity cycles | astro-ph.SR | astro-ph.SR; astro-ph.EP | 2022-10-20 14:34:33 |
2 | 2210.11207v1 | $\texttt{KOBEsim}$: a Bayesian observing strategy algorithm for planet detection in radial velocity blind-search | astro-ph.EP | astro-ph.EP; astro-ph.IM | 2022-10-20 12:33:03 |
3 | 2210.11103v1 | Lower-than-expected flare temperatures for TRAPPIST-1 | astro-ph.SR | astro-ph.SR; astro-ph.EP | 2022-10-20 08:55:47 |
4 | 2210.10909v1 | TOI-3884 b: A rare 6-R$_{\oplus}$ planet that transits a low-mass star with a giant and likely polar spot | astro-ph.EP | astro-ph.EP | 2022-10-19 22:19:15 |
Options
arxivloader.load()
has several keyword arguments:
Keyword | Default value | Description |
---|---|---|
num |
10 | Maximum total number of entries to be retrieved. |
start |
0 | Starting index of query. |
page_size |
10 | The entries are retrieved in pages. The maximum allowed page size is 30000. |
delay |
3. | Delay in seconds between page requests. |
sortBy |
"relevance" |
Possible values: "relevance" , "lastUpdatedDate" , "submittedDate" . |
sortOrder |
"descending" |
Possible values: "descending" , "ascending" . |
columns |
["id", "title", "summary", "authors", "primary_category", "categories", "comments", "updated", "published", "doi", "links"] |
List of the columns the pandas.DataFrame should contain. |
timeout |
10. | Timeout in seconds for a single page. |
verbosity |
2 | Level of verbosity. |
The default options are usually good enough.
The delay
has to be at least three seconds to be fair with the load on the arXiv API.
It can happen that the arxiv API does not respond for a query. timeout
will set the time after which arxivloader
assumes a failed attempt and will retry at most five times. Please note, that timeout
needs to be larger than the arXiv API takes to process the query, which depends on page_size
. Consider two minutes for ten thousand entries in a page.
If verbosity
is 0
, arxivloader
will not display anything on screen. If verbosity
is 1
, arxivloader
will print out the number of retrieved entries at the end of execution. If verbosity
is 2
, arxivloader
will additionally show a progess bar.
Acknowledgements
Thank you to arXiv for use of its open access interoperability.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file arxivloader-1.0.2.tar.gz
.
File metadata
- Download URL: arxivloader-1.0.2.tar.gz
- Upload date:
- Size: 10.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.11.3 pkginfo/1.8.2 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.64.1 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 101ef4f582e2f3d373d115fca77889f99cd5397bdf7da92d4d7dcf10c865e815 |
|
MD5 | 8c8c0207a437f15ba9e0549c306dca1b |
|
BLAKE2b-256 | da62be5ad2477122382ea93bdfea427897e7b77ce95b881cc2566a754b6d5c27 |