Access to arxiv data
Project description
xv
Access to arxiv data
To install: pip install xv
Examples
from xv import *
Raw store
At the point of writing this, my attempts enable graze
to automatically confirm download in the googledrive downloads (which, when downloading too-big files, will tell the user it can't scan the file and ask the user to confirm the download).
Therefore, the following files need to be downloaded manually:
- titles: https://drive.google.com/file/d/1Ul5mPePtoPKHZkH5Rm6dWKAO11dG98GN/view?usp=share_link
- abstracts: https://drive.google.com/file/d/1g3K-wlixFxklTSUQNZKpEgN4WNTFTPIZ/view?usp=share_link
(If those urls don't work, perhaps they were updated: See here: https://alex.macrocosm.so/download .)
You can then copy them over to the place graze will look for by doing:
from pathlib import Path
from xv.util import Graze
from xv.data_access import urls
g[urls['titles']] = Path('TITLES_DATA_LOCAL_FILEPATH').read_bytes()
g[urls['abstracts']] = Path('ABSTRACTS_DATA_LOCAL_FILEPATH').read_bytes()
# from imbed.mdat.arxiv import urls
# from pathlib import Path
# g[urls['titles']] = Path('FILE_WHERE_YOU_DOWNLOADED_TITLES_DATA').read_bytes()
# g[urls['abstracts']] = Path('FILE_WHERE_YOU_DOWNLOADED_TITLES_DATA').read_bytes()
from xv.util import Graze
g = Graze()
list(g)
['https://drive.google.com/file/d/1Ul5mPePtoPKHZkH5Rm6dWKAO11dG98GN/view?usp=share_link',
'https://drive.google.com/file/d/1g3K-wlixFxklTSUQNZKpEgN4WNTFTPIZ/view?usp=share_link',
'https://arxiv.org/pdf/0704.0001']
from xv import raw_sources
list(raw_sources)
['titles', 'abstracts']
raw = raw_sources['titles']
list(raw)
['titles_7.parquet',
'titles_23.parquet',
'titles_15.parquet',
'verifyResults.py',
'titles_14.parquet',
'titles_22.parquet',
'titles_6.parquet',
'titles_16.parquet',
'titles_20.parquet',
'titles_4.parquet',
'titles_5.parquet',
'titles_21.parquet',
'params.txt',
'titles_17.parquet',
'exampleEmbed.py',
'titles_12.parquet',
'README.md',
'titles_9.parquet',
'titles_1.parquet',
'titles_13.parquet',
'titles_8.parquet',
'titles_18.parquet',
'titles_3.parquet',
'titles_11.parquet',
'titles_10.parquet',
'titles_19.parquet',
'titles_2.parquet']
print(raw['exampleEmbed.py'].decode())
from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Research Paper title for retrieval; Input:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)
from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')
/Users/thorwhalen/.pyenv/versions/3.10.13/envs/p10/lib/python3.10/site-packages/InstructorEmbedding/instructor.py:7: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
from tqdm.autonotebook import trange
load INSTRUCTOR_Transformer
max_seq_length 512
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Research Paper title for retrieval; Input:"
embeddings = model.encode([[instruction, sentence]])
print(raw['params.txt'].decode())
prompt: Represent the Research Paper title for retrieval; Input:
type: title
time string: 20230518-185428
model: InstructorXL
version: 2.0
print(raw['exampleEmbed.py'].decode())
from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Research Paper title for retrieval; Input:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)
The imbedding data store
And now, we'll transform the raw store to get a convenient interface to the actual data of interest.
b = raw['titles_1.parquet']
len(b)
313383694
from xv import sources # raw store + wrapper. See parquet_codec code.
titles_tables = sources['titles']
abstract_tables = sources['abstracts']
print(list(titles_tables))
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
titles_df = titles_tables[1]
titles_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
title | embeddings | doi | |
---|---|---|---|
0 | Calculation of prompt diphoton production cros... | [-0.050620172, 0.041436385, 0.05363288, -0.029... | 0704.0001 |
1 | Sparsity-certifying Graph Decompositions | [0.014515653, 0.023809524, -0.028145121, -0.04... | 0704.0002 |
2 | The evolution of the Earth-Moon system based o... | [-4.766115e-05, 0.017415706, 0.04146007, -0.03... | 0704.0003 |
3 | A determinant of Stirling cycle numbers counts... | [0.027208889, 0.046175897, 0.0010913888, -0.01... | 0704.0004 |
4 | From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a... | [0.0113909235, 0.0042667952, -0.0008565594, -0... | 0704.0005 |
... | ... | ... | ... |
99995 | Multiple Time Dimensions | [0.02682626, -0.0015173098, -0.0019915192, -0.... | 0812.3869 |
99996 | Depth Zero Representations of Nonlinear Covers... | [-0.02740943, 0.011689809, -0.0105154915, -0.0... | 0812.3870 |
99997 | Decting Errors in Reversible Circuits With Inv... | [0.0072460608, 0.0028085636, -0.015064359, -0.... | 0812.3871 |
99998 | Unveiling the birth and evolution of the HII r... | [0.009408689, -0.0047120117, 0.0021392817, -0.... | 0812.3872 |
99999 | The K-Receiver Broadcast Channel with Confiden... | [-0.0026305509, -0.006502139, 0.013400236, -0.... | 0812.3873 |
100000 rows × 3 columns
abstract_df = abstract_tables[1]
abstract_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
abstract | embeddings | doi | |
---|---|---|---|
0 | A fully differential calculation in perturba... | [-0.035151865, 0.022851437, 0.025942933, -0.02... | 0704.0001 |
1 | We describe a new algorithm, the $(k,\ell)$-... | [0.035485767, -0.0015772493, -0.0016615744, -0... | 0704.0002 |
2 | The evolution of Earth-Moon system is descri... | [-0.014510429, 0.010210799, 0.049661566, -0.01... | 0704.0003 |
3 | We show that a determinant of Stirling cycle... | [0.029191103, 0.047992915, -0.0061754594, -0.0... | 0704.0004 |
4 | In this paper we show how to compute the $\L... | [-0.015174898, 0.01603887, 0.04062805, -0.0246... | 0704.0005 |
... | ... | ... | ... |
99995 | The possibility of physics in multiple time ... | [0.016121766, 0.011126887, 0.018650021, -0.044... | 0812.3869 |
99996 | We generalize the methods of Moy-Prasad, in ... | [-7.164341e-05, -0.007114291, -0.008979887, -0... | 0812.3870 |
99997 | Reversible logic is experience renewed inter... | [0.03194286, -0.00771745, 0.015977046, -0.0474... | 0812.3871 |
99998 | Based on a multiwavelength study, the ISM ar... | [-0.012340169, -0.021712925, 0.00806009, -0.00... | 0812.3872 |
99999 | The secrecy capacity region for the K-receiv... | [0.0012416588, 0.0006933478, -0.0057888636, -0... | 0812.3873 |
100000 rows × 3 columns
abstract_df['doi'].values
array(['0704.0001', '0704.0002', '0704.0003', ..., '0812.3871',
'0812.3872', '0812.3873'], dtype=object)
from xv import arxiv_url
doi = abstract_df['doi'].values[0]
arxiv_url(doi)
'https://arxiv.org/abs/0704.0001'
from xv.data_access import resource_descriptions
resource_descriptions
{'abs': 'Main page of article. Contains links to all other relevant information.',
'pdf': 'Direct link to article pdf',
'format': 'Page giving access to other formats',
'src': 'Access to the original source files submitted by the authors.',
'cits': 'Tracks citations of the article across various platforms and databases.',
'html': 'Link to the ar5iv html page for the article.'}
doi = '0704.0001'
for resource, description in resource_descriptions.items():
print(f"{resource}: {description}")
print(f"Example: {arxiv_url(doi, resource)}")
print("")
abs: Main page of article. Contains links to all other relevant information.
Example: https://arxiv.org/abs/0704.0001
pdf: Direct link to article pdf
Example: https://arxiv.org/pdf/0704.0001
format: Page giving access to other formats
Example: https://arxiv.org/format/0704.0001
src: Access to the original source files submitted by the authors.
Example: https://arxiv.org/src/0704.0001
cits: Tracks citations of the article across various platforms and databases.
Example: https://arxiv.org/cits/0704.0001
html: Link to the ar5iv html page for the article.
Example: https://ar5iv.labs.arxiv.org/html/0704.0001
arxiv_url(doi, 'pdf')
'https://arxiv.org/pdf/0704.0001'
pdf_bytes = g[arxiv_url(doi, 'pdf')]
The contents (~1.647MB) of https://arxiv.org/pdf/0704.0001 are being downloaded...
abstract_df.embeddings.values[0].shape
(768,)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file xv-0.1.0.tar.gz
.
File metadata
- Download URL: xv-0.1.0.tar.gz
- Upload date:
- Size: 9.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d4d59a81b7eb8ef3ec37e85de6f8f5e37452ed4215f636fb01871cb8a3f14fd6 |
|
MD5 | 673007c42b357477bab63e10467c93c2 |
|
BLAKE2b-256 | 90392f575abf742b4e66ae133016a72410b082c5cf468f3ab493e27265b63a98 |
File details
Details for the file xv-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: xv-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 07409cf54416da0509985cb515396126c7f7945adc9d1d8e62fd16f0de02d974 |
|
MD5 | d920c985c989365fa62ec60f3f759ff7 |
|
BLAKE2b-256 | 562013beead0d54d8a07c41c8d90f8e7d0d21a784ee0bfb9c7f0bdf1c7356e1f |