Skip to main content

Access to arxiv data

Project description

xv

Access to arxiv data

To install: pip install xv

Examples

from xv import *

Raw store

At the point of writing this, my attempts enable graze to automatically confirm download in the googledrive downloads (which, when downloading too-big files, will tell the user it can't scan the file and ask the user to confirm the download).

Therefore, the following files need to be downloaded manually:

(If those urls don't work, perhaps they were updated: See here: https://alex.macrocosm.so/download .)

You can then copy them over to the place graze will look for by doing:

from pathlib import Path
from xv.util import Graze
from xv.data_access import urls


g[urls['titles']] = Path('TITLES_DATA_LOCAL_FILEPATH').read_bytes()
g[urls['abstracts']] = Path('ABSTRACTS_DATA_LOCAL_FILEPATH').read_bytes()
# from imbed.mdat.arxiv import urls
# from pathlib import Path

# g[urls['titles']] = Path('FILE_WHERE_YOU_DOWNLOADED_TITLES_DATA').read_bytes()
# g[urls['abstracts']] = Path('FILE_WHERE_YOU_DOWNLOADED_TITLES_DATA').read_bytes()
from xv.util import Graze

g = Graze()
list(g)
['https://drive.google.com/file/d/1Ul5mPePtoPKHZkH5Rm6dWKAO11dG98GN/view?usp=share_link',
 'https://drive.google.com/file/d/1g3K-wlixFxklTSUQNZKpEgN4WNTFTPIZ/view?usp=share_link',
 'https://arxiv.org/pdf/0704.0001']
from xv import raw_sources

list(raw_sources)
['titles', 'abstracts']
raw = raw_sources['titles']
list(raw)
['titles_7.parquet',
 'titles_23.parquet',
 'titles_15.parquet',
 'verifyResults.py',
 'titles_14.parquet',
 'titles_22.parquet',
 'titles_6.parquet',
 'titles_16.parquet',
 'titles_20.parquet',
 'titles_4.parquet',
 'titles_5.parquet',
 'titles_21.parquet',
 'params.txt',
 'titles_17.parquet',
 'exampleEmbed.py',
 'titles_12.parquet',
 'README.md',
 'titles_9.parquet',
 'titles_1.parquet',
 'titles_13.parquet',
 'titles_8.parquet',
 'titles_18.parquet',
 'titles_3.parquet',
 'titles_11.parquet',
 'titles_10.parquet',
 'titles_19.parquet',
 'titles_2.parquet']
print(raw['exampleEmbed.py'].decode())
from InstructorEmbedding import INSTRUCTOR

model = INSTRUCTOR('hkunlp/instructor-xl')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Research Paper title for retrieval; Input:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)
from InstructorEmbedding import INSTRUCTOR

model = INSTRUCTOR('hkunlp/instructor-xl')
/Users/thorwhalen/.pyenv/versions/3.10.13/envs/p10/lib/python3.10/site-packages/InstructorEmbedding/instructor.py:7: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import trange


load INSTRUCTOR_Transformer
max_seq_length  512
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Research Paper title for retrieval; Input:"
embeddings = model.encode([[instruction, sentence]])
print(raw['params.txt'].decode())
prompt: Represent the Research Paper title for retrieval; Input:
type: title
time string: 20230518-185428
model: InstructorXL
version: 2.0
print(raw['exampleEmbed.py'].decode())
from InstructorEmbedding import INSTRUCTOR

model = INSTRUCTOR('hkunlp/instructor-xl')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Research Paper title for retrieval; Input:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)

The imbedding data store

And now, we'll transform the raw store to get a convenient interface to the actual data of interest.

b = raw['titles_1.parquet']
len(b)
313383694
from xv import sources  # raw store + wrapper. See parquet_codec code.

titles_tables = sources['titles']
abstract_tables = sources['abstracts']
print(list(titles_tables))
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
titles_df = titles_tables[1]
titles_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
title embeddings doi
0 Calculation of prompt diphoton production cros... [-0.050620172, 0.041436385, 0.05363288, -0.029... 0704.0001
1 Sparsity-certifying Graph Decompositions [0.014515653, 0.023809524, -0.028145121, -0.04... 0704.0002
2 The evolution of the Earth-Moon system based o... [-4.766115e-05, 0.017415706, 0.04146007, -0.03... 0704.0003
3 A determinant of Stirling cycle numbers counts... [0.027208889, 0.046175897, 0.0010913888, -0.01... 0704.0004
4 From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a... [0.0113909235, 0.0042667952, -0.0008565594, -0... 0704.0005
... ... ... ...
99995 Multiple Time Dimensions [0.02682626, -0.0015173098, -0.0019915192, -0.... 0812.3869
99996 Depth Zero Representations of Nonlinear Covers... [-0.02740943, 0.011689809, -0.0105154915, -0.0... 0812.3870
99997 Decting Errors in Reversible Circuits With Inv... [0.0072460608, 0.0028085636, -0.015064359, -0.... 0812.3871
99998 Unveiling the birth and evolution of the HII r... [0.009408689, -0.0047120117, 0.0021392817, -0.... 0812.3872
99999 The K-Receiver Broadcast Channel with Confiden... [-0.0026305509, -0.006502139, 0.013400236, -0.... 0812.3873

100000 rows × 3 columns

abstract_df = abstract_tables[1]
abstract_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
abstract embeddings doi
0 A fully differential calculation in perturba... [-0.035151865, 0.022851437, 0.025942933, -0.02... 0704.0001
1 We describe a new algorithm, the $(k,\ell)$-... [0.035485767, -0.0015772493, -0.0016615744, -0... 0704.0002
2 The evolution of Earth-Moon system is descri... [-0.014510429, 0.010210799, 0.049661566, -0.01... 0704.0003
3 We show that a determinant of Stirling cycle... [0.029191103, 0.047992915, -0.0061754594, -0.0... 0704.0004
4 In this paper we show how to compute the $\L... [-0.015174898, 0.01603887, 0.04062805, -0.0246... 0704.0005
... ... ... ...
99995 The possibility of physics in multiple time ... [0.016121766, 0.011126887, 0.018650021, -0.044... 0812.3869
99996 We generalize the methods of Moy-Prasad, in ... [-7.164341e-05, -0.007114291, -0.008979887, -0... 0812.3870
99997 Reversible logic is experience renewed inter... [0.03194286, -0.00771745, 0.015977046, -0.0474... 0812.3871
99998 Based on a multiwavelength study, the ISM ar... [-0.012340169, -0.021712925, 0.00806009, -0.00... 0812.3872
99999 The secrecy capacity region for the K-receiv... [0.0012416588, 0.0006933478, -0.0057888636, -0... 0812.3873

100000 rows × 3 columns

abstract_df['doi'].values
array(['0704.0001', '0704.0002', '0704.0003', ..., '0812.3871',
       '0812.3872', '0812.3873'], dtype=object)
from xv import arxiv_url

doi = abstract_df['doi'].values[0]
arxiv_url(doi)
'https://arxiv.org/abs/0704.0001'
from xv.data_access import resource_descriptions
resource_descriptions
{'abs': 'Main page of article. Contains links to all other relevant information.',
 'pdf': 'Direct link to article pdf',
 'format': 'Page giving access to other formats',
 'src': 'Access to the original source files submitted by the authors.',
 'cits': 'Tracks citations of the article across various platforms and databases.',
 'html': 'Link to the ar5iv html page for the article.'}
doi = '0704.0001'

for resource, description in resource_descriptions.items():
    print(f"{resource}: {description}")
    print(f"Example: {arxiv_url(doi, resource)}")
    print("")
abs: Main page of article. Contains links to all other relevant information.
Example: https://arxiv.org/abs/0704.0001

pdf: Direct link to article pdf
Example: https://arxiv.org/pdf/0704.0001

format: Page giving access to other formats
Example: https://arxiv.org/format/0704.0001

src: Access to the original source files submitted by the authors.
Example: https://arxiv.org/src/0704.0001

cits: Tracks citations of the article across various platforms and databases.
Example: https://arxiv.org/cits/0704.0001

html: Link to the ar5iv html page for the article.
Example: https://ar5iv.labs.arxiv.org/html/0704.0001
arxiv_url(doi, 'pdf')
'https://arxiv.org/pdf/0704.0001'
pdf_bytes = g[arxiv_url(doi, 'pdf')]
The contents  (~1.647MB) of https://arxiv.org/pdf/0704.0001 are being downloaded...
abstract_df.embeddings.values[0].shape
(768,)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xv-0.1.0.tar.gz (9.4 kB view hashes)

Uploaded Source

Built Distribution

xv-0.1.0-py3-none-any.whl (8.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page