Access to arxiv data

Project description

xv

Access to arxiv data

To install: pip install xv

Examples

from xv import *

Raw store

At the point of writing this, my attempts enable graze to automatically confirm download in the googledrive downloads (which, when downloading too-big files, will tell the user it can't scan the file and ask the user to confirm the download).

Therefore, the following files need to be downloaded manually:

(If those urls don't work, perhaps they were updated: See here: https://alex.macrocosm.so/download .)

You can then copy them over to the place graze will look for by doing:

from pathlib import Path
from xv.util import Graze
from xv.data_access import urls


g[urls['titles']] = Path('TITLES_DATA_LOCAL_FILEPATH').read_bytes()
g[urls['abstracts']] = Path('ABSTRACTS_DATA_LOCAL_FILEPATH').read_bytes()

# from imbed.mdat.arxiv import urls
# from pathlib import Path

# g[urls['titles']] = Path('FILE_WHERE_YOU_DOWNLOADED_TITLES_DATA').read_bytes()
# g[urls['abstracts']] = Path('FILE_WHERE_YOU_DOWNLOADED_TITLES_DATA').read_bytes()

from xv.util import Graze

g = Graze()
list(g)

['https://drive.google.com/file/d/1Ul5mPePtoPKHZkH5Rm6dWKAO11dG98GN/view?usp=share_link',
 'https://drive.google.com/file/d/1g3K-wlixFxklTSUQNZKpEgN4WNTFTPIZ/view?usp=share_link',
 'https://arxiv.org/pdf/0704.0001']

from xv import raw_sources

list(raw_sources)

['titles', 'abstracts']

raw = raw_sources['titles']
list(raw)

['titles_7.parquet',
 'titles_23.parquet',
 'titles_15.parquet',
 'verifyResults.py',
 'titles_14.parquet',
 'titles_22.parquet',
 'titles_6.parquet',
 'titles_16.parquet',
 'titles_20.parquet',
 'titles_4.parquet',
 'titles_5.parquet',
 'titles_21.parquet',
 'params.txt',
 'titles_17.parquet',
 'exampleEmbed.py',
 'titles_12.parquet',
 'README.md',
 'titles_9.parquet',
 'titles_1.parquet',
 'titles_13.parquet',
 'titles_8.parquet',
 'titles_18.parquet',
 'titles_3.parquet',
 'titles_11.parquet',
 'titles_10.parquet',
 'titles_19.parquet',
 'titles_2.parquet']

print(raw['exampleEmbed.py'].decode())

from InstructorEmbedding import INSTRUCTOR

model = INSTRUCTOR('hkunlp/instructor-xl')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Research Paper title for retrieval; Input:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)

from InstructorEmbedding import INSTRUCTOR

model = INSTRUCTOR('hkunlp/instructor-xl')

/Users/thorwhalen/.pyenv/versions/3.10.13/envs/p10/lib/python3.10/site-packages/InstructorEmbedding/instructor.py:7: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import trange


load INSTRUCTOR_Transformer
max_seq_length  512

sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Research Paper title for retrieval; Input:"
embeddings = model.encode([[instruction, sentence]])

print(raw['params.txt'].decode())

prompt: Represent the Research Paper title for retrieval; Input:
type: title
time string: 20230518-185428
model: InstructorXL
version: 2.0

print(raw['exampleEmbed.py'].decode())

from InstructorEmbedding import INSTRUCTOR

model = INSTRUCTOR('hkunlp/instructor-xl')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Research Paper title for retrieval; Input:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)

The imbedding data store

And now, we'll transform the raw store to get a convenient interface to the actual data of interest.

b = raw['titles_1.parquet']
len(b)

313383694

from xv import sources  # raw store + wrapper. See parquet_codec code.

titles_tables = sources['titles']
abstract_tables = sources['abstracts']
print(list(titles_tables))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

titles_df = titles_tables[1]
titles_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	title	embeddings	doi
0	Calculation of prompt diphoton production cros...	[-0.050620172, 0.041436385, 0.05363288, -0.029...	0704.0001
1	Sparsity-certifying Graph Decompositions	[0.014515653, 0.023809524, -0.028145121, -0.04...	0704.0002
2	The evolution of the Earth-Moon system based o...	[-4.766115e-05, 0.017415706, 0.04146007, -0.03...	0704.0003
3	A determinant of Stirling cycle numbers counts...	[0.027208889, 0.046175897, 0.0010913888, -0.01...	0704.0004
4	From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...	[0.0113909235, 0.0042667952, -0.0008565594, -0...	0704.0005
...	...	...	...
99995	Multiple Time Dimensions	[0.02682626, -0.0015173098, -0.0019915192, -0....	0812.3869
99996	Depth Zero Representations of Nonlinear Covers...	[-0.02740943, 0.011689809, -0.0105154915, -0.0...	0812.3870
99997	Decting Errors in Reversible Circuits With Inv...	[0.0072460608, 0.0028085636, -0.015064359, -0....	0812.3871
99998	Unveiling the birth and evolution of the HII r...	[0.009408689, -0.0047120117, 0.0021392817, -0....	0812.3872
99999	The K-Receiver Broadcast Channel with Confiden...	[-0.0026305509, -0.006502139, 0.013400236, -0....	0812.3873

100000 rows × 3 columns

abstract_df = abstract_tables[1]
abstract_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	abstract	embeddings	doi
0	A fully differential calculation in perturba...	[-0.035151865, 0.022851437, 0.025942933, -0.02...	0704.0001
1	We describe a new algorithm, the $(k,\ell)$-...	[0.035485767, -0.0015772493, -0.0016615744, -0...	0704.0002
2	The evolution of Earth-Moon system is descri...	[-0.014510429, 0.010210799, 0.049661566, -0.01...	0704.0003
3	We show that a determinant of Stirling cycle...	[0.029191103, 0.047992915, -0.0061754594, -0.0...	0704.0004
4	In this paper we show how to compute the $\L...	[-0.015174898, 0.01603887, 0.04062805, -0.0246...	0704.0005
...	...	...	...
99995	The possibility of physics in multiple time ...	[0.016121766, 0.011126887, 0.018650021, -0.044...	0812.3869
99996	We generalize the methods of Moy-Prasad, in ...	[-7.164341e-05, -0.007114291, -0.008979887, -0...	0812.3870
99997	Reversible logic is experience renewed inter...	[0.03194286, -0.00771745, 0.015977046, -0.0474...	0812.3871
99998	Based on a multiwavelength study, the ISM ar...	[-0.012340169, -0.021712925, 0.00806009, -0.00...	0812.3872
99999	The secrecy capacity region for the K-receiv...	[0.0012416588, 0.0006933478, -0.0057888636, -0...	0812.3873

100000 rows × 3 columns

abstract_df['doi'].values

array(['0704.0001', '0704.0002', '0704.0003', ..., '0812.3871',
       '0812.3872', '0812.3873'], dtype=object)

from xv import arxiv_url

doi = abstract_df['doi'].values[0]
arxiv_url(doi)

'https://arxiv.org/abs/0704.0001'

from xv.data_access import resource_descriptions
resource_descriptions

{'abs': 'Main page of article. Contains links to all other relevant information.',
 'pdf': 'Direct link to article pdf',
 'format': 'Page giving access to other formats',
 'src': 'Access to the original source files submitted by the authors.',
 'cits': 'Tracks citations of the article across various platforms and databases.',
 'html': 'Link to the ar5iv html page for the article.'}

doi = '0704.0001'

for resource, description in resource_descriptions.items():
    print(f"{resource}: {description}")
    print(f"Example: {arxiv_url(doi, resource)}")
    print("")

abs: Main page of article. Contains links to all other relevant information.
Example: https://arxiv.org/abs/0704.0001

pdf: Direct link to article pdf
Example: https://arxiv.org/pdf/0704.0001

format: Page giving access to other formats
Example: https://arxiv.org/format/0704.0001

src: Access to the original source files submitted by the authors.
Example: https://arxiv.org/src/0704.0001

cits: Tracks citations of the article across various platforms and databases.
Example: https://arxiv.org/cits/0704.0001

html: Link to the ar5iv html page for the article.
Example: https://ar5iv.labs.arxiv.org/html/0704.0001

arxiv_url(doi, 'pdf')

'https://arxiv.org/pdf/0704.0001'

pdf_bytes = g[arxiv_url(doi, 'pdf')]

The contents  (~1.647MB) of https://arxiv.org/pdf/0704.0001 are being downloaded...

abstract_df.embeddings.values[0].shape

(768,)

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Feb 6, 2024

0.0.6

Jan 23, 2024

0.0.4 yanked

Oct 10, 2022

Reason this release was yanked:

Not working

0.0.3 yanked

Oct 4, 2022

Reason this release was yanked:

Not working properly

0.0.2 yanked

Jan 6, 2021

Reason this release was yanked:

Not working properly

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xv-0.1.0.tar.gz (9.4 kB view hashes)

Uploaded Feb 6, 2024 Source

Built Distribution

xv-0.1.0-py3-none-any.whl (8.4 kB view hashes)

Uploaded Feb 6, 2024 Python 3

Hashes for xv-0.1.0.tar.gz

Hashes for xv-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d4d59a81b7eb8ef3ec37e85de6f8f5e37452ed4215f636fb01871cb8a3f14fd6`
MD5	`673007c42b357477bab63e10467c93c2`
BLAKE2b-256	`90392f575abf742b4e66ae133016a72410b082c5cf468f3ab493e27265b63a98`

Hashes for xv-0.1.0-py3-none-any.whl

Hashes for xv-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`07409cf54416da0509985cb515396126c7f7945adc9d1d8e62fd16f0de02d974`
MD5	`d920c985c989365fa62ec60f3f759ff7`
BLAKE2b-256	`562013beead0d54d8a07c41c8d90f8e7d0d21a784ee0bfb9c7f0bdf1c7356e1f`