CiteXtract - Bringing structure to the papers on ArXiv.
Project description
CiteXtract
CiteXtract - Bringing structure to the papers on ArXiv.
Getting started
In order to install CiteXtract, run the following command:
pip install citextract
Extracting references
Then, one can extract references from a text using the RefXtract model:
from citextract.models.refxtract import RefXtractor
refxtractor = RefXtractor().load()
text = """This is a test sentence.\n[1] Jacobs, K. 2019. This is a test title. In Proceedings of Some Journal."""
refs = refxtractor(text)
print(refs)
It gives the following output:
['[1] Jacobs, K. 2019. This is a test title. In Proceedings of Some Journal.']
Under the hood, a trained neural network extracts reference boundaries and extracts the references by using these boundaries.
Extracting titles
Using the found references, titles can be extracted by using the TitleXtract model:
from citextract.models.titlextract import TitleXtractor
titlextractor = TitleXtractor().load()
ref = """[1] Jacobs, K. 2019. This is a test title. In Proceedings of Some Journal."""
title = titlextractor(ref)
print(title)
It gives the following output:
'This is a test title.'
Here, a trained neural network extracts the titles from the given reference.
Converting an arXiv PDF to text
There is a utility available which takes an arXiv URL and converts it to text:
from citextract.utils.pdf import convert_pdf_url_to_text
pdf_url = 'https://arxiv.org/pdf/some_file.pdf'
text = convert_pdf_url_to_text(pdf_url)
print(text)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for citextract-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 06f346c06135af11e72b531597727359114c0054c04a3817ff25c669f6e009fa |
|
MD5 | 296a106fddbb6c8b3c2fbfe259bf9341 |
|
BLAKE2b-256 | 681a4da368bc416c7444f410c39de805a5e4eb1ade4c2b546fd613010cc61e2b |