CiteXtract - Bringing structure to the papers on ArXiv.
Project description
CiteXtract
CiteXtract - Bringing structure to the papers on ArXiv.
Getting started
In order to install CiteXtract, run the following command:
pip install citextract
Extracting references
Then, one can extract references from a text using the RefXtract model:
from citextract.models.refxtract import RefXtractor
refxtractor = RefXtractor().load()
text = """This is a test sentence.\n[1] Jacobs, K. 2019. This is a test title. In Proceedings of Some Journal."""
refs = refxtractor(text)
print(refs)
It gives the following output:
['[1] Jacobs, K. 2019. This is a test title. In Proceedings of Some Journal.']
Under the hood, a trained neural network extracts reference boundaries and extracts the references by using these boundaries.
Extracting titles
Using the found references, titles can be extracted by using the TitleXtract model:
from citextract.models.titlextract import TitleXtractor
titlextractor = TitleXtractor().load()
ref = """[1] Jacobs, K. 2019. This is a test title. In Proceedings of Some Journal."""
title = titlextractor(ref)
print(title)
It gives the following output:
'This is a test title.'
Here, a trained neural network extracts the titles from the given reference.
Converting an arXiv PDF to text
There is a utility available which takes an arXiv URL and converts it to text:
from citextract.utils.pdf import convert_pdf_url_to_text
pdf_url = 'https://arxiv.org/pdf/some_file.pdf'
text = convert_pdf_url_to_text(pdf_url)
print(text)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for citextract-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7713eab1198908b888421265f1965c2bb95511ea3898220033c7d7d5055e3b24 |
|
MD5 | e0bb7c218cde941fec9fd3e452081481 |
|
BLAKE2b-256 | c286d983b816b5cef8622becc773fde91ba841beb46a1ff15f30e5c6057003bb |