CiteXtract - Bringing structure to the papers on ArXiv.

# CiteXtract

CiteXtract - Bringing structure to the papers on ArXiv.

## Getting started

In order to install CiteXtract, run the following command:

pip install citextract


### Extracting references

Then, one can extract references from a text using the RefXtract model:

from citextract.models.refxtract import RefXtractor

text = """This is a test sentence.\n[1] Jacobs, K. 2019. This is a test title. In Proceedings of Some Journal."""
refs = refxtractor(text)
print(refs)


It gives the following output:

['[1] Jacobs, K. 2019. This is a test title. In Proceedings of Some Journal.']


Under the hood, a trained neural network extracts reference boundaries and extracts the references by using these boundaries.

### Extracting titles

Using the found references, titles can be extracted by using the TitleXtract model:

from citextract.models.titlextract import TitleXtractor

ref = """[1] Jacobs, K. 2019. This is a test title. In Proceedings of Some Journal."""
title = titlextractor(ref)
print(title)


It gives the following output:

'This is a test title.'


Here, a trained neural network extracts the titles from the given reference.

### Converting an arXiv PDF to text

There is a utility available which takes an arXiv URL and converts it to text:

from citextract.utils.pdf import convert_pdf_url_to_text

pdf_url = 'https://arxiv.org/pdf/some_file.pdf'
text = convert_pdf_url_to_text(pdf_url)
print(text)


## Project details

Uploaded source
Uploaded py3