Skip to main content

Small library for extracting references used in scholarly communication.

Project description

refextract

About

A library for extracting references used in scholarly communication.

Getting Started

Note: due to the usage of mmap resize functionality this library cannot be locally installed on a mac

Docker Setup:

Before the first usage, or anytime a new library/dependency is changed a new docker image must be created using:

docker build --target refextract-tests -t refextract .

After that, spin up a refextract service with:

docker run -it -v ./tests:/refextract/tests -v ./refextract:/refextract/refextract  refextract

Running tests

Exec into the container via

docker exec -it refextract /bin/bash

Then simply run

pytest .

Usage

To get structured information from a publication reference:

>>> from refextract import extract_journal_reference
>>> reference = extract_journal_reference('J.Phys.,A39,13445')
>>> print(reference)
{
'extra_ibids': [],
'is_ibid': False,
'misc_txt': '',
'page': '13445',
'title': 'J. Phys.',
'type': 'JOURNAL',
'volume': 'A39',
'year': '',

}

To extract references from a PDF:

>>> from refextract import extract_references_from_file
>>> references = extract_references_from_file('1503.07589.pdf')
>>> print(references[0])
{
'author': ['F. Englert and R. Brout'],
'doi': ['doi:10.1103/PhysRevLett.13.321'],
'journal_page': ['321'],
'journal_reference': ['Phys. Rev. Lett. 13 (1964) 321'],
'journal_title': ['Phys. Rev. Lett.'],
'journal_volume': ['13'],
'journal_year': ['1964'],
'linemarker': ['1'],
'raw_ref': ['[1] F. Englert and R. Brout, \u201cBroken symmetry and the mass of gauge vector mesons\u201d, Phys. Rev. Lett. 13 (1964) 321, doi:10.1103/PhysRevLett.13.321.'],
'texkey': ['Englert:1964et'],
'year': ['1964'],
}

To extract directly from a URL:

>>> from refextract import extract_references_from_url
>>> references = extract_references_from_url('https://arxiv.org/pdf/1503.07589.pdf')
>>> print(references[0])
{
'author': ['F. Englert and R. Brout'],
'doi': ['doi:10.1103/PhysRevLett.13.321'],
'journal_page': ['321'],
'journal_reference': ['Phys. Rev. Lett. 13 (1964) 321'],
'journal_title': ['Phys. Rev. Lett.'],
'journal_volume': ['13'],
'journal_year': ['1964'],
'linemarker': ['1'],
'raw_ref': ['[1] F. Englert and R. Brout, \u201cBroken symmetry and the mass of gauge vector mesons\u201d, Phys. Rev. Lett. 13 (1964) 321, doi:10.1103/PhysRevLett.13.321.'],
'texkey': ['Englert:1964et'],
'year': ['1964'],

}

Notes

refextract depends on

pdftotext.

Acknowledgments

refextract is based on code and ideas from the following people, who

contributed to the docextract module in Invenio:

  • Alessio Deiana
  • Federico Poli
  • Gerrit Rindermann
  • Graham R. Armstrong
  • Grzegorz Szpura
  • Jan Aage Lavik
  • Javier Martin Montull
  • Micha Moskovic
  • Samuele Kaplun
  • Thorsten Schwander
  • Tibor Simko

License

GPLv2

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refextract-1.1.6.tar.gz (259.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

refextract-1.1.6-py3-none-any.whl (276.1 kB view details)

Uploaded Python 3

File details

Details for the file refextract-1.1.6.tar.gz.

File metadata

  • Download URL: refextract-1.1.6.tar.gz
  • Upload date:
  • Size: 259.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for refextract-1.1.6.tar.gz
Algorithm Hash digest
SHA256 d1cfd235286f1e77af9992c493a3fab83bd3c6d69e91962f0c8c97dae45dc226
MD5 bee3ba760883bd8dce08ad1f9caaa216
BLAKE2b-256 f25dec25190dd00f7121eebcde4656402c59ee565f88adcee40e1c8f8e602c00

See more details on using hashes here.

File details

Details for the file refextract-1.1.6-py3-none-any.whl.

File metadata

  • Download URL: refextract-1.1.6-py3-none-any.whl
  • Upload date:
  • Size: 276.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for refextract-1.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 8fab1374a91e264dc23fac81f3b7ab31fcd4bd970756b9d4417974640fa03e77
MD5 ec803f8993c3e2ec0220679ed4fac2a8
BLAKE2b-256 bc39f00089a804db6b1516568a7479a816dd413f2d12c526d65e746574634f97

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page