Skip to main content

Small library for extracting references used in scholarly communication.

Project description

Small library for extracting references used in scholarly communication.

Originally exported from Invenio https://github.com/inveniosoftware/invenio.

Dependencies

Installation

pip install refextract

Usage

To get structured info from a publication reference:

from refextract import extract_journal_reference
reference = extract_journal_reference("J.Phys.,A39,13445")
print(reference)
{
    'extra_ibids': [],
    'is_ibid': False,
    'misc_txt': u'',
    'page': u'13445',
    'title': u'J. Phys.',
    'type': 'JOURNAL',
    'volume': u'A39',
    'year': ''
 }

To extract references from a publication full-text PDF:

from refextract import extract_references_from_file
reference = extract_references_from_file("some/fulltext/1503.07589v1.pdf")
print(reference)
[
        {'author': [u'F. Englert and R. Brout'],
         'doi': [u'10.1103/PhysRevLett.13.321'],
         'journal_page': [u'321'],
         'journal_reference': ['Phys.Rev.Lett.,13,1964'],
         'journal_title': [u'Phys.Rev.Lett.'],
         'journal_volume': [u'13'],
         'journal_year': [u'1964'],
         'linemarker': [u'1'],
         'title': [u'Broken symmetry and the mass of gauge vector mesons'],
         'year': [u'1964']}, ...
]

You can also extract directly from a URL:

from refextract import extract_references_from_url
reference = extract_references_from_url("http://arxiv.org/pdf/1503.07589v1.pdf")
print(reference)
[
         {'author': [u'F. Englert and R. Brout'],
          'doi': [u'10.1103/PhysRevLett.13.321'],
          'journal_page': [u'321'],
          'journal_reference': ['Phys.Rev.Lett.,13,1964'],
          'journal_title': [u'Phys.Rev.Lett.'],
          'journal_volume': [u'13'],
          'journal_year': [u'1964'],
          'linemarker': [u'1'],
          'title': [u'Broken symmetry and the mass of gauge vector mesons'],
          'year': [u'1964']}, ...
]

Changes

Version 0.2.5 (2018-03-13)

  • Handle all exceptions when extracting TeXkeys.

Version 0.2.4 (2018-02-28)

  • Remove GarbageFullTextError.

Version 0.2.3 (2017-12-19)

  • Handle all possible errors thrown by pyPDF2

  • Fix normalization of CLIC report numbers.

Version 0.2.2 (2017-07-17)

  • Handle pyPDF2 internal errors.

Version 0.2.1 (2017-07-02)

  • Named destinations may not always have left and top coordinates. This case is now handled gracefully: no TeXkeys are returned by extract_texkeys_from_pdf instead of raising an uncaught exception.

  • Makes CFG_PATH_GFILE and CFG_PATH_PDFTOTEXT configurable through shell variables, with fallback on the output of which, in order to allow for easier containerization.

Version 0.2.0 (2017-06-26)

  • Substantial rewrite of the API. In particular:

    • extract_references_from_file, extract_references_from_string, and extract_references_from_url now return a list of the references, instead of an object with keys stats and references.

    • If the number of TeXkeys that were extracted from the PDF metadata matches the number of references parsed by RefExtract, an extra texkey field is added to each returned reference.

    • The API now raises exceptions when it encounters an unrecoverable error.

    • Finally, the API now returns the list of raw references on which refextract worked.

Version 0.1.0 (2016-01-12)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refextract-0.2.5.tar.gz (5.9 MB view details)

Uploaded Source

File details

Details for the file refextract-0.2.5.tar.gz.

File metadata

  • Download URL: refextract-0.2.5.tar.gz
  • Upload date:
  • Size: 5.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for refextract-0.2.5.tar.gz
Algorithm Hash digest
SHA256 1917237fb8f5c18360db8f91d50891c0d3521ca3bb666f249f455b36efc6670b
MD5 b2db942f34d7dc14e4c494f53dc71a47
BLAKE2b-256 3bb782d7f242fac565964fd07dbffcb512b3639e00df79bee95b64fde4d5b215

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page