Skip to main content

A python library/command-line tool to quickly and automatically generate BibTeX data starting from the pdf file of a scientific publication.

Project description

pdf2bib

pdf2bib is a Python library/command-line tool to extract bibliographic information from the .pdf file of a publication (or from a folder containing several .pdf files), and automatically generate BibTeX entries. The pdf file can be either a paper published in a scientific journal (i.e. with a DOI associated to it), or an arXiv preprint. The bibliographic information is retrieved by querying public archives, thus an internet connection is required.

pdf2bib can be used either from command line, or inside your python script or, only for Windows, directly from the right-click context menu of a pdf file or a folder.

Latest stable version

The latest stable version of pdf2doi is the 1.2. See here for the full change log.

[v1.2] - 2024-06-18

Main changes

  • Added the CLI option -nostore, which allows the user to opt out of the default behaviour of pdf2doi regarding storing the found identifier into the pdf metadata. When -nostore is added to the CLI invokation of pdf2bib, the pdf files will not be modified by pdf2doi.

Added

Installation

Use the package manager pip to install pdf2bib.

pip install pdf2bib==1.2

Under Windows, it is also possible to add shortcuts to the right-click context menu.

DownloadsDownloads Pip Package

Table of Contents

Description

pdf2bib relies on the library pdf2doi, which can automatically find a valid identifier of a scientific publication (i.e. either a DOI or an arxiv ID) starting from a .pdf file. pdf2doi will query public archives (e.g., http://dx.doi.org for DOIs and http://export.arxiv.org for arxiv IDs) in order to validate any identifier found. The validation process returns raw BibTeX data (see also here), which is then used by pdf2bib to generate a valid BibTeX entry in the format

@article{[LastNameFirstAuthor][PublicationYear][FirstWordTitle],
        title = ...,
        volume = ...,
        issue = ...,
        page = ...,
        publisher = ...,
        url = ...,
        doi = ...,
        journal = ...,
        year = ...,
        month = ...,
        author = ...
}

In the current version the format of the BibTeX entry is not customizable by the user (unless you want to change the code - have fun :D), but this functionality will be implemented in future realeses.

Usage

pdf2bib can be used either as a stand-alone application invoked from the command line, or by importing it in your python project or, only for Windows, directly from the right-click context menu of a pdf file or a folder.

Command line usage

pdf2bib can be invoked directly from the command line, without having to open a python console. The simplest command-line invokation is

pdf2bib 'path/to/target'

where target is either a valid pdf file or a directory containing pdf files. Adding the optional command '-v' increases the output verbosity, documenting all steps. For example, when targeting the folder examples we get the following output

pdf2bib examples -v
[pdf2bib]: Looking for pdf files in the folder examples...
[pdf2bib]: Found 4 pdf files.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples-s2.0-0021999186900938-main.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1016/0021-9991(86)90093-8 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1016/0021-9991(86)90093-8 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\chaumet_JAP_07.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1063/1.2409490 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1063/1.2409490 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\PhysRevLett.116.061102.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1103/PhysRevLett.116.061102 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1103/PhysRevLett.116.061102 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\s41586-019-1666-5.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1038/s41586-019-1666-5 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1038/s41586-019-1666-5 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/doi'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
@article{jordan1986an,
        title = {An efficient numerical evaluation of the Green's function for the Helmholtz operator on periodic structures},
        volume = {63},
        issue = {1},
        page = {222-235},
        publisher = {Elsevier BV},
        url = {http://dx.doi.org/10.1016/0021-9991(86)90093-8},
        doi = {10.1016/0021-9991(86)90093-8},
        journal = {Journal of Computational Physics},
        year = {1986},
        month = {3},
        author = {Kirk E Jordan and Gerard R Richter and Ping Sheng}
}
@article{chaumet2007coupled,
        title = {Coupled dipole method to compute optical torque: Application to a micropropeller},
        volume = {101},
        issue = {2},
        page = {023106},
        publisher = {AIP Publishing},
        url = {http://dx.doi.org/10.1063/1.2409490},
        doi = {10.1063/1.2409490},
        journal = {Journal of Applied Physics},
        year = {2007},
        month = {1},
        author = {Patrick C. Chaumet and C. Billaudeau}
}
@article{2016observation,
        title = {Observation of Gravitational Waves from a Binary Black Hole Merger},
        volume = {116},
        issue = {6},
        publisher = {American Physical Society (APS)},
        url = {http://dx.doi.org/10.1103/PhysRevLett.116.061102},
        doi = {10.1103/physrevlett.116.061102},
        journal = {Physical Review Letters},
        year = {2016},
        month = {2}
}
@article{arute2019quantum,
        title = {Quantum supremacy using a programmable superconducting processor},
        volume = {574},
        issue = {7779},
        page = {505-510},
        publisher = {Springer Science and Business Media LLC},
        url = {http://dx.doi.org/10.1038/s41586-019-1666-5},
        doi = {10.1038/s41586-019-1666-5},
        journal = {Nature},
        year = {2019},
        month = {10},
        author = {Frank Arute and Kunal Arya and Ryan Babbush and Dave Bacon and Joseph C. Bardin and Rami Barends and Rupak Biswas and Sergio Boixo and Fernando G. S. L. Brandao and David A. Buell and Brian Burkett and Yu Chen and Zijun Chen and Ben Chiaro and Roberto Collins and William Courtney and Andrew Dunsworth and Edward Farhi and Brooks Foxen and Austin Fowler and Craig Gidney and Marissa Giustina and Rob Graff and Keith Guerin and Steve Habegger and Matthew P. Harrigan and Michael J. Hartmann and Alan Ho and Markus Hoffmann and Trent Huang and Travis S. Humble and Sergei V. Isakov and Evan Jeffrey and Zhang Jiang and Dvir Kafri and Kostyantyn Kechedzhi and Julian Kelly and Paul V. Klimov and Sergey Knysh and Alexander Korotkov and Fedor Kostritsa and David Landhuis and Mike Lindmark and Erik Lucero and Dmitry Lyakh and Salvatore Mandrà and Jarrod R. McClean and Matthew McEwen and Anthony Megrant and Xiao Mi and Kristel Michielsen and Masoud Mohseni and Josh Mutus and Ofer Naaman and Matthew Neeley and Charles Neill and Murphy Yuezhen Niu and Eric Ostby and Andre Petukhov and John C. Platt and Chris Quintana and Eleanor G. Rieffel and Pedram Roushan and Nicholas C. Rubin and Daniel Sank and Kevin J. Satzinger and Vadim Smelyanskiy and Kevin J. Sung and Matthew D. Trevithick and Amit Vainsencher and Benjamin Villalonga and Theodore White and Z. Jamie Yao and Ping Yeh and Adam Zalcman and Hartmut Neven and John M. Martinis}
}

Every line which begins with '[pdf2doi]' or '[pdf2bib]' is omitted when the optional command '-v' is absent. It is also possible to store all bibtex entries into a text file, or into the system clipboard, by using the optional arguments -s FILENAME_BIBTEX and -clip

pdf2bib examples -s bibtex.txt -clip
All available bibtex entries have been stored in the file bibtex.txt
All available bibtex entries have been stored in the system clipboard

A list of all optional arguments can be generated by pdf2bib --h

pdf2bib --h
usage: pdf2bib [-h] [-v] [-nostore] [-s FILENAME_BIBTEX] [-clip] [-install--right--click] [-uninstall--right--click]
               [path ...]

Generate BibTeX entries of scientific publications starting from the pdf files. It requires an internet connection.

positional arguments:
  path                  Relative path of the target pdf file or of the targe folder.

options:
  -h, --help            show this help message and exit
  -v, --verbose         Increase verbosity. By default (i.e. when not using -v), only the text of the found bibtex
                        entries will be printed as output.
  -nostore, --no_store_identifier_metadata
                        pdf2bib uses the library pdf2doi to find the DOI/identifier of a publication. By default,
                        anytime an identifier is found, pdf2doi also adds it to the metadata of the pdf file (if not
                        present yet). By using this additional option, the identifier is not stored in the file
                        metadata.
  -s FILENAME_BIBTEX, --make_bibtex_file FILENAME_BIBTEX
                        Create a text file inside the target directory, with name given by FILENAME_BIBTEX, containing
                        the bibtex entry of each pdf file in the target folder (if any is found).
  -clip, --save_bibtex_clipboard
                        Store all found bibtex entries into the clipboard.
  -install--right--click
                        Add a shortcut to pdf2bib in the right-click context menu of Windows. This allows you to copy
                        the bibtex entry of a pdf file (or all pdf files in a folder) into the clipboard by just right
                        clicking on it! NOTE: this feature is only available on Windows.
  -uninstall--right--click
                        Uninstall the right-click context menu functionalities. NOTE: this feature is only available
                        on Windows.

Creating a bib file from a folder

pdf2bib can be used to quickly generate a .bib file containining the BibTeX entries of all pdf files in a target folder, via the command

pdf2bib 'path\to\target\folder' -s bibtex.bib

The generated .bib file can be imported into other software, such as Zotero, to generate bibliograpies for, e.g. Microsoft Word.

Manually associate the correct identifier to a file from command line

Occasionally, the BibTeX generation process will fail (or give wrong results) if the library pdf2doi (which pdf2bib relies on to find a valid publication identifier) fails to retrieve a DOI/identifier (or maybe it retrives the uncorrect one). This problem can be fixed by looking for the DOI/identifier manually and add it to the pdf metadata, by using pdf2doi as described here. In this way, any future use of pdf2bib on this file will always retrieve the correct BibTeX infos.

Usage inside a python script

pdf2bib can also be used as a library within a python script. The function pdf2bib.pdf2bib is the main point of entry. The first input argument must be a valid path (either absolute or relative) to a pdf file or to a folder containing pdf files. The same settings available in the command line operation (see above), are now available via the methods set and get of the object pdf2bib.config For example, we can scan the folder examples with reduced output verbosity,

>>> import pdf2bib
>>> pdf2bib.config.set('verbose',False)
>>> path = r'.\examples'
>>> result = pdf2bib.pdf2bib(path)
>>> print(result[0]['metadata'])
>>> print('
')
>>> print(result[0]['bibtex'])
{'title': "An efficient numerical evaluation of the Green's function for the Helmholtz operator on periodic structures", 'volume': '63', 'issue': '1', 'page': '222-235', 'publisher': 'Elsevier BV', 'url': 'http://dx.doi.org/10.1016/0021-9991(86)90093-8', 'doi': '10.1016/0021-9991(86)90093-8', 'journal': 'Journal of Computational Physics', 'year': 1986, 'month': 3, 'author': 'Kirk E Jordan and Gerard R Richter and Ping Sheng', 'ENTRYTYPE': 'article'}


@article{jordan1986an,
	title = {An efficient numerical evaluation of the Green's function for the Helmholtz operator on periodic structures},
	volume = {63},
	issue = {1},
	page = {222-235},
	publisher = {Elsevier BV},
	url = {http://dx.doi.org/10.1016/0021-9991(86)90093-8},
	doi = {10.1016/0021-9991(86)90093-8},
	journal = {Journal of Computational Physics},
	year = {1986},
	month = {3},
	author = {Kirk E Jordan and Gerard R Richter and Ping Sheng}
}

The output of the function pdf2bib.pdf2bib is a list of dictionaries (or just a single dictionary if a single file was targeted). Each dictionary has the following keys

result['identifier']        = DOI or other identifier (or None if nothing is found)
result['identifier_type']   = string specifying the type of identifier (e.g. 'doi' or 'arxiv')
result['path']              = path of the pdf file
result['method']            = method used by pdf2doi to find the identifier
result['validation_info']   = Raw BibTeX data.
result['metadata']          = Dictionary containing bibtex info
result['bibtex']            = A string containing a valid bibtex entry

The element result['metadata'] is a dictionary containing the most typical bibtex infos. The specific keys contained in this dictionary, and their format, will depend on several factors, such as (1) if the paper was associated to a DOI or to an arxiv ID, (2) which method was used by pdf2doi to validate the paper identifier, and (3) which data is available for this paper in the relevant archive. When the paper is associate to a DOI, the result['metadata'] dictionary will always contain at least the keys 'title', 'author', 'journal', 'volume', 'issue', 'page', 'publisher', 'url', 'doi', 'year', 'month', althought some of them might be empty. When the paper is associated to an arxiv ID, the result['metadata'] dictionary will always contain the keys 'title', 'author', 'ejournal', 'eprint', 'published', 'url', 'doi','arxiv_doi', 'year', 'month', 'day', 'ENTRYTYPE'

Manually associate the correct identifier to a file

Similarly to what described above, it is possible to associate a (manually found) identifier to a pdf file also from within python, by using the function pdf2doi.add_found_identifier_to_metadata:

>>> import pdf2doi
>>> pdf2doi.add_found_identifier_to_metadata(path_to_pdf_file, identifier)

Installing the shortcuts in the right-click context menu of Windows

This functionality is only available on Windows (and so far it has been tested only on Windows 10). It adds additional commands to the context menu of Windows which appears when right-clicking on a pdf file or on a folder.

The menu commands allow to copy BibTeX entry of a pdf file (or all pdf files contained in a folder) into the system clipboard.

To install this functionality, first install pdf2bib via pip (as described above), then open a command prompt with administrator rights and run

$ pdf2bib  -install--right--click

To remove it, simply run (again from a terminal with administrator rights)

$ pdf2bib  -uninstall--right--click

If it is not possible to run this command from a terminal with administrator rights, the batch files here can be alternatively used (see readme.MD file in the same folder for instructions), although it is still required to have admnistrator rights.

NOTE: when multiple pdf files are selected, and the right-click context menu commands are used, pdf2bib will be called separately for each file, and thus only the BibTeX entry of the last file will be stored in the clipboard. In order to copy the info of multiple files it is necessary to save them in a folder and right-click on the folder.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Acknowledgment

I am thankful to my friend and colleague Yarden Mazor for leading the beta-testing efforts for this project.

Donating

If you find this library useful (or amazing!), please consider making donations on my behalf to organizations that advocate for and promote free dissemination of science, such as

arXiv

Sci-Hub

Wikipedia

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2bib-1.2.tar.gz (23.8 kB view details)

Uploaded Source

Built Distribution

pdf2bib-1.2-py3-none-any.whl (19.5 kB view details)

Uploaded Python 3

File details

Details for the file pdf2bib-1.2.tar.gz.

File metadata

  • Download URL: pdf2bib-1.2.tar.gz
  • Upload date:
  • Size: 23.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for pdf2bib-1.2.tar.gz
Algorithm Hash digest
SHA256 3fd4d9e99585a3ba2115892d3e2fac132151f7bd9c7e1f2ffae5c54ec22c66cc
MD5 8716e79fbeb2d9bc3aec05d254614c8a
BLAKE2b-256 86582ed869b6aab7c8f555130bafbabe50562f0c11aa486a35d608c7baef0910

See more details on using hashes here.

File details

Details for the file pdf2bib-1.2-py3-none-any.whl.

File metadata

  • Download URL: pdf2bib-1.2-py3-none-any.whl
  • Upload date:
  • Size: 19.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for pdf2bib-1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d3136422d22a03032259642082e42361943e02c09715fe7ef169a682906b0229
MD5 a62185bccdebc6216779f8bd7310ab79
BLAKE2b-256 31c48dfd39898d04fd1c659c37cc02306edcebf8121ef74e508320328de87424

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page