A python library/command-line tool to quickly and automatically generate BibTeX data starting from the pdf file of a scientific publication.
Project description
pdf2bib
```pdf2bib``` is a Python library/command-line tool to extract bibliographic information from the .pdf file of a publication
(or from a folder containing several .pdf files), and automatically generate BibTeX entries. The pdf file can be either a paper published in a scientific journal (i.e. with
a DOI associated to it), or an [arXiv](https://arxiv.org/about/donate) preprint.
```pdf2bib``` can be used either from [command line](#command-line-usage), or inside your [python script](#usage-inside-a-python-script) or, only for Windows, directly from the [right-click context menu](#installing-the-shortcuts-in-the-right-click-context-menu-of-windows) of a pdf file or a folder.
## Installation
Use the package manager pip to install pdf2bib.
```bash
pip install pdf2bib==1.0
```
Under Windows, it is also possible to add [shortcuts to the right-click context menu](#installing-the-shortcuts-in-the-right-click-context-menu-of-windows).
<!--
<img src="docs/ContextMenu_pdf.gif" width="500" />
[![Downloads](https://pepy.tech/badge/pdf2doi)](https://pepy.tech/project/pdf2doi?versions=0.4&versions=0.5&versions=0.6)[![Downloads](https://pepy.tech/badge/pdf2doi/month)](https://pepy.tech/project/pdf2doi?versions=0.4&versions=0.5&versions=0.6)
[![Pip Package](https://img.shields.io/pypi/v/pdf2doi?logo=PyPI)](https://pypi.org/project/pdf2doi)
-->
## Table of Contents
- [Installation](#installation)
- [Description](#description)
- [Usage](#usage)
* [Command line usage](#command-line-usage)
+ [Creating a bib file from a folder](#creating-a-bib-file-from-a-folder)
+ [Manually associate the correct identifier to a file from command line](#manually-associate-the-correct-identifier-to-a-file-from-command-line)
* [Usage inside a python script](#usage-inside-a-python-script)
+ [Manually associate the correct identifier to a file](#manually-associate-the-correct-identifier-to-a-file)
- [Installing the shortcuts in the right-click context menu of Windows](#installing-the-shortcuts-in-the-right-click-context-menu-of-windows)
-[Contributing](#contributing)
- [License](#license)
- [Acknowledgment](#acknowledgment)
- [Donating](#donating)
## Description
```pdf2bib``` relies on the library [pdf2doi](https://github.com/MicheleCotrufo/pdf2doi), which can automatically find a valid identifier of a scientific publication (i.e. either a DOI or an arxiv ID)
starting from a .pdf file. The identifier is also validated by querying public archives (e.g., http://dx.doi.org for DOIs and http://export.arxiv.org for arxiv IDs).
The validation process returns raw BibTeX data (see also [here](https://github.com/MicheleCotrufo/pdf2doi#usage-inside-a-python-script)), which is then used by
```pdf2bib``` to generate a valid BibTeX entry in the format
```
@article{[LastNameFirstAuthor][PublicationYear][FirstWordTitle],
title = ...,
volume = ...,
issue = ...,
page = ...,
publisher = ...,
url = ...,
doi = ...,
journal = ...,
year = ...,
month = ...,
author = ...
}
```
In the current version the format of the BibTeX entry is not customizable by the user (unless you want to change the code - have fun :D),
but this functionality will be implemented in future realeses.
## Usage
```pdf2bib``` can be used either as a [stand-alone application](#command-line-usage) invoked from the command line, or by [importing it in your python project](#usage-inside-a-python-script) or, only for Windows,
directly from the [right-click context menu](#installing-the-shortcuts-in-the-right-click-context-menu-of-windows) of a pdf file or a folder.
### Command line usage
```pdf2bib``` can be invoked directly from the command line, without having to open a python console.
The simplest command-line invokation is
```bash
pdf2bib 'path/to/target'
```
where ```target``` is either a valid pdf file or a directory containing pdf files. Adding the optional command '-v' increases the output verbosity,
documenting all steps.
For example, when targeting the folder [examples](/examples) we get the following output
```bash
pdf2bib examples -v
[pdf2bib]: Looking for pdf files in the folder examples...
[pdf2bib]: Found 4 pdf files.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples-s2.0-0021999186900938-main.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1016/0021-9991(86)90093-8 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1016/0021-9991(86)90093-8 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\chaumet_JAP_07.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1063/1.2409490 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1063/1.2409490 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\PhysRevLett.116.061102.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1103/PhysRevLett.116.061102 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1103/PhysRevLett.116.061102 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\s41586-019-1666-5.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1038/s41586-019-1666-5 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1038/s41586-019-1666-5 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/doi'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
@article{jordan1986an,
title = {An efficient numerical evaluation of the Green's function for the Helmholtz operator on periodic structures},
volume = {63},
issue = {1},
page = {222-235},
publisher = {Elsevier BV},
url = {http://dx.doi.org/10.1016/0021-9991(86)90093-8},
doi = {10.1016/0021-9991(86)90093-8},
journal = {Journal of Computational Physics},
year = {1986},
month = {3},
author = {Kirk E Jordan and Gerard R Richter and Ping Sheng}
}
@article{chaumet2007coupled,
title = {Coupled dipole method to compute optical torque: Application to a micropropeller},
volume = {101},
issue = {2},
page = {023106},
publisher = {AIP Publishing},
url = {http://dx.doi.org/10.1063/1.2409490},
doi = {10.1063/1.2409490},
journal = {Journal of Applied Physics},
year = {2007},
month = {1},
author = {Patrick C. Chaumet and C. Billaudeau}
}
@article{2016observation,
title = {Observation of Gravitational Waves from a Binary Black Hole Merger},
volume = {116},
issue = {6},
publisher = {American Physical Society (APS)},
url = {http://dx.doi.org/10.1103/PhysRevLett.116.061102},
doi = {10.1103/physrevlett.116.061102},
journal = {Physical Review Letters},
year = {2016},
month = {2}
}
@article{arute2019quantum,
title = {Quantum supremacy using a programmable superconducting processor},
volume = {574},
issue = {7779},
page = {505-510},
publisher = {Springer Science and Business Media LLC},
url = {http://dx.doi.org/10.1038/s41586-019-1666-5},
doi = {10.1038/s41586-019-1666-5},
journal = {Nature},
year = {2019},
month = {10},
author = {Frank Arute and Kunal Arya and Ryan Babbush and Dave Bacon and Joseph C. Bardin and Rami Barends and Rupak Biswas and Sergio Boixo and Fernando G. S. L. Brandao and David A. Buell and Brian Burkett and Yu Chen and Zijun Chen and Ben Chiaro and Roberto Collins and William Courtney and Andrew Dunsworth and Edward Farhi and Brooks Foxen and Austin Fowler and Craig Gidney and Marissa Giustina and Rob Graff and Keith Guerin and Steve Habegger and Matthew P. Harrigan and Michael J. Hartmann and Alan Ho and Markus Hoffmann and Trent Huang and Travis S. Humble and Sergei V. Isakov and Evan Jeffrey and Zhang Jiang and Dvir Kafri and Kostyantyn Kechedzhi and Julian Kelly and Paul V. Klimov and Sergey Knysh and Alexander Korotkov and Fedor Kostritsa and David Landhuis and Mike Lindmark and Erik Lucero and Dmitry Lyakh and Salvatore Mandrà and Jarrod R. McClean and Matthew McEwen and Anthony Megrant and Xiao Mi and Kristel Michielsen and Masoud Mohseni and Josh Mutus and Ofer Naaman and Matthew Neeley and Charles Neill and Murphy Yuezhen Niu and Eric Ostby and Andre Petukhov and John C. Platt and Chris Quintana and Eleanor G. Rieffel and Pedram Roushan and Nicholas C. Rubin and Daniel Sank and Kevin J. Satzinger and Vadim Smelyanskiy and Kevin J. Sung and Matthew D. Trevithick and Amit Vainsencher and Benjamin Villalonga and Theodore White and Z. Jamie Yao and Ping Yeh and Adam Zalcman and Hartmut Neven and John M. Martinis}
}
```
Every line which begins with '[pdf2doi]' or '[pdf2bib]' is omitted when the optional command '-v' is absent. It is also possible to store all bibtex entries into
a text file, or into the system clipboard, by using the optional arguments ```-s FILENAME_BIBTEX``` and ```-clip```
```bash
pdf2bib examples -s bibtex.txt -clip
All available bibtex entries have been stored in the file bibtex.txt
All available bibtex entries have been stored in the system clipboard
```
A list of all optional arguments can be generated by ```pdf2bib --h```
```bash
pdf2bib --h
usage: pdf2bib [-h] [-v] [-s FILENAME_BIBTEX] [-clip] [-install--right--click] [-uninstall--right--click]
[path [path ...]]
Generate BibTeX entries of scientific publications starting from the pdf files. It requires an internet connection.
positional arguments:
path Relative path of the target pdf file or of the targe folder.
optional arguments:
-h, --help show this help message and exit
-v, --verbose Increase verbosity. By default (i.e. when not using -v), only the text of the found bibtex
entries will be printed as output.
-s FILENAME_BIBTEX, --make_bibtex_file FILENAME_BIBTEX
Create a text file inside the target directory, with name given by FILENAME_BIBTEX, containing
the bibtex entry of each pdf file in the target folder (if any is found).
-clip, --save_bibtex_clipboard
Store all found bibtex entries into the clipboard.
-install--right--click
Add a shortcut to pdf2bib in the right-click context menu of Windows. This allows you to copy
the bibtex entry of a pdf file (or all pdf files in a folder) into the clipboard by just right
clicking on it! NOTE: this feature is only available on Windows.
-uninstall--right--click
Uninstall the right-click context menu functionalities. NOTE: this feature is only available
on Windows.
```
#### Creating a bib file from a folder
```pdf2bib``` can be used to quickly generate a .bib file containining the BibTeX entries of all pdf files in a target folder, via the command
```bash
pdf2bib 'path o arget
older' -s bibtex.bib ``` The generated .bib file can be imported into other software, such as Zotero, to generate bibliograpies for, e.g. Microsoft Word.
#### Manually associate the correct identifier to a file from command line
Occasionally, the BibTeX generation process will fail (or give wrong results) if the library ```pdf2doi``` (which ```pdf2bib``` relies on to find a valid publication identifier)
fails to retrieve a DOI/identifier (or maybe it retrives the uncorrect one). This problem can be fixed
by looking for the DOI/identifier manually and add it to the pdf metadata, by using ```pdf2doi``` as described [here](https://github.com/MicheleCotrufo/pdf2doi#manually-associate-the-correct-identifier-to-a-file-from-command-line).
In this way, any future use of ```pdf2bib``` on this file will always retrieve the correct BibTeX infos.
### Usage inside a python script
```pdf2bib``` can also be used as a library within a python script. The function ```pdf2bib.pdf2bib``` is the main point of entry.
The first input argument must be a valid path (either absolute or relative) to a pdf file or to a folder containing pdf files.
The same settings available in the command line operation (see above), are now available via the methods ```set``` and ```get``` of the object ```pdf2bib.config```
For example, we can scan the folder [examples](/examples) with reduced output verbosity,
```python
>>> import pdf2bib
>>> pdf2bib.config.set('verbose',False)
>>> path = r'.\examples'
>>> result = pdf2bib.pdf2bib(path)
>>> print(result[0]['metadata'])
>>> print('
')
>>> print(result[0]['bibtex'])
{'title': "An efficient numerical evaluation of the Green's function for the Helmholtz operator on periodic structures", 'volume': '63', 'issue': '1', 'page': '222-235', 'publisher': 'Elsevier BV', 'url': 'http://dx.doi.org/10.1016/0021-9991(86)90093-8', 'doi': '10.1016/0021-9991(86)90093-8', 'journal': 'Journal of Computational Physics', 'year': 1986, 'month': 3, 'author': 'Kirk E Jordan and Gerard R Richter and Ping Sheng', 'ENTRYTYPE': 'article'}
@article{jordan1986an,
title = {An efficient numerical evaluation of the Green's function for the Helmholtz operator on periodic structures},
volume = {63},
issue = {1},
page = {222-235},
publisher = {Elsevier BV},
url = {http://dx.doi.org/10.1016/0021-9991(86)90093-8},
doi = {10.1016/0021-9991(86)90093-8},
journal = {Journal of Computational Physics},
year = {1986},
month = {3},
author = {Kirk E Jordan and Gerard R Richter and Ping Sheng}
}
```
The output of the function ```pdf2bib.pdf2bib``` is a list of dictionaries (or just a single dictionary if a single file was targeted).
Each dictionary has the following keys
```
result['identifier'] = DOI or other identifier (or None if nothing is found)
result['identifier_type'] = string specifying the type of identifier (e.g. 'doi' or 'arxiv')
result['path'] = path of the pdf file
result['method'] = method used by pdf2doi to find the identifier
result['validation_info'] = Raw BibTeX data.
result['metadata'] = Dictionary containing bibtex info
result['bibtex'] = A string containing a valid bibtex entry
```
#### Manually associate the correct identifier to a file
Similarly to what described [above](#manually-associate-the-correct-identifier-to-a-file-from-command-line), it is possible to associate a (manually found)
identifier to a pdf file also from within python, by using the function ```pdf2doi.add_found_identifier_to_metadata```:
```python
>>> import pdf2doi
>>> pdf2doi.add_found_identifier_to_metadata(path_to_pdf_file, identifier)
```
## Installing the shortcuts in the right-click context menu of Windows
This functionality is only available on Windows (and so far it has been tested only on Windows 10). It adds additional commands to the context menu of Windows
which appears when right-clicking on a pdf file or on a folder.
<!--<img src="docs/ContextMenu_pdf.png" width="550" /><img src="docs/ContextMenu_folder.png" width="550" />-->
The menu commands allow to copy BibTeX entry of a pdf file (or all pdf files contained in a folder) into the system clipboard.
<!--<img src="docs/ContextMenu_pdf.gif" width="500" />-->
To install this functionality, first install ```pdf2bib``` via pip (as described above), then open a command prompt **with administrator rights** and run
```
$ pdf2bib -install--right--click
```
To remove it, simply run (again from a terminal with administrator rights)
```
$ pdf2bib -uninstall--right--click
```
If it is not possible to run this command from a terminal with administrator rights, the batch files
[here](/right_click_menu_installation) can be alternatively used (see readme.MD file in the same folder for instructions), although it is still required to have
admnistrator rights.
NOTE: when multiple pdf files are selected, and the right-click context menu commands are used, ```pdf2bib``` will be called separately for each file, and thus
only the BibTeX entry of the last file will be stored in the clipboard. In order to copy the info of multiple files it is necessary to save them in a folder and right-click on the folder.
## Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
## Acknowledgment
I am thankful to my friend and colleague Yarden Mazor for leading the beta-testing efforts for this project.
## Donating
If you find this library useful (or amazing!), please consider making donations on my behalf to organizations that advocate for and promote free dissemination of science, such as
[arXiv](https://arxiv.org/about/donate)
[Sci-Hub](https://sci-hub.se/donate)
[Wikipedia](https://donate.wikimedia.org/)
## License
[MIT](https://choosealicense.com/licenses/mit/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdf2bib-1.0.tar.gz
(22.9 kB
view hashes)
Built Distribution
pdf2bib-1.0-py3-none-any.whl
(18.3 kB
view hashes)