A workflow for creating and editing publication ready scientific reports, from one or more Jupyter Notebooks
Project description
Project: https://github.com/chrisjsewell/ipypublish
A workflow for creating and editing publication ready scientific reports, from one or more Jupyter Notebooks, without leaving the browser!
For an example of the potential input/output, see Example.ipynb and Example.pdf .
Design Philosophy
In essence, the dream is to have the ultimate hybrid of Jupyter Notebook, WYSIWYG editor (e.g. MS Word) and document preparation system (e.g. TexMaker), being able to:
Dynamically (and reproducibly) explore data, run code and output the results
Dynamically edit and visualise the basic components of the document (text, math, figures, tables, references, citations, etc).
Have precise control over what elements are output to the final document and how they are layed out and typeset.
Also be able to output the same source document to different layouts and formats (pdf, html, presentation slides, etc).
Workflow
Create a notebook with some content!
optionally create a .bib file and logo image
Adjust the notebook and cell metadata.
Clone the ipypublish GitHub repository and run the nbpublish.py script for either the specific notebook, or a folder containing multiple notebooks.
A converted folder will be created, into which final .tex .pdf and _viewpdf.html files will be output, named by the notebook or folder input
The default latex template outputs all markdown cells (unless tagged latex_ignore), and then only code and output cells with latex metadata tags. See Example.ipynb and Example.pdf for an example of the potential input and output.
Setting up the environment
Using Conda is recommended for package management, in order to create self contained environments with specific versions of packages. The main external packages required are the Jupyter notebook, Jupyter nbconvert and Pandoc (for conversion to latex):
conda create --name ipyreport -c conda-forge jupyter pandoc
ipypublish can then be installed into this environment:
source activate ipyreport pip install ipypublish
For converting to PDF, the TeX document preparation ecosystem is required (an in particular latexmk), which can be installed from:
ipypublish is automatically tested on update against python 2.7 and 3.6, for both Linux and OSX, using Travis CI. Therefore, to troubleshoot any installation/run issues, it is best to look at the travis config and travis test runs for working configurations.
For helpful extensions to the notebooks core capabilities (like a toc sidebar):
conda install --name ipyreport jupyter_contrib_nbextensions
A more extensive setup of useful packages (used to create the example) are listed in conda_packages.txt and an environment can be created directly from this using conda:
conda create --name ipyreport -c conda-forge -c matsci --file conda_packages.txt
Setting up a Notebook
For improved latex/pdf output, ipynb_latex_setup.py contains import and setup code for the notebook and a number of common packages and functions, including:
numpy, matplotlib, pandas, sympy, …
images_hconcat, images_vconcat and images_gridconcat functions, which use the PIL/Pillow package to create a single image from multiple images (with specified arrangement)
To use this script, in the first cell of a notebook, insert:
from ipypublish.ipynb_latex_setup import *
It is recommended that you also set this cell as an initialisation cell (i.e. have "init_cell": true in the metadata)
Converting Notebooks
The nbpublish.py script handles parsing the notebooks to nbconvert, with the appropriate converter. To see all options for this script:
nbpublish -h
For example, to convert the Example.ipynb notebook:
nbpublish -pdf example/notebooks/Example.ipynb
If a folder is input, then the .ipynb files it contains are processed and combined in ‘natural’ sorted order, i.e. 2_name.ipynb before 10_name.ipynb. By default, notebooks beginning ‘_’ are ignored.
Currently, three output converters are availiable out-the-box (in the scripts folder):
latex_ipypublish_main.py is the default and converts cells to latex according to metadata tags on an ‘opt in’ basis.
latex_standard_article.py replicates the standard latex article template, which comes with nbconvert.
html_toc_toggle_input.py converts the entire notebook(s) to html and adds a table of contents sidebar and a button to toggle input code on/off.
The current nbconvert --to pdf does not correctly resolve references and citations (since it copies the files to a temporary directory). Therefore nbconvert is only used for the initial nbconvert --to latex phase, followed by using latexmk to create the pdf and correctly resolve everything.
Creating a bespoke converter
nbconvert uses Jinja templates to specify the rules for how each element of the notebook should be converted, and also what each section of the latex file should contain. To create a custom template they employ an inheritance method to build up this template. However, in my experience this makes it;
non-trivial to understand the full conversion process (having to go through the inheritance tree to find where particular methods have been implemented/overriden)
difficult to swap in/out multiple rules
To improve this, ipypublish implements a pluginesque system to systematically append to blank template placeholders. For example, to create a document (with standard formatting) with a natbib bibliography where only input markdown is output, we could create the following dictionary:
my_tplx_dict = {
'meta_docstring':'with a natbib bibliography',
'notebook_input_markdown':r"""
((( cell.source | citation2latex | strip_files_prefix | convert_pandoc('markdown', 'json',extra_args=[]) | resolve_references | convert_pandoc('json','latex') )))
""",
'document_packages':r"""
\usepackage[numbers, square, super, sort&compress]{natbib}
\usepackage{doi} % hyperlink doi's
""",
'document_bibliography':r"""
\bibliographystyle{unsrtnat} % sort citations by order of first appearance
\bibliography{bibliography}
"""
}
The converter would then look like this:
from ipypublish.latex.create_tplx import create_tplx
from ipypublish.latex.standard import standard_article as doc
from ipypublish.latex.standard import standard_definitions as defs
from ipypublish.latex.standard import standard_packages as package
oformat = 'Latex'
template = create_tplx([package.tplx_dict,defs.tplx_dict,
doc.tplx_dict,my_tplx_dict])
config = {'TemplateExporter.filters':{},
'Exporter.filters':{}}
Citations and Bibliography
Using Zotero’s Firefox plugin and Zotero Better Bibtex for;
automated .bib file updating
drag and drop cite keys \cite{kirkeminde_thermodynamic_2012}
latexmk -bibtex -pdf (in nbpublish.py) handles creation of the bibliography
\usepackage{doi} turns the DOI numbers into url links
in Zotero-Better-Bibtex I have the option set to only export DOI, if both DOI and URL are present.
Please note, at the time of writing, Better BibTeX does not support Zotero 5.0 (issue#555). For now I have turned off auto-updates of Zotero, though this is probably not wise for long (Zotero 5 Discussion).
Can use:
<cite data-cite="kirkeminde_thermodynamic_2012">(Kirkeminde, 2012)</cite>
to make it look better in html, but not specifically available for drag and drop in Zotero
Live Slideshows
The Reveal.js - Jupyter/IPython Slideshow Extension (RISE) notebook extension offers rendering as a Reveal.js-based slideshow, where you can execute code or show to the audience whatever you can show/do inside the notebook itself!
Dealing with external data
A goal for scientific publishing is automated reproducibility of analyses, which the Jupyter notebook excels at. But, more than that, it should be possible to efficiently reproduce the analysis with different data sets. This entails having one point of access to a data set within the notebook, rather than having copy-pasted data into variables, i.e. this:
data = read_in_data('data_key')
variable1 = data.key1
variable2 = data.key2
...
rather than this:
variable1 = 12345
variable2 = 'something'
...
The best-practice for accessing heirarchical data (in my opinion) is to use the JSON format (as long as the data isn’t relational), because it is:
applicable for any data structure
lightweight and easy to read and edit
has a simple read/write mapping to python objects (using json)
widely used (especially in web technologies)
A good way to store multiple bits of JSON data is in a mongoDB and accessing it via pymongo. This will also make it easy to move all the data to a cloud server at a later time, if required.
conda install pymongo
But, if the data is coming from files output from different simulation or experimental code, where the user has no control of the output format. Then writing JSON parsers may be the way to go, and this is where jsonextended comes in, which implements:
a lightweight plugin system to define bespoke classes for parsing different file extensions and data types.
a ‘lazy loader’ for treating an entire directory structure as a nested dictionary.
For example:
from jsonextended import plugins, edict
plugins.load_plugins_dir('path/to/folder_of_parsers','parsers')
data = edict.LazyLoad('path/to/data')
variable1 = data.folder1.file1_json.key1
variable2 = data[['folder1','file1.json','key2']]
variable3 = data[['folder1','file2.csv','key1']]
variable4 = data[['folder2','subfolder1','file3.other','key1']]
...
If you are dealing with numerical data arrays which are to large to be loaded directly in to memory, then the h5py interface to the HDF5 binary data format, allows for the manipultion of even multi-terabyte datasets stored on disk, as if they were real NumPy arrays. These files are also supported by jsonextended lazy loading.
Miscellaneous
I also use the Firefox Split Pannel extension to view the {name}_viewpdf.html page and monitor changes to the pdf.
bookbook is another package with some conversion capabilities.
Acknowledgements
I took strong influence from:
Notebook concatination was adapted from nbconvert issue#253
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for ipypublish-0.3.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 70d8e321ab8558f51668dcd9f6d07124928c6dce8b25771cec160284c4c0721a |
|
MD5 | b211e4cc3ec640ddd0a9b558bb9806af |
|
BLAKE2b-256 | 1255e49e567d687fe3db8d68a4b12afd0ccc94a795086c127cb34ad5642c0df6 |