Skip to main content

PDF Download and Analysis Tool

Project description

ExPDF

Overview

ExPDF is a tool that can generate citation relationship between PDFs, and create beautiful, interactive SVG figure inside Jupyter Notebook.

image

Quickstart

With Jupyter Notebook, it is easy to visuzlize citation relationship between PDFs.

Firstly, download and install by:

git clone https://github.com/bupt-ipcr/expdf
cd expdf
pip install ./

Secondly, use expdf to generate json file like:

expdf -d pdfs/ASV -o data.json

Finally, open jupyter notebook and try:

  import json
  from expdf.visualize import create_fig
  with open('data.json', 'r') as f:
    data = json.load(f)
  fig = create_fig(data)
  fig

Installation

download expdf with github and install it with pip

git clone https://github.com/bupt-ipcr/expdf
cd expdf
pip install ./

run expdf -h to see the help output:

usage: expdf [-h] [-a APPEND_PDF] [-r] [-o OUTPUT_DIR] PDF_PATH

Generate reference relation of all PDFs(given or inside PDF)

positional arguments:
  PDF_PATH              PDF path, or directory of PDFs if -r is used

optional arguments:
  -h, --help            show this help message and exit
  -a APPEND_PDF, --append APPEND_PDF
                        append a PDF file
  -d, --dir, --directory
                        treat PDF_PATH as a directory
  -e EXCLUDE_PDF, --exclude EXCLUDE_PDF
                        exclude a PDF file
  -o OUTPUT_DIR, -O OUTPUT_DIR, --output OUTPUT_DIR
                        output directory, default is current directory
  -v, --vis, --visualize
                        create a html file for visualize
  --vis-html HTML_FILENAME
                        output file name of html visualize

Examples

simply use epdf like:

expdf pdfs/test.pdf

Treat as a directory with -d and it will scan all PDFs in specify directory:

expdf -d pdfs

Append PDFs with -a, since there may be sporadic papers not in the same folder:

expdf -d pdfs -a 1.pdf -a 2.pdf

Exclude PDFs with -e, to exclude some PDFs. Note that even if exclude pdf not exists, there will be no error.

expdf -d pdfs -e test.pdf

To specify output directory, use -o, -O or --output like:

expdf pdfs/test.pdf -O ./urdir

To generate visualize html file, use -v and --vis-html like:

expdf -r pdfs/ASV -v --vis-html='vis.html'

Usage as Python library

Here we have three main parts of expdfs: ExPDFParser, Graph and render.

  • ExPDFParser

    a parser built top on pdfminer, look for metadata, links and references of a PDF file.

    # ensure you have ./tests/test.pdf
    from expdf import ExPDFParser
    pdf = ExPDFParser("tests/test.pdf")
    print('title: ', pdf.title)
    print('info: ', pdf.info)
    print('metadata: ', pdf.metadata)
    
    print('Links: ')
    for link in pdf.links:
      print(f'- {link}')
    
    print('Refs: ')
    for ref in pdf.refs:
      print(f'- {ref}')
    
  • PDFNode

    PDFNode is a class that maintain a dict of all its instances. Two PDF that have same title(or just have difference in punctuations) will point to same node. LocalPDFNode is a subclass of PDFNode, which enables you to modify references of a PDF.

    usually it is used with parser like:

    from expdf import ExPDFParser, LocalPDFNode
    
    expdf_parser = ExPDFParser("tests/test.pdf")
    localPDFNode = LocalPDFNode(expdf_parser.title, expdf_parser.refs)
    pdf_info = PDFNode.get_json()
    print(pdf_info)
    

    otherwise, you can also assign title and refs without parser(maybe human is more precise than parser and regex expressions), just like:

    from expdf.graph import PDFNode, LocalPDFNode
    
    # just a example, we wwill never see title like this
    LocalPDFNode('title0', refs=['title1', 'title2'])
    LocalPDFNode('title1', refs=['title3'])
    LocalPDFNode('title2', refs=['title3'])
    pdf_info = PDFNode.get_json()
    print(pdf_info)
    
  • visualize

    PDFNode give you infos of PDFs, such as citation relationship(show as parents and children). But why not visualize it?

    visuzlize provides a top-level function create_fig built on networkx, plotly. networkx provedes methods to allocate positions of all nodes and plotly is a powerful visualization tool.

    render invokes create_fig and write it into html file.

    Visualize is recommended to be use inside jupyter notebook, since plotly only support events(click, hover, etc) with it. You can use like:

    expdf -d pdfs/ASV -o data.json
    
    # in your jupyter notebook
    import json
    from expdf.visualize import create_fig
    with open('data.json', 'r') as f:
      data = json.load(f)
    fig = create_fig(data)
    fig
    

    You can also save it as html, just like:

    expdf -d pdfs/ASV -o data.json -v --vis-html=vis.html
    

Various

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

expdf-0.3.0.tar.gz (21.6 kB view hashes)

Uploaded Source

Built Distribution

expdf-0.3.0-py2.py3-none-any.whl (22.9 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page