Skip to main content

Wrapper for the Trombone project

Project description

pytrombone

Python wrapper for the Trombone project

https://github.com/voyanttools/trombone

Installation

$ pip install pytrombone

Usage

Examples

Consider a situation where we have a bunch of pdfs in a directory named './data/', and we want to calculate the SMOG index on those PDFs.

Making sure that Trombone works

from pytrombone import Trombone, Cache, filepaths_loader

# This will download the Trombone jar in the /tmp/ directory of your machine. 
# Note that Trombone is likely to be deleted on reboot, and will need to be downloaded again.
trombone = Trombone()

# To get the version
version = trombone.get_version()
print(version)

Calculating the SMOG index of 2 files

# To run Trombone on a single file use the run method.
# Note that Trombone parameters are given in the form of a list of tuple of 2 elements.
# The first element of the tuple is the parameter, and the second is its value.
# Also note that Trombone will handle those 2 files concurrently 
# (it will be more performant to give many files at the same time rather than loop on each).
output, error = trombone.run([
    ('tool', 'corpus.DocumentSMOGIndex'),  # Choose the tool you want to use
    ('file', './data/example1.pdf'),
    ('file', './data/example2.pdf'),
    ('storage', 'file'),  # Optional, it allows Trombone to cache pre-processed files (use if you will use the file for many tools)
])
output  # is the successful output of Trombone, in the form of a string
error  # is the failed output of Trombone, in the form of a string

# You can serialize the output, which has the JSON format :
output = trombone.serialize_output(output)
# output is now your results in the form of a dictionary

Calculating the SMOG index in batches

# We first need to setup the cache file (it will allow you to re-run
# your code in case of a problem without having to restart from the beginning)
cache = Cache('./cache.db')

# Then, load the filepaths in batch. pytrombone has a function to do that.
# Note that every file marked as processed will be ignored.
# Also note that the Cache uses the filename of the file as reference.
for filepaths in filepaths_loader('./data/*.pdf', 100, cache):
    # Making tuples to fit the specification of Trombone parameters
    files = [('file', filepath) for filepath in filepaths]

    output, error = trombone.run([
        ('tool', 'corpus.DocumentSMOGIndex'),  # Choose the tool you want to use
        ('storage', 'file'),  # Optional, it allows Trombone to cache pre-processed files (use if you will use the file for many tools)
    ] + files)

    try:
        # If the serialization failed, it is because Trombone failed to performs the analysis.
        # the failed files will be marked as failed in the cache and re-run on the next run.
        # You may want to inspect the "error" variable for more information.
        output = trombone.serialize_output(output)
    except json.JSONDecoder:
        filenames = [os.path.basename(filepath) for filepath in filepaths]
        cache.mark_as_failed(filenames)
        continue

    output  # now has your results in the for of a dictionary

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytrombone-0.1.3.tar.gz (5.3 kB view hashes)

Uploaded Source

Built Distribution

pytrombone-0.1.3-py3-none-any.whl (5.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page