Skip to main content

Python Hive query framework

Project description

# Apiarist

A python 2.5+ package for defining Hive queries which can be run on AWS EMR.

It is, in its current form, only addressing a very narrow use-case.
Reading large CSV files into a Hive database, running a Hive query, and outputting the results to a CSV file.

Future versions will endeavour to extend the input/output formats and be runnable locally.

It is modeled on [mrjob](https://github.com/Yelp/mrjob) and attempts to present a similar API and use similar common variables to cooperate with `boto`.

## A simple Hive job

You will need to provide four methods:

- `table()` the name of the table that your query will select from.
- `input_columns()` the columns in the source data file.
- `output_columns()` the columns that your query will output.
- `query` the HiveQL query.

This code lives in `/examples`.

```python
from apiarist.job import HiveJob

class EmailRecipientsSummary(HiveJob):

def table(self):
return 'emails_sent'

def input_columns(self):
return [
('day', 'STRING'),
('weekday', 'INT'),
('sent', 'BIGINT')
]

def output_columns(self):
return [
('year', 'INT'),
('weekday', 'INT'),
('sent', 'BIGINT')
]

def query(self):
return "SELECT YEAR(day), weekday, SUM(sent) FROM emails_sent GROUP BY YEAR(day), weekday;"

if __name__ == "__main__":
EmailRecipientsSummary().run()
```

### Try it out

Locally (must have a Hive server available):

python email_recipients_summary.py -r local /path/to/your/local/file.csv

EMR:

python email_recipients_summary.py -r emr s3://path/to/your/S3/files/

## Command-line options

Arguments can be passed to jobs on the command line, or programmatically with an array of options. Argument handling uses the [optparse](https://docs.python.org/2/library/optparse.html) module.

Various options can be passed to control the running of the job. In particular the AWS/EMR options.

- `-r` the run mode. Either `local` or `emr` (default is `local`)
- `--output-dir` where the results of the job will go.
- `--s3-scratch-uri` the bucket in which all the temporary files can go.
- `--ec2-instance-type` the base instance type. Default is `m3.xlarge`
- `--ec2-master-instance-type` if you want the master type to be different.
- `--num-ec2-instances` number of instances (including the master). Default is `2`.
- `--ami-version` the ami version. Default is `latest`.
- `--hive-version`. Default is `latest`.

### Passing options to your jobs

Jobs can be configured to accept arguments.

To do this, add the following method to your job class to configutr the options:

def configure_options(self):
super(EmailRecipientsSummary, self).configure_options()
self.add_passthrough_option('--year', dest='year')

And then use the option by providing it in the command line arguments, like this:

python email_recipients_summary.py -r local /path/to/your/local/file.csv --year 2014

Then incorporating it into your HiveQL query like this:

def query(self):
q = "SELECT YEAR(day), weekday, SUM(sent) "
q += "FROM emails_sent "
q += "WHERE YEAR(day) = {0} ".format(self.options.year)
q += "GROUP BY YEAR(day), weekday;"
return q

## License

Apiarist source code is released under Apache 2 License. Check LICENSE file for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apiarist-0.0.14.tar.gz (39.1 kB view details)

Uploaded Source

File details

Details for the file apiarist-0.0.14.tar.gz.

File metadata

  • Download URL: apiarist-0.0.14.tar.gz
  • Upload date:
  • Size: 39.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for apiarist-0.0.14.tar.gz
Algorithm Hash digest
SHA256 8d78aaf5f58bd095211dc61f806ef20f0063ff70a7676d81b28ce9edf9818368
MD5 d699440ac75fa2306f21fae7bfd9cec4
BLAKE2b-256 3738e44a0cb256b19a59388d9d7eae4240ef3361a06588a7c3e745c27cd5c811

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page