Skip to main content

Client-less data retrieval from Hive.

Project description

hivehoney

Extract data from remote Hive to local Windows OS (without Hadoop client).

The most difficult part was figuring out expect+pbrun.

Because there are 2 interactive questions I had to pause after password.

Mode expect+pbrun details are here: https://github.com/hive-scripts/hivehoney/blob/master/expect_pbrun_howto.md

Data access path.

Windows desktop->
               SSH->
                  Linux login->
                       pbrun service login->
                                           kinit
                                           beeline->
                                                   SQL->
                                                       save echo on Windows
                                

Run it like this:

set PROXY_HOST=your_bastion_host

set SERVICE_USER=you_func_user

set LINUX_USER=your_SOID

set LINUX_PWD=your_pwd

python hh.py --query_file=query.sql

query.sql

select * from gfocnnsg_work.pytest LIMIT 1000000;

Result:

  TOTAL BYTES:    60000127

  Elaplsed: 79.637 s

  exit status:  0

  0

  []

  TOTAL Elaplsed: 99.060 s

data_dump.csv

  c:\tmp>dir data_dump.csv



  Directory of c:\tmp

  09/04/2018  12:53 PM        60,000,075 data_dump.csv

                 1 File(s)     60,000,075 bytes

                 0 Dir(s)     321,822,720 bytes free

FAQ

Can it extract CSV file to cluster node?

No, it is extracting to local windows OS.

Can developers integrate Hive Honew into their ETL pipelines?

No. It's for ad-hoc data retrieval.

what's the purpose of Hive Honey?

Improve experience of Data and Business Analysts with Big Data.

What are the other ways to extract data from Hive?

You can use Hive syntax to extract data to HDFS or locally (to a node with Hadoop client)

Can it be executed without pbrun?

No, It's hardcoded to automate pbrun authorization.

Does it create any files?

Yes, it creates sql file with query and expect file for pbrun automation (in your bastion/jump host /tmp dir).

Explain steps of data extract?

From Windows desktop: 1. SSH to Linux 2. Pbrun service login 3. kinit 4. beeline executes SQL 5. Echo of data from beeline is saved on Windows.

What technology was used to create this tool

I used Python and paramiko to write it.

Can you modify functionality and add features?

Yes, please, ask me for new features.

What other AWS tools you've created?

Do you have any AWS Certifications?

Yes, AWS Certified Developer (Associate)

Can you create similar/custom data tool for our business?

Yes, you can PM me here or email at alex_buz@yahoo.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hivehoney-1.0.3.tar.gz (6.8 kB view hashes)

Uploaded Source

Built Distribution

hivehoney-1.0.3-py2.py3-none-any.whl (10.2 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page