Skip to main content

Moves files to hdfs by creating hive tables

Project description

Project to move local files all the way to hdfs

Requirements:
python 2.7
paramiko

Lets go over the assumptions that this script has about your data:

1. You have a parent folder that contains (possibly nested folders) with .csv files where each .csv file corresponds to a different Hive table that you wish to create.
2. These .csv files have headers in them that specify what each column means
3. The name of the .csv file will be used as the name of the Hive table that will be created.
4. Only non-existing hive tables will be created. Thus if a hive table already exists, it will not be removed.
5. You have access to the production hadoop cluster. If you do not have this, then please make a ticket with AppOps for ssh access to the production cluster: pl1rhd402.internal.edmunds.com
6. If you are building the Hive tables automatically, all of the types will be STRING
7. Partitions for your hive tables will be created based on the date that you run the script. Thus, you only will ever need to create the tables once, after that you can just keep loading data into
the tables and it will not even overwrite existing data unless you need to upload data that is different more than once a day. If this is the case please email me at sshuster@edmunds.com

OK great! If you are ok with all of the above lets now go over the config files which is where you can provide all of the information required to do the job

First look at the sample_config/allinfo_load.cfg which is where you will be specifying all parameters about the hive tables you are going to create.
Lets go line by line:

[LocalPaths]
#This is the parent directory containing all of your .csv files on your local machine
local_dir: /Users/sshuster/Documents/Common_Data_Platform_Challenge_Team/allinfo_sample
#If these tables in hive do not exist yet, ddl sql will need to be created and stored locally (you can delete this later) specify a folder where these files can be written to
local_sql_dir: /Users/sshuster/Documents/Common_Data_Platform_Challenge_Team/allinfo_sql

[RemotePaths]
#This is the folder on the remote server where your csv files will be moved -> only modify after the base_remote
dest_dir: %(base_remote)s/allinfo
#This is the folder on the remote server where your hive ddl will be moved to -> only modify after the base_remote
sql_dest_dir: %(base_remote)s/allinfo_sql

#The server to connect to
server: dl1rhd401.internal.edmunds.com
username=sshuster
password=[your password here]
#Do not change
base_remote=/misc/%(username)s

[HDFSLocation]
#This is the folder on HDFS where your hive tables will reside. NOTE you will need to contact the DWH team to have a folder created for your team as otherwise you will not have permission to write to a folder
hdfs_base_folder: /stats_team

[Hive]
#Set equal to True if you want to create the hive tables, otherwise False
create_tables: True
#Set equal to True if you want to overwrite existing tables, otherwise False (ONLY SET TO TRUE IF YOU WANT TO DELETE ALL EXISTING DATA!!)
overwrite_existing_hive: False
#The Delimiter of your csv files
delimiter = ,



How do you run?

python hdfs_load.py [path to your config file]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edmunds_hdfs_load-0.1.tar.gz (6.8 kB view details)

Uploaded Source

Built Distributions

edmunds_hdfs_load-0.1.macosx-10.9-intel.exe (112.4 kB view details)

Uploaded Source

edmunds_hdfs_load-0.1-py2.7.egg (14.6 kB view details)

Uploaded Source

File details

Details for the file edmunds_hdfs_load-0.1.tar.gz.

File metadata

File hashes

Hashes for edmunds_hdfs_load-0.1.tar.gz
Algorithm Hash digest
SHA256 a197e4395e817d8dfd3aa464a02840c545ff9e8363049ad0de89728b7236f51d
MD5 9c588e6160e797f9852f17a1fd3a8758
BLAKE2b-256 8a3c537a40cdf556edfca44d56e2eb3d7d1471dd7d0ddd04dc6109d121b9f9b2

See more details on using hashes here.

File details

Details for the file edmunds_hdfs_load-0.1.macosx-10.9-intel.exe.

File metadata

File hashes

Hashes for edmunds_hdfs_load-0.1.macosx-10.9-intel.exe
Algorithm Hash digest
SHA256 6539c5ffd330da4d90cc8514181437fa26fb1d1d1ee71d4ad92caa2bd1d6ed22
MD5 7951c74abd1d4cd49bd405349223aa1a
BLAKE2b-256 5c5bf2a3af69530cbc0b1104349b0c3fdff6385df8ab293fefcb73189e1dff66

See more details on using hashes here.

File details

Details for the file edmunds_hdfs_load-0.1-py2.7.egg.

File metadata

File hashes

Hashes for edmunds_hdfs_load-0.1-py2.7.egg
Algorithm Hash digest
SHA256 1817f0376b224de26a543d58e096310663ee161233217e4638ac7ad28d289102
MD5 3029ef3d57e910b4cd11e913206c4ec1
BLAKE2b-256 833db9318e007ce5da2338c927609f8b2c1d61ff0f93522f38bb3fa416d5c803

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page