Skip to main content

Gnip Historical libarary and command scripts.

Project description

Python Library
and
Command Line Utilities
for Gnip Historical PowerTrack API


The process for launching and retrieveing data for an historical historical job
requires only a few steps:
1) create job
2) retrieve and review job quote
3) accept or reject job
4) download data files list
5) download data

Untilities are included to assist with each step.

SETUP UTILITY
=============
First, set up your Gnip credentials. There is a simple utility to create the local credential
file named ".gnip".

$ ./setup_gnip_creds.py
Username: shendrickson@gnip.com
Password:
Password again:
Endpoint URL. Enter your Account Name (eg https://historical.gnip.com:443/accounts/<account name>/): shendrickson
Done creating file ./.gnip
Be sure to run:
chmod og-w .gnip

$ chmod og-w .gnip

If you use the example JSON job description, be sure to change the "serviceUserNameField"
to your own, i.e., for Twitter, use your Twitter handle.

You will likely wish to run these utilities from other directory locations so be sure the export an
updated PYTHONPATH,

$ export PYTHONPATH=${PYTHONPATH}:path-to-gnip-python-historical-utilities

CREATE JOB
==========
Create a job description by editing the example JSON file provided ("bieber_job1.json").

You will end up with a single JSON record like this (see GNIP documentation for option
details). the fromDate and toDate are in the format YYYYmmddHHMM:

{
"dataFormat" : "activity-streams",
"fromDate" : "201201010000",
"publisher" : "twitter",
"rules" :
[
{
"tag" : "bestRuleEver",
"value" : "bieber"
}
],
"serviceUsername" : "PUT_YOUR_TWITTER_HANDLE_HERE",
"streamType" : "track",
"title" : "BieberJob1",
"toDate" : "201201010001"
}

To create the job,

$ ./create_job.py -f./bieber_job1.json -t "Social Data Phenoms - Bieber"

The response is the JSON record returned by the server. It will describe the job (including
JobID and the JobURL, or any error messages.

To get help,

$ ./create_job.py -h
Usage: create_job.py [options]

Options:
-h, --help show this help message and exit
-u URL, --url=URL Job url.
-l, --prev-url Use previous Job URL (only from this configuration
file.).
-v, --verbose Detailed output.
-f FILENAME, --filename=FILENAME
File defining job (JSON)
-t TITLE, --title=TITLE
Title of project, this title supercedes title in file.


LIST JOBS, get JOB QUOTES and get JOB STATUS:
=============================================
$ ./list_jobs.py -h
Usage: list_jobs.py [options]

Options:
-h, --help show this help message and exit
-u URL, --url=URL Job url.
-l, --prev-url Use previous Job URL (only from this configuration
file.).
-v, --verbose Detailed output.
-d SINCEDATESTRING, --since-date=SINCEDATESTRING
Only list jobs after date, (default
2012-01-01T00:00:00)

For example, I have three completed jobs, a Gnip job, a Bieber job and a SXSW
job for which data is avaiable.

$ ./list_jobs.py
#########################
TITLE: GNIP2012
STATUS: finished
PROGRESS: 100.0 %
JOB URL: https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/eeh2vte64.json
#########################
TITLE: Justin Bieber 2009
STATUS: finished
PROGRESS: 100.0 %
JOB URL: https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/j5epx4e5c3.json
#########################
TITLE: SXSW2010-2012
STATUS: finished
PROGRESS: 100.0 %
JOB URL: https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/sbxff05b8d.json


To see detailed information or download data filelist,
specify URL with -u or add -v flag (data_files.txt contains
only URLs from last job in list)

DOWNLOAD URLS OF FILES CONTAINING DATA
======================================
To retrieve the file locations for the data files this job created on S3, pass
the job URL with the -u flag (or if you used -u for this job previously, just use -l--see help),

$ ./list_jobs.py -u https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/sbxff05b8d.json
#########################
TITLE: SXSW2010-2012
STATUS: finished
PROGRESS: 100.0 %
JOB URL: https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/sbxff05b8d.json

RESULT:
Job completed at ........ 2012-09-01 04:35:23
No. of Activities ....... -1
No. of Files ............ -1
Files size (MB) ......... -1
Data URL ................ https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historical/track/jobs/sbxff05b8d/results.json
DATA SET:
No. of URLs ............. 131,211
File size (bytes)........ 2,151,308,466
Files (URLs) ............ https://archive.replay.historicals.review.s3.amazonaws.com/historicals/twitter/track/activity-streams/shendrickson/2012/08/28/20100101-20120815_sbxff05b8d/2010/01/01/00/00_activities.json.gz?AWSAccessKeyId=AKIAJ7O2S22DN2NDN7UQ&Expires=1349066046&Signature=hDSc0a%2BRQeG%2BknaSAWpzSUoM1F0%3D
https://archive.replay.historicals.review.s3.amazonaws.com/historicals/twitter/track/activity-streams/shendrickson/2012/08/28/20100101-20120815_sbxff05b8d/2010/01/01/00/10_activities.json.gz?AWSAccessKeyId=AKIAJ7O2S22DN2NDN7UQ&Expires=1349066046&Signature=DOZlXKuMByv5uKgmw4QrCOpmEVw%3D
https://archive.replay.historicals.review.s3.amazonaws.com/historicals/twitter/track/activity-streams/shendrickson/2012/08/28/20100101-20120815_sbxff05b8d/2010/01/01/00/20_activities.json.gz?AWSAccessKeyId=AKIAJ7O2S22DN2NDN7UQ&Expires=1349066046&Signature=X4SFTxwM2X9Y7qwgKCwG6fH8h7w%3D
https://archive.replay.historicals.review.s3.amazonaws.com/historicals/twitter/track/activity-streams/shendrickson/2012/08/28/20100101-20120815_sbxff05b8d/2010/01/01/00/30_activities.json.gz?AWSAccessKeyId=AKIAJ7O2S22DN2NDN7UQ&Expires=1349066046&Signature=WVubKurX%2BAzYeZLX9UnBamSCrHg%3D
https://archive.replay.historicals.review.s3.amazonaws.com/historicals/twitter/track/activity-streams/shendrickson/2012/08/28/20100101-20120815_sbxff05b8d/2010/01/01/00/40_activities.json.gz?AWSAccessKeyId=AKIAJ7O2S22DN2NDN7UQ&Expires=1349066046&Signature=OG9ygKlXNxFvJLlAEWi3hes5yyw%3D
...

Writing files to data_files.txt...

Filenames for the 131K files created on S3 by the job have been downloaded to a file in
the local directory, ./data_files.txt.

DOWNLOAD DATA
=============

To retrieve this data use the utility,

$ ./get_data_files.bash
...

This will lauch up to 8 simultaneousl cUrl connections to S3 to download the files
into a local ./data/year/month/day/hour... directory tree (see name_mangle.py for details).

ACCEPT/REJECT JOB
=================
After a job is quoted, you can accept or reject the job. The job will not start until it is accepted.

$ ./accept_job -u https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historicals/track/jobs/c9pe0day6h.json

or

$ ./reject_job -u https://historical.gnip.com:443/accounts/shendrickson/publishers/twitter/historicals/track/jobs/c9pe0day6h.json

The module gnip_historical.py provides additional functionality you can access programatically.

==
Gnip-Python-Historical-Utilities by Scott Hendrickson is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gnip-historical-0.4.0.tar.gz (23.8 kB view details)

Uploaded Source

File details

Details for the file gnip-historical-0.4.0.tar.gz.

File metadata

File hashes

Hashes for gnip-historical-0.4.0.tar.gz
Algorithm Hash digest
SHA256 869c279f6f90920a7679a6a9a0d93a39bf784562898ad95e0588ee46b6bcf263
MD5 f26501edfbb2342e2d981234c4c5506c
BLAKE2b-256 4cf027867529f2866e2fe15fa31cbbf4852fc60f2426e61347b4ce429a7f63c2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page