Skip to main content

A set of tools to support downloading GDELT data

Project description

Loading GDELT data into MongoDB

This is a set of programs for loading the GDELT 2.0 data set into MongoDB.

Quick Start

Install the latest version of Python from python.org You need at least version 3.6 for this program. Many versions of Python that come pre-installed are only 2.7. This version will not work.

Now install gdelttools

pip install gdelttools

Now get the master file of all the GDELT files.

gdeltloader --master

This will generate a file named something like gdelt-master-file-04-19-2022-19-33-56.txt

Downloading the master data set

To download the master data set associated with GDELT (the export files) you can combine these steps:

gdeltloader --master --download --overwrite

This will get the master file, parse it, extract the list of CSV files and unzip them. the full GDELT 2.0 database runs to several terabytes of data so this is not recommend.

The overwrite argument ruthlessly overwrites all files with extreme prejudice. Without it the gdeltloader script will attempt to reuse the files you have already downloaded. As each file is unique this may save time if you need to re-download some files.

To limit the amount you download you can specify --last to define how many files worth of data you want to download:

gdeltloader --master --download --overwrite --last 20

Will download the most recent 20 files worth of data. Not that a file is a triplet of export, mentions and gkg data. If you only want one you should specify a --filter. Without the filter a command like the above will actually download 60 files.

GDELT 2.0 Encoding and Structure

The GDELT dataset is a large dataset of news events that is updated in real-time. GDELT stands for Global Database of Events Location and Tone. The format of records in a GDELT data is defined by the GDELT 2.0 Cookbook

Each record uses an encoding method called CAMEO coding which is defined by the CAMEO cookbook.

Once you understand the GDELT recording structure and the CAMEO encoding you will be able to decode a record. To fully decode a record you may need the TABARI dictionaries from which the CAMEO encoding is derived.

How to download GDELT 2.0 data

The gdeltloader script can download cameo data an unzip the files so that they can be loaded into MongoDB.

usage: gdeltloader [-h] [--host HOST] [--master] [--update]
                   [--database DATABASE] [--collection COLLECTION]
                   [--local LOCAL] [--overwrite] [--download] [--metadata]
                   [--filefilter {export,gkg,mentions,all}] [--last LAST]

optional arguments:
  -h, --help            show this help message and exit
  --host HOST           MongoDB URI
  --master              GDELT master file [False]
  --update              GDELT update file [False]
  --database DATABASE   Default database for loading [GDELT]
  --collection COLLECTION
                        Default collection for loading [events_csv]
  --local LOCAL         load data from local list of zips
  --overwrite           Overwrite files when they exist already
  --download            download zip files from master or local file
  --metadata            grab meta data files
  --filefilter {export,gkg,mentions,all}
                        download a subset of the data, the default is the
                        export data
  --last LAST           how many recent days of data to download [365]

Version: 0.06a

Here is how to download the last 365 days of GDELT data.

gdeltloader --master --update --download --last 365``

This command will only download the export files for the last 365 days which are the files we are interested in.

How to import downloaded data into MongoDB

Now import the CSV files with mongoimport.

There is a mongoimport.sh script in the gdelttools repo which is already configured with the right arguments. There is also a corresponding field file, gdelt_field_file.ff which this script uses to ensure correct type mappings.

To run:

sh mongoimport.sh

it will upload all the CSV files in the current working directory.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gdelttools-0.7b1.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

gdelttools-0.7b1-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file gdelttools-0.7b1.tar.gz.

File metadata

  • Download URL: gdelttools-0.7b1.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.2

File hashes

Hashes for gdelttools-0.7b1.tar.gz
Algorithm Hash digest
SHA256 5b514da8762beea8a330a7af61d93fb6b9e798827d164335c64cd751b47c019e
MD5 fec2693a25990396f3ef8333538280e5
BLAKE2b-256 95c6c1cdd124d4b1129c1ada5b677947f7a07f7b078f32a1d714dea76fdf6251

See more details on using hashes here.

Provenance

File details

Details for the file gdelttools-0.7b1-py3-none-any.whl.

File metadata

  • Download URL: gdelttools-0.7b1-py3-none-any.whl
  • Upload date:
  • Size: 14.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.2

File hashes

Hashes for gdelttools-0.7b1-py3-none-any.whl
Algorithm Hash digest
SHA256 953e0b9c636c7f74841f4678f0ecb6f848bdca9470b5441f3918527ceb5896ee
MD5 f21932df10be50d14c2958c4b6996c53
BLAKE2b-256 c79d644d9d8960799f1b673ca2d8ae812274bd900fe92daa5175a6c1775e1dac

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page