Skip to main content

A set of tools to support downloading GDELT data

Project description

Loading GDELT data into MongoDB

This is a set of programs for loading the GDELT 2.0 data set into MongoDB.

Quick Start

Install the latest version of Python from python.org You need at least version 3.6 for this program. Many versions of Python that come pre-installed are only 2.7. This version will not work.

Now install gdelttools

pip install gdelttools

Now get the master file of all the GDELT files.

gdeltloader --master --update

This will generate a file named something like gdelt-update-file-04-19-2022-19-33-56.txt

Now using the file you just generated run this grep command to extract the last 365 days of data. Note you will need to substitute the file you just created.

grep export gdelt-update-file-[MM-DD-YYYY-HH-MM-SS].txt | tail -n 365 > last_365_days.txt  | tail -n 365 > last_365_days.txt

Now you can download the list of files you just created using the command

gdeltloader --download --local last_365_days.txt

GDELT 2.0 Encoding and Structure

The GDELT dataset is a large dataset of news events that is updated in real-time. GDELT stands for Global Database of Events Location and Tone. The format of records in a GDELT data is defined by the GDELT 2.0 Cookbook

Each record uses an encoding method called CAMEO coding which is defined by the CAMEO cookbook.

Once you understand the GDELT recording structure and the CAMEO encoding you will be able to decode a record. To fully decode a record you may need the TABARI dictionaries from which the CAMEO encoding is derived.

How to download GDELT 2.0 data

The gdeltloader script can download cameo data an unzip the files so that they can be loaded into MongoDB.

usage: gdeltloader [-h] [--host HOST] [--master] [--update]
                   [--database DATABASE] [--collection COLLECTION]
                   [--local LOCAL] [--overwrite] [--download] [--metadata]

optional arguments:
  -h, --help            show this help message and exit
  --host HOST           MongoDB URI
  --master              GDELT master file [False]
  --update              GDELT update file [False]
  --database DATABASE   Default database for loading [GDELT]
  --collection COLLECTION
                        Default collection for loading [events_csv]
  --local LOCAL         load data from local list of zips
  --overwrite           Overwrite files when they exist already
  --download            download zip files from master or local file
  --metadata            grab meta data files

To operate first get the master and the update list of event files.

gdeltloader --master --update

Now grab the subset of files you want. For us lets grab the last 365 days of events. There are three times of files in the master and update files:

150383 297a16b493de7cf6ca809a7cc31d0b93 http://data.gdeltproject.org/gdeltv2/20150218230000.export.CSV.zip
318084 bb27f78ba45f69a17ea6ed7755e9f8ff http://data.gdeltproject.org/gdeltv2/20150218230000.mentions.CSV.zip
10768507 ea8dde0beb0ba98810a92db068c0ce99 http://data.gdeltproject.org/gdeltv2/20150218230000.gkg.csv.zip

Export files contain event data. Mentions contain other mentions of the initial news event in the current 15 minute cycle. GKS files contain the global knowledge graph.

We just want the previous 365 days of events so we use the master file to get the previous 365 exports files as so.

$ grep export gdelt_master-file-04-08-2019-14-13-28.txt | tail -n 365 > last_365_days.txt
$ wc last_365_days.txt
  365  1095 38847 last_365_days.txt
$

now download the data.

gdeltloader --download --local last_365_days.txt 

Host tells us a database to store the files we have downloaded. The local argument tells us the location of the local file on disk. This command will download all the associated zip files and unpack them into uncompress .CSV files.

Now import the CSV files with mongoimport.

Need mongoimport example here

transforming the data

You can generate GeoJSON points from the existing geo-location lat/long filed by using gdelttools/mapgeolocation.py.

usage: mapgeolocation.py [-h] [--host HOST] [--database DATABASE] [-i INPUTCOLLECTION] [-o OUTPUTCOLLECTION]

optional arguments:
  -h, --help            show this help message and exit
  --host HOST           MongoDB URI [mongodb://localhost:27017]
  --database DATABASE   Default database for loading [GDELT]
  -i INPUTCOLLECTION, --inputcollection INPUTCOLLECTION
                        Default collection for input [events_csv]
  -o OUTPUTCOLLECTION, --outputcollection OUTPUTCOLLECTION
                        Default collection for output [events]

This program expects to read and write data from a database called GDELT. The default input collection is events_csv and the default output collection is events.

To transform the collections run:

python gdelttools/mapgeolocation.py
Processed documents total : 247441

If you run mapgeolocation.py on the same dataset it will overwrite the records. Each new data-set will be merged into previous collections of documents.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gdelttools-0.4a13.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

gdelttools-0.4a13-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file gdelttools-0.4a13.tar.gz.

File metadata

  • Download URL: gdelttools-0.4a13.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.2

File hashes

Hashes for gdelttools-0.4a13.tar.gz
Algorithm Hash digest
SHA256 f499495326b5c973082b27537139e5500576194aefb5820587bec623290db74f
MD5 f1d34da51b076e1eca66b573d0e3eb57
BLAKE2b-256 1d4731d914a06f8f237285df6f248f407434fc01c359a46a04462a3266db024b

See more details on using hashes here.

Provenance

File details

Details for the file gdelttools-0.4a13-py3-none-any.whl.

File metadata

  • Download URL: gdelttools-0.4a13-py3-none-any.whl
  • Upload date:
  • Size: 12.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.2

File hashes

Hashes for gdelttools-0.4a13-py3-none-any.whl
Algorithm Hash digest
SHA256 a7f867295a5192101efb6b39eb47637c8d90b7e3a0ba12e282e875ca4f7f49ef
MD5 bece936619dea9aaeb050db5d75df327
BLAKE2b-256 4ed0feaf794965b762a8c3eb27f319b7df58abbe13fceacadf0a1ee4f1c85e0c

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page