Skip to main content

Reconstruct full-text news articles from GDELT Web News NGrams 3.0

Project description

Reconstructing Full-Text News Articles from GDELT - gdeltnews

Reconstruct full news article text from the GDELT Web News NGrams 3.0 dataset.

This package helps you:

  1. download GDELT Web NGrams files for a time range,
  2. reconstruct article text from overlapping n-gram fragments,
  3. filter and merge reconstructed CSVs using Boolean queries.

To learn more about the dataset, please visit the official announcement: https://blog.gdeltproject.org/announcing-the-new-web-news-ngrams-3-0-dataset/

Input files look like: http://data.gdeltproject.org/gdeltv3/webngrams/20250316000100.webngrams.json.gz

Reconstruction quality depends on the n-gram fragments available in the dataset.

Install

pip install gdeltnews

Quickstart and Docs

The package is documented here.

Step 1: Download Web NGrams files

from gdeltnews.download import download

download(
    "2025-11-25T10:00:00",
    "2025-11-25T13:59:00",
    outdir="gdeltdata",
    decompress=False,
)

Step 2: Reconstruct articles (run as a script, not in Jupyter)

Multiprocessing can be problematic inside notebooks. Run this from a .py script.

from gdeltnews.reconstruct import reconstruct

def main():
    reconstruct(
        input_dir="gdeltdata",
        output_dir="gdeltpreprocessed",
        language="it",
        url_filters=["repubblica.it", "corriere.it"],
        processes=10,  # use None for all available cores
    )

if __name__ == "__main__":
    main()

Step 3: Filter, deduplicate, and merge CSVs

from gdeltnews.filtermerge import filtermerge

filtermerge(
    input_dir="gdeltpreprocessed",
    output_file="final_filtered_dedup.csv",
    query='((elezioni OR voto) AND (regionali OR campania)) OR ((fico OR cirielli) AND NOT veneto)'
)

Advanced users can pre-filter and download GDELT data via Google BigQuery, then process it directly with wordmatch.py.

Citation

If you use this package for research, please cite:

A. Fronzetti Colladon, R. Vestrelli (2025). “A Python Tool for Reconstructing Full News Text from GDELT.” https://arxiv.org/abs/2504.16063

Credits

Code co-developed with robves99.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gdeltnews-1.0.0.tar.gz (51.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gdeltnews-1.0.0-py3-none-any.whl (40.9 kB view details)

Uploaded Python 3

File details

Details for the file gdeltnews-1.0.0.tar.gz.

File metadata

  • Download URL: gdeltnews-1.0.0.tar.gz
  • Upload date:
  • Size: 51.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for gdeltnews-1.0.0.tar.gz
Algorithm Hash digest
SHA256 b541f899e8005bda9cbc21a3c3a8ba7ec0a061d673002d9a0c88befbf372da3d
MD5 dfb3af5c182a00566b322c6897984fae
BLAKE2b-256 e93fd0a9fef519f22d311f82dc0d1648602850ba950ab848cb8001d6babb091e

See more details on using hashes here.

File details

Details for the file gdeltnews-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: gdeltnews-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 40.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for gdeltnews-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ccbab326f288cb54b4c3c5da9525a72a56b89090517d5b83f2be122340a8c110
MD5 f007e420f26423e957d1fe8cce710665
BLAKE2b-256 30c55e79269e02dc86c403532d7665dc27fd44ab88db535af2c4bb02ac6e43d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page