Skip to main content

Reconstruct full-text news articles from GDELT Web News NGrams 3.0

Project description

Reconstructing Full-Text News Articles from GDELT - gdeltnews

Reconstruct full news article text from the GDELT Web News NGrams 3.0 dataset.

This package helps you:

  1. download GDELT Web NGrams files for a time range,
  2. reconstruct article text from overlapping n-gram fragments,
  3. filter and merge reconstructed CSVs using Boolean queries.

To learn more about the dataset, please visit the official announcement: https://blog.gdeltproject.org/announcing-the-new-web-news-ngrams-3-0-dataset/

Input files look like: http://data.gdeltproject.org/gdeltv3/webngrams/20250316000100.webngrams.json.gz

Reconstruction quality depends on the n-gram fragments available in the dataset.

Docs

This package documentation is available here, and a more detailed explanation of the functions’ logic is provided in the accompanying paper.

GUI Version

If you prefer to use a software with a graphical user interface that runs this code, you can find it here and read the instructions here.

Python Package Quickstart

Install

pip install gdeltnews

Step 1: Download Web NGrams files

from gdeltnews.download import download

download(
    "2025-11-25T10:00:00",
    "2025-11-25T13:59:00",
    outdir="gdeltdata",
    decompress=False,
)

Step 2: Reconstruct articles (run as a script, not in Jupyter)

Multiprocessing can be problematic inside notebooks. Run this from a .py script. The compressed .json.gz files are read directly, so you do not need to decompress them first.

from multiprocessing import freeze_support
from gdeltnews.reconstruct import reconstruct

def main():
    reconstruct(
        input_dir="gdeltdata",
        output_dir="gdeltpreprocessed",
        language="it",
        url_filters=["repubblica.it", "corriere.it"],
        processes=10,  # use None for all available cores
    )

if __name__ == "__main__":
    freeze_support()  # important on Windows
    main()

Step 3: Filter, deduplicate, and merge CSVs

from gdeltnews.filtermerge import filtermerge

filtermerge(
    input_dir="gdeltpreprocessed",
    output_file="final_filtered_dedup.csv",
    query='((elezioni OR voto) AND (regionali OR campania)) OR ((fico OR cirielli) AND NOT veneto)'
)

Advanced users can pre-filter and download GDELT data via Google BigQuery, then process it directly with wordmatch.py.

Citation and Credits

If you use this package for research, please cite:

Fronzetti Colladon, A., & Vestrelli, R. (2026). Free Access to World News: Reconstructing Full-Text Articles from GDELT. Big Data and Cognitive Computing, 10(2), 45. https://doi.org/10.3390/bdcc10020045

Code co-developed with robves99.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gdeltnews-1.0.20.tar.gz (53.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gdeltnews-1.0.20-py3-none-any.whl (42.6 kB view details)

Uploaded Python 3

File details

Details for the file gdeltnews-1.0.20.tar.gz.

File metadata

  • Download URL: gdeltnews-1.0.20.tar.gz
  • Upload date:
  • Size: 53.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for gdeltnews-1.0.20.tar.gz
Algorithm Hash digest
SHA256 ad843a220b51b9cc94dc89c0e88aaaea972631924866f94e87b3c6c26d2dd55d
MD5 222670992de363cd2be5283419d6299c
BLAKE2b-256 614f2ca220c7d032d6c04dd5233e8a5b0a73bb5fd35e484ad5ecaa63dea6e832

See more details on using hashes here.

File details

Details for the file gdeltnews-1.0.20-py3-none-any.whl.

File metadata

  • Download URL: gdeltnews-1.0.20-py3-none-any.whl
  • Upload date:
  • Size: 42.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for gdeltnews-1.0.20-py3-none-any.whl
Algorithm Hash digest
SHA256 c0be0b1833b150a1a5812363a00802e580bcf82364fc0dce131b69da52744f0b
MD5 a1389dbedafa52035d2c1398a0aee3ca
BLAKE2b-256 ce59fcd0eb6c3e614b5dbeadcd11e92c35ad4fc44ef82de7c7b0783a9c0c23b3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page