Skip to main content

Reconstruct full-text news articles from GDELT Web News NGrams 3.0

Project description

Reconstructing Full-Text News Articles from GDELT - gdeltnews

Reconstruct full news article text from the GDELT Web News NGrams 3.0 dataset.

This package helps you:

  1. download GDELT Web NGrams files for a time range,
  2. reconstruct article text from overlapping n-gram fragments,
  3. filter and merge reconstructed CSVs using Boolean queries.

To learn more about the dataset, please visit the official announcement: https://blog.gdeltproject.org/announcing-the-new-web-news-ngrams-3-0-dataset/

Input files look like: http://data.gdeltproject.org/gdeltv3/webngrams/20250316000100.webngrams.json.gz

Reconstruction quality depends on the n-gram fragments available in the dataset.

Docs

This package documentation is available here, and a more detailed explanation of the functions’ logic is provided in the accompanying paper.

GUI Version

If you prefer to use a software with a graphical user interface that runs this code, you can find it here and read the instructions here.

Python Package Quickstart

Install

pip install gdeltnews

Step 1: Download Web NGrams files

from gdeltnews.download import download

download(
    "2025-11-25T10:00:00",
    "2025-11-25T13:59:00",
    outdir="gdeltdata",
    decompress=False,
)

Step 2: Reconstruct articles (run as a script, not in Jupyter)

Multiprocessing can be problematic inside notebooks. Run this from a .py script.

from multiprocessing import freeze_support
from gdeltnews.reconstruct import reconstruct

def main():
    reconstruct(
        input_dir="gdeltdata",
        output_dir="gdeltpreprocessed",
        language="it",
        url_filters=["repubblica.it", "corriere.it"],
        processes=10,  # use None for all available cores
    )

if __name__ == "__main__":
    freeze_support()  # important on Windows
    main()

Step 3: Filter, deduplicate, and merge CSVs

from gdeltnews.filtermerge import filtermerge

filtermerge(
    input_dir="gdeltpreprocessed",
    output_file="final_filtered_dedup.csv",
    query='((elezioni OR voto) AND (regionali OR campania)) OR ((fico OR cirielli) AND NOT veneto)'
)

Advanced users can pre-filter and download GDELT data via Google BigQuery, then process it directly with wordmatch.py.

Citation and Credits

If you use this package for research, please cite: Fronzetti Colladon, A., & Vestrelli, R. (2026). Free Access to World News: Reconstructing Full-Text Articles from GDELT. Big Data and Cognitive Computing, 10(2), 45. https://doi.org/10.3390/bdcc10020045

Code co-developed with robves99.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gdeltnews-1.0.1.tar.gz (52.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gdeltnews-1.0.1-py3-none-any.whl (41.1 kB view details)

Uploaded Python 3

File details

Details for the file gdeltnews-1.0.1.tar.gz.

File metadata

  • Download URL: gdeltnews-1.0.1.tar.gz
  • Upload date:
  • Size: 52.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for gdeltnews-1.0.1.tar.gz
Algorithm Hash digest
SHA256 bf998396d9fae2c2659621b0facac59758ce76d67ba28e8f083667cc7b6cccf6
MD5 11b10f2e17c310fc0b276d05163dfb7c
BLAKE2b-256 e764c307118a082e71250f7f584fcd78b937f1076815e6a59585b8d89c72bb5c

See more details on using hashes here.

File details

Details for the file gdeltnews-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: gdeltnews-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 41.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for gdeltnews-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c921101e3b6256b1af545cbb1730e6f9be2cb80dff82a1a87854eea1906269f5
MD5 1848165e181c28e8fb7bcdb6a6492fb2
BLAKE2b-256 10bd29ac2dbfc94497246823573da58a58b99ab5b729686f5e27a1ffe4665ea4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page