Skip to main content

Fast GDELT Web NGrams news reconstruction engine (Rust-powered)

Project description

gdeltnews-rs

Rust-powered reconstruction of full-text news articles from the GDELT Web News NGrams 3.0 dataset.

This is a from-scratch reimplementation of the gdeltnews Python package by Andrea Fronzetti Colladon and Roberto Vestrelli. It preserves the same reconstruction algorithm and output quality while delivering significantly faster performance through a Rust core exposed via PyO3.

Why This Exists

The original gdeltnews package demonstrated that full-text news articles can be reconstructed from GDELT's fragmented n-gram data with up to 95% textual fidelity (see the paper). However, the Python implementation is CPU-bound on the reconstruction step — the authors reported 1 hour 8 minutes to process 39 files on a single core.

This package reduces that to seconds.

Performance

Benchmarked on a 1 MB gzipped GDELT Web NGrams file (42K fragments, 18 articles):

Implementation Time Speedup
Original Python (single core) 12.3s 1x
gdeltnews-rs 0.18s 67x

On a larger 55 MB file (687K fragments, 1,176 articles), gdeltnews-rs completed in 11.3s. The Python benchmark on this file was early-stopped (it was taking too long to be practical), so there is no direct comparison. However, the original authors reported 1 hour 8 minutes to process 39 files on a single core — roughly 1.7 minutes per file. Extrapolating from the 1 MB benchmark ratio, the real speedup on larger files is likely well above 67x, as larger articles with more fragments amplify Rust's advantage: zero-copy slice comparisons avoid the repeated list allocation overhead that dominates in Python.

These numbers include gzip decompression, JSON parsing, fragment creation, overlap-based assembly, and JSONL output — the entire pipeline. The Rust version also reads .gz files directly, eliminating the separate decompression step.

What Changed from the Original

Architecture

  • Rust core via PyO3: The CPU-intensive path (JSON parsing, gzip decompression, fragment creation, overlap assembly, JSONL output) is implemented in Rust and compiled to a native Python extension. Users install with pip and get compiled wheels — no Rust toolchain needed.
  • Python wrapper: Download (HTTP/I/O bound) and filtermerge (light CPU) remain in Python. The __init__.py re-exports everything for a clean API.
  • Rayon parallelism: The original used Python's multiprocessing.Pool which requires serializing data across process boundaries. Rust's rayon provides zero-overhead work-stealing thread parallelism at two levels — across files and across articles within each file. No freeze_support() needed.

Output Format

  • JSONL instead of pipe-delimited CSV: The original used | as a delimiter with QUOTE_NONE, which breaks silently when article text contains | or newlines. JSONL (one JSON object per line) handles all text content safely and is natively supported by pandas: pd.read_json("output.jsonl", lines=True).
  • Reconstruction metadata: Each output record includes fragments_used and fragments_total counts, so you can assess reconstruction completeness per article.

Algorithm

The reconstruction algorithm is identical to the original — greedy maximum-overlap assembly with position constraints. This was a deliberate decision. The original algorithm achieves 95% textual fidelity as validated against EventRegistry ground truth. The performance gain comes entirely from the language difference:

  • Zero-copy slice comparison: Python's result_words[-k:] == words[:k] creates two new lists on every comparison. Rust's result[len-k..] == candidate[..k] compares memory in place.
  • No garbage collector: Python allocates and deallocates thousands of temporary list objects during the overlap search. Rust uses stack-allocated slices.
  • Native string handling: No interpreter overhead on the tight inner loops.

What Was Removed

  • GUI: The tkinter GUI is not included. This package is library-only.
  • CSV output: Replaced by JSONL. The filtermerge module reads JSONL.
  • Decompression step: The Rust engine reads .gz files directly via flate2. No need to decompress to disk first.
  • freeze_support() / multiprocessing boilerplate: Rayon threads work in any context including Jupyter notebooks.

What Was Kept

  • Same reconstruction quality: The greedy overlap algorithm, position constraints, slash artifact removal, and circular overlap cleanup are all preserved exactly.
  • Same download logic: HTTP downloads from data.gdeltproject.org with the same URL format and time range enumeration.
  • Same Boolean query syntax: AND, OR, NOT, parentheses, and quoted phrases for filtermerge.
  • Same language/URL filtering: Filter by language code and URL substrings.

Install

pip install gdeltnews-rs

Pre-built wheels are provided for common platforms. If no wheel matches your platform, pip will build from source (requires a Rust toolchain).

For development:

git clone <this-repo>
cd gdeltnews
python -m venv .venv && source .venv/bin/activate
pip install maturin requests tqdm "boolean.py>=5.0"
maturin develop --release

Quickstart

Step 1: Download GDELT files

from gdeltnews_rs import download

download(
    "2025-01-15T10:00:00",
    "2025-01-15T13:59:00",
    outdir="gdeltdata",
)

Unlike the original package, there is no decompress parameter. The Rust engine reads .gz files directly.

Step 2: Reconstruct articles

from gdeltnews_rs import reconstruct

reconstruct(
    "gdeltdata",
    "output",
    language="en",
    url_filters=["reuters.com", "nytimes.com"],
)

This processes all .gz and .json files in the input directory in parallel and writes one JSONL file per input file to the output directory.

Works in scripts, notebooks, anywhere — no freeze_support() needed.

Step 3: Filter and deduplicate

from gdeltnews_rs import filtermerge

filtermerge(
    "output",
    "final.jsonl",
    query='(elections OR primaries) AND democrats AND NOT republicans',
)

Reads all JSONL files from the output directory, applies the Boolean query, deduplicates by URL (keeping the longest text), and writes a single JSONL file.

Low-level API

For more control, you can call the Rust engine directly:

from gdeltnews_rs import reconstruct_file, reconstruct_file_to_jsonl

# Get articles as Python objects
articles = reconstruct_file("path/to/file.json.gz", language="en")
for article in articles:
    print(article.url, len(article.text), f"{article.fragments_used}/{article.fragments_total}")

# Or write directly to JSONL
count = reconstruct_file_to_jsonl("input.json.gz", "output.jsonl", language="en")

Output Format

Each line in the JSONL output is a JSON object:

{
  "text": "Full reconstructed article text...",
  "date": "2025-01-15",
  "url": "https://www.reuters.com/article/...",
  "source": "reuters.com",
  "fragments_used": 285,
  "fragments_total": 285
}

Project Structure

gdeltnews-rs/
├── Cargo.toml                  # Rust dependencies and build config
├── pyproject.toml              # maturin-based Python packaging
├── src/
│   ├── lib.rs                  # PyO3 module entry point
│   ├── parse.rs                # JSON parsing + gzip decompression
│   ├── fragment.rs             # Fragment creation from pre/ngram/post
│   ├── assemble.rs             # Greedy maximum-overlap assembly
│   └── pipeline.rs             # File + article level parallelism
└── python/
    └── gdeltnews_rs/
        ├── __init__.py         # Public API re-exports
        ├── download.py         # GDELT HTTP downloads
        ├── reconstruct.py      # High-level reconstruction wrapper
        └── filtermerge.py      # Boolean query filtering + URL dedup

Credits and Citation

This package is built on the methodology and research of:

Andrea Fronzetti Colladon (Roma Tre University) and Roberto Vestrelli (University of Perugia), who designed the reconstruction algorithm, validated it against ground truth data, and released the original gdeltnews Python package.

If you use this package in research, please cite their paper:

Fronzetti Colladon, A., & Vestrelli, R. (2026). Free Access to World News: Reconstructing Full-Text Articles from GDELT. Big Data and Cognitive Computing, 10(2), 45. https://doi.org/10.3390/bdcc10020045

License

GPL-3.0, same as the original package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gdeltnews_rs-0.1.3.tar.gz (37.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

gdeltnews_rs-0.1.3-cp312-cp312-win_amd64.whl (347.3 kB view details)

Uploaded CPython 3.12Windows x86-64

gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (518.7 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (506.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

gdeltnews_rs-0.1.3-cp312-cp312-macosx_11_0_arm64.whl (457.4 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

gdeltnews_rs-0.1.3-cp312-cp312-macosx_10_12_x86_64.whl (472.5 kB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

File details

Details for the file gdeltnews_rs-0.1.3.tar.gz.

File metadata

  • Download URL: gdeltnews_rs-0.1.3.tar.gz
  • Upload date:
  • Size: 37.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for gdeltnews_rs-0.1.3.tar.gz
Algorithm Hash digest
SHA256 970efbc900c320942aed851c5a246629c42f42a17bfe4d7d4e949c3bb7a4fe69
MD5 0609eab7477615d1b400201b2d248e11
BLAKE2b-256 c62965c0374c2452d327d47410510409d60604f42b6c7646af40d14e1a8fae92

See more details on using hashes here.

File details

Details for the file gdeltnews_rs-0.1.3-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for gdeltnews_rs-0.1.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6cb6bd04f2cdeb3c180f30d277eefde4a9d30d40c11b2fab2241fbd0c380c7ed
MD5 fdab491971cb634ba46f9e85091e1cf2
BLAKE2b-256 c762c945705495c2fa398a3a7420fe22f885852e81d44b26adcec0f5f77f3ea5

See more details on using hashes here.

File details

Details for the file gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 86fd7fec0e9216dadc6da067e0d9700f513fc63561ba424f2a7c914331d49a6a
MD5 bded07e77b48ef634f9083b3f218896d
BLAKE2b-256 199c3143b22842ebbc50031c5e4187f4ccfb4507a79298da5d0f34322fff947a

See more details on using hashes here.

File details

Details for the file gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 3c97dd681e782aa7870b7429e6dd284feac530de2091e5a01bf8608b12fa96df
MD5 062581a7fc41d3d231c5fe0d47861de8
BLAKE2b-256 eee824ad91776125c399662cf1981b582f0c2388803251f1461776e278a8886a

See more details on using hashes here.

File details

Details for the file gdeltnews_rs-0.1.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for gdeltnews_rs-0.1.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 be73f9818819c846b110b0baaec77a98703dece6914fc1f73eb284ce907471b9
MD5 e6f2b436313235f56e1380227ab76547
BLAKE2b-256 9f98a1495fbbe35384a7a100140a3562859d7861eb71ba68acf7bb481828b8d1

See more details on using hashes here.

File details

Details for the file gdeltnews_rs-0.1.3-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for gdeltnews_rs-0.1.3-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1dfe40860af945304e47f41a238acfc876eee566702519b441bf1bd39d6d5610
MD5 2316bf6dfa7220922590d05662aabbc3
BLAKE2b-256 728d04cb2306509f043666ae3cad5bfbe3536b099dbe56b62a72b564eb30012b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page