Fast GDELT Web NGrams news reconstruction engine (Rust-powered)

These details have not been verified by PyPI

Project description

gdeltnews-rs

Rust-powered reconstruction of full-text news articles from the GDELT Web News NGrams 3.0 dataset.

This is a from-scratch reimplementation of the gdeltnews Python package by Andrea Fronzetti Colladon and Roberto Vestrelli. It preserves the same reconstruction algorithm and output quality while delivering significantly faster performance through a Rust core exposed via PyO3.

Why This Exists

The original gdeltnews package demonstrated that full-text news articles can be reconstructed from GDELT's fragmented n-gram data with up to 95% textual fidelity (see the paper). However, the Python implementation is CPU-bound on the reconstruction step — the authors reported 1 hour 8 minutes to process 39 files on a single core.

This package reduces that to seconds.

Performance

Benchmarked on a 1 MB gzipped GDELT Web NGrams file (42K fragments, 18 articles):

Implementation	Time	Speedup
Original Python (single core)	12.3s	1x
gdeltnews-rs	0.18s	67x

On a larger 55 MB file (687K fragments, 1,176 articles), gdeltnews-rs completed in 11.3s. The Python benchmark on this file was early-stopped (it was taking too long to be practical), so there is no direct comparison. However, the original authors reported 1 hour 8 minutes to process 39 files on a single core — roughly 1.7 minutes per file. Extrapolating from the 1 MB benchmark ratio, the real speedup on larger files is likely well above 67x, as larger articles with more fragments amplify Rust's advantage: zero-copy slice comparisons avoid the repeated list allocation overhead that dominates in Python.

These numbers include gzip decompression, JSON parsing, fragment creation, overlap-based assembly, and JSONL output — the entire pipeline. The Rust version also reads .gz files directly, eliminating the separate decompression step.

What Changed from the Original

Architecture

Rust core via PyO3: The CPU-intensive path (JSON parsing, gzip decompression, fragment creation, overlap assembly, JSONL output) is implemented in Rust and compiled to a native Python extension. Users install with pip and get compiled wheels — no Rust toolchain needed.
Python wrapper: Download (HTTP/I/O bound) and filtermerge (light CPU) remain in Python. The __init__.py re-exports everything for a clean API.
Rayon parallelism: The original used Python's multiprocessing.Pool which requires serializing data across process boundaries. Rust's rayon provides zero-overhead work-stealing thread parallelism at two levels — across files and across articles within each file. No freeze_support() needed.

Output Format

JSONL instead of pipe-delimited CSV: The original used | as a delimiter with QUOTE_NONE, which breaks silently when article text contains | or newlines. JSONL (one JSON object per line) handles all text content safely and is natively supported by pandas: pd.read_json("output.jsonl", lines=True).
Reconstruction metadata: Each output record includes fragments_used and fragments_total counts, so you can assess reconstruction completeness per article.

Algorithm

The reconstruction algorithm is identical to the original — greedy maximum-overlap assembly with position constraints. This was a deliberate decision. The original algorithm achieves 95% textual fidelity as validated against EventRegistry ground truth. The performance gain comes entirely from the language difference:

Zero-copy slice comparison: Python's result_words[-k:] == words[:k] creates two new lists on every comparison. Rust's result[len-k..] == candidate[..k] compares memory in place.
No garbage collector: Python allocates and deallocates thousands of temporary list objects during the overlap search. Rust uses stack-allocated slices.
Native string handling: No interpreter overhead on the tight inner loops.

What Was Removed

GUI: The tkinter GUI is not included. This package is library-only.
CSV output: Replaced by JSONL. The filtermerge module reads JSONL.
Decompression step: The Rust engine reads .gz files directly via flate2. No need to decompress to disk first.
freeze_support() / multiprocessing boilerplate: Rayon threads work in any context including Jupyter notebooks.

What Was Kept

Same reconstruction quality: The greedy overlap algorithm, position constraints, slash artifact removal, and circular overlap cleanup are all preserved exactly.
Same download logic: HTTP downloads from data.gdeltproject.org with the same URL format and time range enumeration.
Same Boolean query syntax: AND, OR, NOT, parentheses, and quoted phrases for filtermerge.
Same language/URL filtering: Filter by language code and URL substrings.

Install

pip install gdeltnews-rs

Pre-built wheels are provided for common platforms. If no wheel matches your platform, pip will build from source (requires a Rust toolchain).

For development:

git clone <this-repo>
cd gdeltnews
python -m venv .venv && source .venv/bin/activate
pip install maturin requests tqdm "boolean.py>=5.0"
maturin develop --release

Quickstart

Step 1: Download GDELT files

from gdeltnews_rs import download

download(
    "2025-01-15T10:00:00",
    "2025-01-15T13:59:00",
    outdir="gdeltdata",
)

Unlike the original package, there is no decompress parameter. The Rust engine reads .gz files directly.

Step 2: Reconstruct articles

from gdeltnews_rs import reconstruct

reconstruct(
    "gdeltdata",
    "output",
    language="en",
    url_filters=["reuters.com", "nytimes.com"],
)

This processes all .gz and .json files in the input directory in parallel and writes one JSONL file per input file to the output directory.

Works in scripts, notebooks, anywhere — no freeze_support() needed.

Step 3: Filter and deduplicate

from gdeltnews_rs import filtermerge

filtermerge(
    "output",
    "final.jsonl",
    query='(elections OR primaries) AND democrats AND NOT republicans',
)

Reads all JSONL files from the output directory, applies the Boolean query, deduplicates by URL (keeping the longest text), and writes a single JSONL file.

Low-level API

For more control, you can call the Rust engine directly:

from gdeltnews_rs import reconstruct_file, reconstruct_file_to_jsonl

# Get articles as Python objects
articles = reconstruct_file("path/to/file.json.gz", language="en")
for article in articles:
    print(article.url, len(article.text), f"{article.fragments_used}/{article.fragments_total}")

# Or write directly to JSONL
count = reconstruct_file_to_jsonl("input.json.gz", "output.jsonl", language="en")

Output Format

Each line in the JSONL output is a JSON object:

{
  "text": "Full reconstructed article text...",
  "date": "2025-01-15",
  "url": "https://www.reuters.com/article/...",
  "source": "reuters.com",
  "fragments_used": 285,
  "fragments_total": 285
}

Project Structure

gdeltnews-rs/
├── Cargo.toml                  # Rust dependencies and build config
├── pyproject.toml              # maturin-based Python packaging
├── src/
│   ├── lib.rs                  # PyO3 module entry point
│   ├── parse.rs                # JSON parsing + gzip decompression
│   ├── fragment.rs             # Fragment creation from pre/ngram/post
│   ├── assemble.rs             # Greedy maximum-overlap assembly
│   └── pipeline.rs             # File + article level parallelism
└── python/
    └── gdeltnews_rs/
        ├── __init__.py         # Public API re-exports
        ├── download.py         # GDELT HTTP downloads
        ├── reconstruct.py      # High-level reconstruction wrapper
        └── filtermerge.py      # Boolean query filtering + URL dedup

Credits and Citation

This package is built on the methodology and research of:

Andrea Fronzetti Colladon (Roma Tre University) and Roberto Vestrelli (University of Perugia), who designed the reconstruction algorithm, validated it against ground truth data, and released the original gdeltnews Python package.

If you use this package in research, please cite their paper:

Fronzetti Colladon, A., & Vestrelli, R. (2026). Free Access to World News: Reconstructing Full-Text Articles from GDELT. Big Data and Cognitive Computing, 10(2), 45. https://doi.org/10.3390/bdcc10020045

License

GPL-3.0, same as the original package.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.3

Mar 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gdeltnews_rs-0.1.3.tar.gz (37.0 kB view details)

Uploaded Mar 5, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gdeltnews_rs-0.1.3-cp312-cp312-win_amd64.whl (347.3 kB view details)

Uploaded Mar 5, 2026 CPython 3.12Windows x86-64

gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (518.7 kB view details)

Uploaded Mar 5, 2026 CPython 3.12manylinux: glibc 2.17+ x86-64

gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (506.8 kB view details)

Uploaded Mar 5, 2026 CPython 3.12manylinux: glibc 2.17+ ARM64

gdeltnews_rs-0.1.3-cp312-cp312-macosx_11_0_arm64.whl (457.4 kB view details)

Uploaded Mar 5, 2026 CPython 3.12macOS 11.0+ ARM64

gdeltnews_rs-0.1.3-cp312-cp312-macosx_10_12_x86_64.whl (472.5 kB view details)

Uploaded Mar 5, 2026 CPython 3.12macOS 10.12+ x86-64

File details

Details for the file gdeltnews_rs-0.1.3.tar.gz.

File metadata

Download URL: gdeltnews_rs-0.1.3.tar.gz
Upload date: Mar 5, 2026
Size: 37.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: maturin/1.12.6

File hashes

Hashes for gdeltnews_rs-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`970efbc900c320942aed851c5a246629c42f42a17bfe4d7d4e949c3bb7a4fe69`
MD5	`0609eab7477615d1b400201b2d248e11`
BLAKE2b-256	`c62965c0374c2452d327d47410510409d60604f42b6c7646af40d14e1a8fae92`

See more details on using hashes here.

File details

Details for the file gdeltnews_rs-0.1.3-cp312-cp312-win_amd64.whl.

File metadata

Download URL: gdeltnews_rs-0.1.3-cp312-cp312-win_amd64.whl
Upload date: Mar 5, 2026
Size: 347.3 kB
Tags: CPython 3.12, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: maturin/1.12.6

File hashes

Hashes for gdeltnews_rs-0.1.3-cp312-cp312-win_amd64.whl
Algorithm	Hash digest
SHA256	`6cb6bd04f2cdeb3c180f30d277eefde4a9d30d40c11b2fab2241fbd0c380c7ed`
MD5	`fdab491971cb634ba46f9e85091e1cf2`
BLAKE2b-256	`c762c945705495c2fa398a3a7420fe22f885852e81d44b26adcec0f5f77f3ea5`

See more details on using hashes here.

File details

Details for the file gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Mar 5, 2026
Size: 518.7 kB
Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: maturin/1.12.6

File hashes

Hashes for gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`86fd7fec0e9216dadc6da067e0d9700f513fc63561ba424f2a7c914331d49a6a`
MD5	`bded07e77b48ef634f9083b3f218896d`
BLAKE2b-256	`199c3143b22842ebbc50031c5e4187f4ccfb4507a79298da5d0f34322fff947a`

See more details on using hashes here.

File details

Details for the file gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Mar 5, 2026
Size: 506.8 kB
Tags: CPython 3.12, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: maturin/1.12.6

File hashes

Hashes for gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`3c97dd681e782aa7870b7429e6dd284feac530de2091e5a01bf8608b12fa96df`
MD5	`062581a7fc41d3d231c5fe0d47861de8`
BLAKE2b-256	`eee824ad91776125c399662cf1981b582f0c2388803251f1461776e278a8886a`

See more details on using hashes here.

File details

Details for the file gdeltnews_rs-0.1.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

Download URL: gdeltnews_rs-0.1.3-cp312-cp312-macosx_11_0_arm64.whl
Upload date: Mar 5, 2026
Size: 457.4 kB
Tags: CPython 3.12, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: maturin/1.12.6

File hashes

Hashes for gdeltnews_rs-0.1.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`be73f9818819c846b110b0baaec77a98703dece6914fc1f73eb284ce907471b9`
MD5	`e6f2b436313235f56e1380227ab76547`
BLAKE2b-256	`9f98a1495fbbe35384a7a100140a3562859d7861eb71ba68acf7bb481828b8d1`

See more details on using hashes here.

File details

Details for the file gdeltnews_rs-0.1.3-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

Download URL: gdeltnews_rs-0.1.3-cp312-cp312-macosx_10_12_x86_64.whl
Upload date: Mar 5, 2026
Size: 472.5 kB
Tags: CPython 3.12, macOS 10.12+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: maturin/1.12.6

File hashes

Hashes for gdeltnews_rs-0.1.3-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm	Hash digest
SHA256	`1dfe40860af945304e47f41a238acfc876eee566702519b441bf1bd39d6d5610`
MD5	`2316bf6dfa7220922590d05662aabbc3`
BLAKE2b-256	`728d04cb2306509f043666ae3cad5bfbe3536b099dbe56b62a72b564eb30012b`

See more details on using hashes here.

gdeltnews-rs 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

gdeltnews-rs

Why This Exists

Performance

What Changed from the Original

Architecture

Output Format

Algorithm

What Was Removed

What Was Kept

Install

Quickstart

Step 1: Download GDELT files

Step 2: Reconstruct articles

Step 3: Filter and deduplicate

Low-level API

Output Format

Project Structure

Credits and Citation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes