Fast GDELT Web NGrams news reconstruction engine (Rust-powered)
Project description
gdeltnews-rs
Rust-powered reconstruction of full-text news articles from the GDELT Web News NGrams 3.0 dataset.
This is a from-scratch reimplementation of the gdeltnews Python package by Andrea Fronzetti Colladon and Roberto Vestrelli. It preserves the same reconstruction algorithm and output quality while delivering significantly faster performance through a Rust core exposed via PyO3.
Why This Exists
The original gdeltnews package demonstrated that full-text news articles can be reconstructed from GDELT's fragmented n-gram data with up to 95% textual fidelity (see the paper). However, the Python implementation is CPU-bound on the reconstruction step — the authors reported 1 hour 8 minutes to process 39 files on a single core.
This package reduces that to seconds.
Performance
Benchmarked on a 1 MB gzipped GDELT Web NGrams file (42K fragments, 18 articles):
| Implementation | Time | Speedup |
|---|---|---|
| Original Python (single core) | 12.3s | 1x |
| gdeltnews-rs | 0.18s | 67x |
On a larger 55 MB file (687K fragments, 1,176 articles), gdeltnews-rs completed in 11.3s. The Python benchmark on this file was early-stopped (it was taking too long to be practical), so there is no direct comparison. However, the original authors reported 1 hour 8 minutes to process 39 files on a single core — roughly 1.7 minutes per file. Extrapolating from the 1 MB benchmark ratio, the real speedup on larger files is likely well above 67x, as larger articles with more fragments amplify Rust's advantage: zero-copy slice comparisons avoid the repeated list allocation overhead that dominates in Python.
These numbers include gzip decompression, JSON parsing, fragment creation, overlap-based assembly, and JSONL output — the entire pipeline. The Rust version also reads .gz files directly, eliminating the separate decompression step.
What Changed from the Original
Architecture
- Rust core via PyO3: The CPU-intensive path (JSON parsing, gzip decompression, fragment creation, overlap assembly, JSONL output) is implemented in Rust and compiled to a native Python extension. Users install with
pipand get compiled wheels — no Rust toolchain needed. - Python wrapper: Download (HTTP/I/O bound) and filtermerge (light CPU) remain in Python. The
__init__.pyre-exports everything for a clean API. - Rayon parallelism: The original used Python's
multiprocessing.Poolwhich requires serializing data across process boundaries. Rust's rayon provides zero-overhead work-stealing thread parallelism at two levels — across files and across articles within each file. Nofreeze_support()needed.
Output Format
- JSONL instead of pipe-delimited CSV: The original used
|as a delimiter withQUOTE_NONE, which breaks silently when article text contains|or newlines. JSONL (one JSON object per line) handles all text content safely and is natively supported by pandas:pd.read_json("output.jsonl", lines=True). - Reconstruction metadata: Each output record includes
fragments_usedandfragments_totalcounts, so you can assess reconstruction completeness per article.
Algorithm
The reconstruction algorithm is identical to the original — greedy maximum-overlap assembly with position constraints. This was a deliberate decision. The original algorithm achieves 95% textual fidelity as validated against EventRegistry ground truth. The performance gain comes entirely from the language difference:
- Zero-copy slice comparison: Python's
result_words[-k:] == words[:k]creates two new lists on every comparison. Rust'sresult[len-k..] == candidate[..k]compares memory in place. - No garbage collector: Python allocates and deallocates thousands of temporary list objects during the overlap search. Rust uses stack-allocated slices.
- Native string handling: No interpreter overhead on the tight inner loops.
What Was Removed
- GUI: The tkinter GUI is not included. This package is library-only.
- CSV output: Replaced by JSONL. The filtermerge module reads JSONL.
- Decompression step: The Rust engine reads
.gzfiles directly viaflate2. No need to decompress to disk first. freeze_support()/ multiprocessing boilerplate: Rayon threads work in any context including Jupyter notebooks.
What Was Kept
- Same reconstruction quality: The greedy overlap algorithm, position constraints, slash artifact removal, and circular overlap cleanup are all preserved exactly.
- Same download logic: HTTP downloads from
data.gdeltproject.orgwith the same URL format and time range enumeration. - Same Boolean query syntax:
AND,OR,NOT, parentheses, and quoted phrases for filtermerge. - Same language/URL filtering: Filter by language code and URL substrings.
Install
pip install gdeltnews-rs
Pre-built wheels are provided for common platforms. If no wheel matches your platform, pip will build from source (requires a Rust toolchain).
For development:
git clone <this-repo>
cd gdeltnews
python -m venv .venv && source .venv/bin/activate
pip install maturin requests tqdm "boolean.py>=5.0"
maturin develop --release
Quickstart
Step 1: Download GDELT files
from gdeltnews_rs import download
download(
"2025-01-15T10:00:00",
"2025-01-15T13:59:00",
outdir="gdeltdata",
)
Unlike the original package, there is no decompress parameter. The Rust engine reads .gz files directly.
Step 2: Reconstruct articles
from gdeltnews_rs import reconstruct
reconstruct(
"gdeltdata",
"output",
language="en",
url_filters=["reuters.com", "nytimes.com"],
)
This processes all .gz and .json files in the input directory in parallel and writes one JSONL file per input file to the output directory.
Works in scripts, notebooks, anywhere — no freeze_support() needed.
Step 3: Filter and deduplicate
from gdeltnews_rs import filtermerge
filtermerge(
"output",
"final.jsonl",
query='(elections OR primaries) AND democrats AND NOT republicans',
)
Reads all JSONL files from the output directory, applies the Boolean query, deduplicates by URL (keeping the longest text), and writes a single JSONL file.
Low-level API
For more control, you can call the Rust engine directly:
from gdeltnews_rs import reconstruct_file, reconstruct_file_to_jsonl
# Get articles as Python objects
articles = reconstruct_file("path/to/file.json.gz", language="en")
for article in articles:
print(article.url, len(article.text), f"{article.fragments_used}/{article.fragments_total}")
# Or write directly to JSONL
count = reconstruct_file_to_jsonl("input.json.gz", "output.jsonl", language="en")
Output Format
Each line in the JSONL output is a JSON object:
{
"text": "Full reconstructed article text...",
"date": "2025-01-15",
"url": "https://www.reuters.com/article/...",
"source": "reuters.com",
"fragments_used": 285,
"fragments_total": 285
}
Project Structure
gdeltnews-rs/
├── Cargo.toml # Rust dependencies and build config
├── pyproject.toml # maturin-based Python packaging
├── src/
│ ├── lib.rs # PyO3 module entry point
│ ├── parse.rs # JSON parsing + gzip decompression
│ ├── fragment.rs # Fragment creation from pre/ngram/post
│ ├── assemble.rs # Greedy maximum-overlap assembly
│ └── pipeline.rs # File + article level parallelism
└── python/
└── gdeltnews_rs/
├── __init__.py # Public API re-exports
├── download.py # GDELT HTTP downloads
├── reconstruct.py # High-level reconstruction wrapper
└── filtermerge.py # Boolean query filtering + URL dedup
Credits and Citation
This package is built on the methodology and research of:
Andrea Fronzetti Colladon (Roma Tre University) and Roberto Vestrelli (University of Perugia), who designed the reconstruction algorithm, validated it against ground truth data, and released the original gdeltnews Python package.
If you use this package in research, please cite their paper:
Fronzetti Colladon, A., & Vestrelli, R. (2026). Free Access to World News: Reconstructing Full-Text Articles from GDELT. Big Data and Cognitive Computing, 10(2), 45. https://doi.org/10.3390/bdcc10020045
License
GPL-3.0, same as the original package.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gdeltnews_rs-0.1.3.tar.gz.
File metadata
- Download URL: gdeltnews_rs-0.1.3.tar.gz
- Upload date:
- Size: 37.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
970efbc900c320942aed851c5a246629c42f42a17bfe4d7d4e949c3bb7a4fe69
|
|
| MD5 |
0609eab7477615d1b400201b2d248e11
|
|
| BLAKE2b-256 |
c62965c0374c2452d327d47410510409d60604f42b6c7646af40d14e1a8fae92
|
File details
Details for the file gdeltnews_rs-0.1.3-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: gdeltnews_rs-0.1.3-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 347.3 kB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6cb6bd04f2cdeb3c180f30d277eefde4a9d30d40c11b2fab2241fbd0c380c7ed
|
|
| MD5 |
fdab491971cb634ba46f9e85091e1cf2
|
|
| BLAKE2b-256 |
c762c945705495c2fa398a3a7420fe22f885852e81d44b26adcec0f5f77f3ea5
|
File details
Details for the file gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 518.7 kB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86fd7fec0e9216dadc6da067e0d9700f513fc63561ba424f2a7c914331d49a6a
|
|
| MD5 |
bded07e77b48ef634f9083b3f218896d
|
|
| BLAKE2b-256 |
199c3143b22842ebbc50031c5e4187f4ccfb4507a79298da5d0f34322fff947a
|
File details
Details for the file gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: gdeltnews_rs-0.1.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 506.8 kB
- Tags: CPython 3.12, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c97dd681e782aa7870b7429e6dd284feac530de2091e5a01bf8608b12fa96df
|
|
| MD5 |
062581a7fc41d3d231c5fe0d47861de8
|
|
| BLAKE2b-256 |
eee824ad91776125c399662cf1981b582f0c2388803251f1461776e278a8886a
|
File details
Details for the file gdeltnews_rs-0.1.3-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: gdeltnews_rs-0.1.3-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 457.4 kB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be73f9818819c846b110b0baaec77a98703dece6914fc1f73eb284ce907471b9
|
|
| MD5 |
e6f2b436313235f56e1380227ab76547
|
|
| BLAKE2b-256 |
9f98a1495fbbe35384a7a100140a3562859d7861eb71ba68acf7bb481828b8d1
|
File details
Details for the file gdeltnews_rs-0.1.3-cp312-cp312-macosx_10_12_x86_64.whl.
File metadata
- Download URL: gdeltnews_rs-0.1.3-cp312-cp312-macosx_10_12_x86_64.whl
- Upload date:
- Size: 472.5 kB
- Tags: CPython 3.12, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1dfe40860af945304e47f41a238acfc876eee566702519b441bf1bd39d6d5610
|
|
| MD5 |
2316bf6dfa7220922590d05662aabbc3
|
|
| BLAKE2b-256 |
728d04cb2306509f043666ae3cad5bfbe3536b099dbe56b62a72b564eb30012b
|