Reconstruct full-text news articles from GDELT Web News NGrams 3.0
Project description
Reconstructing Full-Text News Articles from GDELT - gdeltnews
Reconstruct full news article text from the GDELT Web News NGrams 3.0 dataset.
This package helps you:
- download GDELT Web NGrams files for a time range,
- reconstruct article text from overlapping n-gram fragments,
- filter and merge reconstructed CSVs using Boolean queries.
To learn more about the dataset, please visit the official announcement: https://blog.gdeltproject.org/announcing-the-new-web-news-ngrams-3-0-dataset/
Input files look like: http://data.gdeltproject.org/gdeltv3/webngrams/20250316000100.webngrams.json.gz
Reconstruction quality depends on the n-gram fragments available in the dataset.
Install
pip install gdeltnews
Quickstart and Docs
The package is documented here.
Step 1: Download Web NGrams files
from gdeltnews.download import download
download(
"2025-11-25T10:00:00",
"2025-11-25T13:59:00",
outdir="gdeltdata",
decompress=False,
)
Step 2: Reconstruct articles (run as a script, not in Jupyter)
Multiprocessing can be problematic inside notebooks. Run this from a .py script.
from gdeltnews.reconstruct import reconstruct
def main():
reconstruct(
input_dir="gdeltdata",
output_dir="gdeltpreprocessed",
language="it",
url_filters=["repubblica.it", "corriere.it"],
processes=10, # use None for all available cores
)
if __name__ == "__main__":
main()
Step 3: Filter, deduplicate, and merge CSVs
from gdeltnews.filtermerge import filtermerge
filtermerge(
input_dir="gdeltpreprocessed",
output_file="final_filtered_dedup.csv",
query='((elezioni OR voto) AND (regionali OR campania)) OR ((fico OR cirielli) AND NOT veneto)'
)
Advanced users can pre-filter and download GDELT data via Google BigQuery, then process it directly with wordmatch.py.
Citation
If you use this package for research, please cite:
A. Fronzetti Colladon, R. Vestrelli (2025). “A Python Tool for Reconstructing Full News Text from GDELT.” https://arxiv.org/abs/2504.16063
Credits
Code co-developed with robves99.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gdeltnews-1.0.0.tar.gz.
File metadata
- Download URL: gdeltnews-1.0.0.tar.gz
- Upload date:
- Size: 51.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b541f899e8005bda9cbc21a3c3a8ba7ec0a061d673002d9a0c88befbf372da3d
|
|
| MD5 |
dfb3af5c182a00566b322c6897984fae
|
|
| BLAKE2b-256 |
e93fd0a9fef519f22d311f82dc0d1648602850ba950ab848cb8001d6babb091e
|
File details
Details for the file gdeltnews-1.0.0-py3-none-any.whl.
File metadata
- Download URL: gdeltnews-1.0.0-py3-none-any.whl
- Upload date:
- Size: 40.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ccbab326f288cb54b4c3c5da9525a72a56b89090517d5b83f2be122340a8c110
|
|
| MD5 |
f007e420f26423e957d1fe8cce710665
|
|
| BLAKE2b-256 |
30c55e79269e02dc86c403532d7665dc27fd44ab88db535af2c4bb02ac6e43d8
|