Scrape content from the PolitiFact website
Project description
PolitiFact Scraping
A Python library to scrape fact-check articles, speakers, reviewers and issues from the PolitiFact website. Optionally store everything in MongoDB.
Installation
pip install politifact-scraping
Quick start
Scrape data into Python objects
from politifact_scraping import PolitifactScraper
# Optionally restrict the date range (defaults to all available data since 2007)
scraper = PolitifactScraper(init_date="2025-01-01", end_date="2025-12-31")
# Scrape all articles, speakers, reviewers and issues
articles = scraper.scrape_all_articles()
speakers = scraper.scrape_all_speakers()
reviewers = scraper.scrape_all_reviewers()
issues = scraper.scrape_all_issues()
Each method returns a list of dictionaries. For example, an article dictionary contains keys such as article_url, title, subtitle, article_text, label, speaker_date, publish_date, image_url, sources, and more.
Scrape a single article by title
article = scraper.scrape_article_from_title("The claim you want to search for")
Scrape and store in MongoDB
from politifact_scraping import PolitiFactDB
db = PolitiFactDB()
# Scrape everything and persist to MongoDB in one call
db.scrape_and_store(init_date="2025-01-01", end_date="2025-12-31")
# Or query previously stored articles
results = db.find_articles(
filter={"label": "false"},
populate_speaker=True,
num_docs=10,
)
Data collected
The scraper extracts four entity types from PolitiFact:
| Entity | Key fields |
|---|---|
| Article | article_url, title, subtitle, article_text, label, speaker_date, publish_date, image_url, sources, language |
| Speaker | speaker_id, name, description, image_url, personal_website_url, truth-o-meter counts |
| Reviewer | reviewer_id, name, job_position, description, image_url, twitter_url, phone_number |
| Issue | issue_id, name, description, image_url, truth-o-meter counts |
Truth-o-meter labels: true, mostly_true, half_true, mostly_false, false, pants_on_fire.
Environment variables
MongoDB storage requires the following environment variables (e.g. in a .env file):
| Variable | Description |
|---|---|
MONGODB_HOST |
Connection string to the MongoDB cluster |
MONGODB_USER |
User name with access permissions |
MONGODB_PASSWORD |
Password for the user |
The data is stored in a database named politifact with collections articles, speakers, reviewers and issues.
Requirements
- Python ≥ 3.10
- beautifulsoup4, requests, fuzzywuzzy, python-Levenshtein, pymongo, python-dotenv, numpy, pydantic-core
License
This project is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Citation
If you use this package in your research, please cite:
@dataset{mario_villar_sanz_2026_19062297,
author = {Mario Villar Sanz and
Zylowski, Thorsten and
Wölfel, Matthias and
Rico, Noelia and
Díaz, Irene},
title = {PolitiFact scraping dataset},
year = 2026,
publisher = {Zenodo},
doi = {10.5281/zenodo.19062297},
url = {https://doi.org/10.5281/zenodo.19062297},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file politifact_scraping-0.1.1.tar.gz.
File metadata
- Download URL: politifact_scraping-0.1.1.tar.gz
- Upload date:
- Size: 24.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bbc2a10475db2d2777cda31f3863f54bc6bfb13cb6814a86bc67ce2f20a9c3d1
|
|
| MD5 |
37a98670be3208f5cb5d71964ff8c345
|
|
| BLAKE2b-256 |
ea95cd877f8592807434b7e48a13fd2b9dffdcaeb08c525e372d5e9d92b9d6a2
|
Provenance
The following attestation bundles were made for politifact_scraping-0.1.1.tar.gz:
Publisher:
publish.yml on MarioVillar/politifact-scraping
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
politifact_scraping-0.1.1.tar.gz -
Subject digest:
bbc2a10475db2d2777cda31f3863f54bc6bfb13cb6814a86bc67ce2f20a9c3d1 - Sigstore transparency entry: 1162829944
- Sigstore integration time:
-
Permalink:
MarioVillar/politifact-scraping@0d2f52be0f9da33710cba8d4f5d4902252f691e5 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/MarioVillar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0d2f52be0f9da33710cba8d4f5d4902252f691e5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file politifact_scraping-0.1.1-py3-none-any.whl.
File metadata
- Download URL: politifact_scraping-0.1.1-py3-none-any.whl
- Upload date:
- Size: 24.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81de36e72282fbf9caf2e138c13a1d4a7022a661ca61db4b49f1d7a836013f60
|
|
| MD5 |
0f53b02dea14a0813872038f2e55274e
|
|
| BLAKE2b-256 |
27d8642fef065b91a89771ebfa3786714a9e6d79e736f834911298d2ff3486dc
|
Provenance
The following attestation bundles were made for politifact_scraping-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on MarioVillar/politifact-scraping
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
politifact_scraping-0.1.1-py3-none-any.whl -
Subject digest:
81de36e72282fbf9caf2e138c13a1d4a7022a661ca61db4b49f1d7a836013f60 - Sigstore transparency entry: 1162830117
- Sigstore integration time:
-
Permalink:
MarioVillar/politifact-scraping@0d2f52be0f9da33710cba8d4f5d4902252f691e5 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/MarioVillar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0d2f52be0f9da33710cba8d4f5d4902252f691e5 -
Trigger Event:
release
-
Statement type: