Skip to main content

Scrape content from the PolitiFact website

Project description

PolitiFact Scraping

DOI PyPI version Python versions License: CC BY 4.0

A Python library to scrape fact-check articles, speakers, reviewers and issues from the PolitiFact website. Optionally store everything in MongoDB.

Installation

pip install politifact-scraping

Quick start

Scrape data into Python objects

from politifact_scraping import PolitifactScraper

# Optionally restrict the date range (defaults to all available data since 2007)
scraper = PolitifactScraper(init_date="2025-01-01", end_date="2025-12-31")

# Scrape all articles, speakers, reviewers and issues
articles  = scraper.scrape_all_articles()
speakers  = scraper.scrape_all_speakers()
reviewers = scraper.scrape_all_reviewers()
issues    = scraper.scrape_all_issues()

Each method returns a list of dictionaries. For example, an article dictionary contains keys such as article_url, title, subtitle, article_text, label, speaker_date, publish_date, image_url, sources, and more.

Scrape a single article by title

article = scraper.scrape_article_from_title("The claim you want to search for")

Scrape and store in MongoDB

from politifact_scraping import PolitiFactDB

db = PolitiFactDB()

# Scrape everything and persist to MongoDB in one call
db.scrape_and_store(init_date="2025-01-01", end_date="2025-12-31")

# Or query previously stored articles
results = db.find_articles(
    filter={"label": "false"},
    populate_speaker=True,
    num_docs=10,
)

Data collected

The scraper extracts four entity types from PolitiFact:

Entity Key fields
Article article_url, title, subtitle, article_text, label, speaker_date, publish_date, image_url, sources, language
Speaker speaker_id, name, description, image_url, personal_website_url, truth-o-meter counts
Reviewer reviewer_id, name, job_position, description, image_url, twitter_url, phone_number
Issue issue_id, name, description, image_url, truth-o-meter counts

Truth-o-meter labels: true, mostly_true, half_true, mostly_false, false, pants_on_fire.

Environment variables

MongoDB storage requires the following environment variables (e.g. in a .env file):

Variable Description
MONGODB_HOST Connection string to the MongoDB cluster
MONGODB_USER User name with access permissions
MONGODB_PASSWORD Password for the user

The data is stored in a database named politifact with collections articles, speakers, reviewers and issues.

Requirements

  • Python ≥ 3.10
  • beautifulsoup4, requests, fuzzywuzzy, python-Levenshtein, pymongo, python-dotenv, numpy, pydantic-core

License

This project is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Citation

If you use this package in your research, please cite:

@dataset{mario_villar_sanz_2026_19062297,
  author       = {Mario Villar Sanz and
                  Zylowski, Thorsten and
                  Wölfel, Matthias and
                  Rico, Noelia and
                  Díaz, Irene},
  title        = {PolitiFact scraping dataset},
  year         = 2026,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.19062297},
  url          = {https://doi.org/10.5281/zenodo.19062297},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

politifact_scraping-0.1.1.tar.gz (24.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

politifact_scraping-0.1.1-py3-none-any.whl (24.2 kB view details)

Uploaded Python 3

File details

Details for the file politifact_scraping-0.1.1.tar.gz.

File metadata

  • Download URL: politifact_scraping-0.1.1.tar.gz
  • Upload date:
  • Size: 24.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for politifact_scraping-0.1.1.tar.gz
Algorithm Hash digest
SHA256 bbc2a10475db2d2777cda31f3863f54bc6bfb13cb6814a86bc67ce2f20a9c3d1
MD5 37a98670be3208f5cb5d71964ff8c345
BLAKE2b-256 ea95cd877f8592807434b7e48a13fd2b9dffdcaeb08c525e372d5e9d92b9d6a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for politifact_scraping-0.1.1.tar.gz:

Publisher: publish.yml on MarioVillar/politifact-scraping

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file politifact_scraping-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for politifact_scraping-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 81de36e72282fbf9caf2e138c13a1d4a7022a661ca61db4b49f1d7a836013f60
MD5 0f53b02dea14a0813872038f2e55274e
BLAKE2b-256 27d8642fef065b91a89771ebfa3786714a9e6d79e736f834911298d2ff3486dc

See more details on using hashes here.

Provenance

The following attestation bundles were made for politifact_scraping-0.1.1-py3-none-any.whl:

Publisher: publish.yml on MarioVillar/politifact-scraping

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page