Skip to main content

Scrape article metadata and comments from NYTimes

Project description

nytimes-scraper

PyPI

Scrape article metadata and comments from NYTimes

Setup

pip install nytimes-scraper

CLI usage

The scraper will automatically fetch every article and all the user comments published on nytimes.com. Articles are processed month by month, starting with the current month. For each month, a {year}-{month}-articles.pickle and {year}-{month}-comments.pickle will be generated in the current directory. If the process is restarted, existing outputs will not be overridden and the scraper will continue at the month where it left off. To use it, run

python -m nytimes_scraper <API_KEY>

Programmatic usage

The scraper can also be started programmatically

import datetime as dt
from nytimes_scraper import run_scraper, scrape_month

# scrape february of 2020
article_df, comment_df = scrape_month('<your_api_key>', date=dt.date(2020, 2, 1))

# scrape all articles month by month
run_scraper('<your_api_key>')

Alternatively, the nytimes_scraper.articles and nytimes_scraper.comments modules can be used for more fine-grained access:

import datetime as dt
from nytimes_scraper.nyt_api import NytApi
from nytimes_scraper.articles import fetch_articles_by_month, articles_to_df
from nytimes_scraper.comments import fetch_comments, fetch_comments_by_article, comments_to_df

api = NytApi('<your_api_key>')

# Fetch articles of a specific month
articles = fetch_articles_by_month(api, dt.date(2020, 2, 1))
article_df = articles_to_df(articles)

# Fetch comments from multiple articles
# a) using the results of a previous article query
article_ids_and_urls = list(article_df['web_url'].iteritems())
comments_a = fetch_comments(api, article_ids_and_urls)
comment_df = comments_to_df(comments_a)

# b) using a custom list of articles
comments_b = fetch_comments(api, article_ids_and_urls=[
    ('nyt://article/316ef65c-7021-5755-885c-a9e1ef2cfdf2', 'https://www.nytimes.com/2020/01/03/world/middleeast/trump-iran-suleimani.html'),
    ('nyt://article/b2d1b802-412e-51f7-8864-efc931e87bb3', 'https://www.nytimes.com/2020/01/04/opinion/impeachment-witnesses.html'),
])

# Fetch comment for one specific article by its URL
comments_c = fetch_comments_by_article(api, 'https://www.nytimes.com/2019/11/30/opinion/sunday/bernie-sanders.html')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nytimes-scraper-1.1.2.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nytimes_scraper-1.1.2-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file nytimes-scraper-1.1.2.tar.gz.

File metadata

  • Download URL: nytimes-scraper-1.1.2.tar.gz
  • Upload date:
  • Size: 5.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0.post20200311 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.6

File hashes

Hashes for nytimes-scraper-1.1.2.tar.gz
Algorithm Hash digest
SHA256 aebcfd4fa74e2a504e05e95035c310ff3b0265905ab89835415a04c463edbfa4
MD5 74729952badec1f6ed0ceca385626c7e
BLAKE2b-256 3b52f4a430429e529b33ba395b5768d2a453bf53883457559b438aaa4bbed0bd

See more details on using hashes here.

File details

Details for the file nytimes_scraper-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: nytimes_scraper-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0.post20200311 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.6

File hashes

Hashes for nytimes_scraper-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 59562677e30dce5d31a9047f5881593a089d229ffe63a73f2e35477548e3bb7c
MD5 6734ba9900855922834eb3005a53a147
BLAKE2b-256 fff530d726ec6e3cb4270d9a8fc0f6493b3ad0b9d2320620665de97c3f8b3f4e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page