Skip to main content

obscraper: scrape posts from the overcomingbias blog

Project description

obscraper

obscraper: scrape posts from the overcomingbias blog

Project Version on PyPI Supported Python Versions Unit Tests Documentation Status Unit Test Coverage MIT License

obscraper lets you scrape blog posts and associated metadata from the overcomingbias blog.

It’s easy to get a single post:

>>> import obscraper
>>> intro_url = 'https://www.overcomingbias.com/2006/11/introduction.html'
>>> post = obscraper.get_post_by_url(intro_url)
>>> post.title
'How To Join'
>>> post.plaintext
'How can we better believe what is true? ...'
>>> post.internal_links
{'http://www.overcomingbias.com/2007/02/moderate_modera.html': 1,
'http://www.overcomingbias.com/2006/12/contributors_be.html': 1}
>>> post.comments
20

Or a full list of post URLs and edit dates:

>>> import obscraper
>>> edit_dates = obscraper.grab_edit_dates()
...
>>> len(edit_dates)
4352
>>> {url: str(edit_dates[url]) for url in list(edit_dates)[:5]}
{'/2022/01/much-talk-is-sales-patter':
'2022-01-14 20:46:35+00:00',
'/2022/01/old-man-rant':
'2022-01-13 15:21:33+00:00',
'/2022/01/my-11-bets-at-10-1-odds-on-10m-covid-deaths-by-2022':
'2022-01-12 19:15:10+00:00',
'/2022/01/to-innovate-unify-or-fragment':
'2022-01-11 01:03:44+00:00',
'/2022/01/on-what-is-advice-useful':
'2022-01-10 18:46:26+00:00'}

Features

  • Get posts by their URLs or edit dates, or get all posts hosted on the overcomingbias site

  • Provides detailed post metadata including post URLs, titles, authors, tags, publish dates, and last edit dates

  • Provides summary of post content including full post text as HTML or plaintext, and a list of hyperlinks to other overcomingbias posts

  • Multithreading and caching for fast downloads

  • Use via import obscraper or the simple command line interface

  • Comprehensively tested

  • Supports python 3.8+

Documentation

Read the full documentation here, including the Installation and Getting Started Guide and the Public API Reference.

Bugs/Requests

Please use the GitHub issue tracker to submit bugs or request features.

Changelog

See the Changelog for a list of fixes and enhancements at each version.

License

Copyright (c) 2022 Christopher McDonald

Distributed under the terms of the MIT license.

All overcomingbias posts are copyright the original authors.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

obscraper-0.4.1.tar.gz (45.1 kB view details)

Uploaded Source

Built Distribution

obscraper-0.4.1-py3-none-any.whl (21.1 kB view details)

Uploaded Python 3

File details

Details for the file obscraper-0.4.1.tar.gz.

File metadata

  • Download URL: obscraper-0.4.1.tar.gz
  • Upload date:
  • Size: 45.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for obscraper-0.4.1.tar.gz
Algorithm Hash digest
SHA256 993c1a5d037368f76cb779289a30f1afe07d12ce0fc53e4474ae0c3fe56e11e4
MD5 ef7ff8098113d4c68e7b511b086caffd
BLAKE2b-256 bec1bb9640aaa20da200c44730ee713142348be0ba147a6d493463f78fbfb8e5

See more details on using hashes here.

File details

Details for the file obscraper-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: obscraper-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 21.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for obscraper-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 339a1f3c7afa4a6c8393edbac15e7399385a709ff5fdf979a17f89aca3f64297
MD5 c15ede6f2a69c63ec35f931bcf8d02fb
BLAKE2b-256 b91ab6e8453c4b2e9d063eeaa2a4eaab3768e6bc9114d1a36b497b1edf896d31

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page