Skip to main content

Scrape a subreddit's posts.

Project description

Subreddit trawler

Scrape sub reddit posts using the old url https://old.reddit.com.

https://old.reddit.com/r/Chinatown_irl/

https://old.reddit.com/r/China_irl/

  • scrape sub reddit

    • visit each post link
      • skip announcement
        • if the url contains predictions?tournament, always skip this link. no old version is available.
          • eg: https://www.reddit.com/r/wallstreetbets/predictions?tournament=tnmt-0b14066a-ad68-4351-8261-d1c0740c44d2
      • scrape comments
        • submit text
        • submit image
        • submit video
        • nsfw/spoiler
  • find next button

    • extract link
    • go to link
    • repeat above

Examples for various post types:

Notes

Sample video PostLink:

{
    "id": "z09a7r",
    "author": "Dry_Illustrator5642",
    "timestamp": 1668963979000,
    "url": "https://v.redd.it/4huchegx4x0a1",
    "permalink": "https://old.reddit.com/r/China_irl/comments/z09a7r/翼刀性感电臀舞/",
    "domain": "v.redd.it",
    "comments_count": 1,
    "score": 0,
    "nsfw": false,
    "spoiler": false,
    "type": "video"
}

Actual downloadable video addr: https://v.redd.it/4huchegx4x0a1/DASH_720.mp4 Audio addr: https://v.redd.it/4huchegx4x0a1/DASH_audio.mp4

Sample image PostLink:

{
    "id": "wv4ydl",
    "author": "darkyknight01",
    "timestamp": 1661201834000,
    "url": "https://i.redd.it/6b66lj3fwbj91.jpg",
    "permalink": "https://old.reddit.com/r/zenfone6/comments/wv4ydl/in_delhi_i_need_info_for_that_how_should_i/",
    "domain": "i.redd.it",
    "comments_count": 1,
    "score": 1,
    "nsfw": false,
    "spoiler": false,
    "type": "image"
}

Sample text PostLink:

{
    "id": "xg61f6",
    "author": "silver2006",
    "timestamp": 1663370013000,
    "url": "/r/zenfone6/comments/xg61f6/need_help_unlocking_the_bootloader/",
    "permalink": "https://old.reddit.com/r/zenfone6/comments/xg61f6/need_help_unlocking_the_bootloader/",
    "domain": "self.zenfone6",
    "comments_count": 4,
    "score": 1,
    "nsfw": false,
    "spoiler": false,
    "type": "text"
}

Sample link PostLink:

{
    "id": "z2bhbm",
    "author": "Counterhaters",
    "timestamp": 1669166866000,
    "url": "https://www.zaobao.com.sg/realtime/china/story20221122-1335992",
    "permalink": "https://old.reddit.com/r/China_irl/comments/z2bhbm/消息中国拟对蚂蚁处以逾10亿美元罚款/",
    "domain": "zaobao.com.sg",
    "comments_count": 1,
    "score": 4,
    "nsfw": false,
    "spoiler": false,
    "type": "link"
}

Gallery element:

<div class="media-gallery">
    <div class="gallery-tiles">
        <div class="gallery-tile gallery-navigation">
            <div class="media-preview-content gallery-tile-content">
                <img class="preview", src="...", width=..., height=...>
            </div>
        </div>
    </div>
</div>

The "next" button element:

<span class="next-button">
    <a href="https://old.reddit.com/r/Music/?count=25&after=t3_z1lqur" rel="nofollow next">next ›</a>
</span>

The element that lists all posts:

<div id="siteTable" class="sitetable linklisting">

screenshot of element that has all the links

When you forget to change user-agent:

<!doctype html>
<html>

<head>
    <title>Too Many Requests</title>
</head>

<body>
    <h1>whoa there, pardner!</h1>
    <p>we're sorry, but you appear to be a bot and we've seen too many requests from you lately. we enforce a hard
        speed limit on requests that appear to comefrom bots to prevent abuse.</p>
    <p>if you are not a bot but are spoofing one via your browser's user agentstring: please change your user agent
        string to avoid seeing this messageagain.</p>
    <p>please wait 1 second(s) and try again.</p>
    <p>as a reminder to developers, we recommend that clients make no more than <a
            href="http://github.com/reddit/reddit/wiki/API">one request every two seconds</a> to avoid seeing this
        message.</p>
</body>

</html>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

subreddit_trawler-0.0.2.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

subreddit_trawler-0.0.2-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file subreddit_trawler-0.0.2.tar.gz.

File metadata

  • Download URL: subreddit_trawler-0.0.2.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for subreddit_trawler-0.0.2.tar.gz
Algorithm Hash digest
SHA256 e409218a4b1bf2c9d7efacd352332984cd03f426926ad2d939e893514dbac1ce
MD5 321453e707f101d7bbcd1b5aa221c753
BLAKE2b-256 47b89cc231ecac3921daf5819958bcfbf54a933d3a9914d266627d1cccca85a3

See more details on using hashes here.

File details

Details for the file subreddit_trawler-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for subreddit_trawler-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 318f0b41e4a03a7f16524cf363dde2d6163d8b25c87915bfd15e4d7c5685a1d9
MD5 8076356460b212168d8ad8f984533d9a
BLAKE2b-256 2d48bb7c911f642ddafb4f9739c53d069712b2f2371b06a417be82a8724227ef

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page