Scrape a subreddit's posts.

These details have not been verified by PyPI

Project links

Project description

Subreddit trawler

Scrape sub reddit posts using the old url https://old.reddit.com.

https://old.reddit.com/r/Chinatown_irl/

https://old.reddit.com/r/China_irl/

scrape sub reddit
- visit each post link
  - skip announcement
    - if the url contains predictions?tournament, always skip this link. no old version is available.
      - eg: https://www.reddit.com/r/wallstreetbets/predictions?tournament=tnmt-0b14066a-ad68-4351-8261-d1c0740c44d2
  - scrape comments
    - submit text
    - submit image
    - submit video
    - nsfw/spoiler
find next button
- extract link
- go to link
- repeat above

Examples for various post types:

Notes

Sample video PostLink:

{
    "id": "z09a7r",
    "author": "Dry_Illustrator5642",
    "timestamp": 1668963979000,
    "url": "https://v.redd.it/4huchegx4x0a1",
    "permalink": "https://old.reddit.com/r/China_irl/comments/z09a7r/翼刀性感电臀舞/",
    "domain": "v.redd.it",
    "comments_count": 1,
    "score": 0,
    "nsfw": false,
    "spoiler": false,
    "type": "video"
}

Actual downloadable video addr: https://v.redd.it/4huchegx4x0a1/DASH_720.mp4 Audio addr: https://v.redd.it/4huchegx4x0a1/DASH_audio.mp4

Sample image PostLink:

{
    "id": "wv4ydl",
    "author": "darkyknight01",
    "timestamp": 1661201834000,
    "url": "https://i.redd.it/6b66lj3fwbj91.jpg",
    "permalink": "https://old.reddit.com/r/zenfone6/comments/wv4ydl/in_delhi_i_need_info_for_that_how_should_i/",
    "domain": "i.redd.it",
    "comments_count": 1,
    "score": 1,
    "nsfw": false,
    "spoiler": false,
    "type": "image"
}

Sample text PostLink:

{
    "id": "xg61f6",
    "author": "silver2006",
    "timestamp": 1663370013000,
    "url": "/r/zenfone6/comments/xg61f6/need_help_unlocking_the_bootloader/",
    "permalink": "https://old.reddit.com/r/zenfone6/comments/xg61f6/need_help_unlocking_the_bootloader/",
    "domain": "self.zenfone6",
    "comments_count": 4,
    "score": 1,
    "nsfw": false,
    "spoiler": false,
    "type": "text"
}

Sample link PostLink:

{
    "id": "z2bhbm",
    "author": "Counterhaters",
    "timestamp": 1669166866000,
    "url": "https://www.zaobao.com.sg/realtime/china/story20221122-1335992",
    "permalink": "https://old.reddit.com/r/China_irl/comments/z2bhbm/消息中国拟对蚂蚁处以逾10亿美元罚款/",
    "domain": "zaobao.com.sg",
    "comments_count": 1,
    "score": 4,
    "nsfw": false,
    "spoiler": false,
    "type": "link"
}

Gallery element:

<div class="media-gallery">
    <div class="gallery-tiles">
        <div class="gallery-tile gallery-navigation">
            <div class="media-preview-content gallery-tile-content">
                <img class="preview", src="...", width=..., height=...>
            </div>
        </div>
    </div>
</div>

The "next" button element:

<span class="next-button">
    <a href="https://old.reddit.com/r/Music/?count=25&after=t3_z1lqur" rel="nofollow next">next ›</a>
</span>

The element that lists all posts:

<div id="siteTable" class="sitetable linklisting">

screenshot of element that has all the links

When you forget to change user-agent:

<!doctype html>
<html>

<head>
    <title>Too Many Requests</title>
</head>

<body>
    <h1>whoa there, pardner!</h1>
    <p>we're sorry, but you appear to be a bot and we've seen too many requests from you lately. we enforce a hard
        speed limit on requests that appear to comefrom bots to prevent abuse.</p>
    <p>if you are not a bot but are spoofing one via your browser's user agentstring: please change your user agent
        string to avoid seeing this messageagain.</p>
    <p>please wait 1 second(s) and try again.</p>
    <p>as a reminder to developers, we recommend that clients make no more than <a
            href="http://github.com/reddit/reddit/wiki/API">one request every two seconds</a> to avoid seeing this
        message.</p>
</body>

</html>

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.2

Nov 24, 2022

0.0.1 yanked

Nov 24, 2022

Reason this release was yanked:

Bad packaging (__init__.py not included)

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

subreddit_trawler-0.0.2.tar.gz (1.1 MB view details)

Uploaded Nov 24, 2022 Source

Built Distribution

subreddit_trawler-0.0.2-py3-none-any.whl (19.9 kB view details)

Uploaded Nov 24, 2022 Python 3

File details

Details for the file subreddit_trawler-0.0.2.tar.gz.

File metadata

Download URL: subreddit_trawler-0.0.2.tar.gz
Upload date: Nov 24, 2022
Size: 1.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for subreddit_trawler-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`e409218a4b1bf2c9d7efacd352332984cd03f426926ad2d939e893514dbac1ce`
MD5	`321453e707f101d7bbcd1b5aa221c753`
BLAKE2b-256	`47b89cc231ecac3921daf5819958bcfbf54a933d3a9914d266627d1cccca85a3`

See more details on using hashes here.

File details

Details for the file subreddit_trawler-0.0.2-py3-none-any.whl.

File metadata

Download URL: subreddit_trawler-0.0.2-py3-none-any.whl
Upload date: Nov 24, 2022
Size: 19.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for subreddit_trawler-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`318f0b41e4a03a7f16524cf363dde2d6163d8b25c87915bfd15e4d7c5685a1d9`
MD5	`8076356460b212168d8ad8f984533d9a`
BLAKE2b-256	`2d48bb7c911f642ddafb4f9739c53d069712b2f2371b06a417be82a8724227ef`