Scrape a subreddit's posts.
Project description
Subreddit trawler
Scrape sub reddit posts using the old url https://old.reddit.com
.
https://old.reddit.com/r/Chinatown_irl/
https://old.reddit.com/r/China_irl/
-
scrape sub reddit
- visit each post link
- skip announcement
- if the url contains
predictions?tournament
, always skip this link. no old version is available.- eg:
https://www.reddit.com/r/wallstreetbets/predictions?tournament=tnmt-0b14066a-ad68-4351-8261-d1c0740c44d2
- eg:
- if the url contains
- scrape comments
- submit text
- submit image
- submit video
- nsfw/spoiler
- skip announcement
- visit each post link
-
find next button
- extract link
- go to link
- repeat above
Examples for various post types:
- Text post
- Image post
- Video post
- Gallery
- NSFW text (Whats the most NSFW experience you witnessed right in front of your eyes?)
- NSFW image (Grown man ass-kissing)
- NSFW video (Ukrainian drone flies right into the Russian trench)
Notes
Sample video PostLink:
{
"id": "z09a7r",
"author": "Dry_Illustrator5642",
"timestamp": 1668963979000,
"url": "https://v.redd.it/4huchegx4x0a1",
"permalink": "https://old.reddit.com/r/China_irl/comments/z09a7r/翼刀性感电臀舞/",
"domain": "v.redd.it",
"comments_count": 1,
"score": 0,
"nsfw": false,
"spoiler": false,
"type": "video"
}
Actual downloadable video addr: https://v.redd.it/4huchegx4x0a1/DASH_720.mp4
Audio addr: https://v.redd.it/4huchegx4x0a1/DASH_audio.mp4
Sample image PostLink:
{
"id": "wv4ydl",
"author": "darkyknight01",
"timestamp": 1661201834000,
"url": "https://i.redd.it/6b66lj3fwbj91.jpg",
"permalink": "https://old.reddit.com/r/zenfone6/comments/wv4ydl/in_delhi_i_need_info_for_that_how_should_i/",
"domain": "i.redd.it",
"comments_count": 1,
"score": 1,
"nsfw": false,
"spoiler": false,
"type": "image"
}
Sample text PostLink:
{
"id": "xg61f6",
"author": "silver2006",
"timestamp": 1663370013000,
"url": "/r/zenfone6/comments/xg61f6/need_help_unlocking_the_bootloader/",
"permalink": "https://old.reddit.com/r/zenfone6/comments/xg61f6/need_help_unlocking_the_bootloader/",
"domain": "self.zenfone6",
"comments_count": 4,
"score": 1,
"nsfw": false,
"spoiler": false,
"type": "text"
}
Sample link PostLink:
{
"id": "z2bhbm",
"author": "Counterhaters",
"timestamp": 1669166866000,
"url": "https://www.zaobao.com.sg/realtime/china/story20221122-1335992",
"permalink": "https://old.reddit.com/r/China_irl/comments/z2bhbm/消息中国拟对蚂蚁处以逾10亿美元罚款/",
"domain": "zaobao.com.sg",
"comments_count": 1,
"score": 4,
"nsfw": false,
"spoiler": false,
"type": "link"
}
Gallery element:
<div class="media-gallery">
<div class="gallery-tiles">
<div class="gallery-tile gallery-navigation">
<div class="media-preview-content gallery-tile-content">
<img class="preview", src="...", width=..., height=...>
</div>
</div>
</div>
</div>
The "next" button element:
<span class="next-button">
<a href="https://old.reddit.com/r/Music/?count=25&after=t3_z1lqur" rel="nofollow next">next ›</a>
</span>
The element that lists all posts:
<div id="siteTable" class="sitetable linklisting">
When you forget to change user-agent:
<!doctype html>
<html>
<head>
<title>Too Many Requests</title>
</head>
<body>
<h1>whoa there, pardner!</h1>
<p>we're sorry, but you appear to be a bot and we've seen too many requests from you lately. we enforce a hard
speed limit on requests that appear to comefrom bots to prevent abuse.</p>
<p>if you are not a bot but are spoofing one via your browser's user agentstring: please change your user agent
string to avoid seeing this messageagain.</p>
<p>please wait 1 second(s) and try again.</p>
<p>as a reminder to developers, we recommend that clients make no more than <a
href="http://github.com/reddit/reddit/wiki/API">one request every two seconds</a> to avoid seeing this
message.</p>
</body>
</html>
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file subreddit_trawler-0.0.2.tar.gz
.
File metadata
- Download URL: subreddit_trawler-0.0.2.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e409218a4b1bf2c9d7efacd352332984cd03f426926ad2d939e893514dbac1ce |
|
MD5 | 321453e707f101d7bbcd1b5aa221c753 |
|
BLAKE2b-256 | 47b89cc231ecac3921daf5819958bcfbf54a933d3a9914d266627d1cccca85a3 |
File details
Details for the file subreddit_trawler-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: subreddit_trawler-0.0.2-py3-none-any.whl
- Upload date:
- Size: 19.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 318f0b41e4a03a7f16524cf363dde2d6163d8b25c87915bfd15e4d7c5685a1d9 |
|
MD5 | 8076356460b212168d8ad8f984533d9a |
|
BLAKE2b-256 | 2d48bb7c911f642ddafb4f9739c53d069712b2f2371b06a417be82a8724227ef |