Weibo trending posts scraper
Project description
Weibo trending posts scraper
Scrap trending posts from Weibo front page.
Weibo API
To understand how Weibo fetches new posts, a network inspection is performed on the mobile website.
A request to this API endpoint is observed: https://m.weibo.cn/api/container/getIndex?containerid=102803&openApp=0
Its response looks like this:
The response is a nested JSON object. We're interested in the items in data
's cards
array.
Each card contains a mblog
(microblog) object that encapsulates the content as well as metadata about the content.
One call to the API returns ten cards. When the device viewport scrolls toward the bottom of the page, the API is called again to fetch 10 more posts.
weibo-trending extracts the following fields from mblog
:
text
: post text; can contain HTML tags when a video stream is includedid
: post ID (str)url
: link to the postuser
: the posterpics
: the URLs to the images included in this post (array of objects)created_at
: post date (str)source
: the device the post is submitted from
In addition, the following user
fields are also extracted:
id
: user ID (int)profile_url
: link to user profilescreen_name
: screen namegender
:"f"
for female,"m"
for male. Weibo does not provide codes for those who are non-binaryfollowers_count
: the str number of followers in units of 10,000. For example:"433.8万"
(4,338,000).
Note that repeated calls sometimes return posts that have been returned before.
weibo-trending usage guide
As a library
Install
pip install weibo_trending
Get and parse posts
from weibo_trending import get_new_posts, parse_response
resp = get_new_posts()
mblogs = parse_response(resp)
for mblog in mblogs:
print(mblog)
As a command line tool
Install
pip install weibo_trending
Usage
python -m weibo_trending --help
usage: weibo_trending [-h] [-d DIR] [-s]
Scrape and parse Weibo trending posts.
optional arguments:
-h, --help show this help message and exit
-d DIR, --dir DIR specify the output directory. Defaults to the current working directory
-s, --skip-parsing whether to skip parsing and dump the raw JSON response from Weibo
python -m weibo_trending
weibo_trending will save each scraped post with the following filename format:
weibo_<user ID>_<post ID>.json
- Example:
weibo_1631153043_4834313265233660.json
Each call to weibo_trending usually saves 10 new files. If you get fewer than 10, that means the response contains one or more deleted posts. They are not saved.
Develop
git clone https://github.com/ericlingit/weibo-trending.git
cd weibo-trending
python3 -m venv venv
source venv/bin/activate
pip install -U pip wheel
pip install -r requirements.txt
pip install -e .
pytest
Packaging
python -m build --wheel
The built wheel is in ./dist/
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file weibo_trending-0.0.1.tar.gz
.
File metadata
- Download URL: weibo_trending-0.0.1.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5cb08db462d7bc77a5d924bf40fa5c970c3e50c97ab6c13929d5d1ff22c58c03 |
|
MD5 | 7748d33e969ebb55aa4d5ae52b71a52d |
|
BLAKE2b-256 | 7291d0027046a063fda75045a0695558a546491697bd08a21cdb1af687ea4b6d |
File details
Details for the file weibo_trending-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: weibo_trending-0.0.1-py3-none-any.whl
- Upload date:
- Size: 17.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d1106aeeda013900ade4c9f6a83be0c8222a5a54c691e000f8654af13e564346 |
|
MD5 | f5e214ddd90279e74d1901e4b563447b |
|
BLAKE2b-256 | d7d260f07933c59b368a3bd88ccc558fdfb7147aef98321abf1e8db47682412c |