A python package for crawling instagram based on selenium.
Project description
Introduction
A python package for crawling instagram based on [selenium](https://selenium-python.readthedocs.io) and [selenium-wire](https://github.com/wkeeling/selenium-wire). The idea is to use selenium to trigger the javascript in browser to send requests to the server for a specific action, and meanwhile selenium-wire will keep tracking on the network and capture the responses for data extraction purpose.
User Guide
How to Install
Please install it via:
pip install crawlinsta
Or you can install it from the source code:
pip install git+https://github.com/zhiwei2017/crawlinsta.git@master
Prerequisites
Please make sure your instagram account has English or German as the language setting.
How to Use
Create Browser Driver
To create a browser driver, you need to first import webdriver from crawlinsta package and initiate a browser instance via:
>>> from crawlinsta import webdriver >>> driver = webdriver.Chrome('path_to_chromedriver') >>> # Do some crawling with driver >>> driver.quit()
If you don’t specify the Chrome driver path, the default one will be used.
Please remember to call:
>>> driver.quit()
when you finish the crawling.
Login
For the first time login, you need to prepare your username and password, and use the function login from module crawlinsta.login, such as:
>>> from crawlinsta.login import login >>> login(driver, "your_username", "your_password")
Once you login with your username and password, a cookie will be created with default name instagram_cookies.pkl. You can use the function login_with_cookies from module crawlinsta.login to login with the cookie file, such as:
>>> from crawlinsta.login import login_with_cookies >>> login_with_cookies(driver)
Crawling
Current available crawling functions:
crawlinsta.collecting.collect_user_info
Collects user information for the given username.
- Input:
driver: browser driver instance
username (str): username to crawl
- Output:
user_info (dict): user information, including username, full name, biography, external url, number of posts, number of followers, number of followings, and number of reels.
Example:
>>> collect_user_info(driver, "nasa") { "id": "528817151", "username": "nasa", "fullname": "NASA", "biography": "Exploring the universe and our home planet.", "follower_count": 97956738, "following_count": 77, "following_tag_count": 10, "is_private": false, "is_verified": true, "profile_pic_url": "https://dummy.pic.com", "post_count": 4116, "usertags_count": 0 }
crawlinsta.collecting.collect_posts_of_user
Collects n posts from the account with given username
- Input:
driver: browser driver instance
username (str): username to crawl
n (int): maximum number of posts, which should be collected. By default, it’s 100. If it’s set to 0, collect all posts.
- Output:
posts (list): list of posts, each post is a dictionary containing post information, including post code, post url, post type, post caption, post location, post time, number of likes, number of comments, and media url.
Example:
>>> collect_posts_of_user(driver, "dummy_instagram_username", 100) { "posts": [ { "like_count": 817982, "comment_count": 3000, "id": "3215769692664507668", "code": "CygtX9ivC0U", "user": { "id": "50269116275", "username": "dummy_instagram_username", "fullname": "", "profile_pic_url": "https://scontent.cdninstagram.com/v", "is_private": false, "is_verified": false }, "taken_at": 1697569769, "media_type": "Photo", "caption": { "id": "17985380039262083", "text": "I know what she’s gonna say before she even has the chance 😂", "created_at_utc": null }, "accessibility_caption": "", "original_width": 1080, "original_height": 1920, "urls": [ "https://scontent.cdninstagram.com/o1" ], "has_shared_to_fb": false, "usertags": [], "location": null, "music": { "id": "2614441095386924", "is_trending_in_clips": false, "artist": { "id": "50269116275", "username": "dummy_instagram_username", "fullname": "", "profile_pic_url": "", "is_private": null, "is_verified": null }, "title": "Original audio", "duration_in_ms": null, "url": null } }, ... ], "count": 100 }
crawlinsta.collecting.collect_reels_of_user
Collects n reels from the account with given username
- Input:
driver: browser driver instance
username (str): username to crawl
n (int): maximum number of reels, which should be collected. By default, it’s 100. If it’s set to 0, collect all reels.
- Output:
reels (list): list of reels, each reel is a dictionary containing reel information, including reel code, reel url, reel caption, reel time, number of likes, number of comments, and media url.
Example:
>>> collect_reels_of_user(driver, "dummy_instagram_username", 100) { "reels": [ { "like_count": 817982, "comment_count": 3000, "id": "3215769692664507668", "code": "CygtX9ivC0U", "user": { "id": "50269116275", "username": "dummy_instagram_username", "fullname": "", "profile_pic_url": "https://scontent.cdninstagram.com/v", "is_private": false, "is_verified": false }, "taken_at": 1697569769, "media_type": "Reel", "caption": { "id": "17985380039262083", "text": "I know what she’s gonna say before she even has the chance 😂", "created_at_utc": null }, "accessibility_caption": "", "original_width": 1080, "original_height": 1920, "urls": [ "https://scontent.cdninstagram.com/o1" ], "has_shared_to_fb": false, "usertags": [], "location": null, "music": { "id": "2614441095386924", "is_trending_in_clips": false, "artist": { "id": "50269116275", "username": "dummy_instagram_username", "fullname": "", "profile_pic_url": "", "is_private": null, "is_verified": null }, "title": "Original audio", "duration_in_ms": null, "url": null } }, ... ], "count": 100 }
crawlinsta.collecting.collect_tagged_posts_of_user
Collects n posts in which the user with given username is tagged
- Input:
driver: browser driver instance
username (str): username to crawl
n (int): maximum number of tagged posts, which should be collected. By default, it’s 100. If it’s set to 0, collect all tagged posts.
- Output:
tagged_posts (list): list of tagged posts, each post is a dictionary containing post information, including post code, post url, post type, post caption, post location, post time, number of likes, number of comments, and media url.
Example:
>>> collect_tagged_posts_of_user(driver, "dummy_instagram_username", 100) { "tagged_posts": [ { "like_count": 817982, "comment_count": 3000, "id": "3215769692664507668", "code": "CygtX9ivC0U", "user": { "id": "50269116275", "username": "dummy_instagram_username", "fullname": "", "profile_pic_url": "https://scontent.cdninstagram.com/v", "is_private": false, "is_verified": false }, "taken_at": 1697569769, "media_type": "Reel", "caption": { "id": "17985380039262083", "text": "I know what she’s gonna say before she even has the chance 😂", "created_at_utc": null }, "accessibility_caption": "", "original_width": 1080, "original_height": 1920, "urls": [ "https://scontent.cdninstagram.com/o1" ], "has_shared_to_fb": false, "usertags": [], "location": null, "music": { "id": "2614441095386924", "is_trending_in_clips": false, "artist": { "id": "50269116275", "username": "dummy_instagram_username", "fullname": "", "profile_pic_url": "", "is_private": null, "is_verified": null }, "title": "Original audio", "duration_in_ms": null, "url": null } }, ... ], "count": 100 }
crawlinsta.collecting.get_friendship_status
Get the relationship between the user with username1 and the user with username2, i.e. finding out who is following whom.
- Input:
driver: browser driver instance
username1 (str): username of the person A.
username2 (str): username of the person B.
- Output:
friendship_status (dict): relationship between the two users, including whether person A is following person B and whether person B is following person A. “following” indicates if person A is following person B, and “followed_by” indicates if person A is followed by person B.
Example:
>>> get_friendship_status(driver, "dummy_instagram_username1", "dummy_instagram_username2") { "following": false, "followed_by": true }
crawlinsta.collecting.collect_followers_of_user
Collects n followers from the account with given username
- Input:
driver: browser driver instance
username (str): username to crawl
n (int): maximum number of followers, which should be collected. By default, it’s 100. If it’s set to 0, collect all followers.
- Output:
followers (list): list of followers, each follower is a dictionary containing follower information, including follower username, follower full name, follower profile picture url etc.
Example:
>>> collect_followers_of_user(driver, "dummy_instagram_username", 100) { "users": [ { "id": "528817151", "username": "nasa", "fullname": "NASA", "is_private": false, "is_verified": true, "profile_pic_url": "https://dummy.pic.com", }, ... ], "count": 100 }
crawlinsta.collecting.collect_followings_of_user
Collects n following users from the account with given username
- Input:
driver: browser driver instance
username (str): username to crawl
n (int): maximum number of following users, which should be collected. By default, it’s 100. If it’s set to 0, collect all following users.
- Output:
followings (list): list of following users, each following user is a dictionary containing following user information, including following username, following full name, following profile picture url etc.
Example:
>>> collect_followings_of_user(driver, "dummy_instagram_username", 100) { "users": [ { "id": "528817151", "username": "nasa", "fullname": "NASA", "is_private": false, "is_verified": true, "profile_pic_url": "https://dummy.pic.com", }, ... ], "count": 100 }
crawlinsta.collecting.collect_likers_of_post
Collect the users, who likes a given post.
- Input:
driver: browser driver instance
post_code (str): post code, used for generating post directly accessible url.
n (int): maximum number of likers, which should be collected. By default, it’s 100. If it’s set to 0, collect all likers.
- Output:
likers (list): list of likers, each liker is a dictionary containing liker information, including liker username, liker full name, liker profile picture url etc and friendship status between the post owner and the liker.
Example:
>>> collect_likers_of_post(driver, "WGDBS3D", 100) { "likers": [ { "id": "528817151", "username": "nasa", "fullname": "NASA", "is_private": false, "is_verified": true, "profile_pic_url": "https://dummy.pic.com", }, ... ], "count": 100 }
crawlinsta.collecting.collect_comments_of_post
Collect n comments of a given post.
- Input:
driver: browser driver instance
post_code (str): post code, used for generating post directly accessible url.
n (int): maximum number of comments, which should be collected. By default, it’s 100. If it’s set to 0, collect all comments.
- Output:
comments (list): list of comments, each comment is a dictionary containing comment information, including comment id, comment text, comment time, comment likes count, comment owner username, comment owner full name, comment owner profile picture url etc.
Example:
>>> collect_comments_of_post(driver, "WGDBS3D", 100) { "comments": [ { "id": "18278957755095859", "user": { "id": "6293392719", "username": "dummy_user" }, "post_id": "3275298868401088037", "created_at_utc": 1704669275, "status": null, "share_enabled": null, "is_ranked_comment": null, "text": "Fantastic Job", "has_translation": false, "is_liked_by_post_owner": null, "comment_like_count": 0 }, ... ], "count": 100 }
crawlinsta.collecting.search_with_keyword
Search hashtags or users with given keyword.
- Input:
driver: browser driver instance
keyword (str): keyword for searching.
pers (bool): indicating whether results should be personalized or not.
- Output:
search_results (dict): search results, including users, places and hashtags.
Example:
>>> search_with_keyword(driver, "shanghai", pers=True) { "hashtags": [ { "position": 1, "hashtag": { "id": "17841563224118980", "name": "shanghai", "post_count": 11302316, "profile_pic_url": "" } } ], "users": [ { "position": 0, "user": { "id": "7594441262", "username": "shanghai.explore", "fullname": "Shanghai 🇨🇳 Travel | Hotels | Food | Tips", "profile_pic_url": "https://scontent.cdninstagram.com/v/t51.2885-19/409741157_243678455262812_2168807265478461941_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=108&_nc_ohc=S3SAe59tdbUAX9SLkyd&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfALvv52ytTyye_PDEjKCmWAUetHX8BXCGsS7rnFThzNTQ&oe=65ECAABE&_nc_sid=10d13b", "is_private": null, "is_verified": true } } ], "places": [ { "position": 2, "place": { "location": { "id": "106324046073002", "name": "Shanghai, China" }, "subtitle": "", "title": "Shanghai, China" } } ], "personalised": true }
crawlinsta.collecting.collect_top_posts_of_hashtag
Collect top posts of a given hashtag.
- Input:
driver: browser driver instance
hashtag (str): hashtag
- Output:
top_posts (list): list of top posts, each post is a dictionary containing post information, including post code, post url, post type, post caption, post location, post time, number of likes, number of comments, and media url.
Example:
>>> collect_top_posts_of_hashtag(driver, "shanghai") { "top_posts": [ { "like_count": 817982, "comment_count": 3000, "id": "3215769692664507668", "code": "CygtX9ivC0U", "user": { "id": "50269116275", "username": "dummy_instagram_username", "fullname": "", "profile_pic_url": "https://scontent.cdninstagram.com/v", "is_private": false, "is_verified": false }, "taken_at": 1697569769, "media_type": "Reel", "caption": { "id": "17985380039262083", "text": "I know what she’s gonna say before she even has the chance 😂#shanghai", "created_at_utc": null }, "accessibility_caption": "", "original_width": 1080, "original_height": 1920, "urls": [ "https://scontent.cdninstagram.com/o1" ], "has_shared_to_fb": false, "usertags": [], "location": null, "music": { "id": "2614441095386924", "is_trending_in_clips": false, "artist": { "id": "50269116275", "username": "dummy_instagram_username", "fullname": "", "profile_pic_url": "", "is_private": null, "is_verified": null }, "title": "Original audio", "duration_in_ms": null, "url": null } }, ... ], "count": 100 }
crawlinsta.collecting.collect_posts_by_music_id
Collect n posts containing the given music_id. If n is set to 0, collect all posts.
- Input:
driver: browser driver instance.
music_id (str): id of the music.
n (int): maximum number of posts, which should be collected. By default, it’s 100. If it’s set to 0, collect all posts.
- Output:
posts (list): list of posts, each post is a dictionary containing post information, including post code, post url, post type, post caption, post location, post time, number of likes, number of comments, and media url.
Example:
>>> collect_posts_by_music_id(driver, "2614441095386924", 100) { "posts": [ { "like_count": 817982, "comment_count": 3000, "id": "3215769692664507668", "code": "CygtX9ivC0U", "user": { "id": "50269116275", "username": "dummy_instagram_username", "fullname": "", "profile_pic_url": "https://scontent.cdninstagram.com/v", "is_private": false, "is_verified": false }, "taken_at": 1697569769, "media_type": "Reel", "caption": { "id": "17985380039262083", "text": "I know what she’s gonna say before she even has the chance 😂", "created_at_utc": null }, "accessibility_caption": "", "original_width": 1080, "original_height": 1920, "urls": [ "https://scontent.cdninstagram.com/o1" ], "has_shared_to_fb": false, "usertags": [], "location": null, "music": { "id": "2614441095386924", "is_trending_in_clips": false, "artist": { "id": "50269116275", "username": "dummy_instagram_username", "fullname": "", "profile_pic_url": "", "is_private": null, "is_verified": null }, "title": "Original audio", "duration_in_ms": null, "url": null } }, ... ], "count": 100 }
crawlinsta.collecting.download_media
Download the image/video based on the given media_url, and store it to the given path.
- Input:
driver: browser driver instance
media_url (str): url of the media for downloading.
file_name (str): path for storing the downloaded media.
Example:
>>> download_media(driver, "dummy_media_url", "dummy")
Work wit Docker Compose
A Dockerfile and a docker-compose.yml file are provided for easily using the package in jupyter server in a docker container. First, you need to build the docker image:
docker-compose build
Then you can run the following command to start the container:
docker-compose up
After the container is started, you can access the jupyter server via the link provided in the terminal. When the container is started, the current directory will be mounted to the container in the path /home/work. You can put your notebooks in the current directory and run them in the jupyter server in the container. The package is already installed in the container.
To stop the container, you can run the following command:
docker-compose down
Maintainers
Zhiwei Zhang - Maintainer - zhiwei2017@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file crawlinsta-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: crawlinsta-0.1.0-py3-none-any.whl
- Upload date:
- Size: 55.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 34cb2f71020bfad828427d7d08d078c0fdfc577f58f26e73528be38afe2c4a6d |
|
MD5 | 5aac11c8482f73f9364e8884748516a2 |
|
BLAKE2b-256 | d8c586dffd809adf3eb72a7ca0a9b66ab4b1a930736f50b61d2a9cede31b0dcc |