Professional Instagram data collection toolkit with automation features
Project description
InstaHarvest ๐พ
Professional Instagram Data Collection Toolkit โ Powerful library for Instagram automation, data collection, and analytics built on Playwright.
๐ Documentation | ๐ Report Bug | ๐ก Request Feature | ๐ Changelog
โจ Features
| Category | Capabilities |
|---|---|
| ๐ Profile | Stats, verified badge, category, full bio, external links, Threads |
| ๐ Web API | 16+ JSON endpoints โ profiles, followers, feed, comments, reels, stories, hashtags |
| ๐ธ Content | Posts, Reels, Stories, Highlights, Tagged Posts โ with JSON-first architecture |
| ๐ฌ Engagement | Comments (with replies), likes, media download (images/videos via yt-dlp) |
| ๐ฅ Social | Followers/Following lists, Follow/Unfollow, Direct Messaging |
| ๐ Discovery | Search, Hashtag feeds, Location feeds, Explore, Notifications |
| โก Performance | Parallel processing, SharedBrowser (1 browser for all), Excel export |
| ๐ก๏ธ Reliability | Rate limiting, graceful shutdown (Ctrl+C), auto-save, retry logic |
๐ Installation & Setup
# Install from PyPI
pip install instaharvest
playwright install chrome
# OR install from GitHub (latest dev version)
git clone https://github.com/mpython77/insta-harvester.git
cd insta-harvester
pip install -r requirements.txt
playwright install chrome
Create Instagram session (required, one-time):
from instaharvest import save_session
save_session()
# Browser opens โ Login manually โ Press ENTER โ Session saved โ
โ ๏ธ Without
instagram_session.json, the library won't work.
๐ Quick Start โ SharedBrowser (Recommended)
One browser for ALL operations โ fastest and most efficient way to use InstaHarvest.
from instaharvest import SharedBrowser
from instaharvest.config import ScraperConfig
config = ScraperConfig()
with SharedBrowser(config=config) as browser:
# โโ Profile โโ
profile = browser.scrape_profile("username")
print(f"{profile.full_name}: {profile.followers} followers")
# โโ Social Actions โโ
browser.follow("user1")
browser.send_message("user1", "Hello!")
followers = browser.get_followers("user2", limit=100)
# โโ Content Scraping โโ
post = browser.scrape_post("https://www.instagram.com/p/ABC/")
reel = browser.scrape_reel("https://www.instagram.com/reel/XYZ/")
stories = browser.scrape_stories("username")
comments = browser.scrape_comments("https://www.instagram.com/p/ABC/")
# โโ Discovery โโ
results = browser.search("fashion brands")
hashtag = browser.scrape_hashtag("streetwear")
notifs = browser.read_notifications()
# โโ Batch Operations โโ
posts = browser.scrape_posts(["url1", "url2", "url3"])
files = browser.download_post("https://www.instagram.com/p/ABC/")
# โโ Web API (Direct JSON โ exact data) โโ
profile_json = browser.get_profile_json("username")
print(f"Exact followers: {profile_json.follower_count:,}")
feed = browser.get_user_feed_api(profile_json.user_id, count=5)
reels = browser.get_reels_api(profile_json.user_id)
highlights = browser.get_highlights_api(profile_json.user_id)
๐ API Reference
1. Profile Scraping
from instaharvest import ProfileScraper
from instaharvest.config import ScraperConfig
config = ScraperConfig()
scraper = ProfileScraper(config=config)
session_data = scraper.load_session()
scraper.setup_browser(session_data)
profile = scraper.scrape('username')
print(f"Posts: {profile.posts}, Followers: {profile.followers}")
print(f"Verified: {profile.is_verified}, Category: {profile.category}")
print(f"Bio: {profile.bio}, Links: {profile.external_links}")
scraper.close()
2. Followers / Following
from instaharvest import FollowersCollector
from instaharvest.config import ScraperConfig
config = ScraperConfig()
collector = FollowersCollector(config=config)
session_data = collector.load_session()
collector.setup_browser(session_data)
followers = collector.get_followers('username', limit=100, print_realtime=True)
following = collector.get_following('username', limit=50)
collector.close()
3. Follow / Unfollow & Direct Messaging
from instaharvest import FollowManager, MessageManager
from instaharvest.config import ScraperConfig
config = ScraperConfig()
# Follow
manager = FollowManager(config=config)
session_data = manager.load_session()
manager.setup_browser(session_data)
manager.follow('username')
manager.batch_follow(['user1', 'user2', 'user3'])
manager.close()
# DM
messenger = MessageManager(config=config)
session_data = messenger.load_session()
messenger.setup_browser(session_data)
messenger.send_message('username', 'Hello!')
messenger.batch_send(['user1', 'user2'], 'Hi there!')
messenger.close()
4. Post & Reel Data (JSON-First)
from instaharvest import PostDataScraper
from instaharvest.config import ScraperConfig
config = ScraperConfig()
scraper = PostDataScraper(config=config)
session_data = scraper.load_session()
scraper.setup_browser(session_data)
post = scraper.scrape('https://www.instagram.com/p/DVs7LK-iO0C/')
# 30+ fields extracted automatically from JSON
print(post.like_count, post.comment_count) # Engagement
print(post.caption, post.tagged_accounts) # Content
print(post.location.name if post.location else 'N/A') # Location
print(post.owner.username if post.owner else 'N/A') # Owner
for slide in post.carousel_slides: # Carousel
print(f" Slide {slide.slide_index}: {slide.media_type}")
scraper.close()
5. Comment Scraping
from instaharvest import CommentScraper
from instaharvest.exporters import export_comments_to_json, export_comments_to_excel
from instaharvest.config import ScraperConfig
config = ScraperConfig()
scraper = CommentScraper(config=config)
session_data = scraper.load_session()
scraper.setup_browser(session_data)
result = scraper.scrape(
'https://www.instagram.com/p/POST_ID/',
max_comments=100,
include_replies=True
)
for comment in result.comments:
print(f"@{comment.author.username}: {comment.text}")
for reply in comment.replies:
print(f" โณ @{reply.author.username}: {reply.text}")
# Export
export_comments_to_json(result, 'comments.json')
export_comments_to_excel(result, 'comments.xlsx')
scraper.close()
6. Stories & Highlights
from instaharvest import StoryScraper, HighlightsScraper
from instaharvest.config import ScraperConfig
config = ScraperConfig()
# Stories โ JSON-first, per-slide tag mapping
scraper = StoryScraper(config=config)
session_data = scraper.load_session()
scraper.setup_browser(session_data)
result = scraper.scrape('username', extract_tags=True)
print(f"Stories: {result.story_count}, Tags: {result.all_tagged_accounts}")
for slide in result.slides:
print(f" Slide {slide.slide_index}: [{slide.media_type}] {slide.timestamp} โ {slide.tagged_accounts}")
scraper.close()
# Highlights โ mentions, links, music, locations
hl_scraper = HighlightsScraper(config=config)
session = hl_scraper.load_session()
hl_scraper.setup_browser(session)
full = hl_scraper.scrape_all('mondayswimwear', max_slides_per=100)
print(f"{full.total_highlights} highlights, {full.total_slides} total slides")
hl_scraper.close()
7. Parallel Processing & Orchestrator
from instaharvest import SharedBrowser, InstagramOrchestrator
from instaharvest.config import ScraperConfig
config = ScraperConfig(headless=True)
with SharedBrowser(config=config) as browser:
orch = InstagramOrchestrator(config, shared_browser=browser)
results = orch.scrape_complete_profile_advanced(
'username',
parallel=3,
save_excel=True,
scrape_comments=True,
scrape_stories=True
)
print(f"Scraped {len(results['posts_data'])} posts")
8. Tagged Posts
from instaharvest import TaggedPostsScraper
from instaharvest.config import ScraperConfig
config = ScraperConfig()
scraper = TaggedPostsScraper(config=config)
session = scraper.load_session()
scraper.setup_browser(session)
result = scraper.scrape('mondayswimwear', max_posts=100)
print(f"Total: {result.total_found} tagged posts, Unique taggers: {result.unique_taggers}")
for post in result.tagged_posts:
print(f" @{post.owner} โ {post.url} ({post.media_type})")
scraper.close()
9. Notifications
from instaharvest import SharedBrowser
from instaharvest.config import ScraperConfig
config = ScraperConfig()
with SharedBrowser(config=config) as browser:
notifs = browser.read_notifications()
print(f"Total: {len(notifs)} notifications")
Notification types: follow, post_like, comment_like, comment, mention, follow_request, follow_accepted, thread, story, system
10. Media Download
from instaharvest import SharedBrowser
from instaharvest.config import ScraperConfig
config = ScraperConfig()
with SharedBrowser(config=config) as browser:
# Handles images, videos, reels, carousels automatically
files = browser.download_post("https://www.instagram.com/reel/C-example...")
print(f"Downloaded {len(files)} files")
๐ง Video support requires Google Chrome (
browser_channel='chrome', the default). Chromium lacks video codecs.
๐ Web API โ Direct JSON Data Extraction
Access Instagram's internal API endpoints directly through Playwright. Returns exact, structured data โ no DOM scraping.
16+ endpoints | 15 data models | Auto-pagination | Rate limiting | POST + GET support
from instaharvest import SharedBrowser
from instaharvest.config import ScraperConfig
config = ScraperConfig(headless=True)
with SharedBrowser(config=config) as browser:
# โโ Profile (exact stats) โโ
profile = browser.get_profile_json('mondayswimwear')
print(f"{profile.full_name}: {profile.follower_count:,} followers")
user_id = profile.user_id
# โโ Followers / Following โโ
followers = browser.get_followers_api(user_id, count=50)
following = browser.get_following_api(user_id, count=50)
# โโ Feed, Comments, Likers โโ
feed = browser.get_user_feed_api(user_id, count=12)
comments = browser.get_media_comments_api(feed.posts[0].media_id)
likers = browser.get_media_likers_api(feed.posts[0].media_id)
# โโ Stories, Highlights, Reels โโ
stories = browser.get_stories_api(user_id)
highlights = browser.get_highlights_api(user_id)
reels = browser.get_reels_api(user_id)
# โโ Hashtag & Location โโ
hashtag = browser.get_hashtag_feed_api('swimwear')
location = browser.get_location_feed_api('213385402')
# โโ Raw API (any endpoint, GET or POST) โโ
raw = browser.fetch_raw_api('/api/v1/users/1059031072/info/')
Available Endpoints:
| Method | Description | Returns |
|---|---|---|
get_profile_json(username) |
Profile with exact stats | WebProfileData |
get_user_info(user_id) |
Profile by ID | WebProfileData |
get_followers_api(id, count) |
Followers list (paginated) | FollowListResult |
get_following_api(id, count) |
Following list (paginated) | FollowListResult |
get_friendship_status(id) |
Follow relationship | FriendshipStatus |
get_user_feed_api(id, count) |
User's posts | UserFeedResult |
get_media_info_api(media_id) |
Detailed post info | MediaInfo |
get_media_comments_api(id) |
Post comments | CommentsResult |
get_media_likers_api(id) |
Post likers | LikersResult |
get_stories_api(id) |
Active stories | List[StoryMediaItem] |
get_highlights_api(id) |
Highlights list | HighlightsResult |
get_reels_api(id) |
Reels with plays | ReelsResult |
get_hashtag_feed_api(tag) |
Hashtag posts | HashtagSection |
get_location_feed_api(id) |
Location posts | LocationSection |
get_tagged_posts_api(id) |
Tagged posts | UserFeedResult |
fetch_raw_api(endpoint) |
Any endpoint (GET/POST) | Dict |
Direct API access (without SharedBrowser):
from instaharvest import InstagramWebAPI
from playwright.sync_api import sync_playwright
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=True)
context = browser.new_context(storage_state='instagram_session.json')
page = context.new_page()
page.goto('https://www.instagram.com/')
api = InstagramWebAPI(page=page)
profile = api.get_profile('mondayswimwear')
print(f"{profile.follower_count:,} followers")
browser.close()
๐ฏ Complete Workflow Example
from instaharvest import SharedBrowser, InstagramOrchestrator
from instaharvest.config import ScraperConfig
config = ScraperConfig()
with SharedBrowser(config=config) as browser:
# 1. Profile analysis
profile = browser.scrape_profile('target_user')
print(f"๐ {profile.full_name}: {profile.followers} followers")
# 2. Collect & follow
followers = browser.get_followers('target_user', limit=50)
for f in followers[:10]:
browser.follow(f)
# 3. Scrape posts
post_links = browser.scrape_post_links('target_user')
posts = browser.scrape_posts([l['url'] for l in post_links[:5]])
# 4. Stories + Web API
stories = browser.scrape_stories('target_user')
profile_json = browser.get_profile_json('target_user')
reels = browser.get_reels_api(profile_json.user_id)
# 5. Full orchestrated scrape
orch = InstagramOrchestrator(config, shared_browser=browser)
results = orch.scrape_complete_profile_advanced(
'target_user', parallel=3,
save_excel=True, scrape_stories=True
)
print(f"โ
{len(results['posts_data'])} posts scraped")
๐ Project Structure
๐๏ธ Package Structure
insta-harvester/
โโโ instaharvest/ # Main package
โ โโโ __init__.py # Package entry point
โ โโโ base.py # Base scraper class
โ โโโ config.py # Configuration
โ โโโ profile.py # Profile scraping
โ โโโ followers.py # Followers collection
โ โโโ follow.py # Follow/unfollow
โ โโโ message.py # Direct messaging
โ โโโ post_data.py # Post data (JSON-first)
โ โโโ reel_data.py # Reel data extraction
โ โโโ comment_scraper.py # Comments with replies
โ โโโ story_scraper.py # Story scraping
โ โโโ highlight_scraper.py # Highlights extraction
โ โโโ tagged_posts.py # Tagged posts
โ โโโ notifications.py # Notification reader
โ โโโ web_api.py # ๐ Web API (16+ endpoints)
โ โโโ shared_browser.py # SharedBrowser
โ โโโ orchestrator.py # Workflow orchestrator
โ โโโ parallel_scraper.py # Parallel processing
โ โโโ downloader.py # Media download
โ โโโ ... # More modules
โโโ examples/
โ โโโ save_session.py # Session setup
โ โโโ all_in_one.py # Interactive demo
โ โโโ main_advanced.py # Production scraping
โ โโโ example_web_api.py # ๐ Web API demo
โ โโโ example_custom_config.py
โโโ tests/ # 130+ unit tests
โโโ LICENSE # MIT License
โ๏ธ Configuration
from instaharvest import ScraperConfig
config = ScraperConfig(
headless=True, # Run without browser UI
viewport_width=1920,
viewport_height=1080,
default_timeout=30000, # 30 seconds
max_scroll_attempts=50,
log_level='INFO',
# Rate limiting
follow_delay_min=10.0,
follow_delay_max=15.0,
message_delay_min=15.0,
message_delay_max=20.0,
)
See Configuration Guide for all options.
๐ง Troubleshooting
๐ Common Issues & Solutions
| Problem | Solution |
|---|---|
playwright command not found |
pip install playwright && playwright install chrome |
No module named 'instaharvest' |
pip install instaharvest or pip install -e . |
Session file not found |
Run save_session() first |
Login required / Session expired |
Re-run save_session() |
Instagram says 'Try again later' |
Increase rate limiting delays in config |
Could not follow/unfollow |
Increase popup_open_delay and action_delay_* |
Slow internet errors |
Increase page_load_delay and scroll_delay_* |
Posts: 0 but content exists |
Update to latest version (v2.7.1+) |
Getting Help:
โ ๏ธ Disclaimer
This tool is for educational purposes only. Follow Instagram's Terms of Service, respect rate limits, and use responsibly.
๐ License
MIT License โ see LICENSE for details.
๐ค Contributing
Contributions welcome! Submit a Pull Request.
Made with โค๏ธ by Muydinov Doston
Happy Harvesting! ๐พ
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file instaharvest-2.16.0.tar.gz.
File metadata
- Download URL: instaharvest-2.16.0.tar.gz
- Upload date:
- Size: 240.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9da0051a69fa9494d800b95804e0cde2de6b7c2a75930ffe6091c6b4a20173d
|
|
| MD5 |
8cd679f066db6878ecf6d11668777340
|
|
| BLAKE2b-256 |
48d8e5705b3501e9728c8b72c49b4b9a35e15ead63f5e03bc5146dfad978e227
|
File details
Details for the file instaharvest-2.16.0-py3-none-any.whl.
File metadata
- Download URL: instaharvest-2.16.0-py3-none-any.whl
- Upload date:
- Size: 242.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11c9ab51749d9d4f5c191c6459336a1b88d09cda933e9f97cea8eb4b23a5ecf8
|
|
| MD5 |
719480feac1df39d96d2f1e1b764441b
|
|
| BLAKE2b-256 |
e2b9d420ff9b291e6a8d91d043a9f1acead414e318ff3e2a1854474bb421e0ed
|