Skip to main content

Flexible multimodal scraper for social media and the open web.

Project description

scrapeMM: Multimodal Web Retrieval

Simple web scraper to asynchronously retrieve webpages and access social media contents, fetching text along with media, i.e., images and videos.

This library aims to help developers and researchers to easily access multimodal data from the web and use it for LLM processing.

Usage

from scrapemm import retrieve
import asyncio

url = "https://example.com"
loop = asyncio.get_event_loop()
result = loop.run_until_complete(retrieve(url))
result.render()

scrapeMM will ask you for the API keys needed for the social media integrations. You may skip them if you don't need them. You will also be prompted to choose a password that is used to secure the secrets in an encrypted file.

How it works

Input:                                  Output:
URL (string)   -->   retrieve()   -->   MultimodalSequence

The MultimodalSequence is a sequence of Markdown-formatted text and media provided by the ezMM library.

Web scraping is done with Firecrawl and Decodo.

Supported Platforms

  • ✅ X/Twitter
  • ✅ Telegram
  • ✅ Bluesky
  • ✅ TikTok
  • ✅ YouTube
  • (✅️) Instagram (works for most content)
  • ⚠️ Facebook (done for videos but not for images yet)
  • ⏳ Threads (TBD)
  • ⏳ Reddit (TBD)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemm-0.4.0.tar.gz (33.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapemm-0.4.0-py3-none-any.whl (39.9 kB view details)

Uploaded Python 3

File details

Details for the file scrapemm-0.4.0.tar.gz.

File metadata

  • Download URL: scrapemm-0.4.0.tar.gz
  • Upload date:
  • Size: 33.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.4.0.tar.gz
Algorithm Hash digest
SHA256 86342816d5afc4791bbb55348adc1e80850a96fb731104de31ee618fba9a5728
MD5 64d69290e46aedd406c6cbdf98b21216
BLAKE2b-256 011a3609ffa5eb840d86b63f3b50baf40e75f7d03b98518ec71ec122aa28418c

See more details on using hashes here.

File details

Details for the file scrapemm-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: scrapemm-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 39.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 51f6eed18a64f8c956033a68d414ad3b89e6d3f63ed3f686ccd5ed99bb74bf28
MD5 e6fb860aae6a80a78f399305551523a7
BLAKE2b-256 136e9f444c41ee00675061be7d284b7a3c228381171434037f8b5c40ed42cbdd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page