Skip to main content

LLM-friendly scraper for media and text from social media and the open web.

Project description

scrapeMM: Multimodal Web Retrieval

Simple web scraper to asynchronously retrieve webpages and access social media contents, fetching text along with media, i.e., images and videos.

This library aims to help developers and researchers to easily access multimodal data from the web and use it for LLM processing.

Usage

from scrapemm import retrieve

url = "https://example.com"
result = retrieve(url)
result.render()

How it works

Input:                                  Output:
URL (string)   -->   retrieve()   -->   MultimodalSequence

The MultimodalSequence is a sequence of Markdown-formatted text and media provided by the ezMM library.

Web scraping is done with Firecrawl.

Supported Proprietary APIs

  • ✅ X/Twitter
  • ✅ Telegram
  • ⏳ Facebook
  • ⏳ Instagram
  • ⏳ Threads
  • ⏳ TikTok

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemm-0.1.0.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapemm-0.1.0-py3-none-any.whl (19.6 kB view details)

Uploaded Python 3

File details

Details for the file scrapemm-0.1.0.tar.gz.

File metadata

  • Download URL: scrapemm-0.1.0.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e9f2590b90e4cf092d411666ecf3c74f1f5b0b0741c4bb4dabf35cacc877a42e
MD5 98a0a5a2d6dd2c28d51da27f5a2a6fbd
BLAKE2b-256 3a63f002797004c11c9f1159f92133637e1858cbffc9f3c9ad1672fa05df124d

See more details on using hashes here.

File details

Details for the file scrapemm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: scrapemm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bbd89d36fe816bf58e033942c5cad605023b209f1506497bf10dcd0133021373
MD5 72fe2f5555282031f3667a1075c240fb
BLAKE2b-256 800c0ddc775c86344ef918d7077feeb29ab2634c6a5ec09e1933dfe3e420f23b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page