Skip to main content

LLM-friendly scraper for media and text from social media and the open web.

Project description

scrapeMM: Multimodal Web Retrieval

Simple web scraper to asynchronously retrieve webpages and access social media contents, fetching text along with media, i.e., images and videos.

This library aims to help developers and researchers to easily access multimodal data from the web and use it for LLM processing.

Usage

from scrapemm import retrieve

url = "https://example.com"
result = retrieve(url)
result.render()

How it works

Input:                                  Output:
URL (string)   -->   retrieve()   -->   MultimodalSequence

The MultimodalSequence is a sequence of Markdown-formatted text and media provided by the ezMM library.

Web scraping is done with Firecrawl.

Supported Proprietary APIs

  • ✅ X/Twitter
  • ✅ Telegram
  • ⏳ Facebook
  • ⏳ Instagram
  • ⏳ Threads
  • ⏳ TikTok

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemm-0.1.2.tar.gz (16.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapemm-0.1.2-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file scrapemm-0.1.2.tar.gz.

File metadata

  • Download URL: scrapemm-0.1.2.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.1.2.tar.gz
Algorithm Hash digest
SHA256 9e79e73891e9cb897fb6f76de49a6334192e648700a10f69d8ff011ca24f8db1
MD5 d622ad7a8072052744633bfb7ae2e32b
BLAKE2b-256 4d29449a2c2963a6136e0e04acdf64f5e8484e383a609f77ce40d430409696d2

See more details on using hashes here.

File details

Details for the file scrapemm-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: scrapemm-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 20.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f7931826d0be397f2b90eebeb0955defa7c2207ca16fa8e7451baed3bbf25b3f
MD5 8770def23107bd930afee438fb3a13e2
BLAKE2b-256 db051f5195513bb0b852724fd3d3c53a9027144a23163da3b351913a5bf88deb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page