Skip to main content

A convenient crawling package for collecting web data.

Project description



Weaving connection to the world.

GitHub Team

Musubi is a Python library designed for efficiently crawling and extracting website text, enabling users to construct scalable, domain-specific datasets from the ground up for training LLMs.

With Musubi, you can:

🕸️ Extract text data from most websites with common structures and transform them into markdown format.

🤖 Deploy AI agents to help you find optimal parameters for website crawling and implement crawling automatically.

📆 Set schedulers to schedule your crawling tasks.

🗂️ Manage crawling configurations for each website conveniently.

We've also developed a CLI tool that lets you crawl and deploy agents without the need to write any code!

Table of Contents

Installation

Python Package

For installing Musubi with pip: .

pip install musubi-scrape

From source

You can also install Musubi from source to instantly use the latest features before the official release.

pip install git+https://github.com/Musubi-ai/Musubi.git

Usage

In Musubi, the overall crawling process can be generally split into two stages: the link-crawling stage and the content-crawling stage. In the link-crawling stage, Musubi extracts all links in the specified block on the website. For the link-crawling stage, Musubi provides four main crawling methods based on the website format to extract links of news, documents, and blogs: scan, scroll, click, and onepage. Next, the corresponding text content of each link is crawled and transformed into markdown format.

Key usage

To crawl website contents, you can easily use pipeline function:

from musubi import Pipeline

pipeline_kwargs = {
    "dir": dir, # Name of directory to store links and text contents
    "name": name, # Name of saved file 
    "class_": class_,  # The type of data on the website.
    "prefix": prefix, # Main prefix of website. 
    "suffix": suffix, # The URL Musubi crawls will be formulated as "prefix" + str((page_init_val + pages) * multiplier) + "suffix".
    "root_path": root_path, # Root of the URL if URLs in tags are presented in relative format.
    "pages": max_pages, # Number of crawling pages if type is 'scan' or number of scrolling times if type is 'scroll'.
    "page_init_val": page_init_val, # Initial value of page.
    "multiplier": multiplier, # Multiplier of page.
    "block1": block1, # List of HTML tag and its class. 
    "block2": block2, # Second block if crawling nested structure
    "type": website_type, # Type of crawling method to crawl URLs on the website
    "async_": async_ # Whether to crawl website asynchronously or not
}

pipeline = Pipeline(website_config_path=website_config_path)
pipeline.pipeline(**pipeline_kwargs)

Demo

Task: Crawl 3 pages of articles from the 'Fiction and Poetry' category on Literary Hub.

from musubi.pipeline import Pipeline


pipe = Pipeline(website_config_path=r"config\test.json")

pipe.pipeline(
    dir = "Literary Hub",
    name = "Fiction and Poetry",  
    class_ = "English",
    prefix = "https://lithub.com/category/fictionandpoetry/page/",
    suffix = "/",
    root_path = "https://lithub.com",
    pages = 3,
    page_init_val = 1,
    multiplier = 1,
    block1 = ["div", "post_header"],
    block2 = None,
    type = "scan",
    )

https://github.com/user-attachments/assets/223a5d62-8364-4964-ade6-829306fec271

Scheduler

Musubi allows users to set up a scheduler to run crawling tasks at specified times. To launch a scheduler:

from musubi.scheduler import Controller

controller = Controller()
controller.launch_scheduler()

By default, the scheduler uses tasks.json in the config folder as a task management configuration and uses websites.json to implement crawling tasks. Users can customize these settings using arguments:

from musubi.scheduler import Controller

controller = Controller(
        config_dir="folder-of-task.json"
        website_config_path="path-of-website.json" 
    )

After launching the scheduler, users can add tasks and set the scheduler to implement tasks regularly. Currently, we support users to set tasks with task type update_all or by_idx. The update_all task crawls all websites stored in the website configuration file and the by_idx task crawls the specific website as specified by its index. Note that in the Musubi scheduler, we follow the common cron format to define when to implement the task. For instance, to set a regular update task to crawl all websites stored in the websites.json file at 12:05:05 on the 5th day of May each year:

from musubi.scheduler import Controller

controller = Controller()

def main():
    status_code, _ = controller.check_status()
    if status_code == 200:
        controller.add_task(
        task_type="update_all",
        task_name="test1",
        update_pages=15,
        cron_params={"hour": 12, "second": 5, "minute": 5, "month": 5}
    )

if __name__ == "__main__":
    main()

For valid cron_params arguments, check reference.

Notification

Users can set the argument send_notification=True in the add_task function so that the program will send Gmail notifications when scheduled tasks start and finish. Go to this website to apply for an app password and set the environment variable in the .env file:

GOOGLE_APP_PASSWORD="your-app-password"

Then the notification can be used like:

controller.add_task(
    ...,
    send_notification=True,
    sender_email="youe-account@gmail.com"
)

Agent

Musubi provides agents for users to crawl websites, set crawling schedulers, and analyze crawling configurations with the help of several top-tier proprietary LLMs from corporations such as OpenAI, Anthropic, Google, and open-source LLMs from Hugging Face. Set the API keys in the .env file to use these LLMs:

OPENAI_API_KEY=
GROQ_API_KEY=
XAI_API_KEY=
DEEPSEEK_API_KEY=
ANTHROPIC_API_KEY=
GEMINI_API_KEY=

Alternatively, you can instantiate agents with an API key directly. The API key will be stored in the .env file once the agent is instantiated by default. For example, to utilize GPT-4o to build a pipeline agent in Musubi:

from musubi.agent import PipelineAgent

agent = PipelineAgent(
    actions=[some-actions],
    model_source="openai",
    api_key="your-openai-apikey",
    model_type="gpt-4o"
)

In addition to the LLM APIs for agents, Google Search API and Google Engine ID are required to take the google_search action when using PipelineAgent. Check this documentation for applying for Google Search API and this website to build a search engine and retrieve the engine ID, then set them in the .env file:

GOOGLE_SEARCH_API=
GOOGLE_ENGINE_ID=

Here is a basic example of using a pipeline agent in Musubi to crawl text contents in the 'Fiction and Poetry' category on Literary Hub:

from musubi.agent import PipelineAgent
from musubi.agent.actions import (
    google_search,
    analyze_website,
    get_container,
    get_page_info,
    final_answer
)


actions = [google_search, analyze_website, get_container, get_page_info, final_answer]
pipeline_agent = PipelineAgent(
    actions=actions,
    model_source="openai"
)

prompt = "Help me scrape all pages of articles from the 'Fiction and Poetry' category on Literary Hub."
pipeline_agent.execute(prompt)

Multi-agent System

Beyond instantiating a single agent to perform specific tasks, agents can be coordinated into a hierarchical multi-agent system to execute tasks with greater efficiency, scalability, and adaptability. To build a hierarchical multi-agent system in Musubi, you can simply use MusubiAgent:

from musubi.agent import PipelineAgent, GeneralAgent, SchedulerAgent, MusubiAgent
from musubi.agent.actions import (
    google_search,
    analyze_website,
    get_container,
    get_page_info,
    final_answer,
    domain_analyze,
    type_analyze,
    update_all,
    update_by_idx,
    upload_data_folder,
    del_web_config_by_idx
)


actions = [google_search, analyze_website, get_container, get_page_info, final_answer]
pipeline_agent = PipelineAgent(
    actions=actions
)


general_actions = [domain_analyze, type_analyze, update_all, update_by_idx, upload_data_folder, del_web_config_by_idx]
general_agent = GeneralAgent(
    actions=general_actions
)

main_agent = MusubiAgent(candidates=[general_agent, pipeline_agent])
prompt = "Check how many websites I have scraped already."
main_agent.execute(prompt)

Demo

Task: Crawl 5 pages of articles from the 'Fiction and Poetry' category on Literary Hub.

https://github.com/user-attachments/assets/f61f40fb-882b-4484-9a9d-0304a8967a9e

Check agent examples to further view the details about how to use agents in Musubi.

CLI Tools

Musubi supports users to execute the aforementioned functions using the command line interface (CLI). The fundamental structure of the Musubi CLI tool is formed as:

musubi [COMMAND] [FLAGS] [ARGUMENTS]

For instance, to add openai api key into .env file with Musubi cli, you can use:

musubi env --openai your-openai-api-key

Use pipeline to crawl a website (Suppose that we want to crawl articles present in the first 5 pages of the Chinese website called '測試' with URL https://www.test.com/category?&pages=n):

musubi pipeline \
  --dir 測試 \
  --name 測試文章 \
  --class_ 中文 \
  --prefix https://www.test.com/category?&pages= \
  --pages 5 \
  --block1 ["div", "entry-image"] 
  --type scan \

Use agent:

musubi agent \
  --prompt "Help me crawl all pages of articles from the 'Fiction and Poetry' category on Literary Hub." \
  --model_source anthropic \
  --model_type claude-opus-4-20250514

Re-crawl all previously crawled websites according to the specified page numbers:

musubi strat-all \
 --website_config_path config/websites.json \
 --update-pages 80

Demo

Task: Crawl all websites whose configurations stored in config\test_websites.json again (update 5 pages).

https://github.com/user-attachments/assets/f7c17fa6-f2ab-48c9-aea1-f795cea362a0

License

This repository is licensed under the Apache-2.0 License.

Background

Musubi (結び) is a japanese word of meaning “to tie something like a string”. In Shinto (神道) and traditional Japanese philosophy, musubi also refers to life, birth, relationships, and the natural cycles of the world.

Citation

If you use Musubi in your research or project, please cite it with the following BibTeX entry:

@misc{musubi2025,
  title =        {Musubi: Weaving connection to the world.},
  author =       {Lung-Chuan Chen},
  howpublished = {\url{https://github.com/Musubi-ai/Musubi}},
  year =         {2025}
}

Acknowledgement

This repo benefits from trafilatura for extracting text contents from webpages and PyMuPDF for parsing online PDF documents.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

musubi_scrape-1.0.1.tar.gz (45.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

musubi_scrape-1.0.1-py3-none-any.whl (58.8 kB view details)

Uploaded Python 3

File details

Details for the file musubi_scrape-1.0.1.tar.gz.

File metadata

  • Download URL: musubi_scrape-1.0.1.tar.gz
  • Upload date:
  • Size: 45.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.12.8 Windows/11

File hashes

Hashes for musubi_scrape-1.0.1.tar.gz
Algorithm Hash digest
SHA256 64910c82a6337b61adcc90bac1df6ac8e0d6ae8d165f3cb533e9b93bbafb03db
MD5 72df224fb26620bb21dcfc42f368658c
BLAKE2b-256 633190bb754de31a2e626015de8eb5ec9f05c69ead66953f92326f8c102145ea

See more details on using hashes here.

File details

Details for the file musubi_scrape-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: musubi_scrape-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 58.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.12.8 Windows/11

File hashes

Hashes for musubi_scrape-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4a7f778e36f12327b355654a32717f6ce54c9cfbf245caca183f39af973d23e9
MD5 f8d79129a94911328a2caa3f7ddaee92
BLAKE2b-256 edeb74c2665ba807b85385d0d5c70263ed6069d47eef72504dc1897605833747

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page