A convenient crawling package for collecting web data.

These details have not been verified by PyPI

Project links

Project description

Weaving connection to the world.

Musubi is a Python library designed for efficiently crawling and extracting website text, enabling users to construct scalable, domain-specific datasets from the ground up for training LLMs.

With Musubi, you can:

🕸️ Extract text data from most websites with common structures and transform them into markdown format.

🤖 Deploy AI agents to help you find optimal parameters for website crawling and implement crawling automatically.

📆 Set schedulers to schedule your crawling tasks.

🗂️ Manage crawling configurations for each website conveniently.

We've also developed a CLI tool that lets you crawl and deploy agents without the need to write any code!

Table of Contents
Installation
- Python Package
- From source
Usage
- Key usage
  - Demo
- Scheduler
  - Notification
- Agent
  - Multi-agent System
    - Demo
- CLI Tools
  - Demo
License
Background
Citation
Acknowledgement

Installation

Python Package

For installing Musubi with pip: .

pip install musubi-scrape

From source

You can also install Musubi from source to instantly use the latest features before the official release.

pip install git+https://github.com/Musubi-ai/Musubi.git

Usage

In Musubi, the overall crawling process can be generally split into two stages: the link-crawling stage and the content-crawling stage. In the link-crawling stage, Musubi extracts all links in the specified block on the website. For the link-crawling stage, Musubi provides four main crawling methods based on the website format to extract links of news, documents, and blogs: scan, scroll, click, and onepage. Next, the corresponding text content of each link is crawled and transformed into markdown format.

Key usage

To crawl website contents, you can easily use pipeline function:

from musubi import Pipeline

pipeline_kwargs = {
    "dir": dir, # Name of directory to store links and text contents
    "name": name, # Name of saved file 
    "class_": class_,  # The type of data on the website.
    "prefix": prefix, # Main prefix of website. 
    "suffix": suffix, # The URL Musubi crawls will be formulated as "prefix" + str((page_init_val + pages) * multiplier) + "suffix".
    "root_path": root_path, # Root of the URL if URLs in tags are presented in relative format.
    "pages": max_pages, # Number of crawling pages if type is 'scan' or number of scrolling times if type is 'scroll'.
    "page_init_val": page_init_val, # Initial value of page.
    "multiplier": multiplier, # Multiplier of page.
    "block1": block1, # List of HTML tag and its class. 
    "block2": block2, # Second block if crawling nested structure
    "type": website_type, # Type of crawling method to crawl URLs on the website
    "async_": async_ # Whether to crawl website asynchronously or not
}

pipeline = Pipeline(website_config_path=website_config_path)
pipeline.pipeline(**pipeline_kwargs)

Demo

Task: Crawl 3 pages of articles from the 'Fiction and Poetry' category on Literary Hub.

from musubi.pipeline import Pipeline


pipe = Pipeline(website_config_path=r"config\test.json")

pipe.pipeline(
    dir = "Literary Hub",
    name = "Fiction and Poetry",  
    class_ = "English",
    prefix = "https://lithub.com/category/fictionandpoetry/page/",
    suffix = "/",
    root_path = "https://lithub.com",
    pages = 3,
    page_init_val = 1,
    multiplier = 1,
    block1 = ["div", "post_header"],
    block2 = None,
    type = "scan",
    )

https://github.com/user-attachments/assets/223a5d62-8364-4964-ade6-829306fec271

Scheduler

Musubi allows users to set up a scheduler to run crawling tasks at specified times. To launch a scheduler:

from musubi.scheduler import Controller

controller = Controller()
controller.launch_scheduler()

By default, the scheduler uses tasks.json in the config folder as a task management configuration and uses websites.json to implement crawling tasks. Users can customize these settings using arguments:

from musubi.scheduler import Controller

controller = Controller(
        config_dir="folder-of-task.json"
        website_config_path="path-of-website.json" 
    )

After launching the scheduler, users can add tasks and set the scheduler to implement tasks regularly. Currently, we support users to set tasks with task type update_all or by_idx. The update_all task crawls all websites stored in the website configuration file and the by_idx task crawls the specific website as specified by its index. Note that in the Musubi scheduler, we follow the common cron format to define when to implement the task. For instance, to set a regular update task to crawl all websites stored in the websites.json file at 12:05:05 on the 5th day of May each year:

from musubi.scheduler import Controller

controller = Controller()

def main():
    status_code, _ = controller.check_status()
    if status_code == 200:
        controller.add_task(
        task_type="update_all",
        task_name="test1",
        update_pages=15,
        cron_params={"hour": 12, "second": 5, "minute": 5, "month": 5}
    )

if __name__ == "__main__":
    main()

For valid cron_params arguments, check reference.

Notification

Users can set the argument send_notification=True in the add_task function so that the program will send Gmail notifications when scheduled tasks start and finish. Go to this website to apply for an app password and set the environment variable in the .env file:

GOOGLE_APP_PASSWORD="your-app-password"

Then the notification can be used like:

controller.add_task(
    ...,
    send_notification=True,
    sender_email="youe-account@gmail.com"
)

Agent

Musubi provides agents for users to crawl websites, set crawling schedulers, and analyze crawling configurations with the help of several top-tier proprietary LLMs from corporations such as OpenAI, Anthropic, Google, and open-source LLMs from Hugging Face. Set the API keys in the .env file to use these LLMs:

OPENAI_API_KEY=
GROQ_API_KEY=
XAI_API_KEY=
DEEPSEEK_API_KEY=
ANTHROPIC_API_KEY=
GEMINI_API_KEY=

Alternatively, you can instantiate agents with an API key directly. The API key will be stored in the .env file once the agent is instantiated by default. For example, to utilize GPT-4o to build a pipeline agent in Musubi:

from musubi.agent import PipelineAgent

agent = PipelineAgent(
    actions=[some-actions],
    model_source="openai",
    api_key="your-openai-apikey",
    model_type="gpt-4o"
)

In addition to the LLM APIs for agents, Google Search API and Google Engine ID are required to take the google_search action when using PipelineAgent. Check this documentation for applying for Google Search API and this website to build a search engine and retrieve the engine ID, then set them in the .env file:

GOOGLE_SEARCH_API=
GOOGLE_ENGINE_ID=

Here is a basic example of using a pipeline agent in Musubi to crawl text contents in the 'Fiction and Poetry' category on Literary Hub:

from musubi.agent import PipelineAgent
from musubi.agent.actions import (
    google_search,
    analyze_website,
    get_container,
    get_page_info,
    final_answer
)


actions = [google_search, analyze_website, get_container, get_page_info, final_answer]
pipeline_agent = PipelineAgent(
    actions=actions,
    model_source="openai"
)

prompt = "Help me scrape all pages of articles from the 'Fiction and Poetry' category on Literary Hub."
pipeline_agent.execute(prompt)

Multi-agent System

Beyond instantiating a single agent to perform specific tasks, agents can be coordinated into a hierarchical multi-agent system to execute tasks with greater efficiency, scalability, and adaptability. To build a hierarchical multi-agent system in Musubi, you can simply use MusubiAgent:

from musubi.agent import PipelineAgent, GeneralAgent, SchedulerAgent, MusubiAgent
from musubi.agent.actions import (
    google_search,
    analyze_website,
    get_container,
    get_page_info,
    final_answer,
    domain_analyze,
    type_analyze,
    update_all,
    update_by_idx,
    upload_data_folder,
    del_web_config_by_idx
)


actions = [google_search, analyze_website, get_container, get_page_info, final_answer]
pipeline_agent = PipelineAgent(
    actions=actions
)


general_actions = [domain_analyze, type_analyze, update_all, update_by_idx, upload_data_folder, del_web_config_by_idx]
general_agent = GeneralAgent(
    actions=general_actions
)

main_agent = MusubiAgent(candidates=[general_agent, pipeline_agent])
prompt = "Check how many websites I have scraped already."
main_agent.execute(prompt)

Demo

Task: Crawl 5 pages of articles from the 'Fiction and Poetry' category on Literary Hub.

https://github.com/user-attachments/assets/f61f40fb-882b-4484-9a9d-0304a8967a9e

Check agent examples to further view the details about how to use agents in Musubi.

CLI Tools

Musubi supports users to execute the aforementioned functions using the command line interface (CLI). The fundamental structure of the Musubi CLI tool is formed as:

musubi [COMMAND] [FLAGS] [ARGUMENTS]

For instance, to add openai api key into .env file with Musubi cli, you can use:

musubi env --openai your-openai-api-key

Use pipeline to crawl a website (Suppose that we want to crawl articles present in the first 5 pages of the Chinese website called '測試' with URL https://www.test.com/category?&pages=n):

musubi pipeline \
  --dir 測試 \
  --name 測試文章 \
  --class_ 中文 \
  --prefix https://www.test.com/category?&pages= \
  --pages 5 \
  --block1 ["div", "entry-image"] 
  --type scan \

Use agent:

musubi agent \
  --prompt "Help me crawl all pages of articles from the 'Fiction and Poetry' category on Literary Hub." \
  --model_source anthropic \
  --model_type claude-opus-4-20250514

Re-crawl all previously crawled websites according to the specified page numbers:

musubi strat-all \
 --website_config_path config/websites.json \
 --update-pages 80

Demo

Task: Crawl all websites whose configurations stored in config\test_websites.json again (update 5 pages).

https://github.com/user-attachments/assets/f7c17fa6-f2ab-48c9-aea1-f795cea362a0

License

This repository is licensed under the Apache-2.0 License.

Background

Musubi (結び) is a japanese word of meaning “to tie something like a string”. In Shinto (神道) and traditional Japanese philosophy, musubi also refers to life, birth, relationships, and the natural cycles of the world.

Citation

If you use Musubi in your research or project, please cite it with the following BibTeX entry:

@misc{musubi2025,
  title =        {Musubi: Weaving connection to the world.},
  author =       {Lung-Chuan Chen},
  howpublished = {\url{https://github.com/Musubi-ai/Musubi}},
  year =         {2025}
}

Acknowledgement

This repo benefits from trafilatura for extracting text contents from webpages and PyMuPDF for parsing online PDF documents.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.2

Sep 28, 2025

This version

1.0.1

Sep 21, 2025

1.0.0

Jun 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

musubi_scrape-1.0.1.tar.gz (45.7 kB view details)

Uploaded Sep 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

musubi_scrape-1.0.1-py3-none-any.whl (58.8 kB view details)

Uploaded Sep 21, 2025 Python 3

File details

Details for the file musubi_scrape-1.0.1.tar.gz.

File metadata

Download URL: musubi_scrape-1.0.1.tar.gz
Upload date: Sep 21, 2025
Size: 45.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.0.1 CPython/3.12.8 Windows/11

File hashes

Hashes for musubi_scrape-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`64910c82a6337b61adcc90bac1df6ac8e0d6ae8d165f3cb533e9b93bbafb03db`
MD5	`72df224fb26620bb21dcfc42f368658c`
BLAKE2b-256	`633190bb754de31a2e626015de8eb5ec9f05c69ead66953f92326f8c102145ea`

See more details on using hashes here.

File details

Details for the file musubi_scrape-1.0.1-py3-none-any.whl.

File metadata

Download URL: musubi_scrape-1.0.1-py3-none-any.whl
Upload date: Sep 21, 2025
Size: 58.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.0.1 CPython/3.12.8 Windows/11

File hashes

Hashes for musubi_scrape-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4a7f778e36f12327b355654a32717f6ce54c9cfbf245caca183f39af973d23e9`
MD5	`f8d79129a94911328a2caa3f7ddaee92`
BLAKE2b-256	`edeb74c2665ba807b85385d0d5c70263ed6069d47eef72504dc1897605833747`

See more details on using hashes here.

musubi-scrape 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Weaving connection to the world.

Table of Contents

Installation

Python Package

From source

Usage

Key usage

Demo

Scheduler

Notification

Agent

Multi-agent System

Demo

CLI Tools

Demo

License

Background

Citation

Acknowledgement

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes