Skip to main content

Scrape learn.liferay.com/w/dxp into local Markdown docs (raw/{capability}/*.md) for the liferay-expert Claude Code skill.

Project description

liferay-docs-scraper

Scrape learn.liferay.com/w/dxp/* into local Markdown, then let Claude Code answer Liferay DXP questions by searching those files. No bundled Liferay content, no embeddings, no vector DB.

PyPI Python License: MIT

Quickstart

From zero to asking Liferay questions in Claude Code:

# 1. One-time browser setup for crawl4ai/Playwright
uvx --from crawl4ai crawl4ai-setup

# 2. Scrape the official Liferay DXP docs (~30-40 min)
uvx liferay-docs-scraper

# 3. Install the Claude Code skill in your project
npx skills add mordonez/liferay-docs-scraper --skill liferay-expert -a claude-code

# 4. Check that docs and skill are ready
uvx --from liferay-docs-scraper liferay-docs-scraper-doctor

Then ask Claude Code something like:

How do I configure synonym sets in Liferay Search?

The skill searches the local Markdown, reads the matching page, and cites the source URL from that file's frontmatter.

Keep -a claude-code in the install command. It avoids interactive installer edge cases where the skill can appear installed but not land in .claude/skills/.

What this repo does not do

  • It does not ship Liferay documentation text. The package contains the scraper and skill only; each user fetches their own local copy from learn.liferay.com.
  • It does not use embeddings, RAG infrastructure, or a vector database. The skill uses normal file search and reads Markdown directly.
  • It does not scrape automatically from the skill. If docs are missing, the skill tells you what command to run instead of starting a long crawl in the middle of a conversation.

Requirements

  • Python 3.10-3.13
  • uv
  • Node/npm for npx skills add

crawl4ai drives Playwright, so run this once before the first scrape:

uvx --from crawl4ai crawl4ai-setup

Scraper Reference

Run the official-docs scraper:

uvx liferay-docs-scraper

It crawls https://learn.liferay.com/w/dxp/index with crawl4ai's BFS crawler, keeps URLs under /w/dxp/*, extracts .learn-article-content, classifies each page into one of 14 Liferay capabilities, and writes Markdown to one shared docs directory.

Default docs directory:

~/.liferay-docs

Override it when needed:

export LIFERAY_DOCS_DIR="$PWD/.liferay-docs"
uvx liferay-docs-scraper

Directory layout:

~/.liferay-docs/
  raw/{capability}/*.md
  raw/_navigation/{capability}/*.md
  raw/_removed/{capability}/*.md
  reports/filtered/

Useful commands:

# Smaller smoke run
uvx liferay-docs-scraper --max-pages 200

# Check local docs and current-project skill installation
uvx --from liferay-docs-scraper liferay-docs-scraper-doctor

The scraper writes files atomically, retries page fetches through crawl4ai, uses bounded concurrency, and exits non-zero if page fetches or the crawl stream fail. If the crawl is interrupted, already written pages remain usable, but the run is marked failed and orphan quarantine is skipped so a partial crawl cannot move good pages to raw/_removed/.

Community Articles

Optional, larger, and lower-authority:

uvx --from liferay-docs-scraper liferay-docs-scraper-community

This fetches Liferay community How-To and Troubleshooting articles from learn.liferay.com/kb-article/*. It writes them separately:

raw/community-howto/{capability}/*.md
raw/community-troubleshooting/{capability}/*.md

Many community articles are not tagged with a capability by the site; those go to _uncategorized/. The liferay-expert skill treats community content as a secondary source and says so in answers.

Useful options:

uvx --from liferay-docs-scraper liferay-docs-scraper-community --resource-type howto
uvx --from liferay-docs-scraper liferay-docs-scraper-community --limit 100

Skill Reference

Install into the current project:

npx skills add mordonez/liferay-docs-scraper --skill liferay-expert -a claude-code

Manual install also works: copy skills/liferay-expert/SKILL.md to:

.claude/skills/liferay-expert/SKILL.md

The skill resolves docs exactly like the scraper:

  1. $LIFERAY_DOCS_DIR, if set.
  2. ~/.liferay-docs, otherwise.

When answering, it searches raw/{capability}/*.md, reads the best matching files, and cites their url: frontmatter. It skips raw/_navigation/ unless there is no better source.

Development

uv sync --group dev
uv run ruff check .
uv run pytest
uv build

CI runs lint, tests, and package build on Python 3.10, 3.11, 3.12, and 3.13. It does not run a real scrape. Release publishing is documented in docs/release.md.

License

MIT applies to this tool and skill only. Liferay documentation content remains Liferay's content and is fetched locally by each user.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

liferay_docs_scraper-0.6.1.tar.gz (43.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

liferay_docs_scraper-0.6.1-py3-none-any.whl (24.3 kB view details)

Uploaded Python 3

File details

Details for the file liferay_docs_scraper-0.6.1.tar.gz.

File metadata

  • Download URL: liferay_docs_scraper-0.6.1.tar.gz
  • Upload date:
  • Size: 43.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for liferay_docs_scraper-0.6.1.tar.gz
Algorithm Hash digest
SHA256 fc98d53df0fa6d1bd717a007923c6da4b31f7c80a34f6d3c2ec90562dee7cd7e
MD5 d369f035270f79913d09961d5523b0fd
BLAKE2b-256 3577a371f172a9788300c119f708211d390e1de56062aa0251a3aa3cac9bdc8f

See more details on using hashes here.

File details

Details for the file liferay_docs_scraper-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: liferay_docs_scraper-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 24.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for liferay_docs_scraper-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4eb55475961f59ca1764a6c9ee061086686068ba29ae8e294994277893f79ee9
MD5 3cc7a22ac2a9297df0b8e5ed83a2ce81
BLAKE2b-256 0d7f3e762ca73b0a7d538f4767ea5cfeb4eb46090d3c60b4315e25e242e3ae15

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page