Skip to main content

Documentation scrapper for PyMeilisearch

Project description

pymeilisearch-scraper

An Ansys fork of meilisearch/docs-scraper

This repository has been forked from meilisearch/docs-scraper and incorporates several enhancements to facilitate usage with Python and Sphinx documentation scraping.

It is used by pymeilisearch when scrapping online and local documentation pages.

Added:

  • Ability to install via pip
  • Ignore " #" at the end of headers for sphinx documentation
  • Added a __main__.py to allow you to call this as a Python module
  • Includes desired CNAME when scrapping local pages
$ python -m scraper -h
usage: __main__.py [-h] [--meilisearch-host-url MEILISEARCH_HOST_URL]
                   [--meilisearch-api-key MEILISEARCH_API_KEY]
                   config_file

Scrape documentation.

positional arguments:
  config_file           The path to the configuration file.

options:
  -h, --help            show this help message and exit
  --meilisearch-host-url MEILISEARCH_HOST_URL
                        The URL to the meilisearch host
  --meilisearch-api-key MEILISEARCH_API_KEY
                        The URL to the meilisearch host

Original documentation follows:


Meilisearch

docs-scraper

Meilisearch | Meilisearch Cloud | Documentation | Discord | Roadmap | Website | FAQ

GitHub Workflow Status License Bors enabled

docs-scraper is a scraper for your documentation website that indexes the scraped content into a Meilisearch instance.

Meilisearch is an open-source search engine. Discover what Meilisearch is!

This scraper is used in production and runs on the Meilisearch documentation on each deployment.

💡 If you already have your own scraper but you still want to use Meilisearch and our front-end tools, check out this discussion.

Table of Contents

⚡ Supercharge your Meilisearch experience

Say goodbye to server deployment and manual updates with Meilisearch Cloud. No credit card required.

⚙️ Usage

Here are the 3 steps to use docs-scraper:

  1. Run a Meilisearch instance
  2. Set your config file
  3. Run the scraper

Run your Meilisearch Instance

Your documentation content needs to be scraped and pushed into a Meilisearch instance.

You can install and run Meilisearch on your machine using curl.

curl -L https://install.meilisearch.com | sh
./meilisearch --master-key=myMasterKey

There are other ways to install Meilisearch.

The host URL and the API key you will provide in the next steps correspond to the credentials of this Meilisearch instance. In the example above, the host URL is http://localhost:7700 and the API key is myMasterKey.

Meilisearch is open-source and can run either on your server or on any cloud provider. Here is a tutorial to run Meilisearch in production.

Set your Config File

The scraper tool needs a config file to know which content you want to scrape. This is done by providing selectors (e.g. the HTML tag/id/class). The config file is passed as an argument. It follows no naming convention and may be named as you want.

Here is an example of a basic config file:

{
  "index_uid": "docs",
  "start_urls": ["https://www.example.com/doc/"],
  "sitemap_urls": ["https://www.example.com/sitemap.xml"],
  "stop_urls": [],
  "selectors": {
    "lvl0": {
      "selector": ".docs-lvl0",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": {
      "selector": ".docs-lvl1",
      "global": true,
      "default_value": "Chapter"
    },
    "lvl2": ".docs-content .docs-lvl2",
    "lvl3": ".docs-content .docs-lvl3",
    "lvl4": ".docs-content .docs-lvl4",
    "lvl5": ".docs-content .docs-lvl5",
    "lvl6": ".docs-content .docs-lvl6",
    "text": ".docs-content p, .docs-content li"
  }
}

The index_uid field is the index identifier in your Meilisearch instance in which your website content is stored. The scraping tool will create a new index if it does not exist.

The docs-content class (the . means this is a class) is the main container of the textual content in this example. Most of the time, this tag is a <main> or an <article> HTML element.

lvlX selectors should use the standard title tags like h1, h2, h3, etc. You can also use static classes. Set a unique id or name attribute to these elements.

Every searchable lvl elements outside this main documentation container (for instance, in a sidebar) must be global selectors. They will be globally picked up and injected to every document built from your page.

You can also check out the config file we use in production for our own documentation site.

💡 To better understand the selectors, go to this section.

🔨 There are many other fields you can set in the config file that allow you to adapt the scraper to your need. Check out this section.

Run the Scraper

From Source Code

This project supports Python 3.8 and above.

The pipenv command must be installed.

Set both environment variables MEILISEARCH_HOST_URL and MEILISEARCH_API_KEY.
Following on from the example in the first step, they are respectively http://localhost:7700 and myMasterKey.

Then, run:

pipenv install
pipenv run ./docs_scraper <path-to-your-config-file>

<path-to-your-config-file> should be the path of your configuration file defined at the previous step.

With Docker

docker run -t --rm \
    -e MEILISEARCH_HOST_URL=<your-meilisearch-host-url> \
    -e MEILISEARCH_API_KEY=<your-meilisearch-api-key> \
    -v <absolute-path-to-your-config-file>:/docs-scraper/<path-to-your-config-file> \
    getmeili/docs-scraper:latest pipenv run ./docs_scraper <path-to-your-config-file>

<absolute-path-to-your-config-file> should be the absolute path of your configuration file defined at the previous step.

⚠️ If you run Meilisearch locally, you must add the --network=host option to this Docker command.

In a GitHub Action

To run after your deployment job:

run-scraper:
    needs: <your-deployment-job>
    runs-on: ubuntu-18.04
    steps:
    - uses: actions/checkout@master
    - name: Run scraper
      env:
        HOST_URL: ${{ secrets.MEILISEARCH_HOST_URL }}
        API_KEY: ${{ secrets.MEILISEARCH_API_KEY }}
        CONFIG_FILE_PATH: <path-to-your-config-file>
      run: |
        docker run -t --rm \
          -e MEILISEARCH_HOST_URL=$HOST_URL \
          -e MEILISEARCH_API_KEY=$API_KEY \
          -v $CONFIG_FILE_PATH:/docs-scraper/<path-to-your-config-file> \
          getmeili/docs-scraper:latest pipenv run ./docs_scraper <path-to-your-config-file>

⚠️ We do not recommend using the latest image in production. Use the release tags instead.

Here is the GitHub Action file we use in production for the Meilisearch documentation.

About the API Key

The API key you must provide should have the permissions to add documents into your Meilisearch instance.
In a production environment, we recommend providing the private key instead of the master key, as it is safer and it has enough permissions to perform such requests.

_More about Meilisearch authentication. _

🖌 And for the front-end search bar?

After having scraped your documentation, you might need a search bar to improve your user experience!

About the front part:

Both of these libraries provide a front-end search bar perfectly adapted for documentation.

docs-searchbar-demo

🛠 More Configurations

More About the Selectors

Bases

Very simply, selectors are needed to tell the scraper "I want to get the content in this HTML tag".
This HTML tag is a selector.

A selector can be:

  • a class (e.g. .main-content)
  • an id (e.g. #main-article)
  • an HTML tag (e.g. h1)

With a more concrete example:

"lvl0": {
    "selector": ".navbar-nav .active",
    "global": true,
    "default_value": "Documentation"
},

.navbar-nav .active means "take the content in the class active that is itself in the class navbar-nav".

global: true means you want the same lvl0 (so, the same main title) for all the contents extracted from the same page.

"default_value": "Documentation" will be the displayed value if no content in .navbar-nav .active was found.

NB: You can set the global and default_value attributes for every selector level (lvlX) and not only for the lvl0.

The Levels

You can notice different levels of selectors (0 to 6 maximum) in the config file. They correspond to different levels of titles, and will be displayed this way:

selectors-display

Your data will be displayed with a main title (lvl0), sub-titles (lvl1), sub-sub-titles (lvl2) and so on...

All the Config File Settings

index_uid

The index_uid field is the index identifier in your Meilisearch instance in which your website content is stored. The scraping tool will create a new index if it does not exist.

{
  "index_uid": "example"
}

start_urls

This array contains the list of URLs that will be used to start scraping your website.
The scraper will recursively follow any links (<a> tags) from those pages. It will not follow links that are on another domain.

{
  "start_urls": ["https://www.example.com/docs"]
}
Using Page Rank

This parameter gives more weight to some pages and helps to boost records built from the page.
Pages with highest page_rank will be returned before pages with a lower page_rank.

{
  "start_urls": [
    {
      "url": "http://www.example.com/docs/concepts/",
      "page_rank": 5
    },
    {
      "url": "http://www.example.com/docs/contributors/",
      "page_rank": 1
    }
  ]
}

In this example, records built from the Concepts page will be ranked higher than results extracted from the Contributors page.

stop_urls (optional)

The scraper will not follow links that match stop_urls.

{
  "start_urls": ["https://www.example.com/docs"],
  "stop_urls": ["https://www.example.com/about-us"]
}

selectors_key (optional)

This allows you to use custom selectors per page.

If the markup of your website is so different from one page to another that you can't have generic selectors, you can namespace your selectors and specify which set of selectors should be applied to specific pages.

{
  "start_urls": [
    "http://www.example.com/docs/",
    {
      "url": "http://www.example.com/docs/concepts/",
      "selectors_key": "concepts"
    },
    {
      "url": "http://www.example.com/docs/contributors/",
      "selectors_key": "contributors"
    }
  ],
  "selectors": {
    "default": {
      "lvl0": ".main h1",
      "lvl1": ".main h2",
      "lvl2": ".main h3",
      "lvl3": ".main h4",
      "lvl4": ".main h5",
      "text": ".main p"
    },
    "concepts": {
      "lvl0": ".header h2",
      "lvl1": ".main h1.title",
      "lvl2": ".main h2.title",
      "lvl3": ".main h3.title",
      "lvl4": ".main h5.title",
      "text": ".main p"
    },
    "contributors": {
      "lvl0": ".main h1",
      "lvl1": ".contributors .name",
      "lvl2": ".contributors .title",
      "text": ".contributors .description"
    }
  }
}

Here, all documentation pages will use the selectors defined in selectors.default while the page under ./concepts will use selectors.concepts and those under ./contributors will use selectors.contributors.

scrape_start_urls (optional)

By default, the scraper will extract content from the pages defined in start_urls. If you do not have any valuable content on your starts_urls or if it's a duplicate of another page, you should set this to false.

{
  "scrape_start_urls": false
}

sitemap_urls (optional)

You can pass an array of URLs pointing to your sitemap(s) files. If this value is set, the scraper will try to read URLs from your sitemap(s)

{
  "sitemap_urls": ["http://www.example.com/docs/sitemap.xml"]
}

sitemap_alternate_links (optional)

Sitemaps can contain alternative links for URLs. Those are other versions of the same page, in a different language, or with a different URL. By default docs-scraper will ignore those URLs.

Set this to true if you want those other versions to be scraped as well.

{
  "sitemap_urls": ["http://www.example.com/docs/sitemap.xml"],
  "sitemap_alternate_links": true
}

With the above configuration and the sitemap.xml below, both http://www.example.com/docs/ and http://www.example.com/docs/de/ will be scraped.

<url>
  <loc>http://www.example.com/docs/</loc>
  <xhtml:link rel="alternate" hreflang="de" href="http://www.example.com/de/"/>
</url>

selectors_exclude (optional)

This expects an array of CSS selectors. Any element matching one of those selectors will be removed from the page before any data is extracted from it.

This can be used to remove a table of content, a sidebar, or a footer, to make other selectors easier to write.

{
  "selectors_exclude": [".footer", "ul.deprecated"]
}

custom_settings (optional)

This field can be used to add Meilisearch settings.

Example:
"custom_settings": {
    "synonyms": {
      "static site generator": [
        "ssg"
      ],
      "ssg": [
        "static site generator"
      ]
    },
    "stopWords": ["of", "the"],
    "filterableAttributes": ["genres", "type"]
  }

Learn more about filterableAttributes, synonyms, stop-words and all available settings in the Meilisearch documentation.

min_indexed_level (optional)

The default value is 0. By increasing it, you can choose not to index some records if they don't have enough lvlX matching. For example, with a min_indexed_level: 2, the scraper indexes temporary records having at least lvl0, lvl1 and lvl2 set.

This is useful when your documentation has pages that share the same lvl0 and lvl1 for example. In that case, you don't want to index all the shared records, but want to keep the content different across pages.

{
  "min_indexed_level": 2
}

only_content_level (optional)

When only_content_level is set to true, then the scraper won't create records for the lvlX selectors.
If used, min_indexed_level is ignored.

{
  "only_content_level": true
}

js_render (optional)

When js_render is set to true, the scraper will use ChromeDriver. This is needed for pages that are rendered with JavaScript, for example, pages generated with React, Vue, or applications that are running in development mode: autoreload watch.

After installing ChromeDriver, provide the path to the bin using the following environment variable CHROMEDRIVER_PATH (default value is /usr/bin/chromedriver).

The default value of js_render is false.

{
  "js_render": true
}

js_wait (optional)

This setting can be used when js_render is set to true and the pages need time to fully load. js_wait takes an integer is specifies the number of seconds the scraper should wait for the page to load.

{
  "js_render": true,
  "js_wait": 1
}

allowed_domains (optional)

This setting specifies the domains that the scraper is allowed to access. In most cases the allowed_domains will be automatically set using the start_urls and stop_urls. When scraping a domain that contains a port, for example http://localhost:8080, the domain needs to be manually added to the configuration.

{
  "allowed_domains": ["localhost"]
}

Authentication

WARNING: Please be aware that the scraper will send authentication headers to every scraped site, so use allowed_domains to adjust the scope accordingly!

Basic HTTP

Basic HTTP authentication is supported by setting these environment variables:

  • DOCS_SCRAPER_BASICAUTH_USERNAME
  • DOCS_SCRAPER_BASICAUTH_PASSWORD

Cloudflare Access: Identity and Access Management

If it happens to you to scrape sites protected by Cloudflare Access, you have to set appropriate HTTP headers.

Values for these headers are taken from env variables CF_ACCESS_CLIENT_ID and CF_ACCESS_CLIENT_SECRET.

In case of Google Cloud Identity-Aware Proxy, please specify these env variables:

Keycloak Access: Identity and Access Management

If you need to scrape site protected by Keycloak (Gatekeeper), you have to provide a valid access token.

If you set the environment variables KC_URL, KC_REALM, KC_CLIENT_ID, and KC_CLIENT_SECRET the scraper authenticates itself against Keycloak using Client Credentials Grant and adds the resulting access token as Authorization HTTP header to each scraping request.

Installing Chrome Headless

Websites that need JavaScript for rendering are passed through ChromeDriver.
Download the version suited to your OS and then set the environment variable CHROMEDRIVER_PATH.

🤖 Compatibility with Meilisearch

This package guarantees compatibility with version v1.x of Meilisearch, but some features may not be present. Please check the issues for more info.

⚙️ Development Workflow and Contributing

Any new contribution is more than welcome in this project!

If you want to know more about the development workflow or want to contribute, please visit our contributing guidelines for detailed instructions!

Credits

Based on Algolia's docsearch scraper repository from this commit.
Due to a lot of future changes on this repository compared to the original one, we don't maintain it as an official fork.


Meilisearch provides and maintains many SDKs and Integration tools like this one. We want to provide everyone with an amazing search experience for any kind of project. If you want to contribute, make suggestions, or just know what's going on right now, visit us in the integration-guides repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymeilisearch_scraper-0.2.3.tar.gz (34.1 kB view details)

Uploaded Source

Built Distribution

pymeilisearch_scraper-0.2.3-py3-none-any.whl (32.7 kB view details)

Uploaded Python 3

File details

Details for the file pymeilisearch_scraper-0.2.3.tar.gz.

File metadata

  • Download URL: pymeilisearch_scraper-0.2.3.tar.gz
  • Upload date:
  • Size: 34.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for pymeilisearch_scraper-0.2.3.tar.gz
Algorithm Hash digest
SHA256 f613baaccd58a93b4747093971294c512c58499ad904410f4024f32ac781cebe
MD5 e596a46f0f1a2d23cef8dcb87213b616
BLAKE2b-256 ae8b74239f03ac5516f6cb10aa786d1a6bcf63df2605d8da8156395bfd18dfb9

See more details on using hashes here.

File details

Details for the file pymeilisearch_scraper-0.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for pymeilisearch_scraper-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 75c7305d603a5e0a632ed6c4f6bcdf9e7fcbae9bb7ffa751fb84c18829eac2bd
MD5 c40f04f40a356a8b107fb862a7b98d67
BLAKE2b-256 782e0d2ce6eddfb1142c9aa33756c29fcddf65a3195cf5da62f6f322b713d57c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page