Skip to main content

Flask extension to parse websites and extract structured data to build sitemaps.

Project description

canonicalwebteam.directory-parser

Flask extension to parse websites and extract structured data to build sitemaps.

Install

Install the project with pip: pip install canonicalwebteam.directory-parser

Using the directory parser

Sitemap templates

Include sitemap templates in your Flask app. Copy the following codeblock to where your application is instantiated e.g app.py. The template loader should be placed right after the app is instantiated.

from jinja2 import ChoiceLoader, FileSystemLoader
from pathlib import Path
import canonicalwebteam.directory_parser as directory_parser


# Set up Flask application
app = FlaskBase(...)


# Include directory parser templates
directory_parser_templates = (
    Path(directory_parser.__file__).parent / "templates"
)

loader = ChoiceLoader(
    [
        FileSystemLoader(str(directory_parser_templates)),
    ]
)

app.jinja_loader = loader

Generate sitemaps

The generate_sitemap function will generate a sitemap given directory path and base url using the sitemap templates.

# Dynamic sitemaps that do not need to be included in the sitemap tree.
# Differ from project to project, can be checked on /sitemap.xml
DYNAMIC_SITEMAPS = [
    "tutorials",
    "engage",
    "ceph/docs",
    "blog",
    "security/notices",
    "security/cves",
    "security/livepatch/docs",
    "robotics/docs",
]

directory_path = os.getcwd() + "/templates"
base_url = "https://ubuntu.com"

xml_sitemap = directory_parser.generate_sitemap(
                directory_path, 
                base_url, 
                exclude_paths=DYNAMIC_SITEMAPS
              )

if xml_sitemap:
    with open(sitemap_path, "w") as f:
        f.write(xml_sitemap)

# Serve the existing sitemap
with open(sitemap_path, "r") as f:
    xml_sitemap = f.read()

response = flask.make_response(xml_sitemap)
response.headers["Content-Type"] = "application/xml"
return response

Parse project directory tree

If you'd like to get the parsed tree of a given directory, you can use the scan_directory function.

directory_path = os.getcwd() + "/templates"
tree = directory_parser.scan_directory(
            directory_path, exclude_paths=DYNAMIC_SITEMAPS
        )

tree will return a tree of all the templates given in the directory_path

Local development

Running the project

This guide assumes that you are using dotrun to run your Flask app.

Include a relative path to the project

This example assumes both project exist in the same directory

In requirements.txt:

# Comment out package import
# canonicalwebteam.directory-parser==1.2.6

-e ../directory-parser

Run project with a mounted additor

dotrun -m /path/to/canonicalwebteam.directory-parser:../directory-parser

Linting and formatting

To follow the standard linting rules of this project, we are using Tox

pip3 install tox  # Install tox
tox -e lint       # Check the format of Python code
tox -e format     # Reformat the Python code

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

canonicalwebteam_directory_parser-1.2.7.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file canonicalwebteam_directory_parser-1.2.7.tar.gz.

File metadata

File hashes

Hashes for canonicalwebteam_directory_parser-1.2.7.tar.gz
Algorithm Hash digest
SHA256 0593b0272108cbfb4df34f624a69f3490b6519cacbec0fcb9c38faa6a70dec2a
MD5 ba843a834b62029845d80e662360be7a
BLAKE2b-256 5c9053cc54ad669f2fa87c92658988d619ed3b50cb3b112c059c348d1837d7e6

See more details on using hashes here.

File details

Details for the file canonicalwebteam_directory_parser-1.2.7-py3-none-any.whl.

File metadata

File hashes

Hashes for canonicalwebteam_directory_parser-1.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 b4ccaa427ff3e6c238099c6079279604be06a7bca5ee97a6dec275740f6a49bf
MD5 5c6c83f4ba1753c01c94a31e5aef8ae2
BLAKE2b-256 a5daceb08fa3159d92de54358fbb2824c3f803f242af2e2700d539b33c8ece2b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page