Skip to main content

Flask extension to parse websites and extract structured data to build sitemaps.

Project description

canonicalwebteam.directory-parser

Flask extension to parse websites and extract structured data to build sitemaps.

Install

Install the project with pip: pip install canonicalwebteam.directory-parser

Using the directory parser

Sitemap templates

Include sitemap templates in your Flask app. Copy the following codeblock to where your application is instantiated e.g app.py. The template loader should be placed right after the app is instantiated.

from jinja2 import ChoiceLoader, FileSystemLoader
from pathlib import Path
import canonicalwebteam.directory_parser as directory_parser


# Set up Flask application
app = FlaskBase(...)


# Include directory parser templates
directory_parser_templates = (
    Path(directory_parser.__file__).parent / "templates"
)

loader = ChoiceLoader(
    [
        FileSystemLoader(str(directory_parser_templates)),
    ]
)

app.jinja_loader = loader

Generate sitemaps

The generate_sitemap function will generate a sitemap given directory path and base url using the sitemap templates.

# Dynamic sitemaps that do not need to be included in the sitemap tree.
# Differ from project to project, can be checked on /sitemap.xml
DYNAMIC_SITEMAPS = [
    "tutorials",
    "engage",
    "ceph/docs",
    "blog",
    "security/notices",
    "security/cves",
    "security/livepatch/docs",
    "robotics/docs",
]

directory_path = os.getcwd() + "/templates"
base_url = "https://ubuntu.com"

xml_sitemap = directory_parser.generate_sitemap(
                directory_path, 
                base_url, 
                exclude_paths=DYNAMIC_SITEMAPS
              )

if xml_sitemap:
    with open(sitemap_path, "w") as f:
        f.write(xml_sitemap)

# Serve the existing sitemap
with open(sitemap_path, "r") as f:
    xml_sitemap = f.read()

response = flask.make_response(xml_sitemap)
response.headers["Content-Type"] = "application/xml"
return response

Parse project directory tree

If you'd like to get the parsed tree of a given directory, you can use the scan_directory function.

directory_path = os.getcwd() + "/templates"
tree = directory_parser.scan_directory(
            directory_path, exclude_paths=DYNAMIC_SITEMAPS
        )

tree will return a tree of all the templates given in the directory_path

Local development

Running the project

This guide assumes that you are using dotrun to run your Flask app.

Include a relative path to the project

This example assumes both project exist in the same directory

In requirements.txt:

# Comment out package import
# canonicalwebteam.directory-parser==1.2.6

-e ../directory-parser

Run project with a mounted additor

dotrun -m /path/to/canonicalwebteam.directory-parser:../directory-parser

Linting and formatting

To follow the standard linting rules of this project, we are using Tox

pip3 install tox  # Install tox
tox -e lint       # Check the format of Python code
tox -e format     # Reformat the Python code

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

canonicalwebteam_directory_parser-1.2.10.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file canonicalwebteam_directory_parser-1.2.10.tar.gz.

File metadata

File hashes

Hashes for canonicalwebteam_directory_parser-1.2.10.tar.gz
Algorithm Hash digest
SHA256 dc77fc100d47ddf358ad9ee945d192354da8101e90084766fea99500d95d1d52
MD5 3a78775618b61c788fb06bfb95b93c22
BLAKE2b-256 123d90294eb3b4683e2db771988a76a4a83961394fab6e391eb6b7cf510d35e5

See more details on using hashes here.

File details

Details for the file canonicalwebteam_directory_parser-1.2.10-py3-none-any.whl.

File metadata

File hashes

Hashes for canonicalwebteam_directory_parser-1.2.10-py3-none-any.whl
Algorithm Hash digest
SHA256 f30e8c21f002de463e9dce9f3431537a3bc85840f1cd86a57875daad9ff0f89a
MD5 6c30a185963f6d756ad2ea48f574ccb1
BLAKE2b-256 23745f39c955ef788b8f046d753c27124157aed91e4a4a7c593f755cc1c52c10

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page