Skip to main content

Flask extension to parse websites and extract structured data to build sitemaps.

Project description

canonicalwebteam.directory-parser

Flask extension to parse websites and extract structured data to build sitemaps.

Install

Install the project with pip: pip install canonicalwebteam.directory-parser

Using the directory parser

Sitemap templates

Include sitemap templates in your Flask app. Copy the following codeblock to where your application is instantiated e.g app.py. The template loader should be placed right after the app is instantiated.

from jinja2 import ChoiceLoader, FileSystemLoader
from pathlib import Path
import canonicalwebteam.directory_parser as directory_parser


# Set up Flask application
app = FlaskBase(...)


# Include directory parser templates
directory_parser_templates = (
    Path(directory_parser.__file__).parent / "templates"
)

loader = ChoiceLoader(
    [
        FileSystemLoader(str(directory_parser_templates)),
    ]
)

app.jinja_loader = loader

Generate sitemaps

The generate_sitemap function will generate a sitemap given directory path and base url using the sitemap templates.

# Dynamic sitemaps that do not need to be included in the sitemap tree.
# Differ from project to project, can be checked on /sitemap.xml
DYNAMIC_SITEMAPS = [
    "tutorials",
    "engage",
    "ceph/docs",
    "blog",
    "security/notices",
    "security/cves",
    "security/livepatch/docs",
    "robotics/docs",
]

directory_path = os.getcwd() + "/templates"
base_url = "https://ubuntu.com"

xml_sitemap = directory_parser.generate_sitemap(
                directory_path, 
                base_url, 
                exclude_paths=DYNAMIC_SITEMAPS
              )

if xml_sitemap:
    with open(sitemap_path, "w") as f:
        f.write(xml_sitemap)

# Serve the existing sitemap
with open(sitemap_path, "r") as f:
    xml_sitemap = f.read()

response = flask.make_response(xml_sitemap)
response.headers["Content-Type"] = "application/xml"
return response

Parse project directory tree

If you'd like to get the parsed tree of a given directory, you can use the scan_directory function.

directory_path = os.getcwd() + "/templates"
tree = directory_parser.scan_directory(
            directory_path, exclude_paths=DYNAMIC_SITEMAPS
        )

tree will return a tree of all the templates given in the directory_path

Local development

Running the project

This guide assumes that you are using dotrun to run your Flask app.

Include a relative path to the project

This example assumes both project exist in the same directory

In requirements.txt:

# Comment out package import
# canonicalwebteam.directory-parser==1.2.6

-e ../directory-parser

Run project with a mounted additor

dotrun -m /path/to/canonicalwebteam.directory-parser:../directory-parser

Linting and formatting

To follow the standard linting rules of this project, we are using Tox

pip3 install tox  # Install tox
tox -e lint       # Check the format of Python code
tox -e format     # Reformat the Python code

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

canonicalwebteam_directory_parser-1.2.8.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file canonicalwebteam_directory_parser-1.2.8.tar.gz.

File metadata

File hashes

Hashes for canonicalwebteam_directory_parser-1.2.8.tar.gz
Algorithm Hash digest
SHA256 244ee6d55cf3ec93e2011efce3221947320d4acc06cb8cb27544ac780913dbc1
MD5 e5ec777bc87f943022e15892e48c5d7f
BLAKE2b-256 d08e0563e0ff943e84a05bf3110570aaa4ae4db5231401f07227607637f2e7c9

See more details on using hashes here.

File details

Details for the file canonicalwebteam_directory_parser-1.2.8-py3-none-any.whl.

File metadata

File hashes

Hashes for canonicalwebteam_directory_parser-1.2.8-py3-none-any.whl
Algorithm Hash digest
SHA256 4bcc5042fc28e8a6fb74c3d48f04fe95a82934d2da83b25402f42b7a5749bff5
MD5 ac86b629bb9cfcfa74f1a83fe3667c13
BLAKE2b-256 385cdd31072c638a7f6009907527619893db3d4eece1974b07e01b718498dc9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page