Skip to main content

Module for convert tables from HTML-pages (/sveden) to matrix of bs4 tags

Project description

sveden-table-matrixizer

Extract and matrixize HTML tables from Russian educational organization pages (/sveden) with full colspan/rowspan support.

sveden-table-matrixizer is an asynchronous Python library that parses tables found on /sveden (сведения об образовательной организации) pages, expands colspan and rowspan attributes into clean two-dimensional matrices, and returns structured head and body cell grids. It gives you full control over edge cases via configurable callbacks.

Features

  • Async page fetching using aiohttp.
  • Full colspan/rowspan expansion – cells spanning multiple rows and/or columns are duplicated into the matrix.
  • Separate head and body matrix extraction – each table yields a head matrix (from <thead>) and a body matrix (from <tbody>).
  • Customizable error handling – replace default reactions to missing headers, missing bodies, or multiple <thead>/ <tbody> elements.
  • Lightweight – only depends on aiohttp and beautifulsoup4.
  • Works with any HTML – designed for /sveden, but usable on any page containing <table> elements.

Installation

pip install sveden-table-matrixizer

Quick Start

import asyncio
from sveden_table_matrixizer import matrixize_tables_from_page


async def main():
    url = "https://example.edu/sveden/"
    tables = await matrixize_tables_from_page(url)

    for i, table in enumerate(tables):
        print(f"Table {i + 1}:")
        print("  Head rows:", len(table.head))
        print("  Body rows:", len(table.body))
        # Access cells as list[list[bs4.Tag]]


asyncio.run(main())

Each MatrixizedTable contains:

  • head: list[list[Tag]] – rows of header cells (each row is a list of bs4.Tag).
  • body: list[list[Tag]] – rows of body cells.

Handling Edge Cases

By default, the extractor raises exceptions when a table lacks a header or body, or contains more than one <thead> / <tbody>. You can override this behavior with ExtractorOptions.

from sveden_table_matrixizer import matrixize_tables_from_page, ExtractorOptions
from sveden_table_matrixizer.def_funcs import MatrixizedTable


def handle_missing_header(table_tag, collected_tables):
    print(f"Skipping table without <thead>: {table_tag.get('id', 'no id')}")


opts = ExtractorOptions(
    on_table_no_header=handle_missing_header,
    # other callbacks can be set similarly
)

tables = await matrixize_tables_from_page(url, options=opts)

You can also supply async callbacks – the library automatically detects and awaits them.

API Reference

matrixize_tables_from_page(url, *, options=None)

  • Parameters:
    • url (str) – URL of the page to scrape.
    • options (ExtractorOptions, optional) – configuration callbacks.
  • Returns: list[MatrixizedTable] – extracted and matrixized tables.

ExtractorOptions

A frozen dataclass with the following fields (all optional):

Field Type Default Description
on_table_no_header Callable[[Tag, Sequence[MatrixizedTable]], Any] no‑op Called when a table has no <thead>
on_table_no_body Callable[[Tag, Sequence[MatrixizedTable]], Any] no‑op Called when a table has no <tbody>
on_multiply_table_headers Callable[[Sequence[Tag]], Tag | Never] raises MultiplyTableHeaderError Called when more than one <thead> is found; must return a single <thead> element.
on_multiply_table_bodies Callable[[Sequence[Tag]], Tag | Never] raises MultiplyTableBodyError Called when more than one <tbody> is found; must return a single <tbody> element.

MatrixizedTable

@dataclass(frozen=True, kw_only=True)
class MatrixizedTable:
    head: list[list[Tag]]  # matrix of <th> tags
    body: list[list[Tag]]  # matrix of <td> tags

Exceptions

  • NoTableHeaderError – raised when options.on_table_no_header is not overridden.
  • NoTableBodyError – raised when options.on_table_no_body is not overridden.
  • MultiplyTableHeaderError – default reaction to multiple <thead> elements.
  • MultiplyTableBodyError – default reaction to multiple <tbody> elements.

All exceptions are exported from sveden_table_matrixizer.errors.

How It Works

  1. The page is fetched with aiohttp and parsed by BeautifulSoup.
  2. All <table> tags are collected.
  3. For each table:
    • The <thead> is located; if missing or duplicate, the appropriate callback is invoked.
    • The <tbody> is located similarly.
    • Header rows are expanded: each <th> with colspan/rowspan is replicated into the correct cells of a 2D list.
    • The same expansion is applied to body rows using <td> elements.
  4. A MatrixizedTable(head=..., body=...) is created and added to the result list.

Dependencies

License

This project is licensed under the MIT License – see the source repository for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sveden_table_matrixizer-1.0.0.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sveden_table_matrixizer-1.0.0-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file sveden_table_matrixizer-1.0.0.tar.gz.

File metadata

  • Download URL: sveden_table_matrixizer-1.0.0.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for sveden_table_matrixizer-1.0.0.tar.gz
Algorithm Hash digest
SHA256 2b99a10b473e39470b45ae5bd40a1c3c2e52f26c45ae6ef01583f442569c67c8
MD5 7a4a3d54ad48dff9cb1852361aca9040
BLAKE2b-256 edfd9547e539e096768e540d90427a2cdfe89ac80cf3f87c534861d244a3b736

See more details on using hashes here.

File details

Details for the file sveden_table_matrixizer-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sveden_table_matrixizer-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8335709b8ceb15891296185f9be8f70a1d425a4ba8b8e9afb33d70e7cfcf4247
MD5 e18df6d31eaba9b6866aa5fc579b58c2
BLAKE2b-256 d8b321cc1bfc03156b8a1c061d5b2d00b0692d3aaebadd7515739cba54492105

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page