Skip to main content

Module for convert tables from HTML-pages (/sveden) to matrix of bs4 tags

Project description

sveden-table-matrixizer

Extract and matrixize HTML tables from Russian educational organization pages (/sveden) with full colspan/rowspan support.

sveden-table-matrixizer is an asynchronous Python library that parses tables found on /sveden (сведения об образовательной организации) pages, expands colspan and rowspan attributes into clean two-dimensional matrices, and returns structured head and body cell grids. It gives you full control over edge cases via configurable callbacks.

Features

  • Async page fetching using aiohttp.
  • Full colspan/rowspan expansion – cells spanning multiple rows and/or columns are duplicated into the matrix.
  • Separate head and body matrix extraction – each table yields a head matrix (from <thead>) and a body matrix (from <tbody>).
  • Customizable error handling – replace default reactions to missing headers, missing bodies, or multiple <thead>/ <tbody> elements.
  • Lightweight – only depends on aiohttp and beautifulsoup4.
  • Works with any HTML – designed for /sveden, but usable on any page containing <table> elements.

Installation

pip install sveden-table-matrixizer

Quick Start

import asyncio
from sveden_table_matrixizer import matrixize_tables_from_page


async def main():
    url = "https://example.edu/sveden/"
    tables = await matrixize_tables_from_page(url)

    for i, table in enumerate(tables):
        print(f"Table {i + 1}:")
        print("  Head rows:", len(table.head))
        print("  Body rows:", len(table.body))
        # Access cells as list[list[bs4.Tag]]


asyncio.run(main())

Each MatrixizedTable contains:

  • head: list[list[Tag]] – rows of header cells (each row is a list of bs4.Tag).
  • body: list[list[Tag]] – rows of body cells.

Handling Edge Cases

By default, the extractor raises exceptions when a table lacks a header or body, or contains more than one <thead> / <tbody>. You can override this behavior with ExtractorOptions.

from sveden_table_matrixizer import matrixize_tables_from_page, ExtractorOptions
from sveden_table_matrixizer.def_funcs import MatrixizedTable


def handle_missing_header(table_tag, collected_tables):
    print(f"Skipping table without <thead>: {table_tag.get('id', 'no id')}")


opts = ExtractorOptions(
    on_table_no_header=handle_missing_header,
    # other callbacks can be set similarly
)

tables = await matrixize_tables_from_page(url, options=opts)

You can also supply async callbacks – the library automatically detects and awaits them.

API Reference

matrixize_tables_from_page(url, *, options=None)

  • Parameters:
    • url (str) – URL of the page to scrape.
    • options (ExtractorOptions, optional) – configuration callbacks.
  • Returns: list[MatrixizedTable] – extracted and matrixized tables.

ExtractorOptions

A frozen dataclass with the following fields (all optional):

Field Type Default Description
on_table_no_header Callable[[Tag, Sequence[MatrixizedTable]], Any] no‑op Called when a table has no <thead>
on_table_no_body Callable[[Tag, Sequence[MatrixizedTable]], Any] no‑op Called when a table has no <tbody>
on_multiply_table_headers Callable[[Sequence[Tag]], Tag | Never] raises MultiplyTableHeaderError Called when more than one <thead> is found; must return a single <thead> element.
on_multiply_table_bodies Callable[[Sequence[Tag]], Tag | Never] raises MultiplyTableBodyError Called when more than one <tbody> is found; must return a single <tbody> element.

MatrixizedTable

@dataclass(frozen=True, kw_only=True)
class MatrixizedTable:
    head: list[list[Tag]]  # matrix of <th> tags
    body: list[list[Tag]]  # matrix of <td> tags

Exceptions

  • NoTableHeaderError – raised when options.on_table_no_header is not overridden.
  • NoTableBodyError – raised when options.on_table_no_body is not overridden.
  • MultiplyTableHeaderError – default reaction to multiple <thead> elements.
  • MultiplyTableBodyError – default reaction to multiple <tbody> elements.

All exceptions are exported from sveden_table_matrixizer.errors.

How It Works

  1. The page is fetched with aiohttp and parsed by BeautifulSoup.
  2. All <table> tags are collected.
  3. For each table:
    • The <thead> is located; if missing or duplicate, the appropriate callback is invoked.
    • The <tbody> is located similarly.
    • Header rows are expanded: each <th> with colspan/rowspan is replicated into the correct cells of a 2D list.
    • The same expansion is applied to body rows using <td> elements.
  4. A MatrixizedTable(head=..., body=...) is created and added to the result list.

Dependencies

License

This project is licensed under the MIT License – see the source repository for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sveden_table_matrixizer-1.0.1.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sveden_table_matrixizer-1.0.1-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file sveden_table_matrixizer-1.0.1.tar.gz.

File metadata

  • Download URL: sveden_table_matrixizer-1.0.1.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for sveden_table_matrixizer-1.0.1.tar.gz
Algorithm Hash digest
SHA256 d9b2823ba3d2d2f00035a42d3ade6791fb0fce469ecff31733a4b7e0e152b411
MD5 ea138307606982d29bc8f0e2605574df
BLAKE2b-256 586142854909b36737f01bdcf9fd84f9ea4b96ebe32a2189fb3a3d73948abe56

See more details on using hashes here.

File details

Details for the file sveden_table_matrixizer-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for sveden_table_matrixizer-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cad63b9f06427c00c14523c33b64a4e3a5f915204b3e5951c73ec60a4ac5fd4d
MD5 8c1e1c7dc8901b279042bd8ad56a4c54
BLAKE2b-256 194b78f00a34a884e800529ad97cc6987b50cde8d1b07a665bd34d0900a14a47

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page