Module for convert tables from HTML-pages (/sveden) to matrix of bs4 tags

Project description

sveden-table-matrixizer

Extract and matrixize HTML tables from Russian educational organization pages (/sveden) with full colspan/rowspan support.

sveden-table-matrixizer is an asynchronous Python library that parses tables found on /sveden (сведения об образовательной организации) pages, expands colspan and rowspan attributes into clean two-dimensional matrices, and returns structured head and body cell grids. It gives you full control over edge cases via configurable callbacks.

Features

Async page fetching using aiohttp.
Full colspan/rowspan expansion – cells spanning multiple rows and/or columns are duplicated into the matrix.
Separate head and body matrix extraction – each table yields a head matrix (from <thead>) and a body matrix (from <tbody>).
Customizable error handling – replace default reactions to missing headers, missing bodies, or multiple <thead>/ <tbody> elements.
Lightweight – only depends on aiohttp and beautifulsoup4.
Works with any HTML – designed for /sveden, but usable on any page containing <table> elements.

Installation

pip install sveden-table-matrixizer

Quick Start

import asyncio
from sveden_table_matrixizer import matrixize_tables_from_page


async def main():
    url = "https://example.edu/sveden/"
    tables = await matrixize_tables_from_page(url)

    for i, table in enumerate(tables):
        print(f"Table {i + 1}:")
        print("  Head rows:", len(table.head))
        print("  Body rows:", len(table.body))
        # Access cells as list[list[bs4.Tag]]


asyncio.run(main())

Each MatrixizedTable contains:

head: list[list[Tag]] – rows of header cells (each row is a list of bs4.Tag).
body: list[list[Tag]] – rows of body cells.

Handling Edge Cases

By default, the extractor raises exceptions when a table lacks a header or body, or contains more than one <thead> / <tbody>. You can override this behavior with ExtractorOptions.

from sveden_table_matrixizer import matrixize_tables_from_page, ExtractorOptions
from sveden_table_matrixizer.def_funcs import MatrixizedTable


def handle_missing_header(table_tag, collected_tables):
    print(f"Skipping table without <thead>: {table_tag.get('id', 'no id')}")


opts = ExtractorOptions(
    on_table_no_header=handle_missing_header,
    # other callbacks can be set similarly
)

tables = await matrixize_tables_from_page(url, options=opts)

You can also supply async callbacks – the library automatically detects and awaits them.

API Reference

`matrixize_tables_from_page(url, *, options=None)`

Parameters:
- url (str) – URL of the page to scrape.
- options (ExtractorOptions, optional) – configuration callbacks.
Returns: list[MatrixizedTable] – extracted and matrixized tables.

`ExtractorOptions`

A frozen dataclass with the following fields (all optional):

Field	Type	Default	Description
`on_table_no_header`	`Callable[[Tag, Sequence[MatrixizedTable]], Any]`	no‑op	Called when a table has no `<thead>`
`on_table_no_body`	`Callable[[Tag, Sequence[MatrixizedTable]], Any]`	no‑op	Called when a table has no `<tbody>`
`on_multiply_table_headers`	`Callable[[Sequence[Tag]], Tag \| Never]`	raises `MultiplyTableHeaderError`	Called when more than one `<thead>` is found; must return a single `<thead>` element.
`on_multiply_table_bodies`	`Callable[[Sequence[Tag]], Tag \| Never]`	raises `MultiplyTableBodyError`	Called when more than one `<tbody>` is found; must return a single `<tbody>` element.

`MatrixizedTable`

@dataclass(frozen=True, kw_only=True)
class MatrixizedTable:
    head: list[list[Tag]]  # matrix of <th> tags
    body: list[list[Tag]]  # matrix of <td> tags

Exceptions

NoTableHeaderError – raised when options.on_table_no_header is not overridden.
NoTableBodyError – raised when options.on_table_no_body is not overridden.
MultiplyTableHeaderError – default reaction to multiple <thead> elements.
MultiplyTableBodyError – default reaction to multiple <tbody> elements.

All exceptions are exported from sveden_table_matrixizer.errors.

How It Works

The page is fetched with aiohttp and parsed by BeautifulSoup.
All <table> tags are collected.
For each table:
- The <thead> is located; if missing or duplicate, the appropriate callback is invoked.
- The <tbody> is located similarly.
- Header rows are expanded: each <th> with colspan/rowspan is replicated into the correct cells of a 2D list.
- The same expansion is applied to body rows using <td> elements.
A MatrixizedTable(head=..., body=...) is created and added to the result list.

Dependencies

Python ≥ 3.11
aiohttp
beautifulsoup4

License

This project is licensed under the MIT License – see the source repository for details.

Project details

Release history Release notifications | RSS feed

1.0.2

May 9, 2026

This version

1.0.1

May 8, 2026

1.0.0

May 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sveden_table_matrixizer-1.0.1.tar.gz (8.1 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sveden_table_matrixizer-1.0.1-py3-none-any.whl (8.8 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file sveden_table_matrixizer-1.0.1.tar.gz.

File metadata

Download URL: sveden_table_matrixizer-1.0.1.tar.gz
Upload date: May 8, 2026
Size: 8.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for sveden_table_matrixizer-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`d9b2823ba3d2d2f00035a42d3ade6791fb0fce469ecff31733a4b7e0e152b411`
MD5	`ea138307606982d29bc8f0e2605574df`
BLAKE2b-256	`586142854909b36737f01bdcf9fd84f9ea4b96ebe32a2189fb3a3d73948abe56`

See more details on using hashes here.

File details

Details for the file sveden_table_matrixizer-1.0.1-py3-none-any.whl.

File metadata

Download URL: sveden_table_matrixizer-1.0.1-py3-none-any.whl
Upload date: May 8, 2026
Size: 8.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for sveden_table_matrixizer-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cad63b9f06427c00c14523c33b64a4e3a5f915204b3e5951c73ec60a4ac5fd4d`
MD5	`8c1e1c7dc8901b279042bd8ad56a4c54`
BLAKE2b-256	`194b78f00a34a884e800529ad97cc6987b50cde8d1b07a665bd34d0900a14a47`

See more details on using hashes here.

sveden-table-matrixizer 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

sveden-table-matrixizer

Features

Installation

Quick Start

Handling Edge Cases

API Reference

`matrixize_tables_from_page(url, *, options=None)`

`ExtractorOptions`

`MatrixizedTable`

Exceptions

How It Works

Dependencies

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes