Module for converting HTML tables (from /sveden pages) into a matrix of BeautifulSoup tags
Project description
sveden-table-matrixizer
Extract and matrixize HTML tables from Russian educational organization pages (/sveden) with full colspan/rowspan
support.
sveden-table-matrixizer is an asynchronous Python library that parses tables found on /sveden (сведения об
образовательной организации) pages, expands colspan and rowspan attributes into clean two-dimensional matrices, and
returns structured head and body cell grids. It gives you full control over edge cases via configurable callbacks.
Features
- Async page fetching using
aiohttp. - Full colspan/rowspan expansion – cells spanning multiple rows and/or columns are duplicated into the matrix.
- Separate head and body matrix extraction – each table yields a
headmatrix (from<thead>) and abodymatrix (from<tbody>). - Customizable error handling – replace default reactions to missing headers, missing bodies, or multiple
<thead>/<tbody>elements. - Lightweight – only depends on
aiohttpandbeautifulsoup4. - Works with any HTML – designed for
/sveden, but usable on any page containing<table>elements.
Installation
pip install sveden-table-matrixizer
Quick Start
import asyncio
from sveden_table_matrixizer import matrixize_tables_from_page
async def main():
url = "https://example.edu/sveden/"
tables = await matrixize_tables_from_page(url)
for i, table in enumerate(tables):
print(f"Table {i + 1}:")
print(" Head rows:", len(table.head))
print(" Body rows:", len(table.body))
# Access cells as list[list[bs4.Tag]]
asyncio.run(main())
Each MatrixizedTable contains:
head: list[list[Tag]]– rows of header cells (each row is a list ofbs4.Tag).body: list[list[Tag]]– rows of body cells.
Handling Edge Cases
By default, the extractor raises exceptions when a table lacks a header or body, or contains more than one <thead> /
<tbody>. You can override this behavior with ExtractorOptions.
from sveden_table_matrixizer import matrixize_tables_from_page, ExtractorOptions
from sveden_table_matrixizer.def_funcs import MatrixizedTable
def handle_missing_header(table_tag, collected_tables):
print(f"Skipping table without <thead>: {table_tag.get('id', 'no id')}")
opts = ExtractorOptions(
on_table_no_header=handle_missing_header,
# other callbacks can be set similarly
)
tables = await matrixize_tables_from_page(url, options=opts)
You can also supply async callbacks – the library automatically detects and awaits them.
API Reference
matrixize_tables_from_page(url, *, options=None)
- Parameters:
url(str) – URL of the page to scrape.options(ExtractorOptions, optional) – configuration callbacks.
- Returns:
list[MatrixizedTable]– extracted and matrixized tables.
ExtractorOptions
A frozen dataclass with the following fields (all optional):
| Field | Type | Default | Description |
|---|---|---|---|
on_table_no_header |
Callable[[Tag, Sequence[MatrixizedTable]], Any] |
no‑op | Called when a table has no <thead> |
on_table_no_body |
Callable[[Tag, Sequence[MatrixizedTable]], Any] |
no‑op | Called when a table has no <tbody> |
on_multiply_table_headers |
Callable[[Sequence[Tag]], Tag | Never] |
raises MultiplyTableHeaderError |
Called when more than one <thead> is found; must return a single <thead> element. |
on_multiply_table_bodies |
Callable[[Sequence[Tag]], Tag | Never] |
raises MultiplyTableBodyError |
Called when more than one <tbody> is found; must return a single <tbody> element. |
MatrixizedTable
@dataclass(frozen=True, kw_only=True)
class MatrixizedTable:
head: list[list[Tag]] # matrix of <th> tags
body: list[list[Tag]] # matrix of <td> tags
Exceptions
NoTableHeaderError– raised whenoptions.on_table_no_headeris not overridden.NoTableBodyError– raised whenoptions.on_table_no_bodyis not overridden.MultiplyTableHeaderError– default reaction to multiple<thead>elements.MultiplyTableBodyError– default reaction to multiple<tbody>elements.
All exceptions are exported from sveden_table_matrixizer.errors.
How It Works
- The page is fetched with
aiohttpand parsed by BeautifulSoup. - All
<table>tags are collected. - For each table:
- The
<thead>is located; if missing or duplicate, the appropriate callback is invoked. - The
<tbody>is located similarly. - Header rows are expanded: each
<th>withcolspan/rowspanis replicated into the correct cells of a 2D list. - The same expansion is applied to body rows using
<td>elements.
- The
- A
MatrixizedTable(head=..., body=...)is created and added to the result list.
Dependencies
- Python ≥ 3.11
- aiohttp
- beautifulsoup4
License
This project is licensed under the MIT License – see the source repository for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sveden_table_matrixizer-1.0.2.tar.gz.
File metadata
- Download URL: sveden_table_matrixizer-1.0.2.tar.gz
- Upload date:
- Size: 8.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85caaf2fe020c37518cf78b2ffb452fdb3fbd68a7e130740ded4b45afdd7b5ee
|
|
| MD5 |
3094d74d46af6506844f5891c5fde874
|
|
| BLAKE2b-256 |
99439afbd46259d4e9c73290b941e63d86d881e687fe45a7ca75d8d323b41ac5
|
File details
Details for the file sveden_table_matrixizer-1.0.2-py3-none-any.whl.
File metadata
- Download URL: sveden_table_matrixizer-1.0.2-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50fae3c36d320325ceacc042dbfbfb1d7e1340fa9aaceb46f659e9d903894627
|
|
| MD5 |
c721f2ee3af9839a366f2931d49baaea
|
|
| BLAKE2b-256 |
fa8b88e4006fba13185e27a41f2079b0fd586de299d290142b3674bfb9053be1
|