A polars io plugin for reading MARC21 files
Project description
This project provides a toolkit for efficiently processing bibliographic
records encoded in MARC 21, which is a popular file format used
to exchange bibliographic data between libraries. In particular, the
command line tool marc21 allows efficient filtering of records and
extraction of data into a rectangular schema. Since the extracted data
is in tabular form, it can be processed with popular frameworks such as
Polars or Tidyverse. In addition, the Python package polars-marc21
provides a Polars extension that allows you to use the query syntax to
create a DataFrame, without using the command line.
marc21-rs is developed by the Metadata Department of the German
National Library (DNB). It is used for data analysis and for automating
metadata workflows (data engineering) as part of automatic content
indexing.
The marc21 tool provides the following commands:
- concat — Concatenate records from multiple inputs (alias
cat) - count — Print the number of records in the input data (alias
cnt) - dedup — Remove duplicate records from the input
- describe — Creates a frequency table of all subfield codes
- filter — Filter records that fulfill a specified condition
- frequency — Compute a frequency table of values (alias
freq) - hash — Compute SHA-256 checksum of records
- invalid — Output invalid records that cannot be decoded
- partition — Partition records by values
- print — Print records in human readable format
- sample — Select a random permutation of records
- split — Split the input into chunks of a given size
The polars-marc21 package uses the query engine to transform MARC21 records directly into a DataFrame:
>>> from polars_marc21 import scan_marc21
>>>
>>> filename = "DUMP.mrc.gz"
>>> query = "001, 075{ b | 2 == 'gndgen' }"
>>> header = "ppn,gndgen"
>>>
>>> df = scan_marc21(filename, query, header).collect()
>>> print(df)
shape: (7, 2)
┌───────────┬────────┐
│ ppn ┆ gndgen │
│ --- ┆ --- │
│ str ┆ str │
╞═══════════╪════════╡
│ 118540238 ┆ p │
│ 118572121 ┆ p │
│ 118607626 ┆ p │
│ 118632477 ┆ p │
│ 040992020 ┆ u │
│ 040992918 ┆ u │
│ 040993396 ┆ u │
└───────────┴────────┘
Check out the documentation to learn more about installing and using the tool.
Contributing
All contributors are required to "sign-off" their commits (using
git commit -s) to indicate that they have agreed to the Developer
Certificate of Origin.
This project uses a strict no AI / no LLM policy. Please do not use large language models (LLMs) to create issues, patches, pull requests, or comments. Although English is the preferred language, you are welcome to communicate in your native language.
License
This project is licensed under the European Union Public License 1.2.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polars_marc21-0.1.0.tar.gz.
File metadata
- Download URL: polars_marc21-0.1.0.tar.gz
- Upload date:
- Size: 62.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
116f69fd55a9f30be9f80bca23ffa94a5a3da88b6ce8b9d7e5161680a6553984
|
|
| MD5 |
df3c2536f3e4e9dfc8279e89c7904531
|
|
| BLAKE2b-256 |
685ea311071846c3503984c282ed960ff081060842e1b293c81b8cb35feb15c8
|
File details
Details for the file polars_marc21-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: polars_marc21-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 870.1 kB
- Tags: CPython 3.9+, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ea2b3fcb5363072a0bd3ec2082df9b79f81d52979650a5dd0c97faeff98cbc6
|
|
| MD5 |
c046a66d12e8cf417616ee7a6649dff6
|
|
| BLAKE2b-256 |
775a61578003bc1daa61daceb4d34eddcd360187bcd447af9faad8a108a747f4
|