Skip to main content

A polars io plugin for reading MARC21 files

Project description

Rust Python docs.rs Dependencies crates.io License


This project provides a toolkit for efficiently processing bibliographic records encoded in MARC 21, which is a popular file format used to exchange bibliographic data between libraries. In particular, the command line tool marc21 allows efficient filtering of records and extraction of data into a rectangular schema. Since the extracted data is in tabular form, it can be processed with popular frameworks such as Polars or Tidyverse. In addition, the Python package polars-marc21 provides a Polars extension that allows you to use the query syntax to create a DataFrame, without using the command line.

marc21-rs is developed by the Metadata Department of the German National Library (DNB). It is used for data analysis and for automating metadata workflows (data engineering) as part of automatic content indexing.

The marc21 tool provides the following commands:

  • concat — Concatenate records from multiple inputs (alias cat)
  • count — Print the number of records in the input data (alias cnt)
  • dedup — Remove duplicate records from the input
  • describe — Creates a frequency table of all subfield codes
  • filter — Filter records that fulfill a specified condition
  • frequency — Compute a frequency table of values (alias freq)
  • hash — Compute SHA-256 checksum of records
  • invalid — Output invalid records that cannot be decoded
  • partition — Partition records by values
  • print — Print records in human readable format
  • sample — Select a random permutation of records
  • split — Split the input into chunks of a given size

The polars-marc21 package uses the query engine to transform MARC21 records directly into a DataFrame:

>>> from polars_marc21 import scan_marc21
>>>
>>> filename = "DUMP.mrc.gz"
>>> query = "001, 075{ b | 2 == 'gndgen' }"
>>> header = "ppn,gndgen"
>>>
>>> df = scan_marc21(filename, query, header).collect()
>>> print(df)
shape: (7, 2)
┌───────────┬────────┐
 ppn        gndgen 
 ---        ---    
 str        str    
╞═══════════╪════════╡
 118540238  p      
 118572121  p      
 118607626  p      
 118632477  p      
 040992020  u      
 040992918  u      
 040993396  u      
└───────────┴────────┘

Check out the documentation to learn more about installing and using the tool.

Contributing

All contributors are required to "sign-off" their commits (using git commit -s) to indicate that they have agreed to the Developer Certificate of Origin.

This project uses a strict no AI / no LLM policy. Please do not use large language models (LLMs) to create issues, patches, pull requests, or comments. Although English is the preferred language, you are welcome to communicate in your native language.

License

This project is licensed under the European Union Public License 1.2.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_marc21-0.1.0.tar.gz (62.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polars_marc21-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl (870.1 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.34+ x86-64

File details

Details for the file polars_marc21-0.1.0.tar.gz.

File metadata

  • Download URL: polars_marc21-0.1.0.tar.gz
  • Upload date:
  • Size: 62.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for polars_marc21-0.1.0.tar.gz
Algorithm Hash digest
SHA256 116f69fd55a9f30be9f80bca23ffa94a5a3da88b6ce8b9d7e5161680a6553984
MD5 df3c2536f3e4e9dfc8279e89c7904531
BLAKE2b-256 685ea311071846c3503984c282ed960ff081060842e1b293c81b8cb35feb15c8

See more details on using hashes here.

File details

Details for the file polars_marc21-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl.

File metadata

  • Download URL: polars_marc21-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl
  • Upload date:
  • Size: 870.1 kB
  • Tags: CPython 3.9+, manylinux: glibc 2.34+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for polars_marc21-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 5ea2b3fcb5363072a0bd3ec2082df9b79f81d52979650a5dd0c97faeff98cbc6
MD5 c046a66d12e8cf417616ee7a6649dff6
BLAKE2b-256 775a61578003bc1daa61daceb4d34eddcd360187bcd447af9faad8a108a747f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page