Skip to main content

Scrape Marginados data from a WARC file

Project description

Marginados WARC Scraper

PyPI - Version PyPI - Python Version


marginados-warc-scraper is a command-line tool that extracts tabular data from a Web archive (WARC) file that contains data on 'Marginados'. The WARC file was created using ArchiveWeb.page from recorded interactions with https://ctmn.colmex.mx/UI/Public/BusquedaSimple.aspx.

This tool creates a single tab-separated values (TSV) file with all search results that are in the WARC file, in the order that they were presented and browsed.

It seemed easier to extract these data from a WARC file than directly scraping from the website, because the website uses POST requests with dynamic data to retrieve more search results. We did not attempt to reverse engineer such requests.

We believe extracting the data from the website is allowed by the publisher under the CC-BY-NC-ND license that cover the data. Our request for a copy of the dataset was unanswered.

Table of Contents

Installation

pip install marginados-warc-scraper

Usage

This tool has very specific expectations about the input WARC file. We have used it on a WARC file that was exported from the ArchiveWeb.page desktop app. In the app, we recorded a search for all results on the website that lists Los Catálogos de textos marginados novohispanos. It included clicking through search results, so that all pages of rows were recorded. At the end of the session, we exported the "recording" as a WARC file.

It is expected that you use it like this (note the shorter name):

marwar-scraper my-input.warc output.tsv

The output TSV file does not include a header row.

See marginados-warc-scraper --help for the full usage.

License

marginados-warc-scraper was created by Ben Companjen at Leiden University Libraries' Centre for Digital Scholarship in 2023.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

marginados_warc_scraper-1.0.1.tar.gz (32.6 kB view details)

Uploaded Source

Built Distribution

marginados_warc_scraper-1.0.1-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file marginados_warc_scraper-1.0.1.tar.gz.

File metadata

File hashes

Hashes for marginados_warc_scraper-1.0.1.tar.gz
Algorithm Hash digest
SHA256 6fe56302d854f72cbbe607eb78f3229f02c3b69a0d3e1c722e5720e4d80cba07
MD5 805594259d2341f61a19fdf0295e0e83
BLAKE2b-256 310b495a4bf88fc051cb0765ad135c4ae6f1414c90275ccdb9de1f837f150267

See more details on using hashes here.

File details

Details for the file marginados_warc_scraper-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for marginados_warc_scraper-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a08fa93223cfb92959e73584a537d2bfc070b149e5358565c4f507d1f4e2fa7f
MD5 4060bd08230c7f271a31d0bc2689cffa
BLAKE2b-256 8b1404a1cb3e564e696f7aab215dca1d72530507bfebb00ce5e5fbc4dc629ca0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page