Skip to main content

Scrape Marginados data from a WARC file

Project description

Marginados WARC Scraper

PyPI - Version PyPI - Python Version


marginados-warc-scraper is a command-line tool that extracts tabular data from a Web archive (WARC) file that contains data on 'Marginados'. The WARC file was created using ArchiveWeb.page from recorded interactions with https://ctmn.colmex.mx/UI/Public/BusquedaSimple.aspx.

This tool creates a single tab-separated values (TSV) file with all search results that are in the WARC file, in the order that they were presented and browsed.

It seemed easier to extract these data from a WARC file than directly scraping from the website, because the website uses POST requests with dynamic data to retrieve more search results. We did not attempt to reverse engineer such requests.

We believe extracting the data from the website is allowed by the publisher under the CC-BY-NC-ND license that cover the data. Our request for a copy of the dataset was unanswered.

Table of Contents

Installation

pip install marginados-warc-scraper

Usage

This tool has very specific expectations about the input WARC file. We have used it on a WARC file that was exported from the ArchiveWeb.page desktop app. In the app, we recorded a search for all results on the website that lists Los Catálogos de textos marginados novohispanos. It included clicking through search results, so that all pages of rows were recorded. At the end of the session, we exported the "recording" as a WARC file.

It is expected that you use it like this (note the shorter name):

marwar-scraper my-input.warc output.tsv

The output TSV file does not include a header row.

See marginados-warc-scraper --help for the full usage.

License

marginados-warc-scraper was created by Ben Companjen at Leiden University Libraries' Centre for Digital Scholarship in 2023.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

marginados_warc_scraper-1.0.1.tar.gz (32.6 kB view hashes)

Uploaded Source

Built Distribution

marginados_warc_scraper-1.0.1-py3-none-any.whl (16.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page