Scrape Marginados data from a WARC file
Project description
Marginados WARC Scraper
marginados-warc-scraper
is a command-line tool that extracts tabular data from a
Web archive (WARC) file that contains data on 'Marginados'.
The WARC file was created using ArchiveWeb.page from recorded interactions with
https://ctmn.colmex.mx/UI/Public/BusquedaSimple.aspx.
This tool creates a single tab-separated values (TSV) file with all search results that are in the WARC file, in the order that they were presented and browsed.
It seemed easier to extract these data from a WARC file than directly scraping from the website, because the website uses POST requests with dynamic data to retrieve more search results. We did not attempt to reverse engineer such requests.
We believe extracting the data from the website is allowed by the publisher under the CC-BY-NC-ND license that cover the data. Our request for a copy of the dataset was unanswered.
Table of Contents
Installation
pip install marginados-warc-scraper
Usage
This tool has very specific expectations about the input WARC file. We have used it on a WARC file that was exported from the ArchiveWeb.page desktop app. In the app, we recorded a search for all results on the website that lists Los Catálogos de textos marginados novohispanos. It included clicking through search results, so that all pages of rows were recorded. At the end of the session, we exported the "recording" as a WARC file.
It is expected that you use it like this (note the shorter name):
marwar-scraper my-input.warc output.tsv
The output TSV file does not include a header row.
See marginados-warc-scraper --help
for the full usage.
License
marginados-warc-scraper
was created by Ben Companjen at Leiden University Libraries'
Centre for Digital Scholarship in 2023.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for marginados_warc_scraper-1.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6fe56302d854f72cbbe607eb78f3229f02c3b69a0d3e1c722e5720e4d80cba07 |
|
MD5 | 805594259d2341f61a19fdf0295e0e83 |
|
BLAKE2b-256 | 310b495a4bf88fc051cb0765ad135c4ae6f1414c90275ccdb9de1f837f150267 |
Hashes for marginados_warc_scraper-1.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a08fa93223cfb92959e73584a537d2bfc070b149e5358565c4f507d1f4e2fa7f |
|
MD5 | 4060bd08230c7f271a31d0bc2689cffa |
|
BLAKE2b-256 | 8b1404a1cb3e564e696f7aab215dca1d72530507bfebb00ce5e5fbc4dc629ca0 |