Skip to main content

Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives.

Project description

Paper DOI arXiv preprint Papers with Code
CI status Code coverage Maintenance
Issues Pull requests Commit activity License

📜 The Archive Query Log

Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives.

Queries TSNE

Start now by running your custom analysis/experiment, scraping your own query log, or just look at our example files.

Contents

Integrations

Running Experiments on the AQL

The data in the Archive Query Log is highly sensitive (still, you can re-crawl everything from the Wayback Machine). For that reason, we ensure that custom experiments or analyses can not leak sensitive data (please get in touch if you have questions) by using TIRA as a platform for custom analyses/experiments. In TIRA, you submit a Docker image that implements your experiment. Your software is then executed in sandboxed mode (without internet connection) to ensure that your software does not leak sensitive information. After your software execution finished, administrators will review your submission and unblind it so that you can access the outputs.
Please refer to our dedicated TIRA tutorial as starting point for your experiments.

Crawling

For running the CLI and crawl a query log on your own machine, please refer to the instructions for single-machine deployments. If instead you want to scale up and run the crawling pipelines on a cluster, please refer to the instructions for cluster deployments.

Single-Machine (PyPi/Docker)

To run the Archive Query Log CLI on your machine, you can either use our PyPi package or the Docker image. (If you absolutely need to, you can also install the Python CLI or the Docker image from source.)

Installation (PyPi)

First you need to install Python 3.10 and pipx (this allows you to install the AQL CLI in a virtual environment). Then, you can install the Archive Query Log CLI by running:

pipx install archive-query-log

Now you can run the Archive Query Log CLI by running:

aql --help

Installation (Python from source)

First install Python 3.10, and clone this repository. From inside the repository directory, create a virtual environment and activate it:

python3.10 -m venv venv/
source venv/bin/activate

Install the Archive Query Log by running:

pip install -e .

Now you can run the Archive Query Log CLI by running:

aql --help

Installation (Docker)

You only need to install Docker.

Note: The commands below use the syntax of the PyPi installation. To run the same commands with the Docker installation, replace aql with docker run -it -v "$(pwd)"/config.override.yml:/workspace/config.override.yml ghcr.io/webis-de/archive-query-log, for example:

docker run -it -v "$(pwd)"/config.override.yml:/workspace/config.override.yml ghcr.io/webis-de/archive-query-log --help

Installation (Docker from source)

First install Docker, and clone this repository. From inside the repository directory, build the Docker image like this:

docker build -t aql .

Note: The commands below use the syntax of the PyPi installation. To run the same commands with the Docker installation, replace aql with docker run -it -v "$(pwd)"/config.override.yml:/workspace/config.override.yml aql, for example:

docker run -it -v "$(pwd)"/config.override.yml:/workspace/config.override.yml aql --help

Configuration

Crawling the Archive Query Log requires access to an Elasticsearch cluster and some S3 block storage. To configure access to the Elasticsearch cluster and S3, add a config.override.yml file in the current directory, with the following contents. Replace the placeholders with your actual credentials:

es:
  host: "<HOST>"
  port: 9200
  username: "<USERNAME>"
  password: "<PASSWORD>"
s3:
   endpoint_url: "<URL>"
   bucket_name: archive-query-log
   access_key: "<KEY>"
   secret_key: "<KEY>"

Toy Example: Crawl ChatNoir SERPs from the Wayback Machine

The crawling pipeline of the Archive Query Log can best be understood by looking at a small toy example. Here, we want to crawl and parse SERPs of the ChatNoir search engine from the Wayback Machine.

Add an archive service

Add new web archive services (e.g., the Wayback Machine) to the AQL by running:

aql archives add

We maintain a list of compatible web archives below.

Compatible archives

The web archives below are known to be compatible with the Archive Query Log crawler and can be used to mine SERPs.

Name CDX API URL Memento API URL
Wayback Machine https://web.archive.org/cdx/search/cdx https://web.archive.org/web/

Add a search provider

Add new search providers (e.g., Google) to the AQL by running:

aql providers add

A search provider can be any website that offers some kind of search functionality. Ideally, you should also look at common prefixes of the URLs of the search results pages (e.g., /search for Google). Narrowing down URL prefixes helps to not crawl too many captures that do not contain search results.

Refer to the import instructions below to import providers from the AQL-22 YAML file format.

Build source pairs

Once you have added at least one archive and one search provider, we want to crawl archived captures of SERPs for each search provider and for each archive service, that is, we compute the cross product of archives and the search providers' domains and URL prefixes (roughly: archive×provider). Start building source pairs (i.e., archive–provider pairs) by running:

aql sources build

Running the command again after adding more archives or providers automatically creates the missing source pairs.

Fetch captures

For each source pair, we now fetch captures from the archive service, that correspond to the provider's domain and URL prefix given in the source pair.

aql captures fetch

Again, running the command again after adding more source pairs automatically fetches the missing captures.

Parse SERP URLs

Not every capture necessarily points to a search engine result page (SERP). But usually, SERPs contain the user query in the URL, so we can filter out non-SERP captures by parsing the URLs.

aql serps parse url-query

Parsing the query from the capture URL will add SERPs to a new, more focused index that only contains SERPs. From the SERPs, we can also parse the page number and offset of the SERP, if available.

aql serps parse url-page
aql serps parse url-offset

All the above commands can be run in parallel, and they can be run multiple times to update the SERP index. Already parsed SERPs will be skipped.

Download SERP WARCs

Up to this point, we have only fetched the metadata of the captures, most prominently the URL. However, the snippets of the SERPs are not contained in the metadata, but only on the web page. So we need to download the actual web pages from the archive service.

aql serps download warc

This command will download the contents of each SERP to a WARC file that is stored in the configured S3 bucket. A pointer to the WARC file is stored in the SERP index so that we can quickly access a specific SERP's contents later.

Parsing SERP WARCs

From the WARC, we can again parse the query as it appears on the SERP.

aql serps parse serp-query

More importantly, we can parse the snippets of the SERP.

aql serps parse serp-snippets

Parsing the snippets from the SERP's WARC contents will also add the SERP's results to a new index.

Download SERP WARCs

To get the full text of each referenced result from the SERP, we need to download a capture of the result from the web archive. Intuitively, we would like to download a capture of the result at the exact same time as the SERP was captured. But often, web archives crawl the results later or not at all. We therefore search for the nearest captures before and after the SERP's timestamp and download these two captures for each result, if any could be found.

aql results download warc

This will again download the result's contents to a WARC file that is stored in the configured S3 bucket. A pointer to the WARC file is stored in the result index for random access to a specific result's contents.

Import

We support automatically importing providers and parsers from the AQL-22 YAML-file format (see data/selected-services.yaml). To import the services and parsers from the AQL-22 YAML file, run:

aql providers import
aql parsers url-query import
aql parsers url-page import
aql parsers url-offset import
aql parsers warc-query import
aql parsers warc-snippets import

We also support importing a previous crawl of captures from the AQL-22 file system backend:

aql captures import aql-22

Last, we support importing all archives from the Archive-It web archive service:

aql archives import archive-it

Cluster (Helm/Kubernetes)

Running the Archive Query Log on a cluster is recommended for large-scale crawls. We provide a Helm chart that automatically starts crawling and parsing jobs for you and stores the results in an Elasticsearch cluster.

Installation

Just install Helm and configure kubectl for your cluster.

Configuration

Crawling the Archive Query Log requires access to an Elasticsearch cluster and some S3 block storage. Configure the Elasticsearch and S3 credentials in a values.override.yaml file like this:

elasticsearch:
  host: "<HOST>"
  port: 9200
  username: "<USERNAME>"
  password: "<PASSWORD>"
s3:
  endpoint_url: "<URL>"
  bucket_name: archive-query-log
  access_key: "<KEY>"
  secret_key: "<KEY>"

Deployment

Let's deploy the Helm chart on the cluster (we're testing first with --dry-run to see if everything works):

helm upgrade --install --values helm/archive-query-log/values.override.yaml --dry-run archive-query-log helm/archive-query-log

If everything worked and the output looks good, you can remove the --dry-run flag to actually deploy the chart.

Uninstall

If you no longer need the chart, you can uninstall it:

helm uninstall archive-query-log

Citation

If you use the Archive Query Log dataset or the crawling code in your research, please cite the following paper describing the AQL and its use-cases:

Jan Heinrich Reimer, Sebastian Schmidt, Maik Fröbe, Lukas Gienapp, Harrisen Scells, Benno Stein, Matthias Hagen, and Martin Potthast. The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives. In Hsin-Hsi Chen et al., editors, 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023), pages 2848–2860, July 2023. ACM.

You can use the following BibTeX entry for citation:

@InProceedings{reimer:2023,
    author = {{Jan Heinrich} Reimer and Sebastian Schmidt and Maik Fr{\"o}be and Lukas Gienapp and Harrisen Scells and Benno Stein and Matthias Hagen and Martin Potthast},
    booktitle = {46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023)},
    doi = {10.1145/3539618.3591890},
    editor = {Hsin{-}Hsi Chen and Wei{-}Jou (Edward) Duh and Hen{-}Hsen Huang and Makoto P. Kato and Josiane Mothe and Barbara Poblete},
    ids = {potthast:2023u},
    isbn = {9781450394086},
    month = jul,
    numpages = 13,
    pages = {2848--2860},
    publisher = {ACM},
    site = {Taipei, Taiwan},
    title = {{The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives}},
    url = {https://dl.acm.org/doi/10.1145/3539618.3591890},
    year = 2023
}

Development

Refer to the local Python installation instructions to set up the development environment and install the dependencies.

Then, also install the test dependencies:

pip install -e .[tests]

After having implemented a new feature, you should the check code format, inspect common LINT errors, and run all unit tests with the following commands:

flake8 archive_query_log  # Code format
pylint archive_query_log  # LINT errors
mypy archive_query_log    # Static typing
bandit -c pyproject.toml -r archive_query_log  # Security
pytest archive_query_log  # Unit tests

Add new tests for parsers

At the moment, our workflow for adding new tests for parsers goes like this:

  1. Select the number of tests to run per service and the number of services.
  2. Auto-generate unit tests and download WARCs with generate_tests.py
  3. Run the tests.
  4. Failing tests will open a diff editor with the approval and a web browser tab with the Wayback URL.
  5. Use the web browser dev tools to find the query input field and search result CSS paths.
  6. Close diffs and tabs and re-run tests.

Contribute

If you've found an important search provider to be missing from this query log, please suggest it by creating an issue. We also very gratefully accept pull requests for adding search providers or new parser configurations!

If you're unsure about anything, post an issue, or contact us:

We're happy to help!

License

This repository is released under the MIT license. Files in the data/ directory are exempt from this license. If you use the AQL in your research, we'd be glad if you'd cite us.

Abstract

The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. Of the few query logs publicly available, none combines size, scope, and diversity. The AQL is the first to do so, enabling research on new retrieval models and (diachronic) search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

archive-query-log-0.1.30.tar.gz (37.0 MB view details)

Uploaded Source

Built Distribution

archive_query_log-0.1.30-py3-none-any.whl (183.3 kB view details)

Uploaded Python 3

File details

Details for the file archive-query-log-0.1.30.tar.gz.

File metadata

  • Download URL: archive-query-log-0.1.30.tar.gz
  • Upload date:
  • Size: 37.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for archive-query-log-0.1.30.tar.gz
Algorithm Hash digest
SHA256 c2d59e0b7af8932b3868c46ea2098bdc0b2659d87642b5923055a5f0e8aaaebf
MD5 0cbfb79cf2d08762eb8de191a6d382e2
BLAKE2b-256 96bb0d6dbbc424fe43ffb0088dab9fe3dd5d5a9a52dc950e8e4f17581de70977

See more details on using hashes here.

File details

Details for the file archive_query_log-0.1.30-py3-none-any.whl.

File metadata

File hashes

Hashes for archive_query_log-0.1.30-py3-none-any.whl
Algorithm Hash digest
SHA256 ad17d621c39bc09045aa5329b8d62b9b628b3468a69e5ceeecd36e9e0c7235ac
MD5 2ada92a7218e7a1d323feeac5bcf84fb
BLAKE2b-256 128b0284cca3390190ab03f8fb0b4cacf21c8d4c60ddebc1d4933b41de5b4ef0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page