Skip to main content

Python package for collecting and analyzing webpages

Project description

web-observatory

Download Latest Version from PyPI

web-observatory is a Python package for collecting and analyzing webpages.

See here for extended examples of web-observatory in use.

Modules

start_project

Initializes a project directory

search_google

Searches Google for terms. Google Custom Search Engine credentials required.

google_process

Compiles results from multiple Google searches.

get_domains

Extracts domain-level information from the urls returned by Google searches (e.g. 'google' in www.google.com)

initialize_crawl

Initializes a Scrapy crawl on a set of domains. Returns a JSON file of urls found through the crawl.

crawl_process

Processes the JSON output of a crawl into a pandas DataFrame.

crawl

Not implemented as a module yet, but it can be run through a command like !scrapy crawl digcon_crawler -O output.json --nolog

search_merge

Merges Google searches and crawl results.

get_versions

Gets historical versions of Twitter-searched urls using the Internet Archive's Wayback Machine. Attempts to find the version of the page archived closest in time to when it was tweeted.
Uses the requests package to ping the url and get the "full" address rather than a redirect (e.g. bit.ly/12312). This helps in scraping.

initialize_scrape

Initializes files to scrape urls for their HTML.

scrape

Conducts the scrape of pages' HTML. Stores body text in a Postgresql database.

query

A set of methods for searching the Postgreql database of site text, including filtering empty results and counting specified search terms.

ground_truth

Produces a sample of pages for verifying counts of terms.

analyze_orgs

Calculates and visualizes averages and frequencies for each search term in the site text and summarizes by organization (domain).

analzye_term_correlations

Calculates and visualizes co-variance metrics for specified search terms in the site text.

co_occurrence

Returns specific pages using two or more specified search terms.

Issues and Development

See: web-observatory project

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_observatory-1.2.2.tar.gz (26.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web_observatory-1.2.2-py3-none-any.whl (26.4 kB view details)

Uploaded Python 3

File details

Details for the file web_observatory-1.2.2.tar.gz.

File metadata

  • Download URL: web_observatory-1.2.2.tar.gz
  • Upload date:
  • Size: 26.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for web_observatory-1.2.2.tar.gz
Algorithm Hash digest
SHA256 8f9ef66afecec0340540971accf4a696c602b863195e8c969fb52ea70aaa5bd7
MD5 e26bc18d24c4f7274acc97fe16b475fb
BLAKE2b-256 5234f43d9f69dec8a2694b1aebab756c9b5b39308d176f264a306db28f4722e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for web_observatory-1.2.2.tar.gz:

Publisher: python-publish.yml on ericnost/web-observatory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file web_observatory-1.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for web_observatory-1.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5b2d652acdabe17b5e08ca5f7d02360c412462b6bbee9fdc69ed1b9e75c38552
MD5 65581d0a529aa76ba8ccdd2e4fcb3061
BLAKE2b-256 3663f93f5cd5f73137e04e6d7ca2ed3c58e8e00b95a4df96c3d9db45cb6f7a65

See more details on using hashes here.

Provenance

The following attestation bundles were made for web_observatory-1.2.2-py3-none-any.whl:

Publisher: python-publish.yml on ericnost/web-observatory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page