Python package for collecting and analyzing webpages
Project description
web-observatory
web-observatory is a Python package for collecting and analyzing webpages.
See here for extended examples of web-observatory in use.
Modules
start_project
Initializes a project directory
search_google
Searches Google for terms. Google Custom Search Engine credentials required.
google_process
Compiles results from multiple Google searches.
get_domains
Extracts domain-level information from the urls returned by Google searches (e.g. 'google' in www.google.com)
initialize_crawl
Initializes a Scrapy crawl on a set of domains. Returns a JSON file of urls found through the crawl.
crawl_process
Processes the JSON output of a crawl into a pandas DataFrame.
crawl
Not implemented as a module yet, but it can be run through a command like !scrapy crawl digcon_crawler -O output.json --nolog
search_merge
Merges Google searches and crawl results.
get_versions
Gets historical versions of Twitter-searched urls using the Internet Archive's Wayback Machine. Attempts to find the version of the page archived closest in time to when it was tweeted.
Uses the requests package to ping the url and get the "full" address rather than a redirect (e.g. bit.ly/12312). This helps in scraping.
initialize_scrape
Initializes files to scrape urls for their HTML.
scrape
Conducts the scrape of pages' HTML. Stores body text in a Postgresql database.
query
A set of methods for searching the Postgreql database of site text, including filtering empty results and counting specified search terms.
ground_truth
Produces a sample of pages for verifying counts of terms.
analyze_orgs
Calculates and visualizes averages and frequencies for each search term in the site text and summarizes by organization (domain).
analzye_term_correlations
Calculates and visualizes co-variance metrics for specified search terms in the site text.
co_occurrence
Returns specific pages using two or more specified search terms.
Issues and Development
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file web_observatory-1.2.2.tar.gz.
File metadata
- Download URL: web_observatory-1.2.2.tar.gz
- Upload date:
- Size: 26.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f9ef66afecec0340540971accf4a696c602b863195e8c969fb52ea70aaa5bd7
|
|
| MD5 |
e26bc18d24c4f7274acc97fe16b475fb
|
|
| BLAKE2b-256 |
5234f43d9f69dec8a2694b1aebab756c9b5b39308d176f264a306db28f4722e3
|
Provenance
The following attestation bundles were made for web_observatory-1.2.2.tar.gz:
Publisher:
python-publish.yml on ericnost/web-observatory
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
web_observatory-1.2.2.tar.gz -
Subject digest:
8f9ef66afecec0340540971accf4a696c602b863195e8c969fb52ea70aaa5bd7 - Sigstore transparency entry: 539538983
- Sigstore integration time:
-
Permalink:
ericnost/web-observatory@d8deed2662bb8de25b5cb8769d14c09878c99a1e -
Branch / Tag:
refs/tags/v1.2.2 - Owner: https://github.com/ericnost
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@d8deed2662bb8de25b5cb8769d14c09878c99a1e -
Trigger Event:
release
-
Statement type:
File details
Details for the file web_observatory-1.2.2-py3-none-any.whl.
File metadata
- Download URL: web_observatory-1.2.2-py3-none-any.whl
- Upload date:
- Size: 26.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b2d652acdabe17b5e08ca5f7d02360c412462b6bbee9fdc69ed1b9e75c38552
|
|
| MD5 |
65581d0a529aa76ba8ccdd2e4fcb3061
|
|
| BLAKE2b-256 |
3663f93f5cd5f73137e04e6d7ca2ed3c58e8e00b95a4df96c3d9db45cb6f7a65
|
Provenance
The following attestation bundles were made for web_observatory-1.2.2-py3-none-any.whl:
Publisher:
python-publish.yml on ericnost/web-observatory
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
web_observatory-1.2.2-py3-none-any.whl -
Subject digest:
5b2d652acdabe17b5e08ca5f7d02360c412462b6bbee9fdc69ed1b9e75c38552 - Sigstore transparency entry: 539539008
- Sigstore integration time:
-
Permalink:
ericnost/web-observatory@d8deed2662bb8de25b5cb8769d14c09878c99a1e -
Branch / Tag:
refs/tags/v1.2.2 - Owner: https://github.com/ericnost
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@d8deed2662bb8de25b5cb8769d14c09878c99a1e -
Trigger Event:
release
-
Statement type: