Skip to main content

Python package for collecting and analyzing webpages

Project description

observatory

Python package for collecting and analyzing webpages

See here for extended examples of observatory in use.

Modules

start_project

Initializes a project directory

search_google

Searches Google for terms. Google Custom Search Engine credentials required.

google_process

Compiles results from multiple Google searches.

get_domains

Extracts domain-level information from the urls returned by Google searches (e.g. 'google' in www.google.com)

initialize_crawl

Initializes a Scrapy crawl on a set of domains. Returns a JSON file of urls found through the crawl.

crawl_process

Processes the JSON output of a crawl into a pandas DataFrame.

crawl

Not implemented yet. `!scrapy crawl digcon_crawler -O output.json --nolog

search_merge

Merges Google searches and crawl results.

get_versions

Gets historical versions of Twitter-searched urls using the Internet Archive's Wayback Machine. Attempts to find the version of the page archived closest in time to when it was tweeted.
Uses the requests package to ping the url and get the "full" address rather than a redirect (e.g. bit.ly/12312). This helps in scraping.

initialize_scrape

Initializes files to scrape urls for their HTML.

scrape

Conducts the scrape of pages' HTML. Stores body text in a Postgresql database.

query

A set of methods for searching the Postgreql database of site text, including filtering empty results and counting specified search terms.

ground_truth

Produces a sample of pages for verifying counts of terms.

analyze_orgs

Calculates and visualizes averages and frequencies for each search term in the site text and summarizes by organization (domain).

analzye_term_correlations

Calculates and visualizes co-variance metrics for specified search terms in the site text.

co_occurrence

Returns specific pages using two or more specified search terms.

TBD

  • Documentation :(
  • Convert modules to methods of data classes
  • Add crawl module

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_observatory-1.2.1.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

web_observatory-1.2.1-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file web_observatory-1.2.1.tar.gz.

File metadata

  • Download URL: web_observatory-1.2.1.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for web_observatory-1.2.1.tar.gz
Algorithm Hash digest
SHA256 951cee6bb3f7fd6921daf6e93094c4140477e44c975f846a38d566d7110961df
MD5 bb1b3bea64dcaef6e978b3d4acd76839
BLAKE2b-256 6218a8f2bbf34898008d41715306245f9bfda541d6d2312cc4cf12f9f96a000e

See more details on using hashes here.

File details

Details for the file web_observatory-1.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for web_observatory-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a3a779e1f6dbc1d743717b37db6c1c82e12488fc96ba0a08d197526c24ffabc9
MD5 85e264be7a08f043605fd4fd279bc834
BLAKE2b-256 ab26a294cd3dc4d4bda3945cb64f4dfb7c6d2d88bdb5365a0a9fa624220eadf8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page