Python package for collecting and analyzing webpages

These details have not been verified by PyPI

Project links

Project description

observatory

Python package for collecting and analyzing webpages

See here for extended examples of observatory in use.

Modules

`start_project`

Initializes a project directory

`search_google`

Searches Google for terms. Google Custom Search Engine credentials required.

`google_process`

Compiles results from multiple Google searches.

`get_domains`

Extracts domain-level information from the urls returned by Google searches (e.g. 'google' in www.google.com)

`initialize_crawl`

Initializes a Scrapy crawl on a set of domains. Returns a JSON file of urls found through the crawl.

`crawl_process`

Processes the JSON output of a crawl into a pandas DataFrame.

`crawl`

Not implemented yet. `!scrapy crawl digcon_crawler -O output.json --nolog

`search_merge`

Merges Google searches and crawl results.

`get_versions`

~~Gets historical versions of Twitter-searched urls using the Internet Archive's Wayback Machine. Attempts to find the version of the page archived closest in time to when it was tweeted.~~
Uses the requests package to ping the url and get the "full" address rather than a redirect (e.g. bit.ly/12312). This helps in scraping.

`initialize_scrape`

Initializes files to scrape urls for their HTML.

`scrape`

Conducts the scrape of pages' HTML. Stores body text in a Postgresql database.

`query`

A set of methods for searching the Postgreql database of site text, including filtering empty results and counting specified search terms.

`ground_truth`

Produces a sample of pages for verifying counts of terms.

`analyze_orgs`

Calculates and visualizes averages and frequencies for each search term in the site text and summarizes by organization (domain).

`analzye_term_correlations`

Calculates and visualizes co-variance metrics for specified search terms in the site text.

`co_occurrence`

Returns specific pages using two or more specified search terms.

TBD

Documentation :(
Convert modules to methods of data classes
Add crawl module

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.2.1

Jan 18, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_observatory-1.2.1.tar.gz (14.2 kB view details)

Uploaded Jan 18, 2024 Source

Built Distribution

web_observatory-1.2.1-py3-none-any.whl (13.9 kB view details)

Uploaded Jan 18, 2024 Python 3

File details

Details for the file web_observatory-1.2.1.tar.gz.

File metadata

Download URL: web_observatory-1.2.1.tar.gz
Upload date: Jan 18, 2024
Size: 14.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for web_observatory-1.2.1.tar.gz
Algorithm	Hash digest
SHA256	`951cee6bb3f7fd6921daf6e93094c4140477e44c975f846a38d566d7110961df`
MD5	`bb1b3bea64dcaef6e978b3d4acd76839`
BLAKE2b-256	`6218a8f2bbf34898008d41715306245f9bfda541d6d2312cc4cf12f9f96a000e`

See more details on using hashes here.

File details

Details for the file web_observatory-1.2.1-py3-none-any.whl.

File metadata

Download URL: web_observatory-1.2.1-py3-none-any.whl
Upload date: Jan 18, 2024
Size: 13.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for web_observatory-1.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a3a779e1f6dbc1d743717b37db6c1c82e12488fc96ba0a08d197526c24ffabc9`
MD5	`85e264be7a08f043605fd4fd279bc834`
BLAKE2b-256	`ab26a294cd3dc4d4bda3945cb64f4dfb7c6d2d88bdb5365a0a9fa624220eadf8`

See more details on using hashes here.

web-observatory 1.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

observatory

Modules

start_project

search_google

google_process

get_domains

initialize_crawl

crawl_process

crawl

search_merge

get_versions

initialize_scrape

scrape

query

ground_truth

analyze_orgs

analzye_term_correlations

co_occurrence

TBD

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`start_project`

`search_google`

`google_process`

`get_domains`

`initialize_crawl`

`crawl_process`

`crawl`

`search_merge`

`get_versions`

`initialize_scrape`

`scrape`

`query`

`ground_truth`

`analyze_orgs`

`analzye_term_correlations`

`co_occurrence`