Python package for collecting and analyzing webpages
Project description
observatory
Python package for collecting and analyzing webpages
See here for extended examples of observatory
in use.
Modules
start_project
Initializes a project directory
search_google
Searches Google for terms. Google Custom Search Engine credentials required.
google_process
Compiles results from multiple Google searches.
get_domains
Extracts domain-level information from the urls returned by Google searches (e.g. 'google' in www.google.com)
initialize_crawl
Initializes a Scrapy crawl on a set of domains. Returns a JSON file of urls found through the crawl.
crawl_process
Processes the JSON output of a crawl into a pandas DataFrame.
crawl
Not implemented yet. `!scrapy crawl digcon_crawler -O output.json --nolog
search_merge
Merges Google searches and crawl results.
get_versions
Gets historical versions of Twitter-searched urls using the Internet Archive's Wayback Machine. Attempts to find the version of the page archived closest in time to when it was tweeted.
Uses the requests
package to ping the url and get the "full" address rather than a redirect (e.g. bit.ly/12312). This helps in scraping.
initialize_scrape
Initializes files to scrape urls for their HTML.
scrape
Conducts the scrape of pages' HTML. Stores body text in a Postgresql database.
query
A set of methods for searching the Postgreql database of site text, including filtering empty results and counting specified search terms.
ground_truth
Produces a sample of pages for verifying counts of terms.
analyze_orgs
Calculates and visualizes averages and frequencies for each search term in the site text and summarizes by organization (domain).
analzye_term_correlations
Calculates and visualizes co-variance metrics for specified search terms in the site text.
co_occurrence
Returns specific pages using two or more specified search terms.
TBD
- Documentation :(
- Convert modules to methods of data classes
- Add crawl module
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file web_observatory-1.2.1.tar.gz
.
File metadata
- Download URL: web_observatory-1.2.1.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 951cee6bb3f7fd6921daf6e93094c4140477e44c975f846a38d566d7110961df |
|
MD5 | bb1b3bea64dcaef6e978b3d4acd76839 |
|
BLAKE2b-256 | 6218a8f2bbf34898008d41715306245f9bfda541d6d2312cc4cf12f9f96a000e |
File details
Details for the file web_observatory-1.2.1-py3-none-any.whl
.
File metadata
- Download URL: web_observatory-1.2.1-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a3a779e1f6dbc1d743717b37db6c1c82e12488fc96ba0a08d197526c24ffabc9 |
|
MD5 | 85e264be7a08f043605fd4fd279bc834 |
|
BLAKE2b-256 | ab26a294cd3dc4d4bda3945cb64f4dfb7c6d2d88bdb5365a0a9fa624220eadf8 |