Data collection manager
Project description
aswan
collect and organize data into a T1 data lake and T2 tables. named after the Aswan Dam
Quickstart
import aswan
config = aswan.AswanConfig.default_from_dir("imdb-env")
celeb_table = config.get_prod_table("person")
movie_table = config.get_prod_table("movie")
project = aswan.Project(config) # this creates the env directories by default
@project.register_handler
class CelebHandler(aswan.UrlHandler):
url_root = "https://www.imdb.com"
def parse_soup(self, soup):
return {
"name": soup.find("h1").find("span").text.strip(),
"dob": soup.find("div", id="name-born-info").find("time")["datetime"],
}
@project.register_handler
class MovieHandler(aswan.UrlHandler):
url_root = "https://www.imdb.com"
def parse_soup(self, soup):
for cast in soup.find("table", class_="cast_list").find_all("td", class_="primary_photo")[:3]:
link = cast.find("a")["href"]
self.register_link_to_handler(link, CelebHandler)
return {
"title": soup.find("title").text.replace(" - IMDb", "").strip(),
"summary": soup.find("div", class_="summary_text").text.strip(),
"year": int(soup.find("span", id="titleYear").find("a").text),
}
# all this registering can be done simpler :)
project.register_t2_table(celeb_table)
project.register_t2_table(movie_table)
@project.register_t2_integrator
class MovieIntegrator(aswan.FlexibleDfParser):
handlers = [MovieHandler]
def get_t2_table(self):
return movie_table
@project.register_t2_integrator
class CelebIntegrator(aswan.FlexibleDfParser):
handlers = [CelebHandler]
def get_t2_table(self):
return celeb_table
def add_init_urls():
movie_urls = [
"https://www.imdb.com/title/tt1045772",
"https://www.imdb.com/title/tt2543164",
]
person_urls = ["https://www.imdb.com/name/nm0000190"]
project.add_urls_to_handler(MovieHandler, movie_urls)
project.add_urls_to_handler(CelebHandler, person_urls)
add_init_urls()
project.run(with_monitor_process=True)
2021-05-09 22:13.42 [info ] running function reset_surls env=prod function_batch=run_prep
...
2021-05-09 22:13.45 [info ] ray dashboard: http://127.0.0.1:8266
...
2021-05-09 22:13.45 [info ] monitor app at: http://localhost:6969
...
movie_table.get_full_df()
title | summary | year | |
---|---|---|---|
0 | I Love You Phillip Morris (2009) | A cop turns con man once he comes out of the c... | 2009 |
0 | Arrival (2016) | A linguist works with the military to communic... | 2016 |
celeb_table.get_full_df()
name | dob | |
---|---|---|
0 | Matthew McConaughey | 1969-11-4 |
0 | Leslie Mann | 1972-3-26 |
0 | Jeremy Renner | 1971-1-7 |
0 | Forest Whitaker | 1961-7-15 |
0 | Jim Carrey | 1962-1-17 |
0 | Amy Adams | 1974-8-20 |
0 | Ewan McGregor | 1971-3-31 |
Pre v0.0.0 laundry list
will probably need to separate a few things from it:
- t2extractor
- scheduler
TODO
- cleanup reqirements
- s3, scp for push/pull
- selective push / pull
- with possible nuking of remote archive
- cleaning local obj store (when envs blow up, ide dies)
- parsing/connection error confusion
- also broken session thing
- conn session cpu requirement
- resource limits
- transfering / ignoring cookies
- lots of things with extractors
- template projects
- oddsportal
- updating thingy, based on latest match in season
- footy
- rotten
- boxoffice
- oddsportal
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
aswan-0.0.1.tar.gz
(38.9 kB
view hashes)