Skip to main content

Download and store MTA turnstile data

Project description

# pymtattl

## Introduction

Download and store MTA Turnstile Data

Automate downloading turnstile entry/exit data from MTA website. Save as text files, or directly write to a SQLite/Postgres Database. Can also specify a requested time frame as the earliest files went back to 2010.

MTA Turnstile Data: http://web.mta.info/developers/turnstile.html


## Table of Contents

* [Installation](#installation)

* [Download](#download)

* [Text Files](#text-files)

* [SQLite Database](#sqlite-database)

* [Postgres Database](#postgres-database)

* [Caveats](#caveats)

* [To-Do](#to-do)

## Installation

pip install pymtattl

## Requirements

* Written for Python 3! Feel free to test and contribute using Python 2!
* Requires bs4, pandas, psycopg2

## Download Methods

### Text Files

BaseDownloader: Download requested data as separate **text files**

from pymtattl.download import BaseDownloader
base_downlowder = BaseDownloader(start=141018, end=None)
dat_dir = base_downloader.download_to_txt(path='data', keep_urls=False)

* `start/end`: *integer or None*
- Define the date range to pull data files *(recommend testing with small date ranges, as downloading all files might be slow)*
- Example (yymmdd) for 2014-10-18: `141018`

* `path`: *string*
- An existing directory to save downloaded data files
- Can also put an empty string (to save under current working directory) or a new folder name (ie. 'data')

* `keep_urls`: *boolean*
- If true will include retrieved urls in **data_urls.txt** under provided directory

* Returns data folder directory

### SQLite Database

SqliteDownloader: Reformat data either from **local path** or directly downloaded from MTA website and save in a SQLite database

from pymtattl.download import SqliteDownloader
# provide database parameters
pm = {'path': 'test',
'dbname': 'testdb'}
sqlite_downloader = SqliteDownloader(start=141018, end=None, dbparms=pm)
# download data files and save to sqlite db
sqlite_downloader.download_to_db(path='data', update=False)
# write name_keys file to db
sqlite_downloader.init_namekeys(path='data', update=False)

* Create (if not exists) a SQLite database **testdb.db** under **~/test/** and 3 tables

- **turnstile**: holds turnstile data
- **name_keys**: a matching table to lookup station name given remote and booth
- **file_names**: names of data files that are already in **turnstile** table

* `start/end`: *integer or None*
- Define the date range to pull data files *(recommend testing with small date ranges, as downloading all files might be slow)*
- Example (yymmdd) for 2014-10-18: `141018`

* `dbparm`: *dict*
- `path`: path to create or find an existing sqlite database file
- `dbname`: database file name to create or save to if exists

* `path`: *string*
- Local data folder path if data already downloaded
- Specify an existing directory or a new folder name to store downloaded text files
- Can also choose to directly read from MTA website and write to db, as if there is no local data files

* Returns data folder directory

### Postgres Database

PostgresDownloader: Reformat data either from **local path** or directly downloaded from MTA website and save in a Postgres database

from pymtattl.download import PostgresDownloader
# provide database parameters
pm = {'dbname': '',
'user': 'a',
'password': 'b',
'host': 'localhost',
'port': '5432'}
postgres_downloader = PostgresDownloader(start=141018, end=None, dbparms=pm)
# download data files and save to postgres db
postgres_downloader.download_to_db(path='data', update=False)
# write name_keys file to db
postgres_downloader.init_namekeys(path='data', update=False)

* Create (if not exists) a Postgres database and 3 tables

- **turnstile**: holds turnstile data
- **name_keys**: a matching table to lookup station name given remote and booth
- **file_names**: names of data files that are already in **turnstile** table

* `start/end`: *integer or None*
- Define the date range to pull data files *(recommend testing with small date ranges, as downloading all files might be slow)*
- Example (yymmdd) for 2014-10-18: `141018`

* `dbparm`: *dict*
- `dbname`: database name to connect, if empty string or remove from the dict, will prompt to ask for new database name to **create**
- `user`|`password`|`host`|`port`: parameters to connect to Postgres instance

* `path`: *string*
- Local data folder path if data already downloaded
- Specify an existing directory or a new folder name to store downloaded text files
- Can also choose to directly read from MTA website and write to db, as if there is no local data files

## Caveats

* Some know data issues and these rows will be skipped while building the database

- In Turnstile_120428.txt, one line with empty ('') exit number
- In Turnstile_120714.txt, first few lines could not be parsed
- It seems recently date strings were reformatted to `mm/dd/yyyy` (03/20/2018)

## To-Do

* De-cumulate entry and exit numbers, and store data within selected date range into a new table

* A Summary table (ie. number of booth per station, average daily station entries/exits, ...) for "cleaned" data table above

* More to come...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymtattl-0.1.4.tar.gz (9.9 kB view hashes)

Uploaded source

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page