Production-ready downloader for Chicago Crime data (Socrata/SoQL) with resumable chunking and modular architecture.
Project description
📊 Chicago Crime Downloader — Command-Line Guide
🚀 Overview
The Chicago Crime Downloader is a production-ready, resumable command-line tool to fetch open crime data directly from the City of Chicago Open Data API (ijzp-q8t2).
It improves over manual downloads or Kaggle dumps by providing automatic retries, structured manifests, and deterministic partitioning (daily, weekly, monthly) — all from the command line.
Unlike typical one-shot CSV downloads, this tool is:
- ✅ Resumable — restarts exactly where it left off.
- 🧩 Modular — works in daily, weekly, or monthly windows.
- 🧠 Smart — includes preflight checks, structured logs, and JSON manifests.
- ⚙️ Configurable — supports CSV or Parquet, user agents, and API tokens.
- 🧱 Reproducible — every file has a checksum and metadata manifest.
🧑💻 Installation
1️⃣ Requirements
- Python 3.11+
- pip (latest)
- Optional: install Parquet engine (
pyarroworfastparquet)
2️⃣ Clone and install
CLI to download Chicago Crime data from Socrata with resumable chunking, manifests, and flexible layouts.
git clone https://github.com/<yourusername>/chicago-crime-downloader.git
cd chicago-crime-downloader
pip install -e .
This installs the console command:
chicago-crime-dl
or you can still run it directly as:
python data/download_data_v5.py
⚡ Quick Start
Example: Download a single day (CSV)
chicago-crime-dl --mode daily --start-date 2020-01-10 --end-date 2020-01-10 --out-root data/raw_daily --out-format csv
Output:
data/raw_daily/daily/2020-01-10/2020-01-10_chunk_0001.csv
data/raw_daily/daily/2020-01-10/2020-01-10_chunk_0001.manifest.json
🧭 Command-Line Reference
Basic Syntax
chicago-crime-dl [OPTIONS]
or
python data/download_data_v5.py [OPTIONS]
Key Options
| Option | Description | Example |
|---|---|---|
--mode |
One of full, monthly, weekly, or daily. |
--mode daily |
--start-date, --end-date |
Restrict downloads to a date range (YYYY-MM-DD). | --start-date 2020-01-01 --end-date 2020-01-31 |
--chunk-size |
Number of rows per request (default: 50,000). | --chunk-size 100000 |
--max-chunks |
Limit chunks in one run (useful for testing). | --max-chunks 5 |
--out-root |
Output directory. | --out-root data/raw_daily |
--out-format |
csv or parquet. |
--out-format parquet |
--select |
Comma-separated list of columns. | --select id,date,primary_type,latitude,longitude |
--columns-file |
Path to file listing columns (one per line). | --columns-file columns.txt |
--layout |
Directory layout: nested, mode-flat, flat, or ymd. |
--layout nested |
--preflight |
Skips days with 0 rows (uses API count(1) precheck). |
--preflight |
🗂️ Layout Options
| Layout | Example Output |
|---|---|
| nested (default) | data/raw_daily/daily/2020-01-10/2020-01-10_chunk_0001.csv |
| mode-flat | data/raw_daily/2020-01-10_chunk_0001.csv |
| flat | data/raw_daily_daily_2020-01-10_chunk_0001.csv |
| ymd | data/raw_daily/daily/2020/01/10/2020-01-10_chunk_0001.csv |
Automatic inference:
- If
out-rootends with mode name (raw_daily→ daily), uses mode-flat. - Else defaults to nested.
🔐 API Tokens
For higher rate limits, export a Socrata token:
export SOC_APP_TOKEN="YOUR_APP_TOKEN"
# or
export SOCRATA_APP_TOKEN="YOUR_APP_TOKEN"
Without a token, the downloader still works, but with limited speed.
🧾 Output Manifest Example
Each data file has a sidecar manifest with metadata:
{
"data_file": "2020-01-10_chunk_0001.csv",
"rows": 1024,
"sha256": "eb1a62d0...",
"params": {"$limit": "50000", "$offset": "0"},
"started_at": "2025-11-09T02:31:30",
"duration_seconds": 1.42,
"endpoint": "https://data.cityofchicago.org/resource/ijzp-q8t2.json",
"version": 5
}
🧩 Advanced Examples
1️⃣ Monthly mode
chicago-crime-dl --mode monthly --start-date 2020-01-01 --end-date 2020-12-31 --out-root data/raw_monthly
2️⃣ Weekly mode
chicago-crime-dl --mode weekly --start-date 2020-01-01 --end-date 2020-03-31 --out-root data/raw_weekly
3️⃣ Full historical data
chicago-crime-dl --mode full --out-root data/raw_full --out-format parquet
4️⃣ Resume after interruption
chicago-crime-dl --mode daily --start-date 2020-01-01 --end-date 2020-01-05 --out-root data/raw_daily
Resumes automatically by skipping existing chunks.
5️⃣ Select only specific columns
chicago-crime-dl --mode daily --start-date 2020-02-01 --end-date 2020-02-01 --select id,date,primary_type,latitude,longitude
🧠 Why Use This Tool Instead of Manual Downloads?
| Feature | Manual CSV Download | Kaggle Dataset | This CLI Tool |
|---|---|---|---|
| Up-to-date | ❌ Static | ❌ Often outdated | ✅ Always current (direct API) |
| Resumable | ❌ No | ❌ No | ✅ Yes |
| Incremental | ❌ No | ❌ No | ✅ Daily / Weekly / Monthly windows |
| Custom Columns | ❌ No | ✅ Somewhat | ✅ Full SoQL $select support |
| Parallelization | ❌ Manual | ❌ Manual | ✅ Built-in window logic |
| Logging | ❌ None | ✅ Some | ✅ Full structured logs + manifests |
| Robustness | ❌ Fragile | ⚠️ | ✅ Retries + backoff + token auth |
| Integration | ❌ | ❌ | ✅ Perfect for ETL / Airflow / Kubeflow / ML pipelines |
This makes it ideal for data science pipelines, ETL automation, and reproducible analysis.
🛠️ Troubleshooting
| Issue | Fix |
|---|---|
| 429 Too Many Requests | Tool waits and retries automatically (exponential backoff). |
| Empty folders | Enable --preflight to skip days with zero data. |
| Date format error | Use YYYY-MM-DD; tool will auto-fix invalid days (e.g. April 31 → April 30). |
| Parquet not written | Install an engine: pip install pyarrow or pip install fastparquet. |
✅ Best Practices
- Always use API token for stable throughput.
- Keep logs (
--log-file) and manifests for reproducibility. - For production, prefer mode-flat layout for easier orchestration.
- Run tests regularly:
pytest -m unit -q pytest -m integration -q
Author: Habib Bayo
License: MIT
Version: 5.0
Repository: https://github.com//chicago-crime-downloader
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chicago_crime_downloader-0.5.0.tar.gz.
File metadata
- Download URL: chicago_crime_downloader-0.5.0.tar.gz
- Upload date:
- Size: 17.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e0398b05f3c3f08a46b45c01e2342a5b5f74eecf0ef2480b885321a178f95ca
|
|
| MD5 |
48d03c4e29e3bf1b1095787f0b3ec365
|
|
| BLAKE2b-256 |
ae7b62f09be02f430248661a015a5653c7e7ae3521b79ecaaa9ee9e47ac6c932
|
Provenance
The following attestation bundles were made for chicago_crime_downloader-0.5.0.tar.gz:
Publisher:
release.yml on BayoHabib/chicago_crime_data_cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chicago_crime_downloader-0.5.0.tar.gz -
Subject digest:
7e0398b05f3c3f08a46b45c01e2342a5b5f74eecf0ef2480b885321a178f95ca - Sigstore transparency entry: 732693092
- Sigstore integration time:
-
Permalink:
BayoHabib/chicago_crime_data_cli@f4c863ac18f8344259ccbca8d7dc76e9e47b0c11 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/BayoHabib
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f4c863ac18f8344259ccbca8d7dc76e9e47b0c11 -
Trigger Event:
push
-
Statement type:
File details
Details for the file chicago_crime_downloader-0.5.0-py3-none-any.whl.
File metadata
- Download URL: chicago_crime_downloader-0.5.0-py3-none-any.whl
- Upload date:
- Size: 17.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9df12b8fb23266d85e0c7c0713061574221ff2b11427b586513ad9cb1a62e590
|
|
| MD5 |
3b5486bf6eaecfd266b5288e053fcdc7
|
|
| BLAKE2b-256 |
d6d99f5c53ac3cec7360d6d133c07f2a65c5265fc01ca12a0eda6874923fe8e9
|
Provenance
The following attestation bundles were made for chicago_crime_downloader-0.5.0-py3-none-any.whl:
Publisher:
release.yml on BayoHabib/chicago_crime_data_cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chicago_crime_downloader-0.5.0-py3-none-any.whl -
Subject digest:
9df12b8fb23266d85e0c7c0713061574221ff2b11427b586513ad9cb1a62e590 - Sigstore transparency entry: 732693095
- Sigstore integration time:
-
Permalink:
BayoHabib/chicago_crime_data_cli@f4c863ac18f8344259ccbca8d7dc76e9e47b0c11 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/BayoHabib
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f4c863ac18f8344259ccbca8d7dc76e9e47b0c11 -
Trigger Event:
push
-
Statement type: