← Back to Projects
Data Engineering
CryptoPulse
Crypto Market ETL Pipeline + Dashboard

Problem
Crypto price data is public and fast-moving, but raw API responses are messy and unnormalised. Most people look at a chart and see a number — a data engineer sees a pipeline problem: how do you reliably ingest, clean, transform, and serve that data so it's always fresh and queryable?
Solution
- 1Ingest — Python fetches top-100 coin prices from CoinGecko's free API, validates each row with a pydantic model, writes a timestamped Parquet snapshot.
- 2Store — An idempotent loader appends the snapshot into DuckDB; duplicate timestamps are silently ignored via
INSERT OR IGNOREon a composite primary key. - 3Transform — dbt rebuilds a staging view and two mart tables (daily returns leaderboard, top gainers/losers), then runs 10 data-quality tests.
- 4Commit back — GitHub Actions commits the updated DuckDB file to the repo with
[skip ci]so the next cron run picks up fresh historical data. - 5Serve — A Streamlit dashboard reads from the mart tables and shows a movers section, a coin price history chart, and a full top-100 leaderboard.
Tech Stack
| Ingestion | Python · httpx · pydantic |
| Storage | DuckDB (file-based, committed to repo) |
| Transformation | dbt-duckdb (staging → marts pattern) |
| Orchestration | GitHub Actions schedule: cron (every 6h) |
| Dashboard | Streamlit on Streamlit Community Cloud |
| Tests | dbt unique · not_null · accepted_values |
| Docs | dbt docs site hosted on GitHub Pages |
Architecture
GitHub Actions (cron 0 */6 * * *) │ ├─ ingest.py ──► data/raw/prices_<UTC>.parquet │ │ ├─ load.py ──────────────► cryptopulse.duckdb │ │ │ ┌────────────────────┤ raw_prices (table) │ │ │ ├─ dbt run ────┤ stg_prices (view) │ │ │ │ ├── mart_daily_returns (table) │ └── mart_top_movers (table) │ └─ git commit ──► pushes cryptopulse.duckdb back [skip ci] Streamlit app reads marts → live public dashboard
Selected Lessons
- ›dbt staging → marts pattern. Separating raw-to-clean from clean-to-business-logic means one place to fix schema changes, not every downstream model.
- ›Idempotency by design. INSERT OR IGNORE on (id, ingested_at) makes the loader safe to re-run; the same lesson applies to every ETL job.
- ›Jinja in SQL comments is parsed. {{ ref(...) }} inside a comment creates a real dbt dependency and a false cycle error; always use plain text in comments.
- ›git pull --rebase needs a clean tree. In CI: commit first, then rebase on the remote, then push. Not the other way around.
- ›Windows zoneinfo. Python's zoneinfo module needs the tzdata PyPI package on Windows (Linux gets it from the OS); catching this early saved a broken GitHub Actions run.