Data Engineering

CryptoPulse

Crypto Market ETL Pipeline + Dashboard

Problem

Crypto price data is public and fast-moving, but raw API responses are messy and unnormalised. Most people look at a chart and see a number — a data engineer sees a pipeline problem: how do you reliably ingest, clean, transform, and serve that data so it's always fresh and queryable?

Solution

1Ingest — Python fetches top-100 coin prices from CoinGecko's free API, validates each row with a pydantic model, writes a timestamped Parquet snapshot.
2Store — An idempotent loader appends the snapshot into DuckDB; duplicate timestamps are silently ignored via INSERT OR IGNORE on a composite primary key.
3Transform — dbt rebuilds a staging view and two mart tables (daily returns leaderboard, top gainers/losers), then runs 10 data-quality tests.
4Commit back — GitHub Actions commits the updated DuckDB file to the repo with [skip ci] so the next cron run picks up fresh historical data.
5Serve — A Streamlit dashboard reads from the mart tables and shows a movers section, a coin price history chart, and a full top-100 leaderboard.

Tech Stack

Ingestion	Python · httpx · pydantic
Storage	DuckDB (file-based, committed to repo)
Transformation	dbt-duckdb (staging → marts pattern)
Orchestration	GitHub Actions schedule: cron (every 6h)
Dashboard	Streamlit on Streamlit Community Cloud
Tests	dbt unique · not_null · accepted_values
Docs	dbt docs site hosted on GitHub Pages

Architecture

GitHub Actions (cron 0 */6 * * *)
  │
  ├─ ingest.py ──► data/raw/prices_<UTC>.parquet
  │                        │
  ├─ load.py ──────────────► cryptopulse.duckdb
  │                                   │
  │              ┌────────────────────┤  raw_prices (table)
  │              │                    │
  ├─ dbt run ────┤         stg_prices (view)
  │              │                    │
  │              ├── mart_daily_returns (table)
  │              └── mart_top_movers   (table)
  │
  └─ git commit ──► pushes cryptopulse.duckdb back [skip ci]

Streamlit app reads marts → live public dashboard

Selected Lessons

›dbt staging → marts pattern. Separating raw-to-clean from clean-to-business-logic means one place to fix schema changes, not every downstream model.
›Idempotency by design. INSERT OR IGNORE on (id, ingested_at) makes the loader safe to re-run; the same lesson applies to every ETL job.
›Jinja in SQL comments is parsed. {{ ref(...) }} inside a comment creates a real dbt dependency and a false cycle error; always use plain text in comments.
›git pull --rebase needs a clean tree. In CI: commit first, then rebase on the remote, then push. Not the other way around.
›Windows zoneinfo. Python's zoneinfo module needs the tzdata PyPI package on Windows (Linux gets it from the OS); catching this early saved a broken GitHub Actions run.

← All Projects