CryptoPulse

Data Engineering

CryptoPulse

Crypto Market ETL Pipeline + Dashboard

Live dashboard → View source → dbt docs →

Problem

Crypto price data is public and fast-moving, but raw API responses are messy and unnormalised. Most people look at a chart and see a number — a data engineer sees a pipeline problem: how do you reliably ingest, clean, transform, and serve that data so it's always fresh and queryable?

Solution

A fully automated data pipeline that runs every 6 hours with zero manual steps:

Ingest — Python fetches top-100 coin prices from CoinGecko's free API (no key required), validates each row with a pydantic model, writes a timestamped Parquet snapshot.
Store — An idempotent loader appends the snapshot into DuckDB; duplicate timestamps are silently ignored via INSERT OR IGNORE on a composite primary key.
Transform — dbt rebuilds a staging view and two mart tables (daily returns leaderboard, top gainers/losers), then runs 10 data-quality tests.
Commit back — GitHub Actions commits the updated DuckDB file to the repo with [skip ci] so the next cron run picks up fresh historical data.
Serve — A Streamlit dashboard reads from the mart tables and shows a movers section, a coin price history chart, and a full top-100 leaderboard.

Tech stack

Ingestion	Python · `httpx` · `pydantic`
Storage	DuckDB (file-based, committed to repo)
Transformation	`dbt-duckdb` (staging → marts pattern)
Orchestration	GitHub Actions `schedule:` cron (every 6h)
Dashboard	Streamlit on Streamlit Community Cloud
Tests	dbt `unique` · `not_null` · `accepted_values`
Docs	dbt docs site hosted on GitHub Pages

Architecture

GitHub Actions (cron 0 */6 * * *)
  │
  ├─ ingest.py ──► data/raw/prices_<UTC>.parquet
  │                        │
  ├─ load.py ──────────────► cryptopulse.duckdb
  │                                   │
  │              ┌────────────────────┤  raw_prices (table)
  │              │                    │
  ├─ dbt run ────┤         stg_prices (view)
  │              │                    │
  │              ├── mart_daily_returns (table)
  │              └── mart_top_movers   (table)
  │
  └─ git commit ──► pushes cryptopulse.duckdb back [skip ci]

Streamlit app reads marts → live public dashboard

Selected lessons

dbt staging → marts pattern — separating raw-to-clean from clean-to-business-logic means one place to fix schema changes, not every downstream model.
Idempotency by design — INSERT OR IGNORE on (id, ingested_at) makes the loader safe to re-run; the same lesson applies to every ETL job.
Jinja in SQL comments is parsed — {{ ref(...) }} inside a comment creates a real dbt dependency and a false cycle error; always use plain text in comments.
git pull --rebase needs a clean tree. In CI: commit first, then rebase on the remote, then push. Not the other way around.
Windows zoneinfo — Python's zoneinfo module needs the tzdata PyPI package on Windows (Linux gets it from the OS); catching this early saved a broken GitHub Actions run.