Data Engineering
CryptoPulse
Crypto Market ETL Pipeline + Dashboard
CryptoPulse dashboard
Problem
Crypto price data is public and fast-moving, but raw API responses are messy and unnormalised. Most people look at a chart and see a number — a data engineer sees a pipeline problem: how do you reliably ingest, clean, transform, and serve that data so it's always fresh and queryable?
Solution
A fully automated data pipeline that runs every 6 hours with zero manual steps:
  1. Ingest — Python fetches top-100 coin prices from CoinGecko's free API (no key required), validates each row with a pydantic model, writes a timestamped Parquet snapshot.
  2. Store — An idempotent loader appends the snapshot into DuckDB; duplicate timestamps are silently ignored via INSERT OR IGNORE on a composite primary key.
  3. Transform — dbt rebuilds a staging view and two mart tables (daily returns leaderboard, top gainers/losers), then runs 10 data-quality tests.
  4. Commit back — GitHub Actions commits the updated DuckDB file to the repo with [skip ci] so the next cron run picks up fresh historical data.
  5. Serve — A Streamlit dashboard reads from the mart tables and shows a movers section, a coin price history chart, and a full top-100 leaderboard.
Tech stack
IngestionPython · httpx · pydantic
StorageDuckDB (file-based, committed to repo)
Transformationdbt-duckdb (staging → marts pattern)
OrchestrationGitHub Actions schedule: cron (every 6h)
DashboardStreamlit on Streamlit Community Cloud
Testsdbt unique · not_null · accepted_values
Docsdbt docs site hosted on GitHub Pages
Architecture
GitHub Actions (cron 0 */6 * * *)
  │
  ├─ ingest.py ──► data/raw/prices_<UTC>.parquet
  │                        │
  ├─ load.py ──────────────► cryptopulse.duckdb
  │                                   │
  │              ┌────────────────────┤  raw_prices (table)
  │              │                    │
  ├─ dbt run ────┤         stg_prices (view)
  │              │                    │
  │              ├── mart_daily_returns (table)
  │              └── mart_top_movers   (table)
  │
  └─ git commit ──► pushes cryptopulse.duckdb back [skip ci]

Streamlit app reads marts → live public dashboard
Selected lessons
  • dbt staging → marts pattern — separating raw-to-clean from clean-to-business-logic means one place to fix schema changes, not every downstream model.
  • Idempotency by designINSERT OR IGNORE on (id, ingested_at) makes the loader safe to re-run; the same lesson applies to every ETL job.
  • Jinja in SQL comments is parsed{{ ref(...) }} inside a comment creates a real dbt dependency and a false cycle error; always use plain text in comments.
  • git pull --rebase needs a clean tree. In CI: commit first, then rebase on the remote, then push. Not the other way around.
  • Windows zoneinfo — Python's zoneinfo module needs the tzdata PyPI package on Windows (Linux gets it from the OS); catching this early saved a broken GitHub Actions run.
© 2026 Jhames Andrew Macabata