technology stack
⚙️dbt 1.11 📊Power BI 🐍Python
📈yfinance API 🐘PostgreSQL 🐳Docker .NET 10 FastAPI 🤖Ollama / LLM 🍓Raspberry Pi 5
stocks_db · data lifecycle
From markets to insight

Every night, raw price data from global markets travels through seven acts — arriving as noise, leaving as clarity.

🌙
cron · bash
act 01
Wake up
At 1 am every night, a scheduled job starts the entire pipeline from sleep.
01:00 daily
📡
Python · yfinance · Docker
act 02
Collect
Four groups run one after another, each pulling their slice of NYSE, B3, crypto, and forex tickers via the yfinance API — the longest act by far.
~4 hrs · serialized
yfinance API
🗄️
PostgreSQL
act 03
Store
Fresh data lands in a PostgreSQL database alongside 86 years of history — 70 million rows and counting.
70M+ rows
🧹
dbt · SQL
act 04
Clean
Duplicate tickers are removed. Broken numbers are fixed. Only valid, meaningful records pass through.
−23.5M ghosts
⚗️
dbt · PostgreSQL
act 05
Enrich
Moving averages, Bollinger bands, MACD signals, and momentum indicators are computed for every ticker.
17+ indicators
📊
Power BI · .NET 10
act 06
Serve
Power BI picks up a fresh snapshot. The web app shows live charts. Historical data is always ready, cached and fast.
live · cached
🤖
Ollama · Python
act 07
Insight
An AI analyst reads the day's patterns, writes market briefs, and flags anomalies — all in plain language.
AI analyst
86 yrs
of market history
46.6M
clean rows every day
4 nodes
collecting in parallel
~4 hrs
extraction runtime
01:00
wakes up every night
technical pipeline · detailed view
end-to-end flow diagram
cron
0 1 * * *
sequential execution
Group 1
cluster-pi1
Group 2
cluster-pi2
Group 3
cluster-pi3
Group 4
cluster-pi4
NYSE · NASDAQ · B3 · Crypto · Forex · Japan · Brazil
Extract
🗄
integrated_stocks + stocks_info
public schema · PostgreSQL · 70M+ rows
Store
STAGING
dbt_stocks schema
stg_stocks
VIEW · stateless · feeds stg_stocks_cleansed
materialisation
VIEW
purpose
clean · filter · deduplicate
filter
S&P500 + Forex + Crypto + Intl
dedup
row_number() OVER (ticker, date)
removed
35,582 duplicate pairs
join
LEFT JOIN stocks_info ON ticker
Stage
CLEANSE
dbt_stocks schema
🧹
stg_stocks_cleansed
VIEW · NaN / Inf / overflow guard
materialisation
VIEW
trigger
Power BI System.Decimal error
nan detection
= 'NaN'::float8 (not !=)
overflow guard
|value| ≥ 1e14 → NULL
prices
NaN / Inf / overflow → NULL
volume
NaN / Inf / overflow → 0
Cleanse
MART
dbt_stocks schema
📊
stocks_analytics
TABLE · 46.6M rows · ~38 min build
materialisation
TABLE
pre_hook
SET LOCAL work_mem = '512MB'
post_hook
ANALYZE {{ this }}
index
UNIQUE (ticker, date)
indicators
MA 7/14/20/30 · MACD · deviations
optimisation
5 CTEs → 3 · −42% runtime
core_calcs
MA · EMA · lag/lead · max/min
macd_signal_calc
9-period signal · count
final
pct changes · histogram
Transform
💾
Import partitions
cached · 1940→last 3d
1940–2025 · cached
last 3 days · daily refresh
DirectQuery
live · today · PPU only
LIVE
Query engine
transparent merge
Hybrid
📈
Power BI
PPU · gateway
.NET 10 MVC
PBI Embedded
🤖
NL/SQL
Qwen2.5 · llama.cpp
Consume
Source rows
70M+
integrated_stocks
Output rows
46.6M
after deduplication
Build time
~38 min
down from 60 min
CTEs
5 → 3
optimisation fences removed
Date range
86 yrs
1940-01-03 → today

PostgreSQL NaN ≠ IEEE 754 NaN. Power BI raised DataSource.Error: Numeric NaN not supported by System.Decimal on window-function computed columns. PostgreSQL float8 NaN satisfies NaN = NaN → TRUE, so the IEEE 754 self-inequality guard (value != value) silently passes NaN through. Fix: test with value = 'NaN'::float8 in stg_stocks_cleansed, and apply safe_numeric() macro to every numeric(18,x) cast in stocks_analytics to catch NaN propagating through lag/lead arithmetic.

!

PostgreSQL 12+ treats each CTE as an optimisation fence — fully materialising it to disk before the next one reads it. The original view had 5 CTEs, causing 5 full passes over 70M rows. Collapsing stock_data + ema_calc + macd_line_calc into a single core_calcs CTE eliminated 2 redundant sort cycles. Combined with SET LOCAL work_mem = 512MB in the dbt pre_hook, this cut wall-clock time by 42%.

pipeline stages
cron — 0 1 * * *
run_stocks.sh · fires 1am daily · main-pi
Entry point for the full pipeline. The cron job on main-pi triggers run_stocks.sh which orchestrates extraction across all nodes and calls run_dbt.sh as its final step. All output appends to $LOG.
cronbashmain-pi · 192.168.0.242
stocks extractor
Python · Docker · 4 Pi nodes
Distributed extraction across Groups 1–4 using hostname-based load balancing. Each node pulls a subset of tickers from B3, NYSE, NASDAQ, Crypto, Forex markets. Each group overrides machine-id via /proc/sys/kernel/random/uuid to avoid provider fingerprinting. Writes to public.integrated_stocks.
cluster-pi1cluster-pi2cluster-pi3cluster-pi4yfinanceDocker
source tables
public schema · PostgreSQL · stocks_db
Three source tables consumed by dbt as {{ source('public', ...) }} references. integrated_stocks holds 70M+ OHLCV rows across global tickers. stocks_info is the sector/country dimension lookup. sp500_tickers drives the S&P 500 inclusion filter.
integrated_stocks · 70M+ rowsstocks_info · dimensionsp500_tickers · filter
dbt — stg_stocks
dbt_stocks schema · VIEW · stateless
Staging layer — clean, filter, deduplicate. Joins stocks_info for sector enrichment. Filters to S&P 500 tickers + Forex, Crypto, Indices, Weapons, Commodities + Japan, Brazil, USA. Uses row_number() OVER (PARTITION BY ticker, date ORDER BY open) to eliminate 35,582 duplicate ticker/date pairs. Removes 23.5M ghost rows from downstream.
VIEWdedup · row_number()sector filter−23.5M ghost rows
🧹
dbt — stg_stocks_cleansed
dbt_stocks schema · VIEW · NaN / Inf / overflow guard
Intermediate cleansing layer. Fixes Power BI DataSource.Error: Numeric NaN not supported by System.Decimal. Must use value = 'NaN'::float8 — not value != value.

Prices — NaN, ±Infinity, or |value| ≥ 1e14 → NULL
Volume — NaN or Infinity → 0

The safe_numeric() macro applies the same guard to every numeric(18,x) cast in stocks_analytics.
VIEWNaN → NULL±Inf → NULLoverflow → NULLsafe_numeric() macro
dbt — stocks_analytics
dbt_stocks schema · TABLE · 46.5M rows
Core analytics model. 3-CTE chain collapsed from original 5 to eliminate PostgreSQL optimisation fences. All 17+ window functions share PARTITION BY ticker ORDER BY date enabling a single sort pass. SET LOCAL work_mem = '512MB' via pre_hook. Post_hook runs ANALYZE. Unique index on (ticker, date).

core_calcs → MA 7/14/20/30, lag/lead, EMA 12/26, highest/lowest ever
macd_signal_calc → 9-period signal line, signal count
final → daily change %, deviations, intraday change, MACD line/signal/histogram
TABLE3 CTEswork_mem 512MBANALYZE post_hook~35 min build
Power BI
PPU · on-premises gateway
Connected via on-premises data gateway on main-pi-win (Windows 11 ARM64 VM · QEMU/KVM). Import mode with sector WHERE clause. Gateway watchdog restarts PBIEgwService every 5 min if down.
import modegateway watchdogmain-pi-win
.NET 10 MVC
Power BI Embedded · Pi 5
ASP.NET Core MVC app with Power BI Embedded, workspace/report browser, and orbital carousel UI. Containerised with Docker on Pi 5 · Ubuntu ARM64. Exposed via Cloudflare tunnel at linux-embedded.ricardo.expert.
Docker · ARM64Cloudflare tunnel
NL/SQL pipeline
Qwen2.5-Coder-14B · llama.cpp
Local natural language to SQL pipeline. Qwen2.5-Coder-14B sharded across two 16GB Pi nodes via llama.cpp RPC. FastAPI orchestration on Pi3, read-only PostgreSQL executor on Pi4.
llama.cpp RPCFastAPIPi3 + Pi4
1cron 1am
2extract
3write to postgres
4dbt run
5stocks_analytics rebuilt
6consumers query fresh data
next day
Extract
Store
Stage
Transform
Consume
power bi hybrid table — PPU
Import partitions cached · fast
1940 – 2015historical · never refreshed
2016 – 2020yearly partition
2021 – 2024yearly partition
Jan – Dec 2025monthly partition
Jan – Mar 2026last 3 days refreshed daily
Queries served from in-memory cache. Historical partitions locked after initial load.
DirectQuery partition live · PPU only
today● live · no cache
yesterdayreal-time window
2 days agoreal-time window
Every query hits dbt_stocks.stocks_analytics in PostgreSQL live. Always reflects the latest dbt run.
Power BI query engine — transparent merge
Import + DirectQuery results are merged automatically. Report visuals and DAX measures see one seamless table spanning 1940 to now. First refresh loads all 46.5M rows once. Every subsequent refresh touches only the last 3 days — seconds, not minutes.
Pipeline: stocks_db · dbt 1.11.7 · postgres adapter 1.10.0 Infra: Raspberry Pi 5 cluster · Ubuntu ARM64 · main-pi Ricardo Gondo · 2026