end-to-end flow diagram
cron
0 1 * * *
sequential execution
Group 1
cluster-pi1
Group 2
cluster-pi2
Group 3
cluster-pi3
Group 4
cluster-pi4
NYSE · NASDAQ · B3 · Crypto · Forex · Japan · Brazil
Extract
🗄
integrated_stocks + stocks_info
public schema · PostgreSQL · 70M+ rows
Store
STAGING
dbt_stocks schema
stg_stocks
VIEW · stateless · feeds stg_stocks_cleansed
materialisation
VIEW
purpose
clean · filter · deduplicate
filter
S&P500 + Forex + Crypto + Intl
dedup
row_number() OVER (ticker, date)
removed
35,582 duplicate pairs
join
LEFT JOIN stocks_info ON ticker
Stage
CLEANSE
dbt_stocks schema
🧹
stg_stocks_cleansed
VIEW · NaN / Inf / overflow guard
materialisation
VIEW
trigger
Power BI System.Decimal error
nan detection
= 'NaN'::float8 (not !=)
overflow guard
|value| ≥ 1e14 → NULL
prices
NaN / Inf / overflow → NULL
volume
NaN / Inf / overflow → 0
Cleanse
MART
dbt_stocks schema
📊
stocks_analytics
TABLE · 46.6M rows · ~38 min build
materialisation
TABLE
pre_hook
SET LOCAL work_mem = '512MB'
post_hook
ANALYZE {{ this }}
index
UNIQUE (ticker, date)
indicators
MA 7/14/20/30 · MACD · deviations
optimisation
5 CTEs → 3 · −42% runtime
core_calcs
MA · EMA · lag/lead · max/min
macd_signal_calc
9-period signal · count
final
pct changes · histogram
Transform
💾
Import partitions
cached · 1940→last 3d
1940–2025 · cached
last 3 days · daily refresh
DirectQuery
live · today · PPU only
LIVE
Query engine
transparent merge
Hybrid
📈
Power BI
PPU · gateway
.NET 10 MVC
PBI Embedded
🤖
NL/SQL
Qwen2.5 · llama.cpp
Consume
Source rows
70M+
integrated_stocks
Output rows
46.6M
after deduplication
Build time
~38 min
down from 60 min
CTEs
5 → 3
optimisation fences removed
Date range
86 yrs
1940-01-03 → today

PostgreSQL NaN ≠ IEEE 754 NaN. Power BI raised DataSource.Error: Numeric NaN not supported by System.Decimal on window-function computed columns. PostgreSQL float8 NaN satisfies NaN = NaN → TRUE, so the IEEE 754 self-inequality guard (value != value) silently passes NaN through. Fix: test with value = 'NaN'::float8 in stg_stocks_cleansed, and apply safe_numeric() macro to every numeric(18,x) cast in stocks_analytics to catch NaN propagating through lag/lead arithmetic.

!

PostgreSQL 12+ treats each CTE as an optimisation fence — fully materialising it to disk before the next one reads it. The original view had 5 CTEs, causing 5 full passes over 70M rows. Collapsing stock_data + ema_calc + macd_line_calc into a single core_calcs CTE eliminated 2 redundant sort cycles. Combined with SET LOCAL work_mem = 512MB in the dbt pre_hook, this cut wall-clock time by 42%.

pipeline stages
cron — 0 1 * * *
run_stocks.sh · fires 1am daily · main-pi
Entry point for the full pipeline. The cron job on main-pi triggers run_stocks.sh which orchestrates extraction across all nodes and calls run_dbt.sh as its final step. All output appends to $LOG.
cron bash main-pi · 192.168.0.242
stocks extractor
Python · Docker · 4 Pi nodes
Distributed extraction across Groups 1–4 using hostname-based load balancing. Each node pulls a subset of tickers from B3, NYSE, NASDAQ, Crypto, Forex markets. Each group overrides machine-id via /proc/sys/kernel/random/uuid to avoid provider fingerprinting. Writes to public.integrated_stocks.
cluster-pi1 cluster-pi2 cluster-pi3 cluster-pi4 yfinance Docker
source tables
public schema · PostgreSQL · stocks_db
Three source tables consumed by dbt as {{ source('public', ...) }} references. integrated_stocks holds 70M+ OHLCV rows across global tickers. stocks_info is the sector/country dimension lookup. sp500_tickers drives the S&P 500 inclusion filter.
integrated_stocks · 70M+ rows stocks_info · dimension sp500_tickers · filter
dbt — stg_stocks
dbt_stocks schema · VIEW · stateless
Staging layer — clean, filter, deduplicate. Joins stocks_info for sector enrichment. Filters to S&P 500 tickers + Forex, Crypto, Indices, Weapons, Commodities + Japan, Brazil, USA. Uses row_number() OVER (PARTITION BY ticker, date ORDER BY open) to eliminate 35,582 duplicate ticker/date pairs caused by float precision artifacts in split-adjusted OTC tickers. Removes 23.5M ghost rows from downstream.
VIEW dedup · row_number() sector filter −23.5M ghost rows
🧹
dbt — stg_stocks_cleansed
dbt_stocks schema · VIEW · NaN / Inf / overflow guard
Intermediate cleansing layer between stg_stocks and stocks_analytics. Fixes Power BI DataSource.Error: Numeric NaN not supported by System.Decimal. PostgreSQL float8 NaN satisfies NaN = NaN → TRUE — the IEEE 754 self-inequality trick (value != value) does not work. Must use value = 'NaN'::float8.

Prices (open/high/low/close) — NaN, ±Infinity, or |value| ≥ 1e14 → NULL
Volume — NaN or Infinity → 0
Rate of return — NaN, ±Infinity, or overflow → NULL

The safe_numeric() macro applies the same guard to every numeric(18,x) cast in stocks_analytics, catching NaN that propagates through lag/lead window arithmetic on historical rows.
VIEW NaN → NULL ±Inf → NULL overflow → NULL safe_numeric() macro Power BI fix
dbt — stocks_analytics
dbt_stocks schema · TABLE · 46.5M rows
Core analytics model. 3-CTE chain collapsed from original 5 to eliminate PostgreSQL optimisation fences. All 17+ window functions share PARTITION BY ticker ORDER BY date enabling a single sort pass in core_calcs. SET LOCAL work_mem = '512MB' via pre_hook eliminates disk sort spills. Post_hook runs ANALYZE to update planner statistics. Unique index on (ticker, date).

core_calcs → MA 7/14/20/30, lag/lead, EMA 12/26, highest/lowest ever
macd_signal_calc → 9-period signal line, signal count
final → daily change %, deviations, intraday change, MACD line/signal/histogram
TABLE 3 CTEs work_mem 512MB ANALYZE post_hook ~35 min build unique index (ticker, date)
Power BI
PPU · on-premises gateway
Connected via on-premises data gateway on main-pi-win (Windows 11 ARM64 VM · QEMU/KVM). Import mode with sector WHERE clause. Gateway watchdog scheduled task auto-restarts PBIEgwService every 5 min if down. Windows Update active hours configured to protect 1am dbt run window.
import mode gateway watchdog main-pi-win
.NET 10 MVC
Power BI Embedded · Pi 5
ASP.NET Core MVC app with Power BI Embedded, workspace/report browser, and orbital carousel UI. Containerised with Docker on Pi 5 · Ubuntu ARM64. Exposed via Cloudflare tunnel at linux-embedded.ricardo.expert. Migrated to .NET 10.
Docker · ARM64 Cloudflare tunnel linux-embedded.ricardo.expert
NL/SQL pipeline
Qwen2.5-Coder-14B · llama.cpp
Local natural language to SQL pipeline. Qwen2.5-Coder-14B sharded across two 16GB Pi nodes via llama.cpp RPC. FastAPI orchestration on Pi3, read-only PostgreSQL executor on Pi4. Queries stocks_analytics directly.
llama.cpp RPC FastAPI Pi3 + Pi4
1cron 1am
2extract
3write to postgres
4dbt run
5stocks_analytics rebuilt
6consumers query fresh data
next day
Extract
Store
Stage
Transform
Consume
power bi hybrid table — PPU
Import partitions cached · fast
1940 – 2015historical · never refreshed
2016 – 2020yearly partition
2021 – 2024yearly partition
Jan – Dec 2025monthly partition
Jan – Mar 2026last 3 days refreshed daily
Queries served from in-memory cache. Historical partitions locked after initial load.
DirectQuery partition live · PPU only
today● live · no cache
yesterdayreal-time window
2 days agoreal-time window
Every query hits dbt_stocks.stocks_analytics in PostgreSQL live. Always reflects the latest dbt run. No stale data for today's rows.
Power BI query engine — transparent merge
Import + DirectQuery results are merged automatically. Report visuals and DAX measures see one seamless table spanning 1940 to now. First refresh loads all 46.5M rows once. Every subsequent refresh touches only the last 3 days — seconds, not minutes.
pipeline stages