stocks_db · dbt Pipeline

stocks_db · data lifecycle

From markets to insight

Every night, raw price data from global markets travels through seven acts — arriving as noise, leaving as clarity.

🌙

cron · bash

act 01

Wake up

At 1 am every night, a scheduled job starts the entire pipeline from sleep.

01:00 daily

📡

Python · yfinance · Docker

act 02

Collect

Four groups run one after another, each pulling their slice of NYSE, B3, crypto, and forex tickers via the yfinance API — the longest act by far.

~4 hrs · serialized

yfinance API

🗄️

PostgreSQL

act 03

Store

Fresh data lands in a PostgreSQL database alongside 86 years of history — 70 million rows and counting.

70M+ rows

🧹

dbt · SQL

act 04

Clean

Duplicate tickers are removed. Broken numbers are fixed. Only valid, meaningful records pass through.

−23.5M ghosts

⚗️

dbt · PostgreSQL

act 05

Enrich

Moving averages, Bollinger bands, MACD signals, and momentum indicators are computed for every ticker.

17+ indicators

📊

Power BI · .NET 10

act 06

Serve

Power BI picks up a fresh snapshot. The web app shows live charts. Historical data is always ready, cached and fast.

live · cached

🤖

Ollama · Python

act 07

Insight

An AI analyst reads the day's patterns, writes market briefs, and flags anomalies — all in plain language.

AI analyst

86 yrs

of market history

46.6M

clean rows every day

4 nodes

collecting in parallel

~4 hrs

extraction runtime

01:00

wakes up every night

technical pipeline · detailed view

end-to-end flow diagram

⏰

cron

0 1 * * *

sequential execution

①

Group 1

cluster-pi1

→

②

Group 2

cluster-pi2

→

③

Group 3

cluster-pi3

→

④

Group 4

cluster-pi4

NYSE · NASDAQ · B3 · Crypto · Forex · Japan · Brazil

Extract

🗄

integrated_stocks + stocks_info

public schema · PostgreSQL · 70M+ rows

Store

STAGING

dbt_stocks schema

⚙

stg_stocks

VIEW · stateless · feeds stg_stocks_cleansed

materialisation

VIEW

purpose

clean · filter · deduplicate

filter

S&P500 + Forex + Crypto + Intl

dedup

row_number() OVER (ticker, date)

removed

35,582 duplicate pairs

join

LEFT JOIN stocks_info ON ticker

Stage

CLEANSE

dbt_stocks schema

🧹

stg_stocks_cleansed

VIEW · NaN / Inf / overflow guard

materialisation

VIEW

trigger

Power BI System.Decimal error

nan detection

= 'NaN'::float8 (not !=)

overflow guard

|value| ≥ 1e14 → NULL

prices

NaN / Inf / overflow → NULL

volume

NaN / Inf / overflow → 0

Cleanse

MART

dbt_stocks schema

📊

stocks_analytics

TABLE · 46.6M rows · ~38 min build

materialisation

TABLE

pre_hook

SET LOCAL work_mem = '512MB'

post_hook

ANALYZE {{ this }}

index

UNIQUE (ticker, date)

indicators

MA 7/14/20/30 · MACD · deviations

optimisation

5 CTEs → 3 · −42% runtime

core_calcs

MA · EMA · lag/lead · max/min

→

macd_signal_calc

9-period signal · count

→

final

pct changes · histogram

Transform

💾

Import partitions

cached · 1940→last 3d

1940–2025 · cached

last 3 days · daily refresh

●

DirectQuery

live · today · PPU only

LIVE

⊕

Query engine

transparent merge

Hybrid

📈

Power BI

PPU · gateway

⬡

.NET 10 MVC

PBI Embedded

🤖

NL/SQL

Qwen2.5 · llama.cpp

Consume

Source rows

70M+

integrated_stocks

Output rows

46.6M

after deduplication

Build time

~38 min

down from 60 min

CTEs

5 → 3

optimisation fences removed

Date range

86 yrs

1940-01-03 → today

⚠

PostgreSQL NaN ≠ IEEE 754 NaN. Power BI raised DataSource.Error: Numeric NaN not supported by System.Decimal on window-function computed columns. PostgreSQL float8 NaN satisfies NaN = NaN → TRUE, so the IEEE 754 self-inequality guard (value != value) silently passes NaN through. Fix: test with value = 'NaN'::float8 in stg_stocks_cleansed, and apply safe_numeric() macro to every numeric(18,x) cast in stocks_analytics to catch NaN propagating through lag/lead arithmetic.

PostgreSQL 12+ treats each CTE as an optimisation fence — fully materialising it to disk before the next one reads it. The original view had 5 CTEs, causing 5 full passes over 70M rows. Collapsing stock_data + ema_calc + macd_line_calc into a single core_calcs CTE eliminated 2 redundant sort cycles. Combined with SET LOCAL work_mem = 512MB in the dbt pre_hook, this cut wall-clock time by 42%.

pipeline stages

①

cron — 0 1 * * *

run_stocks.sh · fires 1am daily · main-pi

Entry point for the full pipeline. The cron job on main-pi triggers run_stocks.sh which orchestrates extraction across all nodes and calls run_dbt.sh as its final step. All output appends to $LOG.

cronbashmain-pi · 192.168.0.242

②

stocks extractor

Python · Docker · 4 Pi nodes

Distributed extraction across Groups 1–4 using hostname-based load balancing. Each node pulls a subset of tickers from B3, NYSE, NASDAQ, Crypto, Forex markets. Each group overrides machine-id via /proc/sys/kernel/random/uuid to avoid provider fingerprinting. Writes to public.integrated_stocks.

cluster-pi1cluster-pi2cluster-pi3cluster-pi4yfinanceDocker

③

source tables

public schema · PostgreSQL · stocks_db

Three source tables consumed by dbt as {{ source('public', ...) }} references. integrated_stocks holds 70M+ OHLCV rows across global tickers. stocks_info is the sector/country dimension lookup. sp500_tickers drives the S&P 500 inclusion filter.

integrated_stocks · 70M+ rowsstocks_info · dimensionsp500_tickers · filter

④

dbt — stg_stocks

dbt_stocks schema · VIEW · stateless

Staging layer — clean, filter, deduplicate. Joins stocks_info for sector enrichment. Filters to S&P 500 tickers + Forex, Crypto, Indices, Weapons, Commodities + Japan, Brazil, USA. Uses row_number() OVER (PARTITION BY ticker, date ORDER BY open) to eliminate 35,582 duplicate ticker/date pairs. Removes 23.5M ghost rows from downstream.

VIEWdedup · row_number()sector filter−23.5M ghost rows

🧹

dbt — stg_stocks_cleansed

dbt_stocks schema · VIEW · NaN / Inf / overflow guard

Intermediate cleansing layer. Fixes Power BI DataSource.Error: Numeric NaN not supported by System.Decimal. Must use value = 'NaN'::float8 — not value != value.

Prices — NaN, ±Infinity, or |value| ≥ 1e14 → NULL
Volume — NaN or Infinity → 0

The safe_numeric() macro applies the same guard to every numeric(18,x) cast in stocks_analytics.

VIEWNaN → NULL±Inf → NULLoverflow → NULLsafe_numeric() macro

⑤

dbt — stocks_analytics

dbt_stocks schema · TABLE · 46.5M rows

Core analytics model. 3-CTE chain collapsed from original 5 to eliminate PostgreSQL optimisation fences. All 17+ window functions share PARTITION BY ticker ORDER BY date enabling a single sort pass. SET LOCAL work_mem = '512MB' via pre_hook. Post_hook runs ANALYZE. Unique index on (ticker, date).

core_calcs → MA 7/14/20/30, lag/lead, EMA 12/26, highest/lowest ever
macd_signal_calc → 9-period signal line, signal count
final → daily change %, deviations, intraday change, MACD line/signal/histogram

TABLE3 CTEswork_mem 512MBANALYZE post_hook~35 min build

⑥

Power BI

PPU · on-premises gateway

Connected via on-premises data gateway on main-pi-win (Windows 11 ARM64 VM · QEMU/KVM). Import mode with sector WHERE clause. Gateway watchdog restarts PBIEgwService every 5 min if down.

import modegateway watchdogmain-pi-win

⑥

.NET 10 MVC

Power BI Embedded · Pi 5

ASP.NET Core MVC app with Power BI Embedded, workspace/report browser, and orbital carousel UI. Containerised with Docker on Pi 5 · Ubuntu ARM64. Exposed via Cloudflare tunnel at linux-embedded.ricardo.expert.

Docker · ARM64Cloudflare tunnel

⑥

NL/SQL pipeline

Qwen2.5-Coder-14B · llama.cpp

Local natural language to SQL pipeline. Qwen2.5-Coder-14B sharded across two 16GB Pi nodes via llama.cpp RPC. FastAPI orchestration on Pi3, read-only PostgreSQL executor on Pi4.

llama.cpp RPCFastAPIPi3 + Pi4

1cron 1am

›

2extract

›

3write to postgres

›

4dbt run

›

5stocks_analytics rebuilt

›

6consumers query fresh data

›

↻next day

Extract

Store

Stage

Transform

Consume

power bi hybrid table — PPU

Import partitions cached · fast

1940 – 2015historical · never refreshed

2016 – 2020yearly partition

2021 – 2024yearly partition

Jan – Dec 2025monthly partition

Jan – Mar 2026last 3 days refreshed daily

Queries served from in-memory cache. Historical partitions locked after initial load.

DirectQuery partition live · PPU only

today● live · no cache

yesterdayreal-time window

2 days agoreal-time window

Every query hits dbt_stocks.stocks_analytics in PostgreSQL live. Always reflects the latest dbt run.

⊕

Power BI query engine — transparent merge

Import + DirectQuery results are merged automatically. Report visuals and DAX measures see one seamless table spanning 1940 to now. First refresh loads all 46.5M rows once. Every subsequent refresh touches only the last 3 days — seconds, not minutes.