Ask HN: Distributed SQL engine for ultra-wide tables

https://news.ycombinator.com/rss Hits: 2

Summary

I ran into a practical limitation while working on ML feature engineering and multi-omics data.At some point, the problem stops being “how many rows” and becomes “how many columns”. Thousands, then tens of thousands, sometimes more.What I observed in practice:- Standard SQL databases usually cap out around ~1,000–1,600 columns. - Columnar formats like Parquet can handle width, but typically require Spark or Python pipelines. - OLAP engines are fast, but tend to assume relatively narrow schemas. - Feature stores often work around this by exploding data into joins or multiple tables.At extreme width, metadata handling, query planning, and even SQL parsing become bottlenecks.I experimented with a different approach: - no joins - no transactions - columns distributed instead of rows - SELECT as the primary operationWith this design, it’s possible to run native SQL selects on tables with hundreds of thousands to millions of columns, with predictable (sub-second) latency when accessing a subset of columns.On a small cluster (2 servers, AMD EPYC, 128 GB RAM each), rough numbers look like: - creating a 1M-column table: ~6 minutes - inserting a single column with 1M values: ~2 seconds - selecting ~60 columns over ~5,000 rows: ~1 secondI’m curious how others here approach ultra-wide datasets. Have you seen architectures that work cleanly at this width without resorting to heavy ETL or complex joins?

First seen: 2026-01-15 04:14

Last seen: 2026-01-15 05:15

Read Full Article More from this Source

Ask HN: Distributed SQL engine for ultra-wide tables

Summary

Related News

Project SkyWatch (a.k.a. Wescam at Home)

Bubblewrap: A nimble way to prevent agents from accessing your .env files

Claude Cowork Exfiltrates Files

Ask HN: How are you doing RAG locally?

Show HN: Sparrow-1 – Audio-native model for human-level turn-taking without ASR