I built the first open-source FP8 linear solver in Python — 2-3x faster than cuBLAS

I Built the First Open-Source FP8 Linear Solver in Python I'm a second-year CS student. Last week I published ssBlast — an open-source Python library that solves large linear systems 2-3x faster than CuBLAS using FP8 precision on consumer NVIDIA GPUs. Here's exactly how it works and why it's fast. The Problem Solving Ax = b (where A is a huge matrix) is one of the most common operations in scientific computing: Weather prediction: 1,000,000 unknowns Airplane simulation: 500,000 unknowns Drug discovery: 100,000 unknowns CPU solvers take hours. GPU solvers are faster, but existing tools either don't support FP8 or require C++ expertise. Why FP8 is Faster Floating point numbers store digits: FP64 = 8 bytes per number (very precise) FP32 = 4 bytes per number FP16 = 2 bytes per number FP8 = 1 byte per number (rough) Less bytes = less memory to read from GPU = faster computation. FP64: 128 MB for 4000×4000 matrix FP8: 16 MB for same matrix (8x less!) RTX 4050 FP8 Tensor Cores = ~330 TFLOPS R

I built the first open-source FP8 linear solver in Python — 2-3x faster than cuBLAS

Related Articles

Live-service games are a mess

CRA SBOM Requirements: What’s Mandated, What’s Optional, and What’s Still Unclear

One hundred curl graphs

An engineering thesis disguised as a coupe: A history of the Honda Prelude

Brompton Electric T-Line Folding Electric Bicycle Review: Pocket-Sized Pedal Power

Related Articles

News
Live-service games are a mess
The Verge • 12h ago

News
CRA SBOM Requirements: What’s Mandated, What’s Optional, and What’s Still Unclear
Medium Programming • 12h ago

News
One hundred curl graphs
Lobsters • 12h ago

News
An engineering thesis disguised as a coupe: A history of the Honda Prelude
Ars Technica • 13h ago

News
Brompton Electric T-Line Folding Electric Bicycle Review: Pocket-Sized Pedal Power
Wired • 13h ago