Back to articles
How I built a 39 compression pipeline with AES-256-GCM in Python (and why the dictionary is everything)

How I built a 39 compression pipeline with AES-256-GCM in Python (and why the dictionary is everything)

via Dev.to PythonNaveen Badiger

I store LLM training data. Every tool I found either compresses it or encrypts it — nothing did both. So I built QUANTUM-PULSE. The pipeline payload → MsgPack → Zstd-L22 + corpus dict → AES-256-GCM → SHA3-256 Merkle Step 1: MsgPack over JSON Before compression, MsgPack shrinks the payload by ~22%: import msgpack raw = msgpack . packb ( payload , use_bin_type = True ) # 22% smaller than json.dumps().encode() — better input = better downstream ratio Step 2: The dictionary insight Standard Zstd builds a probability model from scratch every time. For training records sharing the same schema, this is wasted work. Train once: import zstandard as zstd dict_data = zstd . train_dictionary ( 131072 , corpus_samples [: 200 ]) cctx = zstd . ZstdCompressor ( level = 22 , dict_data = dict_data ) compressed = cctx . compress ( raw ) Result: 28.46× with dict vs 14.64× vanilla — +94.4% improvement, 29% faster. The dictionary retrains automatically every 24h via APScheduler as new data arrives. Step 3:

Continue reading on Dev.to Python

Opens in a new tab

Read Full Article
3 views

Related Articles