Saturday Morning Inference

Less prose this morning, more focus on home-lab and new OSS tech.

What is LoRA and why is it not LoRaWAN?

It's an acronym for "Low Rank Adaptation", which is an optimization method for LLMs that can dramatically reduce resource requirements while optimizing for post-training re-trains.

Neural Nets: LoRA: Low-Rank Adaptation of Large Language Models
WAN Nets: LoRaWAN is a Low Power, Wide Area (LPWA)

What is Auto-Round and why should we care?

Another neural-network optimization tool, specifically for quantizing large language models into "still large but optimized for specific precision". Intel released a Q4KS quantized version of the GPT-OSS-20B model on Huggingface for those who want a high-performance model on lower resource hardware.

This algorithm helps engineers run models originally intended for Very Large Systems With Lots of VRAM to run on lower-spec systems which are often optimized on Tokens/sec per Watt (Tk/s/W).
Additionally, it allows models which are expected to run on Nvidia GPUs to run on alternate hardware (CPU, HPU, etc), multiple calibration datasets, mixed-bit Quantizing, etc.
Reference: Intel Auto-Round Quantizer

64GB vRAM GPUs vs Cerebras Cloud

The new big news this week has been OpenAi's release of two OSS models, GPT-OSS-20B and GPT-OSS-120B, which are quite exciting for technical reasons which I won't go into right now.

So, a quick comparison with running GPT-OSS-20B on a local-at-home GPU host vs the inference cloud at work - where we're running the GPT-OSS-120B variant of the model (feel free to try that here: https://cloud.cerebras.ai)

First, a screenshot of the local system running Open-WebUI with Ollama backend. See the hover-box for the tokens/sec performance values.

Token results for GPT-OSS-20B running on a 64GB vRAM local@home system

And then we see the tokens per second for the Cerebras cluster, running GPT-OSS-120B variant of the model, which is executing on millions of cores for the cluster and multi-TB of vRAM (equivalent terms).

Token results for GPT-OSS-120B running on the public Cerebras cluster

Cerebras UI Term	Corresponds to	Explanation
Input tokens/s	Prompt tokens/s	Speed of processing the input/context tokens (prefill phase)
Output tokens/s	Response tokens/s	Speed of generating new tokens sequentially (generation phase)
Total tokens/s	Combined metric	(Input + Output tokens) / total processing time

Using the same prompt query we have token rates for the local development system:

Prompt Token/sec	Response Token/sec	System	Cores
541/s	36/s	4x A4000	44,352
22,858/s	1,394/s	multi-CS-3	millions

Big numbers! 😃

Saturday Morning Inference

What is LoRA and why is it not LoRaWAN?

What is Auto-Round and why should we care?

64GB vRAM GPUs vs Cerebras Cloud

Explore Additional Posts

Inferencing Trivialities

DPU Cabling and Rack Elevations

Serial RS232 + Crossover Keystones