>You don't pick an inference engine first. You pick a hardware strategy, a workload shape, and a serving model. The engine follows.

That is the most useful way to think about LLM inference engines.
Series note: This is Part 3 in my series teaching Self-hosted LLMs / Local AI.
• Part 1:
View Tweet.
• Part 2:
View Tweet.
Those two pieces explain the hardware capacity and bandwidth math.
This one explains the software layer that turns that hardware into usable inference.
## Engines
These tools serve different purposes / occupy different layers
• Local portability
• Consumer CUDA
• Apple unified-memory workflows
• Quantized inference
• Production serving
• Distributed orchestration
• Vendor-optimized datacenter execution
A useful mental model:

The inference engine is not "the model." It is the traffic cop, memory manager, kernel dispatcher, scheduler, cache accountant, parallelism planner, API surface, and sometimes the deployment framework.
The best engine matches your memory hierarchy, interconnect, quantization format, latency and throughput targets, model architecture, and operational maturity.
Generated by Thread Navigator
Press ⌘ + S to quick-export
