For non-LLMs, we have a high variety of frameworks with Python interfaces for running models, often built to call C bindings. This makes it impractical to run models directly from languages like Rust—you’d need to implement an idiomatic layer for each model runtime, something already done for Python. Nvidia's Triton covers a lot, but is it even designed for embedded use? And how feasible is adding custom logic?
For LLMs, it feels like the opposite. There’s a smaller set of frameworks (e.g., llama.cpp, vllm), each supporting a wide range of models. This makes it relatively straightforward to integrate them into other languages like Go, as you only need to maintain a few idiomatic layers.
To me, it’s a no-brainer that Go or Rust will replace Python for serving LLMs. They’re CPU-intensive, Python is generally slow, and the limited number of LLM runtimes simplifies the transition.
(1) Python’s dominance in AI inference is driving, and will continue to drive, more investment in improving Python for lots of things that it isn’t great at right now that people want to do a long with AI inference. We’ve actually seen a lot of that over the last few years, with physics engines and robotics simulation platforms for Python, some of which are Python bindings for existing libraries written in other languages, but some of which are built in Python (e.g., via Taichi or Numba, both of which can produce and execute GPU kernels from Python code, and the latter of which can JIT and parallelize (mostly numeric) Python code on CPU, as well.) This will also include investment in Python’s core and standard library to address pain points.
(2) The increasing importance of AI inference will at the same time drive more investment in AI inference libraries for non-Python platforms.
The relative balance between the progress of those two efforts will be a big factor in how much Python is used in inference going forward, for AI in general, and for LLM’s in particular.