Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Do you think Python will disappear for LLM inference?
2 points by astronautas on Jan 3, 2025 | hide | past | favorite | 19 comments
For non-LLMs, we have a high variety of frameworks with Python interfaces for running models, often built to call C bindings. This makes it impractical to run models directly from languages like Rust—you’d need to implement an idiomatic layer for each model runtime, something already done for Python. Nvidia's Triton covers a lot, but is it even designed for embedded use? And how feasible is adding custom logic?

For LLMs, it feels like the opposite. There’s a smaller set of frameworks (e.g., llama.cpp, vllm), each supporting a wide range of models. This makes it relatively straightforward to integrate them into other languages like Go, as you only need to maintain a few idiomatic layers.

To me, it’s a no-brainer that Go or Rust will replace Python for serving LLMs. They’re CPU-intensive, Python is generally slow, and the limited number of LLM runtimes simplifies the transition.



On the main question, I don’t think Python will disappear for LLM inference soon. But I think there are two processes that will determine the longer term process, as AI inference gets built into more things:

(1) Python’s dominance in AI inference is driving, and will continue to drive, more investment in improving Python for lots of things that it isn’t great at right now that people want to do a long with AI inference. We’ve actually seen a lot of that over the last few years, with physics engines and robotics simulation platforms for Python, some of which are Python bindings for existing libraries written in other languages, but some of which are built in Python (e.g., via Taichi or Numba, both of which can produce and execute GPU kernels from Python code, and the latter of which can JIT and parallelize (mostly numeric) Python code on CPU, as well.) This will also include investment in Python’s core and standard library to address pain points.

(2) The increasing importance of AI inference will at the same time drive more investment in AI inference libraries for non-Python platforms.

The relative balance between the progress of those two efforts will be a big factor in how much Python is used in inference going forward, for AI in general, and for LLM’s in particular.


The summer BERT came out I worked at a company that had models not so good as we have today. I was working on the framework for model training that was Python based but we would pack up Tensorflow models and these would be run inside Scala and made available via a web service.

My guess is the CPU overhead of Python is not significant compared to running an LLM but Python has limited facilities for dealing with concurrency. For a while I was into writing asyncio web servers but I eventually found workloads (an image sorter running the wrong way on an ADSL connection: one process is thinking hard for 2sec, meanwhile images are not downloading) that would tie them into knots. gunicorn and celery and similar things can handle parallelism with multiple processes but if you have a 1GB model you will terribly waste memory.

In Java on the other hand you can have a 1GB model and it is shared by the threads and there is no drama.

I wrote a chess program in Python that was good enough to beat my tester a few times last month and have been wanting to take it to a chess club but my tester tells me it needs to respect time control for that. Also I'd like to support a protocol like XBoard or UCI. Either way it is necessary that the comms thread can interrupt the thinking thread and that's dead easy to do in Java and a huge hassle in Python.

Sure there are threads in Python and if I wanted to screw around with alpha software there is the no-GIL Python but remember this: when you're doing a project which has a high-risk or research component it's a bad time to pick tools that require you to learn things. If you are good at Rust or Go I'd say go with that. But don't pick up a language because you heard somebody else thinks it cool. A lot of people are running big and complex apps on Java but you don't hear about it so much.


Good perspective. No-GIL should make things better (shared memory parallelism), but it's not bulletproof.


If I could boot up Python with --no-gil I might give it a shot but it looks like a hassle to install a no-GIL Python now. And after all that hassle I'd expect it to take years for the details to get worked out.

One reason Java was successful with thread safety was its xenophobia. Sure you can load libs with JNI but it feels like putting your hand in a toilet. If you add threads to POSIX you will always have old libraries not built with threads in mind, if you start out with threads the whole ecosystem is built on the assumption it should work right with threads.


No, it won't. Python is a scripting language that is more composable than Go or Rust and can even be optimized for inference just as well, at least in theory. LLM inference doesn't necessitate a strong type system which actually gives Python a bit of an advantage for less complex programs.

The only places where Go and Rust take the lead is optimizing the non-AI code that you write. That's still a valuable advantage, but it's not going to displace the use case for Python on it's own.


Indeed, my point was that Go and Rust could lead optimizing the non-AI code, which often begs to be coupled with AI code (think guardrails).

Also, what's the benefit of Python then in this case? Ergonomically, Go isn't shabby, Rust is another story though.


The intensive code is not running python, your assumptions are bad.


Sure, but what about non AI / business logic pre and post? Think RAG calls, guardrails, ...? Or do they fly compared to LLM inference itself?


Probably depends on the specific case. Basic initialization is stupid fast compared to any stupid training exercise. If your training is going to take anything longer than a minute you are probably very very safe keeping init/postprocess in python and well if you're doing anything intense consider importing a faster way of doing it. Python is like a motherboard for soldering components.


agree, it depends, always benchmark, but my question is rather generic i.e. I am looking for a perspective.


Ok my perspective of pursuing something like this in Rust and continuing the motherboard analogy, is that you take on the full burden of developing everything in your laptop onto a single chip (sounds pretty awesome really) but your boundaries can end up blurred as everything is just a single etch. Forcing boundaries by making components with hard won decisions creates a marketplace of exchangeable components. Force everyone to agree on that API by dropping across a hard language boundary.

Your impact is measured by raw performance masked by time in market. Pure Rust or whatever will have higher raw performance but lower overall (my guess) impact because it misses out on a wider market. Python API will have in general a very slightly less raw performance but much wider time in market.


can't disagree (a tradeoff).


It's getting very philosophical, without a (perhaps impossibly complex/fraught) model of real-world programmers using all available languages doing just what you ask there is no way to find the sweet-spot. I've given some rules of thumb why python still rules and why things in general favor standard forms easily reviewable by humans and peppered with magic only understandable by geeks.

edit: you've removed a line while I was resonding like: "maybe go is a sweet-spot"

I think if you are looking to replace python as a runtime you'd maybe be better off arguing for safety, python is much more easily corruptible so if maybe you don't trust your machine(s) you are training on and don't want to be mislead by someone hacking your initializations then arguably running a compiled program with certain guarantees is safer.


I agree Python is hardly replacable. Btw, I only mean inference, not training. Training imho should stay pure Python, you can achieve mega throughout with it for batch processing.


> Btw, I only mean inference

D'oh! Yea that changes things. I would be considering UI integration from the inference side.


When you say RAG calls are you talking about requests to the (usually vector db) external datastore, or the repeated calls to the LLM?

“Guardrails” are often just calls to one or more (usually smaller) classification/moderation models.


call to external db, and then to llm with retrieved context.

also business rules, no?


> call to external db, and then to llm with retrieved context.

Right, neither of those are CPU intensive in a different way than LLM inference itself (the latter is LLM inference itself.)

> also business rules, no?

Business rules can vary quite a bit in content and complexity, but either tend to be simple enough that they won’t impose much additional load, or complex enough that you are probably going to want to simply use an existing rules engine (many of which, regardless of their implementation language, have Python bindings) which are going to behave the same way no matter what language you call them from.


Fair points!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: