AI Update - Mid 2025

Entai2965

Moderator
Moderator
Elite Member
Dec 28, 2023
123
63
While I was writing another translation guide, it became necessary to revisit the state of AI to explain how software interacts. It grew a bit large though, so instead of posting it as a subsection, it probably makes more sense to post it here as its own article.

Background knowledge
- Read the wiki https://www.reddit.com/r/LocalLLaMA/wiki/wiki/
- Introduction to gguf https://huggingface.co/docs/hub/gguf and https://github.com/ggml-org/ggml/blob/master/docs/gguf.md All AI software worth using locally can read .gguf, the de facto standard for distributing LLMs.
- I also found this medium article to be very helpful. https://medium.com/thedeephub/50-open-source-options-for-running-llms-locally-db1ec6f5a54f
- This techtactician article has an oddly comprehensive introduction. https://techtactician.com/beginners-guide-to-local-llm-hardware-software

Summary
Here is a quick summary on the available software.

Software is broadly split up into backend engines, front end UI's, and software that tries to do both. A lot of the better software only does one or the other particularly well. The better options have an emphasis on running the models well, like allowing split VRAM/RAM loading, or they focus on providing a useful user interface for the end user, never both.

One example of this is ollama. ollama embeds llama.cpp to handle the core engine of running models and then exposes an OpenAI server. That allows the user to run any client that talks using the OpenAI API while benefiting from a project focused on configuring sensible defaults for many llama.cpp models.

AI Engines
These are mostly for programmers or highly technical users.

Engine focused
This is software for power users that want to tinker to get the best performance possible.
Note that llama.cpp also broadly fits into this category, excluding the automated configuration performed by these tools.
The main use of these is to expose an OpenAI API compatible endpoint for use with different UI software. They also tend to embed simple UIs for basic use.

UI focused
There are some projects focused purely on being a frontend. They may have beta-like support for running models with a slow compatibility-focused engine, but they are really intended and expect to the pointed to an OpenAI compatible server somewhere else.
The best way to find reputable frontends is to go to llama.cpp's description, https://github.com/ggml-org/llama.cpp#description , and look at the list of UIs for llama.cpp. There are more, but here are a few.

Hybrid tools
Hybrid tools combine an LLM engine backend with a frontend UI, sometimes a native desktop UI, and try to get both working really well. They can integrate model downloaders into the UI somewhere. They tend to not allow customizing how GGUF layers are split between VRAM and main memory, if at all. In practice, this tends to mean extra slow performance due to the inability to tweak settings and thus poor model compatibility as well.
  • - LMStudio
    • - closed source, not recommended
    • - seems to work, despite being closed source
  • - gpt4all
    • - It seems partly discontinued? not recommended
    • - It does launch, but it does not support recent-ish models like gemma3.

All in one environments
This is an extension of the hybrid tools. These are more like frameworks that try to combine other tools, like both vllm and desktop software, to abstract away all of the essentials of running LLMs.
There is this saying in engineering about making things as simple as possible, but no simpler. These fall into that pitfall of trying to abstract away too much which makes their software completely unusable for any purpose. If these work at all for anything, then you have found a unicorn and should be appropriately amazed by it.

Splitting GGUF layers
Splitting GGUF layers between VRAM and main memory helps computers load certain models they would otherwise be unable to load and also run those models at the fastest speed the hardware allows. This feature is not particularly useful for interactive LLM use because it dramatically slows down inference time and for cloud inference providers because they use GPUs that cost $25k/each, or TPUs, and have the VRAM to hold the models, but it is critical for software development on desktop systems and for local batch processing where accurately processing data is more important than speed. Support for this feature is growing over time but still fairly abysmal. Here is some of the software that supports splitting GGUF layers.
- llama.cpp
- ollama
- KoboldCpp
- LMStudio

OpenAI's API
OpenAI, https://openai.com, (OAI) is the company that made ChatGPT, https://chatgpt.com. Contrary to their name, almost everything about them is proprietary. The notable exceptions are
- a well documented API, https://platform.openai.com/docs/quickstart
- client software for their API that uses their API in a server-agnostic way, https://github.com/orgs/openai/repositories
- and a very small number of models publicly, notably their whisper series which does speed recognition and transcription, https://github.com/openai/whisper

OpenAI documents and develops the de facto API, especially the Chat Completions https://platform.openai.com/docs/api-reference/chat, that software developers use to code their software. The OpenAI API allows clients and servers to communicate in a standardized way which enhances compatibility among various tools in the AI ecosystem. If a tool does not support the OpenAI API, then it is not really part of the ecosystem and has a strong proprietary vibe to it.

Routers
Much of the software above exposes an OpenAI compatible server which allows end users to select their own OpenAI compatible clients. Because of that server-client dynamic, the server running the LLM can be on the local computer, http://127.0.0.1/, somewhere on the local network, or internet.

On the internet, there are a lot of service providers that offer to run the models, presumably on their hardware, and allow clients to interact with them for a fee. They typically expose an OpenAI compatible API. So to communicate with them, use either one of the dedicated frontend clients, or use OpenAI's client libraries.

OpenRouter https://openrouter.ai is one online service that simplifies using online model providers. They provide support for different models by merging different providers for a particular model and then dynamically routing any requests they receive to different providers.

Instead of going to Deepseek, https://www.deepseek.com/en, to use Deepseek's v3.1, https://huggingface.co/deepseek-ai/DeepSeek-V3.1, model and using their API, signing up for OpenRouter will allow OpenRouter to forward the request to Deepseek.

https://openrouter.ai/deepseek/deepseek-chat-v3.1/providers

If Deepseek's API is ever down or another AI provider can run that same model faster or cheaper, then OpenRouter will automatically route the request appropriately to increase speed or minimize costs. How and when to route requests can be configured on OpenRouter's settings page. They also support whitelisting/blacklist providers and a free tier.

"For free models, rate limits are determined by the credits that you have purchased. If you have purchased at least 10 credits, your free model rate limit will be 1000 requests per day. Otherwise, you will be rate limited to 50 free model API requests per day."

Other routers exist, like https://huggingface.co/inference/get-started, index https://huggingface.co/docs/inference-providers/index. Jan, https://www.jan.ai/docs/desktop/remote-models/huggingface, also has some extra instructions for working with huggingface's router.

LiteLLM https://docs.litellm.ai/docs seems to be a local version of OpenRouter.
 

Users who are viewing this thread