● Mixture of experts (MoE), model fusion, and retrieval augmented generation (RAG) are some of the techniques that are under study in the bid to build AIs that are sufficiently compact for smartphone and edge computing deployment.
● However, smaller models with more limited operational capacity may nonetheless give rise to an energy rebound effect.
As researcher Sascha Luccioni pointed out in a June 2025 interview with Hello Future, the carbon footprint of today’s highly powerful large language models (LLMs) is increasingly cause for concern. Alternative approaches focusing on lighter and more specialised small language models could help mitigate this problem. “When we speak of AI, more often than not we are thinking of one monolithic model that can do everything, but this versatility comes at a cost.” When enhanced with techniques such as retrieval-augmented generation (RAG) and tool-calling, which enable them to interact with external resources, these smaller models can effectively manage customer support or power specific modules for personalized learning.
We can provide AI with the ability to remember past exchanges, much as humans do.
The potential and operational challenges of personalization
“However, it is important to point out,” adds Gwénolé Lecorvé, “that personalization and downsizing are quite distinct, even if they occasionally overlap.” Personalization does not necessarily involve creating a dedicated model for every user. Instead, it adds “overlays that adapt the model’s behaviour to specific themes, contexts, and users.” These additional layers guide the generation process while leaving the core model untouched.
The necessity for this approach stems from a significant hurdle: memory. “A model’s capacity is fixed. In other words, if it learns something new, it is likely that some of its existing knowledge or capabilities will be eroded.” Unlike with the human brain, we are still trying to work out which specific parts of a transformer models encode knowledge and skills. Turning to external knowledge sources — such as RAG and deep research — facilitates this quest.
By storing interaction histories (either in raw or encoded form), we can create a knowledge source that may be accessed using these techniques. “If we frequently interact with a model,” explains the researcher, “we can provide it with the ability to remember past exchanges, much as humans do.” This ability to remember takes the form of an external system that the model can tap into to better contextualize and personalize its responses.
Modularity and hybrid architectures
As Gwenolé Lecorvé explains, there are other approaches that can be used to construct smaller and more specialized AI systems, notably mixture-of-experts (MoE) architectures. “Instead of relying on one large network, MoE combines several specialized or expert sub-models. Only a small number of these experts are activated for any single question, which reduces computing and memory requirements.” Llama 4 and certain versions of Qwen3, GPT-OSS and Mistral already make use of this approach.
However, the researcher warns that in practice the beguiling prospect of better power efficiency with MoE could ultimately give rise to an energy rebound effect. “There’s nothing to stop the deployment of quite large models, for example of up to 20 billion parameters, as MoE experts.” This is precisely the strategy championed by Yann LeCun for the JEPA (Joint Embedding Predictive Architecture) models, which are designed to understand and anticipate physical action in the real world. While such models may only require 62 hours of direct interaction data to learn how to navigate new situations, this ‘fine-tuning’ window relies on a massive foundation of pre-existing training that has already consumed vast quantities of data.
Along with this myriad of alternatives – many of which have yet to be fully implemented – there is also the possibility of model merging. As Gwenolé Lecorvé, explains, this involves combining specialist models in more comprehensive systems. “It remains an experimental approach, because we don’t know how their internal parameters will interact. Merging two models is a bit like superimposing two drawings. You cannot be sure the result will be harmonious.”
As for the future?
The researcher believes the key to lean yet powerful LLMs lies in hybridity. “We could develop multi-tier systems that switch between different engines like hybrid cars: with small models for basic tasks, medium ones for intermediate levels of difficulty, and large models for complex requests.” Routing mechanisms to distribute workloads are already baked into the architecture of models like GPT-5. “Small models could locally process 90% of tasks on smartphones, while the remaining 10% of requests could be offloaded to more powerful models over the network.” It is an intelligent approach that could reduce the burden on systems, which only needs to be implemented. As Gwenolé Lecorvé points out, we already have all the building blocks, we only have to put them together. Naturally, “it will require expertise and craftsmanship,” but the way ahead is clear.
Gwénolé Lecorvé







