On language models and on writing about language models
Recently I was invited to write a guest post on Computer Weekly. They hoped for a topic related to Small Language Models, and I jumped at the chance as I had recently given an internal webinar on the topic of when to use Small or Large Language Models.
As I often do when I get excited, I decided to start without asking for instructions and wrote a bit over 3.5k words on the topic before asking them about the length. It took me around a day to write the long version, and almost double that to slim it down to 1k words. I think the end result is much better. This probably isn't surprising, especially if you've ever even browsed The Elements of Style. Brevity is often the best choice, and I love to drone on when I am writing.
So I think you will probably benefit more from reading the Computer Weekly article. But I wanted to share the 1.0 version with its 3.5k words as well, if not else then just for posterity. If I ever move the blog to my own servers with more flexibility, I might include the interim 2k and 1.5k versions as well, but this should do for now.
Finally, the same core idea was also used for an internal blog post that I worked together on with our marketing. It might be an interesting comparison point as well, if anyone is looking for examples of text evolution.
Anyway, here we go:
Language models, Small and Large
The continuous improvements on Large Language Model (LLM) capabilities have been dominating the weekly news cycle for the past few years with an ever increasing pace. Both the advancements and the speed with which they occur have been truly remarkable ever since the publication of the transformers architecture in 2017 and the GPT-3 model in 2020.
One should not forget, however, that there is more to the world of language models than just the Large ones. Though Small Language Models (SLMs) get less media coverage, their capabilities have also been marching steadily forward are nowadays part of the standard NLP toolbox.
Most choices in life are tradeoffs, and that includes the choice of a language model in a given project. There are many dimensions in which LLMs and SLMs differ in structure and usage, and it is important to be aware of these as they can guide us in choosing the best tool for a job. Our purpose here is to discuss both the differences and similarities of these language models, with a particular focus on the where, when and why either one should be used.
Where do the differences lie?
We should first of all note that there is no official definition on what exactly would be a large or a small language model and any such division would be somewhat arbitrary. In practice, however, there rarely seems to be any confusion, and a natural division can be approached e.g. via the amount of parameters the models have. When discussing Large Language Models people usually refer to the family of GPT-models, LLaMa, Gemini, Mistral, DeepSeek and the like. These tend to have parameter counts in the billions or hundreds of billions, with people often referring to e.g. various LLaMa2 versions as the 7B, 13B or 70B models based on their parameter count. Small Language Models like BERT, RoBERTa, or BigBird, on the other hand, tend to live the realm of at most some hundreds of millions of parameters. (We'll focus here on SLMs based on the so called transformer architecture. In particular, we omit classical RNN or LSTM based models here for the sake of brevity.)
Another immediate difference is that modern LLMs tend to be aimed at language generation, whereas most SLMs (at least those still in active use) tend to focus on language understanding tasks like classification or Named Entity Recognition. This trend is probably due to the versatility of high-performing language generating AIs; any results expressable in e.g. json format can be in theory produced also via language generation. Training an LLM is very expensive, so the people involved tend to prefer creating systems that are as widely usable as possible. (Not to mention the current 'fashion' of everyone building GenAI in particular.)
We'll next look at some more specific dimensions on which these model types differ in practice.
Jack of all trades, master of none
We feel that many people do not truly appreciate the breadth of skill that modern LLMs possess. We often fixate on the mistakes they make when we use them in our particular narrow setting. We might e.g. notice the suboptimal algorithm choice, or a weird turn of phrase in translation, not remembering that even if GPT-4o is less than perfect in our preferred (programming) language, it is doing reasonable work in all of the languages. It can write poetry in French, algorithms in Python and create a Japanese cooking instruction in the format of a fantasy story for children. And this is all done by the one model. So even though there are very few, if any, specific fields where LLMs are performing on a superhuman level, their breadth of skill is very much beyond any human realm of possibility.
This capability of attacking almost any issue in any field is naturally one of the main reasons why LLMs can be so quickly adapted for almost any new task that relies on using language. It does come at a cost, though. Indeed, if you are in the mood of some good food, would you prefer the restaurant whose menu offers a selection of five different foods, or the restaurant where the menu is 50 pages long containing everything from breakfast to burritos and from pretzels to pho? You might get the feeling that the kitchen who has specialized to just making the five things might produce better results than the kitchen that will spin out anything and everything under the sun. This does not mean that both restaurants don't have their uses, though - if you have a group of people with very differing food preferences, the latter restaurant might be the preferrable choice.
There is a similar dichotomy in language models as well. The scope of things you can do with a given SLM is much narrower compared to the almost universal capabilities of a modern LLM, but this narrowed scope is compensated, among other things, by potentially much better performance. While an LLM can do an okay job on a given task on the fly, an SLM can be trained to be hyperfocused to that one particular task, running fast and with the fraction of the cost. This kind of cheap and fast inference at a high performance level can often be worth the reduced flexibility and increased development efforts.
Black boxes and Vantablack boxes
It is a truth universally acknowledged that deep neural networks can be very much opaque to human understanding. But as in many things, models can have be different levels of transparency or opaqueness. Though it would be a very bold claim for anyone to say that they fully understand how e.g. the SLM named BERT functions with its 100M parameters, it would be wrong to say that we understand nothing. The BERT model in particular has been a sort of foundational base model for many other SLM variants, and there has been extensive academical research on it - see e.g. (Rogers, Kovaleva and Rumshisky 2020). The fact that we know so much about BERT owes both to the openness of the model and its size. Being able to run the model on "any" laptop, or doing limited training on it on household level hardware makes it a very accessible research target. In all honesty, we do not know how well e.g. OpenAI has been able to grok the inner workings of GPT4o, but we would doubt that they would fare better than the full scientific community that has been studying the openly available models.
The compute costs for the training of e.g. a BERT model, from scratch, is in the ballpark of 10k euros with today's compute costs. This is not an insignificant amount of resources, but compared to the estimated (tens or hundreds of) millions that LLMs can require it is much more palatable. Especially for research groups with discounted access to high-performance computing. To give just a single example, some variants of the so called ablation studies of language models are based on repeatedly training a language model while removing various parts of the model architecture. Based on peformance drops in various tasks this will yield crucial insights on the role that the various system components play. Such studies are very much limited by the resource costs of training a model, and thus much more applicable to BERT than GPT-4o.
Furthermore, less intensive task-specific trainings or domain fine-tuning of BERT can be run with less than a day of GPU-time. This means that the behaviour of an SLM with respect to various training schemes can be probed and iterated in a much faster pace compared to larger models.
Finally, besides understanding the models, trust can often be crucial. Especially in the healthcare sector many systems even adjacent to patient data need to be vetted and approved by medical professionals. A smaller model trained specifically on data related to the task at hand can be easier to trust as we can know (and control) exactly what data goes into training, and we can modify unwanted behaviour on the fly. Furthermore, the narrower scope of an SLM also yield better explainability - to understand even GPT-3 we need to understand how it works with all of human language, while for an SLM it might suffice how it processes referral data related to hip replacements.
Fitting to local machines or lifting to the clouds?
One of the major questions in any AI project is about where to run your models. The setting of contrasting small and large language models brings to fore some particular considerations.
First of all, we don't always really have a choice. Not only are many of the top-tier LLMs closed source, but many of the larger variants of LLMs require massive computational capability. Setting up a local supercomputer is doable, but does require considerable resources also in maintenance and setting up the system architecture to host these models. Here in particular we also run to the issue of the economies of scale: running an LLM locally might be particulary expensive if you only use the system 10% of the time. Large actors like OpenAI, on the other hand, will in any case need teams of high performance computing specialists, and they might more easily be able to use idle server time for other compute purposes.
Smaller LLM variants, and SLMs in particular, require vastly smaller resources and the choice between a local environment and a cloud is dependent on other classical factors. Though we note that all of the major cloud providers are nowadays offering solutions specifically aimed at various types of AI-development and hosting. All that being said, we do note that compared to LLMs, most if not all SLMs do provide the opportunity to run them on very limited local hardware.
Resource consumption and sustainability
LLMs naturally use a lot more resources than SLMs. In particular in the training phase. This is one of the big reasons why the HuggingFace ecosystem is so beneficial: most transformer-based language models are trained in stages, with the base-training being most computationally costly. In the base-training the model is essentially taught to model language in general, and in latter training phases this training is then focused more on domain-specific or task-specific types of text. HuggingFace provides a good way to share models in various phases of training. In particular, a massive reason for the success of the BERT model has been Google's sharing of the base-trained model. This means that instead of people all over the world spending weeks of time and thousands of euros to train their copies of the model, they can all start from a pretrained basepoint. This both speeds up all R&D activities and is a considerable saving of natural resources. Some LLMs have also been released openly, but the trend of open sharing is sadly not so prevalent with them.
Inference is several orders of magnitude less computationally intensive compared to training, but it should be also noted that LLMs generate text one token at a time. This means that if an LLM needs to generate a piece of e.g. json that describes a classification result, it will run the model over and over again, generating the output json one token at a time. On the other hand, most SLM models will process the input in one single go. (There are again nuances here; for a classification one might use only the interim embedding vectors generated by an LLM and apply a smaller AI model to classify the text based on those embeddings. This would run the model just once while arguably still using an LLM.)
Hallucinations
The term "hallucination" is now a part of the standard vocabulary for anyone working with LLMs. Though we agree with (Hicks, Humphries and Slater 2024) that the choice of a word is not great, we will stick to using the term "hallucination" here as well.
As a rule of thumb, all AI-systems make mistakes. Their function tends to be statistical in nature, usually producing their outputs in the form of distributions rather than specific choices. But the certain types of errors that LLMs do, where they seem to "make up" facts, feel very different to most users when compared to more classical AI mistakes.
From a technical point of view, it is again hard to find a specific definition that would distinguish between a classical AI-error and an LLM hallucination. But we cannot deny that there is a very specific feel to it when an LLM model argues with us about the amount of "r"s in the word "Strawberry". Part of the difference is probably due the anthropomorphization of LLMs that seems to come naturally for us - we see something that talks with us like a human would and our brains can't help thinking about it in human terms.
Another reason for the flavor difference is probably related to the fact that since LLMs can generate anything, they can also generate any kinds of errors. When an SLM-based classifier makes a mistake, it means that it predicted class B instead of class A. We can then go and see what was the exact prediction distributions, and possibly probe on which parts of the text were most influencial in the decision. On the other hand when an LLM-based classifier hallucinates, it might not output a class name at all but a poem or, in a worse case, an SQL injection targetting your database. It's also natural to note here that even though the so called adversial attacks are a threat for essentially all AI-models, the attack surface for LLMs tends to be much bigger. They have been usually trained to understand any and all language, and can perform tasks ranging from email classification to creating code injections for cybersecurity attacks. Any fine-tuning we do to the model tends to only alter the superficial behaviour of the model, and the underlying capabilities (and thus dangers) remain.
We furthermore note that for SLMs the challenges here can be more easily mitigated. One of the double-edged swords of LLMs is that they tend to understand text from all avenues of life, which means that when they are used in a very specific context, a lot of effort might go into getting them to understand that they should now treat ideas only with respect to one narrow context. For SLMs this kind of restriction can be much easier, if plentiful data is available, as fine-tuning an SLM can affect the model behaviour in a much deeper level. In an extreme case, one might even train a domain-specific SLM from scratch. In the context of e.g. medical texts this can be particularly effective as the texts involved rarely contain fictional stories or lies that the model would learn how to mimic.
When working with any AI system, a standard part of the process is to prepare for mistakes. For LLM-based systems we need to prepare for a much wider error class, and usually prepare specific guard rails to sanitize the output.
When to use SLMs or LLMs?
Next we'll finally go to some more direct ideas about which kinds of situations might call for SLMs or LLMs.
When to go big?
Most, if not all, things you can do with SLMs can be done with a modern LLM. And while an LLM will be more expensive to use at runtime, writing a prompt that says "Classify the following based on its sentiment: positive/neutral/negative." is much faster than training a custom SLM for the task. Especially in situations where we have very little data and we can easily boost the LLM performance with e.g. the so called few-shot techniques. This is one of the big reasons we think that LLM based systems are excellent especially in the discovery phase of a potential AI project. With an LLM based approach we can cook up a quick estimate on how hard the problem seems to be. It might also turn out the issue is simply too complex for SLMs to comprehend and the full power of an LLM is really required for the task.
Another common use case for an LLM is when we have very little data available. The classical solution for low-resource situations is to try and find some related dataset and do a base training with that one, and only then fine-tuning the system for the particular task with any data we have. Most LLMs have already been base-trained with roughly all of the internet. To get them to do a particular task can be achieved by simply showing them a few examples through a method called few-shot prompting (a.k.a. in-context learning), which most likely works by simply helping the model to understand which part of its massive skillbase it should apply in this case. Another option that one might also use here is to use few-shot prompting to turn the LLM into a training data generator, whose outputs can then be used for an SLM traingin.
As mentioned before, the split between large and small language models tends to coincide with the split between generative and non-generative systems. And one major difference here is the scalability of inference-time compute for generative models. By this we refer to the notion that with generative models one can considerably boost the performance of the system by making it "think out loud" about the task before giving a final solution. This seems to provide a new scaling law for LLM usage where we can get better results out of the systems with the cost of more resources spent for text generation. For SLMs focused on language understanding tasks, this is not structurally possible; there is no way to throw in more compute at inference time to improve the results.
When to go small
"Everything should be made as simple as possible, but no simpler!"
We strongly advocate that, in general, we should use the simplest tool that can reasonably solve the problem at hand. If we are trying to detect SSNs from a piece of text, a regular expression script will probably do. If we need to detect if there is any PII information in the text, maybe a SLM should be trained for the task. If the document needs to be assessed for "suspicious artefact or signs of tampering", then maybe an LLM would be the tool of choice.
As we mentioned before, when we have enough data a hyperfocused SLM can outperform an LLM by a margin. So when the task is simple enough for an SLM and performance is a major issue, an SLM might be the optimal tool. Though, again, this might require more work in the beginning by a Data Scientist.
As also discussed earlier, SLMs can also provide a higher degree of transparency to their inner workings, and their outputs and thus error modes are more restricted. In particular, when explainability and trust are crucial, auditing an SLM can be much simpler compared to trying to extract reasons for an LLM's behaviour. They can also be easier to install into secure or even air-gapped environments that might have low compute resources.
It's not either or!
Having to choose one or the other might be a false dichotomy. In many modern processes there might be tasks of varying levels. Some routine and others more complex, with the solutions requiring anything from algorithms to SLMs or LLMs, or even human intervention. This division of labor can be implemented in various ways; one might have some AI-systems dedicated to deciding which class of issues the task belongs to, or we might have a flow where we start with the lighter tools and if they cannot complete the task it will be kicked to the higher level systems. This kind of division of labor is, in fact, something that also happens inside most latest generation LLMs! They tend to use the so called Mixture of Experts (MoE) solutions where the processing is routed at inference time to sort of submodels within various components.
This kind division of labor can bring about the best of both worlds: we use simple reliable tools whenver we can, but lever the more powerful LLMs when needed. This can also bring great benefits to resource usage and thus to pricing. The tradeoff here then is that we are adding complexity to the whole system. An LLM is not simple, but neither is a system using several different SLMs and LLMs. So one should try to strike a balance between the system complexity and component complexities involved.
In summary
Like most things in life, there is no clear cut answer when to use various types of language models. There are various dimensions like performance, inference cost, development cost, and opacity that should be considered. The best solution here is to be as informed as you can of all the different aspects and try to find the best compromise in the task you are trying to solve.
Finally, using the newest and flashiest system available is not a value in itself, but with paradigm shifts rolling out on an almost regular basis, it is good to keep your ear on the ground. You never know when the next announcement can help you turn complex things more simple.