Acme AI

The siren call of open-source Large Language Models (LLMs) is hard to ignore. Models like Meta's Llama, Mistral's offerings, and Google's T5 series are often touted as freely available, offering a seemingly accessible and cost-effective alternative to the often-pricey proprietary models developed by industry giants such as OpenAI and Anthropic. These models promise a future where cutting-edge AI is available to everyone. However, the practical reality of running these "free" models can paint a vastly different picture, often revealing substantial hidden costs. The central question remains: Are open-source LLMs genuinely the most economical option when we factor in all associated expenses? This blog post aims to delve deep into these concealed costs, explore the inherent paradox of accessibility versus practicality, and, crucially, identify strategies and pathways to achieve genuine cost optimisation in the dynamic world of open-source AI.

But first, let's jot down what kind of cost heads do deployments of LLM typically incur:

"Open-source drives innovation because it enables many more developers to build with new technology," Mark Zuckerberg once stated, highlighting the collaborative spirit often associated with open-source initiatives. Indeed, open-source LLMs are language models with publicly available source code, inviting users to freely access, utilise, modify, and distribute them. Prominent examples include Meta's Llama family, Google's T5 and Flan T5 models, and the innovative offerings from Mistral AI. These models are frequently promoted as a means to democratise AI, providing developers with unprecedented control, customisation, transparency, and a supportive community. They are, therefore, often positioned as low-cost or even free alternatives to proprietary systems. See how Llama 3.1 405B is closing the gap in accuracy against its proprietary counterparts:

However, before diving in, it is essential to carefully examine the licenses governing their use. These licenses can range from highly permissive, such as Apache 2.0, allowing for commercial use, to more restrictive ones, like Creative Commons licenses that explicitly prohibit commercial applications. Understanding these nuances is crucial to avoid legal pitfalls and ensure the model aligns with your project's requirements.

The uncomfortable truth is that running large open-source models demands significant computational resources, especially for behemoths like Llama 3.1, which boasts a staggering 405 billion parameters. The primary cost driver is GPU memory (VRAM), and the necessity of deploying multiple high-end GPUs to handle the colossal datasets and intricate calculations.

Consider this: running Llama 3.1 with 16-bit precision for a mere 100 concurrent users can necessitate approximately 2430 GB of VRAM5. This translates to needing 31 Nvidia H100 GPUs, each costing around $30,000, leading to a total hardware expenditure of approximately $930,000. Even when opting for lower precision levels, such as 8-bit or 4-bit, the expenses remain considerable. The need for extensive caching and the demand for efficient processing speeds, measured in tokens per second, further exacerbate the financial burden. This reality sharply contrasts with the relative simplicity and lower initial investment required when using cloud-based closed-source APIs. Here, the infrastructure is managed by the provider, and you typically only pay for the API usage itself.

"Closed, off-the-shelf LLMs are high quality. They’re often far more accessible to the average developer," notes Eddie Aftandilian, a principal researcher at GitHub, underscoring the appeal of proprietary models. Indeed, closed-source LLMs, such as OpenAI's GPT series, Anthropic's Claude, and Google's Gemini, offer a convenient alternative through easily accessible APIs. These models incur costs primarily based on API usage, often with varying rates for input and output tokens. While embracing these models means sacrificing some degree of flexibility and transparency, the operational costs are sometimes much lower, and they are generally more convenient and less complicated for developers. For example, a study highlighted that while using GPT-4 for text summarisation was 18x more expensive than the largest Llama 2 model, the 70B parameter Llama 2 was only 10% more expensive than GPT-3.5 while often delivering superior performance. So, more cost effective right? We thought so too until we came about the below analysis as well as tested out our assumptions internally at Acme AI:

Open-source LLMs are often perceived as more cost-effective, this isn't always true, as some startups invest 50% to 100% more in running Llama 2 compared to GPT-3.5 Turbo. This can be a good place to mention that the costs of using LLMs can be complicated, since some models are expensive to train and run, even if the model itself is free.

In a notable case, the founders of the chatbot startup Cypher tested Llama 2 in August, racking up $1,200 in costs. In contrast, running the same tests with GPT-3.5 Turbo cost just $5, underscoring a significant difference in operational expenses between the two models.

Furthermore, the superior accuracy of models like GPT-4 may result in long-term cost savings, since less computational effort may be needed to achieve the desired outcome.

The high costs associated with running large open-source models give rise to a significant paradox: the promise of accessibility is often contradicted by the practical realities of infrastructure requirements. While the models themselves are free to download, the sheer financial burden of deploying them effectively can be a massive barrier, especially for smaller organisations, individual developers, and startups lacking substantial capital.

The Artificial Analysis Leaderboard presents a detailed comparison of models from OpenAI, Cohere, Gemini, and Anthropic, evaluating critical aspects like context window, pricing per million tokens, latency, and model availability to support informed provider selection. It also includes a summary of cost structures for Llama-2and Mistral open-source models offered by Replicate, Together AI, and RunPod, as outlined in the table below.

This prompts us to consider the strategic reasons behind companies like Meta releasing such resource-intensive models. Is it a deliberate strategy to set industry standards and create a dependence on their ecosystem, thus bolstering the demand for their other services? Is it a move to indirectly drive the need for cloud-based services, where these models can be more easily managed and utilised? Or, is it a genuine attempt to democratise AI, while knowing full well that only well-funded entities can truly leverage its full potential? These questions bring to the forefront the concept of “open washing”, where the term "open source" is sometimes used more for marketing purposes and to create the appearance of openness, when in reality there are significant restrictions on the model's usage.

Fortunately, the narrative isn't entirely bleak, as there are certainly viable strategies for achieving cost-effective open-source AI. Prioritising efficiency over sheer size is a pivotal step. The release of more efficient models, such as Llama 3.3, is a prime example, demonstrating that comparable performance can be attained with significantly fewer parameters, thus reducing infrastructure costs. For instance, Llama 3.3 is designed to offer similar performance as Meta's older 405B parameter model, but at a fraction of the cost and with a vastly reduced GPU load.

"Llama 3.3 delivers leading performance and quality across text-based use cases at a fraction of the inference cost," Meta's AI team stated, highlighting the importance of efficiency.

Furthermore, deploying fine-tuned or smaller models for specific tasks can be a highly effective method of minimising costs without compromising performance. Managed open-source services, often provided through cloud platforms, present a compelling alternative to self-hosting, allowing for the dynamic allocation of resources and making costs more predictable and accessible. Choosing the correct model size and precision level for your use case can also dramatically reduce costs, ensuring you're not overspending on resources you don't truly need.

In conclusion, while the allure of "free" open-source LLMs is undeniable, it's crucial to recognise the illusion of cost savings when the complete picture is considered. While these models offer distinct advantages, such as control and customisation, the substantial infrastructure costs can often make them more expensive than using closed-source alternatives. Therefore, it's essential for any organisation or developer to carefully consider all associated costs - infrastructure, training, operational overhead - when choosing the optimal model for their specific needs. The field of AI is rapidly evolving, so staying informed of new developments, especially regarding more efficient models and cost-reduction strategies, is paramount for anyone looking to harness the transformative power of LLMs. Keep a watchful eye on this space as the landscape is constantly changing, with newer, more efficient models and methods for cost reduction in the open-source AI arena being developed all the time.

The open-source GenAI paradox and the real costs of "Free" LLMs

Let's start a project together.