Token 1.17: Deploying ML Model: Best practices feat. LLMs
Unless you are a researcher whose sole job is to beat benchmarks at some predefined dataset, you will want to deploy your model
The spotlight in AI has recently shifted towards foundation models (FMs) and their subset – large language models (LLMs). Despite this trend, the cornerstone of machine learning success remains in deployment – turning these sophisticated algorithms into practical, operational tools. In systematizing the knowledge about the newly developed FMOps infrastructure, we want to highlight that whether it's a traditional ML model or an advanced LLM, the deployment process shares many similarities. There are some additional considerations, of course, but a lot of what you heard might just be hype ;)
The earlier you deploy the model, the earlier you become aware of production issues. As a result, you can tackle them before you are set to deploy the final version of your model. It is estimated that about 80% of the ML models never make it to production! Today, we'll cut through the complexities, highlighting the key parallels and unique challenges in deploying both traditional and advanced models. This streamlined approach demystifies deployment, equipping you with best practices that you can follow even before training an ML model to make it easier and faster to deploy.
In today’s Token:
How to choose the right model?
Where can I store my embeddings?
Are feature stores any useful?
How can we get the best performance from a chosen model for our use case?
Now that I know how to choose a given model, how do I choose the infrastructure to deploy it?
I have deployed my model. Can I sit back and relax?
Conclusion
Bonus resources
Let’s get started with how you can choose the right model.
How can I choose the model?
Well, the answer is, to use the simplest model that can get the job done. For instance, if you are working on a binary classification problem, start with logistic regression. There are two benefits to using the simplest model:
Simpler models are explainable
Simpler models can be trained quickly, and the team can focus on other phases of the ML lifecycle. This reduces the time to deployment.
For LLMs, this translates to using good enough models and prompt engineering instead of fine-tuning or training a custom LLM. For tasks like information extraction, where the required answer is present in the prompt itself, prompt engineering is very effective and less time-consuming than fine-tuning or training a custom model.
Where can I store my embeddings?
Unlike neural network models like CNNs, RNNs, etc. that internally compute the embeddings, for language models, it is possible to compute and provide the embeddings separately. Models like stable diffusion expect an embedding or a vector representation of the actual prompt to compute the output. But, can we store these embeddings like a normal text or image file?
Embeddings can be stored in a normal text file or a csv file but that is not the optimal place. Vector database, a specialized database made specifically to store embeddings, makes operations like querying, scaling, similarity search, filtering, etc. to perform on top of embeddings. Hence, it is better to store embeddings in a vector database.
Another advantage of using a vector database is efficient search. Milvus, an open-source vector database, can help you save time and money via hardware-efficient indexing algorithms that can boost retrieval speed by up to 10 times.