Introduction
Peter F. Drucker, in his famous book “The Effective Executive”, coined the phrase “What gets measured, gets improved”. This saying is as popular in the Machine Learning (ML) world as it is in the corporate world. For instance, to know that you need to work on the accuracy of the model, you first need to be rightly informed that the model's accuracy is not sufficient. And, to know that the accuracy is lacking, you need to measure it.
To improve the models, we need to gauge them across numerous facets. The monitoring metrics depend on the task at hand and overlap with those of conventional ML models. Depending on the task, you could still use metrics like the F1 score, accuracy, and precision to gauge the performance of LLMs but, in addition to these metrics, you will also need to take care of:
Safety measures: Filtering content to avoid spitting out biased and conflicting content
Protection from adversarial attacks
Interpretability
Failing to monitor LLMs could result in a tarnished reputation and might cause irrevocable damage to both the company using it and the company that made it. So, what should you know about monitoring large language (and traditional) models?
In today’s Token, we cover:
Turns out things can get nasty really quick with LLMs, how can I start monitoring my models and infrastructures?
Curated a list of open-source tools that solve some of the most pressing problems with LLM monitoring and observability.
What would be the right KPIs to measure?
My model metrics look good, but the model is still not performant. What might be the issue?
How do I know my users are actually benefitting from the model and improved metrics?
Adversarial attacks 😱
Conclusion
Turns out things can get nasty really quick with LLMs, how can I start monitoring my models and infrastructures?
In the ML world, 100s of new tools emerge every week. Not all of them are going to be useful for your use case. Below, we have curated a list of open-source tools that solve some of the most pressing problems with LLM monitoring and observability:
A library for interpreting and visualizing LLM predictions, assisting in model explanation and debugging. It works for any model of your choice.
An open-source toolkit for monitoring Large Language Models (LLMs). Features include assessing text quality and relevance, hallucinations check, sentiment and toxicity analysis.
Specifically designed for visualizing and interpreting BERT-based LLMs. Helps visualize attention in NLP Models (BERT, GPT2, BART, etc.).
SHAP (SHapley Additive exPlanations):
A game theoretic approach to explain the output of any machine learning model. Allows users to use models from the transformers library by HuggingFace.
An extensible open-source toolkit can help you examine, report, and mitigate discrimination and bias in machine learning models throughout the AI application lifecycle
An open-source monitoring toolkit for collecting and querying metrics from LLMs in real-time.
Integrates with tools like Prometheus and Elasticsearch to provide visualization and analysis of LLM metrics and logs.