Luna-2 is only available in the Enterprise tier of Galileo. Contact us to learn more and get started.
Contact us
Contact us to learn more about using Luna-2 in your evaluations
Read our research
Learn how Galileo pushes the envelope on GenAI evaluation with our family of fine tuned small language models.
Overview
LLMs are powerful judges for evaluations, but as your application scales up to from tens or hundreds of traces a day, to thousands or millions, they can fall short. Too often, organizations relying solely on LLMs to act as judges incur major inference costs and don’t see the low-latency they need to enable real-time evaluations and runtime protection.- LLMs are expensive
- LLMs don’t provide the performance needed, especially for runtime protection
- LLMs are general purpose, and even leveraging CLHF to enhance the evaluation prompts, can still be less effective for your specific needs.
- Being an SLM, it is an order of magnitude cheaper to run than most LLMs
- SLMs run an order of magnitude faster, allowing for runtime protection
- Luna-2 is not only fine-tuned for evaluations, giving comparable performance out of the box with the top LLMs, but it can be further fine-tuned using your data to improve accuracy beyond any general purpose LLM.
Performance and cost comparison
Comparison with different LLMs and content safety tools
| Model | Cost/1M token | Accuracy (F1 score) | Latency (avg) | Max tokens |
|---|---|---|---|---|
| Luna-2 | $0.02 | 0.95 | 152ms | 128k |
| GPT 4o | $2.50 | 0.94 | 3,200ms | 128k |
| GPT 4o mini | $0.60 | 0.90 | 2,600ms | 128k |
| Azure Content Safety | $1.52 | 0.62 | 312ms | 3k |
Latency vs compute requirements
These are the measured latencies for Luna-2 across a range of GPUs for different sized requests.H100 GPU
| Model | Small (500 tokens) | Medium (2K tokens) | Large (15K tokens) | Extra Large (100K tokens) |
|---|---|---|---|---|
| Luna-2 3B | 14ms | 35ms | 283ms | 4.4s |
| Luna-2 8B | 22ms | 66ms | 555ms | 7.52s |
A100 GPU
| Model | Small (500 tokens) | Medium (2K tokens) | Large (15K tokens) | Extra Large (100K tokens) |
|---|---|---|---|---|
| Luna-2 3B | 19ms | 60ms | 601ms | 12.5s |
| Luna-2 8B | 36ms | 138ms | 1.24s | 21.2s |
L40S GPU
| Model | Small (500 tokens) | Medium (2K tokens) | Large (15K tokens) | Extra Large (100K tokens) |
|---|---|---|---|---|
| Luna-2 3B | 14ms | 47ms | 564ms | 17.2s |
| Luna-2 8B | 31ms | 115ms | 1.06s | 29s |
L4 GPU
L4 GPUs are only supported for calculating metrics for Log streams and experiments. These GPUs are not supported for runtime protection.
| Model | Small (500 tokens) | Medium (2K tokens) | Large (15K tokens) | Extra Large (100K tokens) |
|---|---|---|---|---|
| Luna-2 3B | 63ms | 249ms | 4.1s | 154s |
| Luna-2 8B | 150ms | 580ms | 10.3s | 163s |
The actual latencies can vary a lot based upon the load on the system (Eg: QPS). This can be managed with more GPUs, but the cost will increase.
Technical details
Galileo’s Luna-2 metrics utilize fine-tuned Llama models (3B and 8B variants) in evaluating generative AI metrics. The technical process involves:- Fine-Tuning: Base Llama models are fine-tuned with proprietary data for specific metric needs.
- Classification: Models output normalized log-probabilities of True/False tokens to determine metric accuracy.
- Optimized Infrastructure: Metrics are hosted on Galileo’s optimized inference engine with modern GPU hardware for low-latency and cost-effective evaluations. You can also self host on-prem or on your cloud infrastructure.
- Adapters for Custom Metrics: Lightweight adapters on a shared base model enhance scalability and minimize infrastructure overhead for additional metrics.
- Adaptability: These models are most effective when fine tuned, requiring approximately 4,000 samples for fine-tuning to customer-specific use cases.
- Efficiency and Cost-Effectiveness: Luna-2 models enable simultaneous evaluation of multiple metrics with low latency and reduced costs, ideal for real-time, high-scale deployments.
- Enhanced Accuracy: Luna-2 demonstrates at least a 10% accuracy increase compared to traditional BERT-based models, perfect for precise monitoring in production environments.
Get started with Luna-2
If you are using the enterprise tier of Galileo, follow these steps to use Galileo’s Luna-based metrics:- Contact Galileo’s customer support or account management to begin onboarding.
- If you are using a Galileo-hosted instance, request L4 GPUs or higher, necessary for running Luna-2 models. Otherwise you can deploy to your own infrastructure, using L4 or higher GPUs.
- Review the provided documentation and model cards for details on latency, accuracy, and comparisons to BERT-based metrics.
- Provide Galileo with relevant labelled sample data to fine tune the model. We can augment this with synthetic data if needed.
- Galileo will fine tune your model for you, and deploy it.
- Set up your experiments and Log streams to use Luna-based metrics.