Local Multi-LLM Testing & Performance Tracker
- Multi-LLM Testing
- September 22, 2025
- No Comments
Are you experimenting with multiple local large language models (LLMs) and struggling to track which one performs best? Multi-LLM Testing offers a structured way to evaluate and compare responses across different models running locally.
In this guide, we’ll walk through a real workflow setup for Multi-LLM Testing, showing you how to capture speed, readability, and performance metrics—all while keeping everything logged automatically in Google Sheets.
This is more than a test—it’s an AI workflow automation use case that helps you identify which model fits best for your tasks.

What is Local Multi-LLM Testing?
Multi-LLM Testing is the process of sending the same input to multiple local language models and then comparing their performance. Instead of guessing which LLM works best, this workflow measures response time, readability, and text complexity, so you can make data-driven decisions.
It’s a great example of AI workflow automation in action—reducing manual effort and enabling repeatable, structured benchmarking for LLMs.
How the Multi-LLM Testing Workflow Works
This workflow is designed with efficiency and analysis in mind. Let’s break it down step by step.
Trigger & Model Setup
The workflow starts when a chat message is received. It then fetches a list of available models from your local LM Studio server. Remember to update the base URL to match your LM Studio server’s IP as noted in the sticky notes.
Processing Chat Messages
Once triggered, the workflow automatically adds a system prompt that instructs the models to:
- Keep responses concise
- Ensure explanations are clear enough for a 5th grader
It then sends your prompt dynamically to one or more local models for Multi-LLM Testing.
Timing & Analysis
The workflow measures:
- Start and end times → to calculate total response time
- Word and sentence counts
- Average word and sentence length
- Readability score using Flesch-Kincaid
This gives you a full analysis of how each model performs not only in speed but also in clarity.
Logging Results
Every interaction is automatically logged into a Google Sheet, including:
- The original prompt
- The model used
- Response details
- Timing data
- Readability and analysis metrics
This logging function allows long-term tracking and easy comparison between models.
Built-In Guidance & Tips
Sticky notes throughout the workflow provide best practices, such as:
- Installing LM Studio correctly
- Configuring model parameters
- Deleting old chats to prevent overlapping responses
This ensures smoother Multi-LLM Testing and consistent results.
Example Code for Multi-LLM Testing Workflow
Below is the core setup for the Multi-LLM Testing workflow. This code should be kept as-is when implementing the process in your local LM Studio and workflow automation environment.
// Trigger: Chat Message Received
// Fetch Available Models from LM Studio
const baseURL = “http://YOUR_LM_STUDIO_SERVER_IP”;
const models = await fetch(`${baseURL}/v1/models`).then(res => res.json());
// Add System Prompt for Simplicity
const systemPrompt = “Please keep the response concise and explain it so that a 5th grader can understand.”;
// Send Prompt to Each Model
for (const model of models) {
const startTime = Date.now();
const response = await fetch(`${baseURL}/v1/chat/completions`, {
method: “POST”,
body: JSON.stringify({
model: model.id,
messages: [
{ role: “system”, content: systemPrompt },
{ role: “user”, content: input.prompt }
]
}),
}).then(res => res.json());
const endTime = Date.now();
// Analyze Response
const text = response.choices[0].message.content;
const words = text.split(” “).length;
const sentences = text.split(/[.!?]/).length;
const avgWordLength = text.replace(/\s+/g, “”).length / words;
const avgSentenceLength = words / sentences;
// Log Results (Example: Google Sheets Integration)
logToGoogleSheets({
model: model.id,
prompt: input.prompt,
response: text,
responseTime: endTime – startTime,
words,
sentences,
avgWordLength,
avgSentenceLength,
readability: calculateFleschKincaid(text),
});
}
This example captures timing, readability, and response logging, making it easy to compare model performance.
*Note: For the JSON template, please contact us and provide the blog URL.
Why Multi-LLM Testing Matters
Running multiple large language models locally without structured testing can be chaotic. With Multi-LLM Testing, you:
- Benchmark different models objectively
- Automate repetitive performance tracking
- Identify which model gives the best balance of speed and readability
- Improve productivity by letting automation handle the heavy lifting
This workflow transforms experimentation into actionable insights.
Suggested Reads:
Intelligent Invoice Automation with LlamaParse and OpenAI
Best AI Workflow Automation Tools for 2025
Conclusion
Local Multi-LLM Testing is not just about speed tests—it’s about creating a sustainable, automated workflow to optimize LLM performance. By combining timing, readability analysis, and automatic logging, you gain a powerful tool to guide your AI development.
Ready to try it out? Set up the workflow, connect your LM Studio server, and start comparing models today.
FAQs
1. What is Multi-LLM Testing and why is it useful?
Multi-LLM Testing is the process of running the same input across multiple local language models to compare performance, readability, and accuracy. It’s useful for choosing the best model for your workflows.
2. How does the Multi-LLM Testing workflow analyze responses?
The workflow calculates response time, word count, sentence length, and Flesch-Kincaid readability scores to measure both speed and clarity of different models.
3. Can I log Multi-LLM Testing results automatically?
Yes. The workflow automatically logs every response, along with timing and analysis metrics, into a Google Sheet for easy review and long-term tracking.