Local Multi-LLM Testing & Performance Tracker

Multi-LLM Testing
Anjali
September 22, 2025
No Comments

Are you experimenting with multiple local large language models (LLMs) and struggling to track which one performs best? Multi-LLM Testing offers a structured way to evaluate and compare responses across different models running locally.

In this guide, we’ll walk through a real workflow setup for Multi-LLM Testing, showing you how to capture speed, readability, and performance metrics—all while keeping everything logged automatically in Google Sheets.

This is more than a test—it’s an AI workflow automation use case that helps you identify which model fits best for your tasks.

What is Local Multi-LLM Testing?

Multi-LLM Testing is the process of sending the same input to multiple local language models and then comparing their performance. Instead of guessing which LLM works best, this workflow measures response time, readability, and text complexity, so you can make data-driven decisions.

It’s a great example of AI workflow automation in action—reducing manual effort and enabling repeatable, structured benchmarking for LLMs.

How the Multi-LLM Testing Workflow Works

This workflow is designed with efficiency and analysis in mind. Let’s break it down step by step.

Trigger & Model Setup

The workflow starts when a chat message is received. It then fetches a list of available models from your local LM Studio server. Remember to update the base URL to match your LM Studio server’s IP as noted in the sticky notes.

Processing Chat Messages

Once triggered, the workflow automatically adds a system prompt that instructs the models to:

Keep responses concise
Ensure explanations are clear enough for a 5th grader

It then sends your prompt dynamically to one or more local models for Multi-LLM Testing.

Timing & Analysis

The workflow measures:

Start and end times → to calculate total response time
Word and sentence counts
Average word and sentence length
Readability score using Flesch-Kincaid

This gives you a full analysis of how each model performs not only in speed but also in clarity.

Logging Results

Every interaction is automatically logged into a Google Sheet, including:

The original prompt
The model used
Response details
Timing data
Readability and analysis metrics

This logging function allows long-term tracking and easy comparison between models.

Built-In Guidance & Tips

Sticky notes throughout the workflow provide best practices, such as:

Installing LM Studio correctly
Configuring model parameters
Deleting old chats to prevent overlapping responses

This ensures smoother Multi-LLM Testing and consistent results.

Example Code for Multi-LLM Testing Workflow

Below is the core setup for the Multi-LLM Testing workflow. This code should be kept as-is when implementing the process in your local LM Studio and workflow automation environment.

// Trigger: Chat Message Received

// Fetch Available Models from LM Studio

const baseURL = “http://YOUR_LM_STUDIO_SERVER_IP”;

const models = await fetch(`${baseURL}/v1/models`).then(res => res.json());

// Add System Prompt for Simplicity

const systemPrompt = “Please keep the response concise and explain it so that a 5th grader can understand.”;

// Send Prompt to Each Model

for (const model of models) {

const startTime = Date.now();

const response = await fetch(`${baseURL}/v1/chat/completions`, {

method: “POST”,

body: JSON.stringify({

model: model.id,

messages: [

{ role: “system”, content: systemPrompt },

{ role: “user”, content: input.prompt }

]

}),

}).then(res => res.json());

const endTime = Date.now();

// Analyze Response

const text = response.choices[0].message.content;

const words = text.split(” “).length;

const sentences = text.split(/[.!?]/).length;

const avgWordLength = text.replace(/\s+/g, “”).length / words;

const avgSentenceLength = words / sentences;

// Log Results (Example: Google Sheets Integration)

logToGoogleSheets({

model: model.id,

prompt: input.prompt,

response: text,

responseTime: endTime – startTime,

words,

sentences,

avgWordLength,

avgSentenceLength,

readability: calculateFleschKincaid(text),

});

}

This example captures timing, readability, and response logging, making it easy to compare model performance.

*Note: For the JSON template, please contact us and provide the blog URL.

Why Multi-LLM Testing Matters

Running multiple large language models locally without structured testing can be chaotic. With Multi-LLM Testing, you:

Benchmark different models objectively
Automate repetitive performance tracking
Identify which model gives the best balance of speed and readability
Improve productivity by letting automation handle the heavy lifting

This workflow transforms experimentation into actionable insights.

Best AI Workflow Automation Tools for 2025

Conclusion

Local Multi-LLM Testing is not just about speed tests—it’s about creating a sustainable, automated workflow to optimize LLM performance. By combining timing, readability analysis, and automatic logging, you gain a powerful tool to guide your AI development.

Ready to try it out? Set up the workflow, connect your LM Studio server, and start comparing models today.

FAQs

1. What is Multi-LLM Testing and why is it useful?

Multi-LLM Testing is the process of running the same input across multiple local language models to compare performance, readability, and accuracy. It’s useful for choosing the best model for your workflows.

2. How does the Multi-LLM Testing workflow analyze responses?

The workflow calculates response time, word count, sentence length, and Flesch-Kincaid readability scores to measure both speed and clarity of different models.

3. Can I log Multi-LLM Testing results automatically?

Yes. The workflow automatically logs every response, along with timing and analysis metrics, into a Google Sheet for easy review and long-term tracking.

#Local Multi-LLM Testing