News

OpenPipe Mixture of Agents: Outperform GPT-4 at 1/25th the Cost

Kyle Corbitt and Saumya Gandhi

Jun 20, 2024

10 minutes

We’re excited to announce a new family of “Mixture of Agents” models optimized for generating synthetic training data. Using our Mixture of Agents (MoA) architecture, we’ve achieved SOTA results on both LMSYS’s Arena Hard Auto (score: 84.8) and AlpacaEval 2.0 (LC score: 68.4).

We’ve also benchmarked our MoA approach against GPT-4 variants on real-world OpenPipe customer tasks, and found completions from our MoA model were preferred over GPT-4 59.5% of the time (Claude 3 Opus as judge).

Finally, we’ve experimented with fine-tuning smaller Llama 3 models on synthetic data generated by MoA. We found that fine-tuned Llama 3 70B outperforms GPT-4 on 4/4 tasks, and even the much smaller fine-tuned Llama 3 8B outperforms GPT-4 on 3/4 tasks. Importantly, our fine-tuned Llama 3 8B models are also 3x faster and 25x cheaper to run, occupying an attractive point on the Pareto frontier of price vs performance!

We share more details of how we implemented and validated this approach below.

Background

LLMs perform better on many tasks when given “space to think.” Early results confirming this include techniques like Chain of Thought and Tree of Thoughts, both of which inspired our work here. More recently, we’ve also seen Together develop a similar Mixture of Agents approach that outperforms its constituent models.

Generalizing from these results, it’s clearly possible in at least some cases to improve model performance by dedicating more resources at inference time while holding the base model’s intelligence constant.[1] Our contribution is a concrete approach that achieves this in practice across many task types.

Model Design

Our Mixture of Agents model is designed as a drop-in replacement for GPT-4. That is, its input is a set of messages and tools, and its output is a chat completion object or stream. It is also base-model-agnostic—we have already deployed versions against GPT-4, GPT-4 Turbo and GPT-4o.[2]

Internally, we implement the following 3-prompt chain to construct the final completion:

  • Prompt 1 generates 3 candidate completions in parallel by calling the chosen base model with n=3 and a high temperature to promote output diversity.

  • Prompt 2 again calls the base model. It passes in the original input again, along with the 3 candidate completions generated by prompt 1. It then asks the LLM to review the candidate completions and critique them.

  • Prompt 3 again passes the original input, the 3 candidate completions, and their critiques. Using this information, the base model generates a final completion that incorporates the best of all 3 candidates.

This flow is represented visually below.

Evaluating Response Quality

To comprehensively assess the quality of our MoA model, we benchmarked it across both open-source and private benchmarks.

Open Source Benchmarks

We chose two open-source benchmarks: Arena Hard Auto and AlpacaEval 2.0.

Arena Hard Auto consists of 500 challenging user queries filtered from the LMSys arena. They also provide an automated evaluation tool which has the highest correlation and separability to the chatbot arena results. To avoid bias from only using an OpenAI model-based judge, we report separate results using both GPT-4 and Claude 3 Opus as judges. In both methods, our MoA model outperformed GPT-4-Turbo, emerging as the best-performing model.

Results on Arena Hard Auto using GPT-4 as judge


Results on Arena Hard Auto using Claude 3 Opus as judge


AlpacaEval 2.0 includes 805 challenging prompts with a judge method that debiases results using length-controlled win rates, showing a Spearman correlation of 0.98 with Chatbot Arena. Our model outperformed all other models, including the recently released Together MoA model.


Private Benchmark Evaluations

We selected four tasks representing common categories on our platform: Summarization, Chat, and Data Extraction. With customer permission, we chose 200 random samples from each task to create our internal benchmark. Using our custom LLM-as-judge evaluations, calibrated through user feedback, we compared GPT-4-Turbo against our MoA model. Tests were run with both GPT-4-Turbo and Claude as the LLM judge. Our MoA model showed a 19.25% improvement over GPT-4-Turbo.


Human Evaluation

We acknowledge that LLM-as-judge alone is not a sufficient measure of performance. We selected a subset of 32 samples from our benchmark, evenly split between samples where the LLM judge preferred MoA and those where it preferred GPT-4-Turbo. We manually collected preference data on this subset from four highly trained human raters. Using this data, we adjusted the win rates of the LLM judge to more accurately reflect human judgment. Although the adjustments decreased the degree to which MoA outperforms GPT-4-Turbo, the adjusted win rates still showed that MoA models were stronger than the base models they came from by 9%.

This comprehensive evaluation gives us confidence that our Mixture of Agents model offers significant improvements over existing models.


MoA and Fine Tuning: Beating GPT-4 at 1/25th the cost

When designing our Mixture of Agents architecture, we intentionally traded off both cost and latency in order to maximize generation quality. Concretely, it takes 3x as long to return a completion as calling the base model directly, and costs 3-4x as much.

This is because we believe the best use case for this kind of model is to generate synthetic data for fine-tuning smaller models, and when generating synthetic data, quality is important over all else. Once you have generated a few hundred or a few thousand examples using MoA, you can use them to train a task-specific smaller model that preserves much of the quality while being far faster and cheaper to run.

If you are already using GPT-4 in production, the process might look something like this:

  1. Continue using GPT-4 in production, while using the OpenPipe SDK to capture completions.

  2. Relabel your completions using our MoA model as the relabeling model.

  3. Fine-tune a smaller model on the relabeled completions.

We tested exactly this pipeline across four use-cases, training both Llama 3 70B and 8B variants on each use-case. We found that while the MoA outputs still outperformed our fine-tuned models on most tasks, the fine-tuned models in turn outperformed base GPT-4. Concretely, Llama 3 70B outperformed GPT-4 on 4/4 tasks, and Llama 3 8B outperformed GPT-4 on 3/4 tasks. Importantly, Llama 3 8B is also 1/25th the cost and 1/3rd the latency of GPT-4-Turbo.


By visualizing price-vs-performance, we can see that models fine-tuned on high-quality domain-specific outputs significantly outperform similarly-priced prompted models at every point on the Pareto frontier.

Using MoA Today

We’ll continue releasing improved variants incorporating new techniques and more models. But our initial family of MoA models are already available for use on the OpenPipe platform. There are two primary ways to take advantage of them:

First, you can call our OpenAI-compatible Chat Completions endpoint. Simply create an OpenPipe account, save your OpenAI key in your OpenPipe project settings and call our chat completions endpoint with one of the following model IDs:

  • moa-gpt-4-v1

  • moa-gpt-4-turbo-v1

  • moa-gpt-4o-v1

Second, you can use them as a relabeling model. If you have an existing dataset or are tracking one via OpenPipe, you can relabel it using a Mixture of Agents model for a much stronger dataset that will lead to higher-quality fine-tuned models. You can find the full documentation here.

We’re very excited to share these results with the community and help push the SOTA forward for both prompted and fine-tuned models!

[1]: This generalization has many parallels with Aidan McLaughlin’s recent essay AI Search: The Bitter-er Lesson.

[2]: Our design also works with multiple different proposer models and we’ve already tested it by mixing GPT-4 with Claude Opus. However, we’ve chosen to first release the single-base-model variants for operational simplicity.

Introducing Direct Preference Optimization (DPO) Support on OpenPipe

Introducing Direct Preference Optimization (DPO) Support on OpenPipe

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.