Matryoshka

Matryoshka
Learning to Drive Black-Box LLMs with LLMs

Georgia Institute of Technology
^*Indicates Equal Contribution

Abstract

Despite the impressive generative abilities of black-box large language models (LLMs), their inherent opacity hinders further advancements in capabilities such as reasoning, planning, and personalization. Existing works aim to enhance LLM capabilities via domain-specific adaptation or in-context learning, which require additional training on accessible model parameters, an infeasible option for black-box LLMs. To address this challenge, we introduce Matryoshika, a lightweight white-box LLM controller that guides a large-scale black-box LLM generator by decomposing complex tasks into a series of intermediate outputs. Specifically, we consider the black-box LLM as an environment, with Matryoshika serving as a policy to provide intermediate guidance through prompts for driving the black-box LLM. Matryoshika is trained to pivot the outputs of the black-box LLM aligning with preferences during iterative interaction, which enables controllable multi-turn generation and self-improvement in optimizing intermediate guidance. Empirical evaluations on three diverse tasks demonstrate that Matryoshika effectively enhances the capabilities of black-box LLMs in complex, long-horizon tasks, including reasoning, planning, and personalization. By leveraging this pioneering controller-generator framework to mitigate dependence on model parameters, Matryoshika provides a transparent and practical solution for improving black-box LLMs through controllable multi-turn generation using white-box LLMs.

Introduction

Existing research efforts for improving black-box LLM performance can be largely categorized into two main paradigms:

In-context learning (ICL)-based methods that are designed to guide LLM in exhibiting specific capabilities or adhering to particular directives. These frameworks necessitate meticulously constructing few-shot demonstrations or prompts for LLMs to emulate or follow, rather than fundamentally advancing their intrinsic capabilities.

Adapter-based methods that exploit the inherent randomness in LLM generation, producing multiple candidate outputs and subsequently selecting those that optimally satisfy domain-predetermined criteria. Nevertheless, these approaches are highly dependent on the intrinsic synthetic capabilities or built-in functionalities of the black-box LLM, potentially resulting in the selection of a suboptimal candidate when all the generated options are less than ideal.

We propose Matryoshka, a modular framework designed to enhance the advanced problem-solving capabilities of black-box LLMs via controllable multi-turn generations.

Experiments

Main Results

LaMP

AlfWorld

GSM

Matryoshka consistently surpasses existing baselines, achieving state-of-the-art performance across three domains: personalization, planning, and reasoning. Notably, on the LaMP benchmark, we use gpt-4o-mini as the black-box LLM generator, while on the GSM and AlfWorld benchmarks, we use gpt-3.5-turbo as the black-box LLM generator, demonstrating the versatility of our approach.

Plug-and-Play

We demonstrate that Matryoshka effectively applies the optimized white-box controller to other black-box models in a plug-and-play fashion, requiring no additional training. For example, in the LaMP benchmark, our white-box controller functions as a plug-in with black-box models like gpt-3.5-turbo and gemini-1.5-flash. Experimental results indicate that our controller significantly outperforms other baselines, with large performance gains on LaMP-3 and LaMP-4. Additionally, in solving mathematical problems on GSM8K, a Matryoshka model trained with gpt-3.5-turbo can be seamlessly applied to gpt-4o-mini without extra training costs.

Efficiency

LaMP

We presents the accuracy and ROUGE-L curves for LaMP-2M and LaMP-4, with the x-axis representing the total number of profiles per user. We compare Matryoshka and PAG, utilizing same white-box controller and the black-box model gpt-4o-mini. On LaMP-2M, as the profile count increases, PAG’s performance significantly deteriorates, whereas Matryoshka maintains stable performance and surpasses PAG by an increasing margin. For LaMP-4, both Matryoshka and PAG exhibit similar trends, but Matryoshka consistently outperforms PAG by a substantial and steady margin. These results demonstrate the efficacy of IGO in enhancing the summarization capabilities of the black-box controller, especially when dealing with varying and large amounts of profiles.

AlfWorld

We evaluate the performance of Matryoshka against the vanilla LLaMA3-8B-Instruct and AdaPlanner on ALFWorld, varying the number of closed-loop iterations M during the inference phase. As illustrated, Matryoshka achieves an accuracy exceeding 95% in the open-loop inference setting (M=1), significantly surpassing both AdaPlanner and LLaMA3-8B-Instruct. Furthermore, during an 8-iteration closed-loop inference, Matryoshka maintains the highest accuracy of 97%. These findings indicate that Matryoshka is capable of generating exceptionally high-quality plans, enabling the GPT model serving as the task executor to interact with the environment and complete tasks successfully without requiring closed-loop refinement.

Case Study

Here is a case study of Matryoshka on GSM-Hard, AlfWorld and LaMP-4.

In GSM-Hard, the white-box controller is responsible for breaking down complex problems into simpler sub-problems, while the black-box generator produces executable code for each sub-problem.
In Alfworld, the white-box controller decomposes tasks into executable plans, and the black-box generator provides an execution strategy for each plan.
In LaMP, the white-box controller summarizes each user's history, and the black-box generator creates suitable titles based on this summary.

BibTeX

@article{li2024matryoshka, title={Matryoshka: Learning to Drive Black-Box LLMs with LLMs}, author={Li, Changhao and Zhuang, Yuchen and Qiang, Rushi and Sun, Haotian and Dai, Hanjun and Zhang, Chao and Dai, Bo}, journal={arXiv preprint arXiv:2410.20749}, year={2024} }

MatryoshkaLearning to Drive Black-Box LLMs with LLMs

Abstract

Introduction

Experiments

Main Results

LaMP

AlfWorld

GSM

Plug-and-Play

Efficiency

LaMP

AlfWorld

Case Study

BibTeX

Matryoshka
Learning to Drive Black-Box LLMs with LLMs