Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents
TL;DR Highlight
An agent framework decomposing complex queries into subtasks, routing easy ones to local SLMs and hard ones to cloud LLMs — cutting costs by 83% while maintaining accuracy.
Who Should Read
Developers running LLM agents on mobile/edge environments who want to reduce API costs and latency. Teams building on-device AI assistants or smartphone agents.
Core Mechanics
- Key insight: decomposing queries into subtasks reveals many easy tasks within complex requests that local SLMs can handle — use this to reduce cloud calls
- Models subtask dependencies as a DAG to run independent tasks in parallel — 66% time reduction vs sequential execution
- Attaches only an MLP adapter to Llama 3-8B for task difficulty classification — original model parameters untouched and removable
- Adapter training data auto-generated without human labeling: uses alpha-Tree algorithm with token probability uncertainty as difficulty indicator
- 7-benchmark experiment with GPT-4o vs Llama 3-8B: avg 83.57% API cost reduction, 66.12% time reduction, accuracy within -2% of best baseline
- Adapter generalizability confirmed: near-identical performance when transferred to different benchmarks within math category
Evidence
- 7-benchmark avg: 83.57% API cost reduction, 66.12% time reduction vs highest accuracy baseline
- DROP benchmark: DoT accuracy 85% vs CoT (GPT-4o) 80% — cheaper at 0.32 cents vs 1.30 cents while more accurate
- alpha-Tree (n=1) SLM usage ratio 85.53%, success rate 99.44% vs zero-shot LLM evaluation (53.11%, 92.78%)
- Task decomposition independence: DoT 91.3% vs vanilla prompting 74.2% (MATH, 50 manual labels verified)
How to Apply
- In LLM agent pipelines: first decompose user queries into subtasks, check local model token probability distribution for each, and route only high-uncertainty ones to cloud APIs.
- Identify subtask dependencies and execute independent steps asynchronously in parallel — especially effective for complex workflows with 8+ subtasks, achieving 30%+ time savings.
- Add a small MLP head to on-device SLM and train a difficulty classifier with auto-generated training data from actual inference feedback — no labeling cost.
Code Example
# Task Decomposition prompt pattern (based on paper Appendix C.1)
decompose_prompt = """
I will now give you a [problem type] question.
Please break this problem down into several easy-to-solve steps.
Examples:
[8 hand-crafted few-shot examples]
Now the question is: {user_query}
Please decompose it into easy-to-solve steps.
Answer Format:
To solve the question "{user_query}", we need to know:
"1. {step1}", "2. {step2}", "3. {step3}"...
"""
# Dependency Construction prompt pattern (based on Appendix C.2)
dependency_prompt = """
Given the following subtasks for: [{original_question}]
Subtasks:
{subtask_list}
Please list the dependencies in the format:
'Subproblem A [xxx] -> Subproblem B [xxx]'
indicating that Subproblem A must be completed before Subproblem B.
Answer format (strictly follow, no explanation):
Step_i [sub-problem i] -> Step_j [sub-problem j]
"""
# Sentence embedding prompt for difficulty estimation (based on Section 4.4)
# Induces the SLM to compress subtask semantics into a single token
embedding_prompt = 'This sentence: "{subtask_text}" means in one word: '
# → The hidden state of the last generated token is fed into the difficulty classification MLPTerminology
Related Resources
Original Abstract (Expand)
The rapid expansion of web content has made on-device AI assistants indispensable for helping users manage the increasing complexity of online tasks. The emergent reasoning ability in large language models offer a promising path for next-generation on-device AI agents. However, deploying full-scale Large Language Models (LLMs) on resource-limited local devices is challenging. In this paper, we propose Division-o f-Thoughts (DoT), a collaborative reasoning framework leveraging the synergy between locally deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT leverages a Task Decomposer to elicit the inherent planning abilities in language models to decompose user queries into smaller sub-tasks, which allows hybrid language models to fully exploit their respective strengths. Besides, DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks and create a dependency graph, facilitating parallel reasoning of sub-tasks and the identification of key steps. To allocate the appropriate model based on the difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an additional task head attached to the SLM that does not alter the SLM's parameters. To boost adapter's task allocation capability, we propose a self-reinforced training method that relies solely on task execution feedback. Extensive experiments on various benchmarks demonstrate that our DoT significantly reduces LLM costs while maintaining competitive reasoning accuracy. Specifically, DoT reduces the average reasoning time and API costs by 66.12% and 83.57%, while achieving comparable reasoning accuracy with the best baseline methods.