解决Llama 3只返回函数调用问题：驯服语言模型指南

2025-03-17 03:32:32

Llama 3 只返回函数调用？驯服你的语言模型！

最近在使用 Llama 3 Instruct 模型时，我发现了一个问题：不管我问什么，它都只返回函数调用。即使是像“你是谁？”这样简单的问题，也无法得到正常的回复。这让我很困惑，难道它只能进行函数调用了吗？

问题根源：指令模板与工具的交互

经过一番调查，我发现问题可能出在指令模板（chat template）与工具（tools）的交互上。Llama 3 Instruct 模型经过特殊训练，能够理解并执行函数调用。当我们提供工具（例如上面的get_current_temperature）时，模型会优先考虑使用这些工具来响应用户输入。

原始代码中，直接把tools参数放进了tokenizer.apply_chat_template里, 即使是很普通的对话，它也会按照function call的格式去回复。这其实就是对模型的一种“过度约束”。

解决方案：让模型自由呼吸

要解决这个问题，我们需要给模型一些“自由”，让它能够判断何时使用函数调用，何时进行正常的对话。下面是几种可能的解决方案。

1. 精细化提示工程 (Prompt Engineering)

最直接的方法是通过优化提示工程。我们可以更明确地告诉模型我们的意图。

原理： 通过修改用户输入的提示 (prompt)，更清晰地指示模型执行何种操作。对于不需要调用工具的问题，可以通过提问方式，上下文等引导模型正常回复。

代码示例：

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import os
import torch
from huggingface_hub import login

def get_current_temperature(location: str, unit: str) -> float:
    """Get the current temperature."""
    return 22.

def get_current_wind_speed(location: str) -> float:
    """Get the current wind speed in km/h."""
    return 6.

tools = [get_current_temperature, get_current_wind_speed]

# Suppress MPS log message (optional)
os.environ["TORCH_MPS_DEVICE"] = "1"

checkpoint = "models/Llama-3.2-1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="cpu")
def ask_llama(question, use_tools=False):
    messages = [
      {"role": "user", "content": question}
    ]

    if use_tools:
        inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
    else:
         inputs = tokenizer.apply_chat_template(messages,  add_generation_prompt=True, return_dict=True, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    out = model.generate(**inputs, max_new_tokens=128)
    return tokenizer.decode(out[0][len(inputs["input_ids"][0]):],skip_special_tokens=True)

# 不需要工具的问题
print(ask_llama("你是谁？请直接回答，不要调用工具。", use_tools=False))
print(ask_llama("苹果设备是什么？", use_tools=False))

# 需要工具的问题
print(ask_llama("法国巴黎的温度是多少？", use_tools=True))

操作步骤：

定义一个ask_llama函数, 通过 use_tools参数，控制是否传入tools.
对于不需要函数调用的问题, 设置 use_tools=False, 并且在问题后加上明确的指令（如“请直接回答，不要调用工具。”）。
对于需要函数调用的问题，设置use_tools=True.

安全建议： 明确的提示工程有助于防止模型产生意外的输出，提高可控性。

2. 条件化工具注入

我们可以更进一步，只有在需要调用工具的时候才注入tools参数。

原理： 通过编程逻辑控制，只在必要时才将tools参数传递给apply_chat_template函数。

代码示例（基于上述代码修改）：

# 在上面的代码基础上, 只需更改调用方式.

# 不需要工具的问题
print(ask_llama("你是谁？", use_tools=False)) #直接用 False

# 需要工具的问题
print(ask_llama("法国巴黎的温度是多少？", use_tools=True)) # 使用 True

改进 : 对ask_llama函数进行了优化，使用一个布尔变量use_tools来决定是否将工具传递给模型.

3. 多轮对话中的工具管理

如果我们需要在多轮对话中使用工具，可以更精细地控制工具的注入时机。

原理： 在多轮对话中，根据对话历史和当前用户输入，动态决定是否需要注入tools。

代码示例：

def ask_llama_multi_turn(messages, use_tools=False):
    if use_tools:
        inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
    else:
         inputs = tokenizer.apply_chat_template(messages,  add_generation_prompt=True, return_dict=True, return_tensors="pt")

    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    out = model.generate(**inputs, max_new_tokens=128)
    return tokenizer.decode(out[0][len(inputs["input_ids"][0]):],skip_special_tokens=True)

# 对话历史
messages = []

# 第一轮：不需要工具
messages.append({"role": "user", "content": "你好！"})
response = ask_llama_multi_turn(messages, use_tools=False)
messages.append({"role": "assistant", "content": response})
print(f"Assistant: {response}")

# 第二轮：需要工具
messages.append({"role": "user", "content": "伦敦现在的风速是多少？"})
response = ask_llama_multi_turn(messages, use_tools=True)
messages.append({"role": "assistant", "content": response})
print(f"Assistant: {response}")

# 第三轮: 不需要工具
messages.append({"role": "user", "content": "谢谢你"})
response = ask_llama_multi_turn(messages, use_tools=False)
messages.append({"role": "assistant", "content": response})
print(f"Assistant: {response}")

操作步骤：

维护一个messages列表，存储对话历史。
根据每一轮的对话内容，决定是否设置 use_tools 为 True。

4. (进阶)自定义Chat Template

如果以上方法都不能完全满足你的需求，你可以尝试自定义 Chat Template。这是因为, Llama3 的默认Chat Template在遇到tools的时候，会自动给回复加上一些function call的前后缀。

原理：
通过修改分词器（tokenizer）的 chat_template 属性, 来控制模型生成文本的格式, 避免无差别加入function call的内容。

示例(非完整可运行示例, 仅供参考思路):

# 注意：这只是一个 *思路* 示例，具体实现可能需要根据您的模型和分词器进行调整。

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

#  一个简化的 chat_template 示例，可能需要根据实际情况进行调整。
custom_template = "{% for message in messages %}" \
                  "{% if message['role'] == 'user' %}" \
                  "{{ '<|start_header_id|>user<|end_header_id|>\n\n' + message['content'] + '<|eot_id|>' }}" \
                  "{% elif message['role'] == 'assistant' %}"\
                    "{% if message.content is not none %}"\
                        "{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' + message['content'] + '<|eot_id|>'  }}"\
                     "{% endif %}"\
                   "{% endif %}"\
                  "{% endfor %}"

tokenizer.chat_template = custom_template

# 测试
messages = [
  {"role": "user", "content": "Hey, who are you ?"}
]

inputs = tokenizer.apply_chat_template(messages,  add_generation_prompt=True, return_dict=True, return_tensors="pt")
# ... rest of the code ...