强化学习调试：修复动作奖励不匹配 (Connect Four实战)

2025-03-26 22:27:12

解决强化学习训练中动作与奖励记录不匹配问题 (Connect Four 案例分析)

一、问题来了：动作和奖励数量对不上

你在用强化学习批量训练一个玩 Connect Four（四子棋）的 AI，但碰到了一个怪事：记录的奖励（rewards）数量有时会比记录的动作（actions）数量多一个。这事儿还不是每次都发生，但一旦发生，动作数就正好比奖励数少一个。按理说，智能体（agent）每执行一个动作，都应该对应一个奖励记录才对。

你还提到状态（states）的数量有时候是奇数，但理论上应该是偶数，因为你需要记录动作前和动作后的状态。你感觉奖励记录是准的，但有些动作似乎被漏掉了。

看下你精简后的代码片段，问题主要出在数据记录的环节。

# ... (部分代码省略) ...
for epoch in range(N_EPOCHS):
    replays = []
    for train_episode_id in range(0,N_TRAINING_EPISODES,EPISODE_BATCH_SIZE):
        # ... (初始化 batch_states, batch_rewards, batch_actions 等列表) ...
        games = []
        # ... (初始化游戏和回合) ...

        # ... (获取未结束的游戏索引 not_ended_games_indices) ...

        while len(not_ended_games_indices) > 0 and moves < 80:
            if len(agent_games) > 0:
                # ... (准备 agent 输入状态) ...
                for i in agent_games:
                    state = games[i].to_tensor(agent_turns[i])
                    batch_states[i].append(state) # <--- 记录 Agent 动作前的状态
                    input_states.append(state.unsqueeze(0).to(device))
                # ... (Agent 模型预测) ...
                for i in range(len(agent_games)):
                    # ... (选择动作) ...
                    batch_actions[agent_games[i]].append(agent_response) # <--- 记录 Agent 的动作
                    games[agent_games[i]].player_move(agent_turns[agent_games[i]], agent_response)
                moves+=1

            if len(opponent_games) > 0:
                # ... (准备 opponent 输入状态) ...
                for i in opponent_games:
                    state = games[i].to_tensor(opponent_turns[i])
                    if moves > 0:
                        batch_states[i].append(state) # <--- 在特定条件下记录 Opponent 动作前的状态
                    input_states.append(state.unsqueeze(0).to(device))
                # ... (Opponent 模型预测) ...
                for i in range(len(opponent_games)):
                    # ... (选择并执行 Opponent 动作) ...
                    games[opponent_games[i]].player_move(opponent_turns[opponent_games[i]], opponent_response)

                    # 在 Opponent 移动后，如果游戏未结束，计算并记录一个中间奖励
                    if games[opponent_games[i]].check_winner() == 0 and not games[opponent_games[i]].check_tie() and moves < 80:
                        new_score = games[opponent_games[i]].player_score(agent_turns[opponent_games[i]]) - games[opponent_games[i]].player_score(opponent_turns[opponent_games[i]])
                        if moves > 0:
                            batch_rewards[opponent_games[i]].append((new_score - scores[opponent_games[i]]) * DISCOUNT_RATE ** (len(batch_states[opponent_games[i]])//2) ) # <--- 记录中间奖励
                    scores[opponent_games[i]] = new_score # 更新分数用于下次计算差值 (这个变量命名可能引起误解)
                moves+=1

            # 检查游戏结束状态
            for i in range(len(not_ended_games_indices)): # 这里应该迭代 not_ended_games_indices 本身的值，而不是range(len(...))?
                 current_game_index = not_ended_games_indices[i] # 假设这样获取正确的索引
                 if games[current_game_index].check_winner()!=0 or games[current_game_index].check_tie() or moves >= 80:
                     if not recorded_endings[current_game_index]:
                         # 记录最终奖励
                         if games[current_game_index].check_winner() == agent_turns[current_game_index]:
                             batch_rewards[current_game_index].append(160) # <--- 记录最终胜利奖励
                         elif games[current_game_index].check_tie():
                             batch_rewards[current_game_index].append(0) # <--- 记录最终平局奖励
                         elif games[current_game_index].check_winner() == opponent_turns[current_game_index]:
                             batch_rewards[current_game_index].append(-160) # <--- 记录最终失败奖励
                         recorded_endings[current_game_index] = True
                         # 记录最终状态
                         state = games[current_game_index].to_tensor(agent_turns[current_game_index]) # <-- 使用哪个 player 的视角？这里用了 agent_turns[i]，可能需要确认
                         batch_states[current_game_index].append(state) # <--- 记录最终状态

            # ... (更新未结束游戏列表) ...

        # ... (计算调整后的奖励 batch_rewards_adjusted) ...
        for i in range(len(batch_states)):
             print(len(batch_rewards[i]),len(batch_actions[i]),len(batch_states[i]),len(batch_move_types[i]))
             # ... (构建 replays 数据) ...
             # 注意这里的索引方式：batch_actions[i][j//2], batch_rewards_adjusted[i][j//2]
             # 这暗示了 actions 和 rewards 的数量应该是 state 数量的一半左右
             for j in range(0,len(batch_states[i])-1,2):
                 # ... (replays.append) ...

二、问题根源分析：数据记录时机和内容混乱

标准的强化学习训练，特别是基于经验回放（Experience Replay）的方法，通常需要收集一系列的(State, Action, Reward, Next_State, Done)这样的五元组（或者类似结构）。这里的关键在于，这五元组的是智能体（Agent）执行了一个动作后的完整反馈闭环：

Agent 在状态 S 观察环境。
Agent 决定并执行动作 A。
环境给出奖励 R （这个奖励是执行动作 A 的直接或间接结果）。
环境进入新的状态 S'。
标记 (Done) 表明这个状态 S' 是否是回合的终点。

现在对照你的代码逻辑，看看问题出在哪里：

动作 (batch_actions) 记录 : 只在 agent_games 循环中记录，也就是只记录了 Agent 自己执行的动作。这本身没问题，因为我们通常只训练 Agent。所以 len(batch_actions) 等于 Agent 在一局游戏中实际执行的动作次数。
奖励 (batch_rewards) 记录 :
- 中间奖励 : 在 opponent_games 循环里，对手移动之后，会计算一个基于分数变化的 new_score - scores[...] 作为中间奖励，并添加到 batch_rewards。这意味着这个奖励是在对手行动后才计算并记录的。
- 最终奖励 : 在游戏结束时（包括赢、输、平局或达到最大步数），会添加一个大的最终奖励（160, 0, 或 -160）。
- 问题点 : 奖励列表混合了两种不同时机、不同性质的奖励。特别是中间奖励，它紧跟在对手动作之后记录。而最终奖励是在游戏明确结束后记录。这就导致 len(batch_rewards) 的计算方式和 len(batch_actions) 完全不同。如果一局游戏中 Agent 走了 N 步，对手走了 M 步，并且游戏结束时添加了一个最终奖励，那么奖励的总数可能就不是 N。例如，如果对手最后一步导致游戏结束，则可能记录了 M 个中间奖励 + 1 个最终奖励，这跟 Agent 的动作数 N 就对不上了。如果 Agent 最后一步导致游戏结束，情况又会不同。len(rewards) == len(actions) + 1 的情况很可能发生在 Agent 先手，并且最终是对手的回合结束后游戏才判定胜负平（比如对手走最后一步棋获胜，或者对手走完后平局）。
状态 (batch_states) 记录 :
- Agent 动作前的状态会被记录。
- Opponent 动作前的状态，但只有在 moves > 0 时才记录。
- 游戏结束时，会记录一个最终状态。
- 问题点 : 这个记录逻辑非常混乱。首先，moves > 0 的限制可能导致第一个对手回合的状态丢失（如果对手先手）。其次，混合记录动作前状态和最终状态，必然导致状态列表的长度难以预测，并且很容易出现奇数长度。len(states) 应该是 len(agent_moves) + len(opponent_moves) + 1 （如果包含初始状态）或者类似的值，而不是期望的 2 * len(agent_moves) （即每个 agent 动作对应一个动作前和一个动作后状态）。更关键的是，你需要的 (State, Next_State) 对应该是围绕 Agent 动作 A 的状态 S (执行 A 之前) 和 S' (执行 A 之后，环境演变的结果，通常是对手响应之后的状态)。
Replay Buffer 构建逻辑 :
- for j in range(0, len(batch_states[i])-1, 2) 这个循环假设 batch_states 包含成对的状态 (S, S')。但基于上面的分析，batch_states 的记录方式并不保证这一点。
- 它使用 j//2 来索引 batch_actions 和 batch_rewards_adjusted，这隐含地假设了动作和奖励的数量大约是状态数量的一半，并且是按顺序对应的。但我们已经分析出奖励的记录方式跟动作并不同步。

总结一下 ：核心问题在于数据记录的逻辑没有严格遵循强化学习 (S, A, R, S') 的框架。奖励的记录时机与 Agent 的动作解耦了，状态的记录也很随意，最终导致构建训练样本时数据对不上号。

三、解决方案：统一到 Agent 的视角

目标是让记录的数据能够清晰地构成 Agent 的经验 (S, A, R, S')。这意味着每次记录都应该围绕 Agent 的一次决策。

方案一：彻底改造数据记录流程 (推荐)

这是最根本的解决办法。调整 while 循环内的逻辑，确保只收集与 Agent 相关的 (S, A, R, S') 元组。

原理与作用:

我们只关心 Agent 的学习过程。所以，我们需要记录：Agent 在某个状态 S 下，选择了动作 A，最终导致了什么结果（奖励 R 和下一个状态 S'）。这里的 S' 通常是 Agent 执行动作 A 后，经过环境（包括对手）响应之后的状态。奖励 R 也是 Agent 执行动作 A 所带来的（可能是延迟的）回报。

操作步骤与代码示例:

调整数据结构: 不再分别维护 batch_states, batch_actions, batch_rewards。改为为每个游戏维护一个经验列表，专门存放 (S, A, R, S', Done) 元组。

# 在 epoch 循环开始处
all_replays = [] # 用于存储所有批次的所有经验

# 在 train_episode_id 循环内，初始化时
agent_experiences = [[] for _ in range(batch_size)]
# agent_experiences[i] 将存储第 i 个游戏中 Agent 的 (S, A, R, S', Done) 序列

改造主循环 (while) :

# ... (初始化部分) ...
temp_agent_data = {} # 临时存放 Agent 刚执行完动作后的 (S, A) 数据，等待 R 和 S'

while len(not_ended_games_indices) > 0 and moves < 80:
    current_agent_game_indices = list(agent_games) # 记录当前轮到 Agent 的游戏索引
    current_opponent_game_indices = list(opponent_games)

    # === Agent 回合 ===
    if len(agent_games) > 0:
        states_before_agent_move = []
        input_states_tensor = []
        for i in agent_games:
            state = games[i].to_tensor(agent_turns[i])
            states_before_agent_move.append(state)
            input_states_tensor.append(state.unsqueeze(0).to(device))

        agent_q_values = agent(torch.stack(input_states_tensor))
        agent_actions_taken = []

        for i_idx, game_idx in enumerate(current_agent_game_indices):
            action_probs = torch.softmax(agent_q_values[i_idx], dim=0)
            agent_response = torch.multinomial(action_probs, 1).item()
            # ... (处理无效动作，随机选择) ...
            move_type = "AI" if games[game_idx].check_valid_move(agent_response) else "Rand" # 记录移动类型

            # 执行动作
            games[game_idx].player_move(agent_turns[game_idx], agent_response)
            agent_actions_taken.append(agent_response)

            # 临时存储 Agent 的 State 和 Action，等待 Reward 和 Next_State
            temp_agent_data[game_idx] = {
                'state': states_before_agent_move[i_idx],
                'action': agent_response,
                'move_type': move_type
            }
        moves += 1 # 这里只计算有效回合数可能更合理？看具体需求

    # === Opponent 回合 ===
    if len(opponent_games) > 0:
        # ... (类似地，获取状态，预测动作，执行动作) ...
        # 注意：Opponent 的动作我们不需要记录到训练数据里，但需要执行来推进游戏
        for game_idx in current_opponent_game_indices:
             # ... opponent selects and takes action ...
             games[game_idx].player_move(opponent_turns[game_idx], opponent_response)
        moves += 1

    # === 处理状态转换和奖励 (关键步骤) ===
    indices_to_process = list(temp_agent_data.keys()) # 处理上一轮 Agent 动作的结果
    for game_idx in indices_to_process:
        if game_idx not in games: # Maybe game already finished and removed? Add safety checks.
             continue

        data = temp_agent_data.pop(game_idx) # 取出之前存的 S 和 A
        s = data['state']
        a = data['action']
        m_type = data['move_type']

        current_game = games[game_idx]
        agent_id = agent_turns[game_idx]

        # 确定 Next State (S')
        # S' 是 Agent 执行动作 A 后，对手也响应（如果游戏没结束）之后的状态
        s_prime = current_game.to_tensor(agent_id)

        # 确定 Reward (R) 和 Done 标志
        reward = 0  # 默认中间奖励为 0
        done = False
        winner = current_game.check_winner()
        is_tie = current_game.check_tie()

        if winner == agent_id:
            reward = 160  # 胜利奖励
            done = True
        elif winner != 0 and winner != agent_id: # 对手赢了
            reward = -160 # 失败奖励
            done = True
        elif is_tie:
            reward = 0    # 平局奖励
            done = True
        elif moves >= 80: # 达到最大步数也算结束
             # reward 可以设为0或者一个小惩罚
             done = True
        # else: # 游戏继续，如果需要非稀疏奖励，可以在这里加基于分数的中间奖励
            # 例如: new_score = current_game.player_score(agent_id) - current_game.player_score(opponent_turns[game_idx])
            # reward = new_score - previous_score_map[game_idx] # 需要维护 previous_score_map
            # previous_score_map[game_idx] = new_score
            pass # 当前使用稀疏奖励 (只在游戏结束时给奖励)

        # 将完整的 (S, A, R, S', Done) 元组添加到对应游戏的经验列表中
        agent_experiences[game_idx].append({
            'state': s,
            'action': a,
            'reward': reward,
            'next_state': s_prime,
            'done': done,
            'm_type': m_type # 如果需要区分随机还是AI动作
        })

        # 如果游戏在本轮结束后才标记结束 (例如 Agent 获胜)，更新结束状态
        if done and not recorded_endings[game_idx]:
             recorded_endings[game_idx] = True
             # 无需额外记录最终状态和奖励到 batch_xxx 列表

    # ... (更新 not_ended_games_indices, agent_games, opponent_games) ...
    # 要注意在更新这些列表时，如果游戏在本轮判定结束了，要正确移除


# === 游戏批次结束后 ===
# 现在 agent_experiences[i] 里是第 i 个游戏的完整 (S, A, R, S', Done) 序列
# 进行奖励折扣计算 (例如 Monte Carlo return 或 GAE) 并构建最终的 replays 列表

for i in range(batch_size):
    game_trajectory = agent_experiences[i]
    if not game_trajectory: continue # 如果游戏一开始就结束或没产生经验

    # 计算折扣回报 (以 Monte Carlo 为例，类似你之前的 batch_rewards_adjusted)
    discounted_rewards = [0] * len(game_trajectory)
    cumulative_reward = 0
    # 从后往前计算
    for t in reversed(range(len(game_trajectory))):
        # 如果是回合终点，则从该步的奖励开始累计，否则从后一步的折扣回报继续累加
        # 注意: 如果 done=True, 理论上 next_state 的价值应该为 0
        # 对于 Monte Carlo, 直接用实际奖励累加
        cumulative_reward = game_trajectory[t]['reward'] + DISCOUNT_RATE * cumulative_reward
        discounted_rewards[t] = cumulative_reward

    # 添加到总的回放列表
    for t in range(len(game_trajectory)):
        exp = game_trajectory[t]
        all_replays.append({
            'state': exp['state'],
            'action': exp['action'],
            'reward': discounted_rewards[t], # 使用计算好的折扣回报
            'new_state': exp['next_state'],
            'done': exp['done'],
             'm_type': exp['m_type']
        })

最后, replays 列表（现在是 all_replays）就可以用来训练了

loss = train_step(all_replays, ...)


**额外的安全和调试建议:** 

*   **Off-by-one 错误** : 在处理序列和索引时要特别小心，很容易出现差一错误。多打印中间结果的长度和内容来确认。
*   **状态视角** : `game.to_tensor(player_id)` 要确保始终使用正确的玩家视角来生成状态张量，尤其是 `state` 和 `next_state`。通常，都应该从 Agent 的视角出发。
*   **奖励设计** : 稀疏奖励（只在游戏结束时给）实现简单，但可能学习较慢。如果使用中间奖励（如分数变化），要确保它能合理地归因于 Agent 的前一个动作。错误地将对手行动后的分数变化直接作为 Agent 前一个动作的奖励，可能会误导学习。
*   **最终状态 `S'`** : 对于回合结束 (`done=True`) 的那一步，理论上 `next_state` (`S'`) 不应该影响 Q 值更新的目标（因为没有后续动作了）。在计算 TD target 时需要考虑这一点 (例如，目标值直接设为 `R`)。
*   **代码库** : 如果问题持续存在或者想简化开发，可以考虑使用成熟的强化学习库（如 Stable Baselines3, Ray RLlib）和环境接口（如 PettingZoo for multi-agent, or a custom Gym wrapper），它们封装了许多底层的经验收集和管理逻辑。

**进阶使用技巧:** 

*   **Generalized Advantage Estimation (GAE):**  对于 Actor-Critic 方法，使用 GAE 计算优势函数通常比简单的 Monte Carlo 折扣回报效果更好，可以减少方差。这需要在收集经验时也存储价值网络的输出 `V(S)`。
*   **Prioritized Experience Replay (PER):**  如果使用经验回放池，可以考虑 PER。那些带来更大 TD 误差（即模型预测更不准）的经验被赋予更高的优先级，从而更频繁地被采样用于训练，加速学习。

### 方案二： 微调现有代码的数据组合方式 (不推荐，易出错)

尝试在你现有的 `batch_states`, `batch_actions`, `batch_rewards` 基础上，调整最后组合 `replays` 的逻辑。但这很困难，因为数据记录本身就有问题。你需要精确推断出哪个奖励对应哪个动作，哪个状态是哪个动作之前或之后的。

**为什么不推荐:** 

这种方法治标不治本。数据源头（记录方式）存在逻辑缺陷，后续再怎么“拼凑”也很容易出错，并且使得代码难以理解和维护。你遇到的问题本身就源于这种数据和处理逻辑的错位。

## 四、 整合与优化建议

强烈建议采用 **方案一** ，即**彻底改造数据记录流程** 。核心思想是：

1.  **明确目标:**  为 Agent 收集 `(State, Action, Reward, Next_State, Done)` 形式的经验。
2.  **Agent中心:**  所有记录都围绕 Agent 的一次决策展开。记录 Agent 动作前的状态 `S` 和选择的动作 `A`。
3.  **延迟记录 R 和 S':**  等待环境（包括对手）响应后，再确定这个动作 `A` 对应的奖励 `R` 和最终到达的状态 `S'`，以及是否回合结束 `Done`。
4.  **独立存储:**  每个游戏的经验序列独立存储，方便后续处理（如计算折扣回报）。
5.  **最后组合:**  在一个游戏（或一个批次的游戏）结束后，对收集到的经验序列进行后处理（如计算 GAE 或 Monte Carlo returns），然后添加到最终的回放缓冲区 `all_replays`。

通过这种方式，可以确保动作、奖励、状态之间的对应关系清晰、准确，从而解决数量不匹配和奇数状态的问题，让你的强化学习训练跑在正确的数据上。