【强化学习】深度确定性策略梯度(DDPG)算法求解 Pendulum 问题 + Pytorch代码实战

文章目录

一、Pendulum问题介绍
二、深度确定性策略梯度(DDPG)算法
三、Python代码实战

一、Pendulum问题介绍

动作：往左转还是往右转，用力矩来衡量，即力乘以力臂。范围[-2,2]
状态：cos(theta), sin(theta) , thetadot（角速度）
奖励：总的来说，越直立拿到的奖励越高，越偏离，奖励越低。
游戏结束：200步后游戏结束。所以要在200步内拿到的分越高越好。
在这里插入图片描述

二、深度确定性策略梯度(DDPG)算法

关于详细的深度确定性策略梯度(DDPG)算法的介绍，请看我之前发的博客：【EasyRL学习笔记】第十二章 Deep Deterministic Policy Gradient 深度确定性策略梯度(DDPG)算法

三、Python代码实战

3.1 运行前配置

准备好一个RL_Utils.py文件，文件内容可以从我的一篇里博客获取：【RL工具类】强化学习常用函数工具类（Python代码）

这一步很重要，后面需要引入该RL_Utils.py文件

在这里插入图片描述

3.2 主要代码

import warnings

warnings.filterwarnings("ignore")
import argparse
import datetime
import time
import torch.optim as optim
import torch.nn.functional as F
import gym
from torch import nn

# 这里需要改成自己的RL_Utils.py文件的路径
from Python.ReinforcementLearning.EasyRL.RL_Utils import *


# 将action范围重定在[0,1]之间
class NormalizedActions(gym.ActionWrapper):
    def action(self, action):
        low_bound = self.action_space.low
        upper_bound = self.action_space.high
        action = low_bound + (action + 1.0) * 0.5 * (upper_bound - low_bound)
        action = np.clip(action, low_bound, upper_bound)
        return action

    def reverse_action(self, action):
        low_bound = self.action_space.low
        upper_bound = self.action_space.high
        action = 2 * (action - low_bound) / (upper_bound - low_bound) - 1
        action = np.clip(action, low_bound, upper_bound)
        return action


# Ornstein–Uhlenbeck噪声
class OUNoise(object):
    def __init__(self, action_space, mu=0.0, theta=0.15, max_sigma=0.3, min_sigma=0.3, decay_period=100000):
        self.mu = mu  # OU噪声的参数
        self.theta = theta  # OU噪声的参数
        self.sigma = max_sigma  # OU噪声的参数
        self.max_sigma = max_sigma
        self.min_sigma = min_sigma
        self.decay_period = decay_period
        self.n_actions = action_space.shape[0]
        self.low = action_space.low
        self.high = action_space.high
        self.reset()

    def reset(self):
        self.obs = np.ones(self.n_actions) * self.mu

    def evolve_obs(self):
        x = self.obs
        dx = self.theta * (self.mu - x) + self.sigma * np.random.randn(self.n_actions)
        self.obs = x + dx
        return self.obs

    def get_action(self, action, t=0):
        ou_obs = self.evolve_obs()
        self.sigma = self.max_sigma - (self.max_sigma - self.min_sigma) * min(1.0, t / self.decay_period)  # sigma会逐渐衰减
        return np.clip(action + ou_obs, self.low, self.high)  # 动作加上噪声后进行剪切


# 经验回放对象
class ReplayBuffer:
    def __init__(self, capacity):
        self.capacity = capacity  # 经验回放的容量
        self.buffer = []  # 缓冲区
        self.position = 0

    def push(self, state, action, reward, next_state, done):
        ''' 缓冲区是一个队列，容量超出时去掉开始存入的转移(transition)
        '''
        if len(self.buffer) < self.capacity:
            self.buffer.append(None)
        self.buffer[self.position] = (state, action, reward, next_state, done)
        self.position = (self.position + 1) % self.capacity

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)  # 随机采出小批量转移
        state, action, reward, next_state, done = zip(*batch)  # 解压成状态，动作等
        return state, action, reward, next_state, done

    def __len__(self):
        ''' 返回当前存储的量
        '''
        return len(self.buffer)


# 演员网络（给定状态，输出动作）
class Actor(nn.Module):
    def __init__(self, n_states, n_actions, hidden_dim, init_w=3e-3):
        super(Actor, self).__init__()
        self.linear1 = nn.Linear(n_states, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, hidden_dim)
        self.linear3 = nn.Linear(hidden_dim, n_actions)

        self.linear3.weight.data.uniform_(-init_w, init_w)
        self.linear3.bias.data.uniform_(-init_w, init_w)

    def forward(self, x):
        x = F.relu(self.linear1(x))
        x = F.relu(self.linear2(x))
        x = torch.tanh(self.linear3(x))
        return x


# 评论员网络（给定状态-动作对，做出评价）
class Critic(nn.Module):
    def __init__(self, n_states, n_actions, hidden_dim, init_w=3e-3):
        super(Critic, self).__init__()

        self.linear1 = nn.Linear(n_states + n_actions, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, hidden_dim)
        self.linear3 = nn.Linear(hidden_dim, 1)
        # 随机初始化为较小的值
        self.linear3.weight.data.uniform_(-init_w, init_w)
        self.linear3.bias.data.uniform_(-init_w, init_w)

    def forward(self, state, action):
        # 按维数1拼接
        x = torch.cat([state, action], 1)
        x = F.relu(self.linear1(x))
        x = F.relu(self.linear2(x))
        x = self.linear3(x)
        return x


# 深度确定性策略梯度算法对象
class DDPG:
    def __init__(self, n_states, n_actions, arg_dict):
        self.device = torch.device(arg_dict['device'])
        # DDPG要训练四个网络：Q网络，Q-target网络，策略网络，策略-target网络
        self.critic = Critic(n_states, n_actions, arg_dict['hidden_dim']).to(self.device)
        self.actor = Actor(n_states, n_actions, arg_dict['hidden_dim']).to(self.device)
        self.target_critic = Critic(n_states, n_actions, arg_dict['hidden_dim']).to(self.device)
        self.target_actor = Actor(n_states, n_actions, arg_dict['hidden_dim']).to(self.device)

        # 复制参数到目标网络
        for target_param, param in zip(self.target_critic.parameters(), self.critic.parameters()):
            target_param.data.copy_(param.data)
        for target_param, param in zip(self.target_actor.parameters(), self.actor.parameters()):
            target_param.data.copy_(param.data)

        self.critic_optimizer = optim.Adam(
            self.critic.parameters(), lr=arg_dict['critic_lr'])
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=arg_dict['actor_lr'])
        self.memory = ReplayBuffer(arg_dict['memory_capacity'])
        self.batch_size = arg_dict['batch_size']
        self.soft_tau = arg_dict['soft_tau']  # 软更新参数
        self.gamma = arg_dict['gamma']

    def choose_action(self, state):
        state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
        action = self.actor(state)
        return action.detach().cpu().numpy()[0, 0]

    def update(self):
        if len(self.memory) < self.batch_size:  # 当 memory 中不满足一个批量时，不更新策略
            return
        # 从经验回放中(replay memory)中随机采样一个批量的转移(transition)
        state, action, reward, next_state, done = self.memory.sample(self.batch_size)
        # 转变为张量
        state = torch.FloatTensor(np.array(state)).to(self.device)
        next_state = torch.FloatTensor(np.array(next_state)).to(self.device)
        action = torch.FloatTensor(np.array(action)).to(self.device)
        reward = torch.FloatTensor(reward).unsqueeze(1).to(self.device)
        done = torch.FloatTensor(np.float32(done)).unsqueeze(1).to(self.device)

        policy_loss = self.critic(state, self.actor(state))
        policy_loss = -policy_loss.mean()
        next_action = self.target_actor(next_state)
        target_value = self.target_critic(next_state, next_action.detach())
        expected_value = reward + (1.0 - done) * self.gamma * target_value
        expected_value = torch.clamp(expected_value, -np.inf, np.inf)

        value = self.critic(state, action)
        value_loss = nn.MSELoss()(value, expected_value.detach())

        self.actor_optimizer.zero_grad()
        policy_loss.backward()
        self.actor_optimizer.step()
        self.critic_optimizer.zero_grad()
        value_loss.backward()
        self.critic_optimizer.step()
        # 软更新
        for target_param, param in zip(self.target_critic.parameters(), self.critic.parameters()):
            target_param.data.copy_(
                target_param.data * (1.0 - self.soft_tau) +
                param.data * self.soft_tau
            )
        for target_param, param in zip(self.target_actor.parameters(), self.actor.parameters()):
            target_param.data.copy_(
                target_param.data * (1.0 - self.soft_tau) +
                param.data * self.soft_tau
            )

    def save_model(self, path):
        Path(path).mkdir(parents=True, exist_ok=True)
        torch.save(self.actor.state_dict(), path + 'checkpoint.pt')

    def load_model(self, path):
        self.actor.load_state_dict(torch.load(path + 'checkpoint.pt'))

    # 训练函数


def train(arg_dict, env, agent):
    # 开始计时
    startTime = time.time()
    print(f"环境名: {arg_dict['env_name']}, 算法名: {arg_dict['algo_name']}, Device: {arg_dict['device']}")
    print("开始训练智能体......")
    ou_noise = OUNoise(env.action_space)  # noise of action
    rewards = []  # 记录所有回合的奖励
    ma_rewards = []  # 记录所有回合的滑动平均奖励
    for i_ep in range(arg_dict['train_eps']):
        state = env.reset()
        ou_noise.reset()
        done = False
        ep_reward = 0
        i_step = 0
        while not done:
            if arg_dict['train_render']:
                env.render()
            i_step += 1
            action = agent.choose_action(state)
            action = ou_noise.get_action(action, i_step)
            next_state, reward, done, _ = env.step(action)
            ep_reward += reward
            agent.memory.push(state, action, reward, next_state, done)
            agent.update()
            state = next_state
        if (i_ep + 1) % 10 == 0:
            print(f'Env:{i_ep + 1}/{arg_dict["train_eps"]}, Reward:{ep_reward:.2f}')
        rewards.append(ep_reward)
        if ma_rewards:
            ma_rewards.append(0.9 * ma_rewards[-1] + 0.1 * ep_reward)
        else:
            ma_rewards.append(ep_reward)
    print('训练结束 , 用时: ' + str(time.time() - startTime) + " s")
    # 关闭环境
    env.close()
    return {'episodes': range(len(rewards)), 'rewards': rewards, 'ma_rewards': ma_rewards}


# 测试函数
def test(arg_dict, env, agent):
    startTime = time.time()
    print("开始测试智能体......")
    print(f"环境名: {arg_dict['env_name']}, 算法名: {arg_dict['algo_name']}, Device: {arg_dict['device']}")
    rewards = []  # 记录所有回合的奖励
    ma_rewards = []  # 记录所有回合的滑动平均奖励
    for i_ep in range(arg_dict['test_eps']):
        state = env.reset()
        done = False
        ep_reward = 0
        i_step = 0
        while not done:
            if arg_dict['test_render']:
                env.render()
            i_step += 1
            action = agent.choose_action(state)
            next_state, reward, done, _ = env.step(action)
            ep_reward += reward
            state = next_state
        rewards.append(ep_reward)
        if ma_rewards:
            ma_rewards.append(0.9 * ma_rewards[-1] + 0.1 * ep_reward)
        else:
            ma_rewards.append(ep_reward)
        print(f"Epside:{i_ep + 1}/{arg_dict['test_eps']}, Reward:{ep_reward:.1f}")
    print("测试结束 , 用时: " + str(time.time() - startTime) + " s")
    env.close()
    return {'episodes': range(len(rewards)), 'rewards': rewards}


# 创建环境和智能体
def create_env_agent(arg_dict):
    env = NormalizedActions(gym.make(arg_dict['env_name']))  # 装饰action噪声
    env.seed(arg_dict['seed'])  # 随机种子
    n_states = env.observation_space.shape[0]
    n_actions = env.action_space.shape[0]
    agent = DDPG(n_states, n_actions, arg_dict)
    return env, agent


if __name__ == '__main__':
    # 防止报错 OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
    os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
    # 获取当前路径
    curr_path = os.path.dirname(os.path.abspath(__file__))
    # 获取当前时间
    curr_time = datetime.datetime.now().strftime("%Y_%m_%d-%H_%M_%S")
    # 相关参数设置
    parser = argparse.ArgumentParser(description="hyper parameters")
    parser.add_argument('--algo_name', default='DDPG', type=str, help="name of algorithm")
    parser.add_argument('--env_name', default='Pendulum-v1', type=str, help="name of environment")
    parser.add_argument('--train_eps', default=300, type=int, help="episodes of training")
    parser.add_argument('--test_eps', default=20, type=int, help="episodes of testing")
    parser.add_argument('--gamma', default=0.99, type=float, help="discounted factor")
    parser.add_argument('--critic_lr', default=1e-3, type=float, help="learning rate of critic")
    parser.add_argument('--actor_lr', default=1e-4, type=float, help="learning rate of actor")
    parser.add_argument('--memory_capacity', default=8000, type=int, help="memory capacity")
    parser.add_argument('--batch_size', default=128, type=int)
    parser.add_argument('--target_update', default=2, type=int)
    parser.add_argument('--soft_tau', default=1e-2, type=float)
    parser.add_argument('--hidden_dim', default=256, type=int)
    parser.add_argument('--device', default='cuda', type=str, help="cpu or cuda")
    parser.add_argument('--seed', default=520, type=int, help="seed")
    parser.add_argument('--show_fig', default=False, type=bool, help="if show figure or not")
    parser.add_argument('--save_fig', default=True, type=bool, help="if save figure or not")
    parser.add_argument('--train_render', default=False, type=bool,
                        help="Whether to render the environment during training")
    parser.add_argument('--test_render', default=True, type=bool,
                        help="Whether to render the environment during testing")
    args = parser.parse_args()
    default_args = {'result_path': f"{curr_path}/outputs/{args.env_name}/{curr_time}/results/",
                    'model_path': f"{curr_path}/outputs/{args.env_name}/{curr_time}/models/",
                    }
    # 将参数转化为字典 type(dict)
    arg_dict = {**vars(args), **default_args}
    print("算法参数字典:", arg_dict)

    # 创建环境和智能体
    env, agent = create_env_agent(arg_dict)
    # 传入算法参数、环境、智能体，然后开始训练
    res_dic = train(arg_dict, env, agent)
    print("算法返回结果字典:", res_dic)
    # 保存相关信息
    agent.save_model(path=arg_dict['model_path'])
    save_args(arg_dict, path=arg_dict['result_path'])
    save_results(res_dic, tag='train', path=arg_dict['result_path'])
    plot_rewards(res_dic['rewards'], arg_dict, path=arg_dict['result_path'], tag="train")

    # =================================================================================================
    # 创建新环境和智能体用来测试
    print("=" * 300)
    env, agent = create_env_agent(arg_dict)
    # 加载已保存的智能体
    agent.load_model(path=arg_dict['model_path'])
    res_dic = test(arg_dict, env, agent)
    save_results(res_dic, tag='test', path=arg_dict['result_path'])
    plot_rewards(res_dic['rewards'], arg_dict, path=arg_dict['result_path'], tag="test")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341

3.3 运行结果展示

部分输出：

环境名: Pendulum-v1, 算法名: DDPG, Device: cuda
开始训练智能体......
Env:10/300, Reward:-514.63
Env:20/300, Reward:-257.10
Env:30/300, Reward:-374.82
Env:40/300, Reward:-374.79
Env:50/300, Reward:-376.84
Env:60/300, Reward:-618.92
Env:70/300, Reward:-256.16
Env:80/300, Reward:-626.38
Env:90/300, Reward:-644.01
Env:100/300, Reward:-534.69
Env:110/300, Reward:-499.96
Env:120/300, Reward:-472.34
Env:130/300, Reward:-261.39
Env:140/300, Reward:-380.94
Env:150/300, Reward:-384.42
Env:160/300, Reward:-744.95
Env:170/300, Reward:-407.66
Env:180/300, Reward:-744.57
Env:190/300, Reward:-380.46
Env:200/300, Reward:-504.49
Env:210/300, Reward:-505.49
Env:220/300, Reward:-634.64
Env:230/300, Reward:-626.34
Env:240/300, Reward:-599.63
Env:250/300, Reward:-624.56
Env:260/300, Reward:-878.27
Env:270/300, Reward:-746.17
Env:280/300, Reward:-509.80
Env:290/300, Reward:-499.35
Env:300/300, Reward:-636.66
训练结束 , 用时: 327.0484480857849 s
============================================================================================================================================================================================================================================================================================================
开始测试智能体......
环境名: Pendulum-v1, 算法名: DDPG, Device: cuda
Epside:1/20, Reward:-24.8
Epside:2/20, Reward:-149.3
Epside:3/20, Reward:-25.1
Epside:4/20, Reward:-147.2
Epside:5/20, Reward:-138.2
Epside:6/20, Reward:-139.3
Epside:7/20, Reward:-142.8
Epside:8/20, Reward:-143.2
Epside:9/20, Reward:-26.7
Epside:10/20, Reward:-24.8
Epside:11/20, Reward:-143.6
Epside:12/20, Reward:-253.4
Epside:13/20, Reward:-141.3
Epside:14/20, Reward:-337.7
Epside:15/20, Reward:-338.8
Epside:16/20, Reward:-147.8
Epside:17/20, Reward:-25.3
Epside:18/20, Reward:-253.8
Epside:19/20, Reward:-150.2
Epside:20/20, Reward:-145.0
测试结束 , 用时: 30.252137422561646 s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

在这里插入图片描述

3.4 关于可视化的设置

如果你觉得可视化比较耗时，你可以进行设置，取消可视化。
或者你想看看训练过程的可视化，也可以进行相关设置

在这里插入图片描述

相关阅读:
JUCE框架教程（5）——Plugin项目构造基础
 java程序处理三张表要进行怎么样的操作
 苍穹外卖项目(黑马)学习笔记DAY10
深入理解ThreadLocal
蓝桥杯：跑步锻炼
 PTA之SQL训练题
 有没有免费的视频剪辑软件？快来看看这些视频裁剪软件
 设计模式 — — 代理模式
 Spark基础入门（01）—RDD
java springboot 如何实现小程序支付
原文地址：https://blog.csdn.net/weixin_51545953/article/details/127771492