• 强化学习实战(一)—— 使用BaslineDQN学习飞船降落


    本文将介绍如何使用Stable Basline3中的DQN算法学习飞船降落问题。

    实验过程1

    1. 引入库并创建环境

    import gym
    from stable_baselines3 import DQN
    from stable_baselines3.common.evaluation import evaluate_policy
    
    # Create environment
    env = gym.make("LunarLander-v2")
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    (1)了解环境信息
    成功创建环境后,我们可以通过env.action_spaceenv.observation_space查看环境的动作和状态空间。也可以通过env.action_space.sample()env.observation_space.sample()随机采样,了解具体的动作和状态表示情况。

    print(env.action_space)
    print(env.action_space.sample())
    
    print(env.observation_space)
    print(env.observation_space.sample())
    
    • 1
    • 2
    • 3
    • 4
    • 5
    Discrete(4)
    3
    Box([-inf -inf -inf -inf -inf -inf -inf -inf], [inf inf inf inf inf inf inf inf], (8,), float32)
    [-0.39453888  0.88357323 -2.6758633   0.26985604 -0.31590447 -0.5141233
      1.2682225   0.7396759 ]
    
    • 1
    • 2
    • 3
    • 4
    • 5

    接着我们可以通过查询gym官方文档了解动作和状态空间更细节的信息。
    在这里插入图片描述

    2. 创建模型

    在这里插入图片描述

    # Instantiate the agent
    model = DQN("MlpPolicy",
                env,
                tensorboard_log = './logs',
                verbose=1)
    
    • 1
    • 2
    • 3
    • 4
    • 5

    Parameters
    下面将对一些特殊的常用的参数进行说明:

    • policy – The policy model to use
      ①MlpPolicy:DQNPolicy
      ②CnnPolicy:Policy class for DQN when using images as input.
      ③MultiInputPolicy:Policy class for DQN when using dict observations as input.

    • tensorboard_log – the log location for tensorboard (if None, no logging)

    • verbose – Verbosity level: 0 for no output, 1 for info messages (such as device or wrappers used), 2 for debug messages

    • policy_kwargs (Optional[Dict[str, Any]]) – additional arguments to be passed to the policy on creation

    • seed – Seed for the pseudo random generators

    • device – Device (cpu, cuda, …) on which the code should be run. Setting it to auto, the code will be run on the GPU if possible.

    3. 模型学习

    在这里插入图片描述

    # Train the agent and display a progress bar
    model.learn(total_timesteps=int(5e5),
                tb_log_name = 'DQN2',
                progress_bar= True,)
    # Save the agent
    model.save("dqn_lunar")
    del model  # delete trained model to demonstrate loading
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    Parameters
    下面将对一些特殊的常用的参数进行说明:

    • total_timesteps – The total number of samples (env steps) to train on
    • log_interval (int) – The number of timesteps before logging.
    • tb_log_name (str) – the name of the run for TensorBoard logging
    • progress_bar (bool) – Display a progress bar using tqdm and rich.

    4. 模型评估

    方式一
    在这里插入图片描述

    # Load the trained agent
    # model = DQN.load("dqn_lunar", env=env, print_system_info=True)
    model = DQN.load("dqn_lunar", env=env)
    # Evaluate the agent
    # NOTE: If you use wrappers with your environment that modify rewards,
    #       this will be reflected here. To evaluate with original rewards,
    #       wrap environment in a "Monitor" wrapper before other wrappers.
    mean_reward, std_reward = evaluate_policy(model, 
                                              model.get_env(), 
                                              render = True,
                                              n_eval_episodes=10)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    方式二

    model = DQN.load("dqn_lunar", env=env)
    # Evaluate the agent
    episodes = 10
    for ep in range(episodes):
        obs = env.reset()
        done = False
        rewards = 0
        while not done:
    #         action = env.action_space.sample()
            action, _states = model.predict(obs, deterministic=True)
            obs,reward,done,info = env.step(action)
            env.render()
            rewards += reward
        print(rewards)
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15

    5. 附录

    Code

    import gym
    from stable_baselines3 import DQN
    from stable_baselines3.common.evaluation import evaluate_policy
    
    # Create environment
    env = gym.make("LunarLander-v2")
    print(env.action_space)
    print(env.action_space.sample())
    # do nothing, fire left orientation engine, fire main engine, fire right orientation engine.
    print(env.observation_space)
    print(env.observation_space.sample())
    # the coordinates of the lander in x & y, its linear velocities in x & y, its angle, its angular velocity, 
    # and two booleans that represent whether each leg is in contact with the ground or not.
    # Instantiate the agent
    model = DQN("MlpPolicy",
                env,
                tensorboard_log = './logs',
                verbose=1)
    # Train the agent and display a progress bar
    model.learn(total_timesteps=int(5e5),
                tb_log_name = 'DQN2',
                progress_bar= True,)
    # Save the agent
    model.save("dqn_lunar")
    del model  # delete trained model to demonstrate loading
    
    # Load the trained agent
    # NOTE: if you have loading issue, you can pass `print_system_info=True`
    # to compare the system on which the model was trained vs the current one
    # model = DQN.load("dqn_lunar", env=env, print_system_info=True)
    model = DQN.load("dqn_lunar", env=env)
    # Evaluate the agent
    episodes = 10
    for ep in range(episodes):
        obs = env.reset()
        done = False
        rewards = 0
        while not done:
    #         action = env.action_space.sample()
            action, _states = model.predict(obs, deterministic=True)
            obs,reward,done,info = env.step(action)
    #         env.render()
            rewards += reward
        print(rewards)
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45

    Result
    在这里插入图片描述
    很明显飞船表现得并不好,当它下降到一定位置后便开始悬浮,不符合要求。我们需要修改训练参数。

    实验过程2

    1. 修改模型参数

    在这里插入图片描述
    (1)加大学习率learning_rate,加快收敛。
    (2)修改网络结构
    可以通过policy_kwargs传递网络参数,通过查看MLPPolicy参数可知net_arch可以修改网络结构。
    在这里插入图片描述

    网络结构修改参考
    在这里插入图片描述

    查看源码我们知道初始网络结构为[64, 64],因此我们修改结构为[256, 256]。

    # Instantiate the agent
    model = DQN("MlpPolicy",
                env,
                tensorboard_log = './logs',
                learning_rate = 5e-4,
                policy_kwargs = {'net_arch': [256,256]},
                verbose=1)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    在这里插入图片描述

  • 相关阅读:
    许啸宇:从内部研发到开源开发之路|OneFlow U
    如何将Python文件生成bat脚本,点击bat自动运行Python脚本
    基于模糊PID控制器的水温控制系统仿真
    软件要想做的好,测试必定少不了
    为什么 NGINX 的 reload 命令不是热加载?
    stm32f4xx-GPIO
    java实现物流查询(使用阿里云物流查询接口)
    迅为iTOP-iMX6Q&PLUS-Android6.0下uboot添加网卡驱动
    (三)构建网络模型
    【Linux】- 一文秒懂shell编程
  • 原文地址:https://blog.csdn.net/koulongxin123/article/details/127391850