Reinforcement Learning - neptune.ai

Reinforcement Learning From Human Feedback (RLHF) For LLMs

Michał Oleszak — Thu, 12 Sep 2024 11:00:00 +0000

Reinforcement Learning from Human Feedback (RLHF) unlocked the full potential of today’s large language models (LLMs).

By integrating human judgment into the training process, RLHF ensures that models not only produce coherent and useful outputs but also align more closely with human values, preferences, and expectations.

The RLHF process consists of three steps: collecting human feedback in the form of a preference dataset, training a reward model to mimic human preferences, and fine-tuning the LLM using the reward model. The last step is enabled by the Proximal Policy Optimization (PPO) algorithm.

Alternatives to RLHF include Constitutional AI where the model learns to critique itself whenever it fails to adhere to a predefined set of rules and Reinforcement Learning from AI Feedback (RLAIF) in which off-the-shelf LLMs replace humans as preference data providers.

Reinforcement Learning from Human Feedback (RLHF) has turned out to be the key to unlocking the full potential of today’s large language models (LLMs). There is arguably no better evidence for this than OpenAI’s GPT-3 model. It was released back in 2020, but it was only its RLHF-trained version dubbed ChatGPT that became an overnight sensation, capturing the attention of millions and setting a new standard for conversational AI.

Before RLHF, the LLM training process typically consisted of a pre-training stage in which the model learned the general structure of the language and a fine-tuning stage in which it learned to perform a specific task. By integrating human judgment as a third training stage, RLHF ensures that models not only produce coherent and useful outputs but also align more closely with human values, preferences, and expectations. It achieves this through a feedback loop where human evaluators rate or rank the model’s outputs, which is then used to adjust the model’s behavior.

This article explores the intricacies of RLHF. We will look at its importance for language modeling, analyze its inner workings in detail, and discuss the best practices for implementation.

Importance of RLHF in LLMs

When analyzing the importance of RLHF to language modeling, one could approach it from two different perspectives.

On the one hand, this technique has emerged as a response to the limitations of traditional supervised fine-tuning, such as reliance on static datasets often limited in scope, context, and diversity, as well as broader human values, ethics, or social norms. Additionally, traditional fine-tuning often struggles with tasks that involve subjective judgment or ambiguity, where there may be multiple valid answers. In such cases, a model might favor one answer over another based on the training data, even if the alternative might be more appropriate in a given context. RLHF provides a way to lift some of these limitations.

On the other hand, however, RLHF represents a paradigm shift in the fine-tuning of LLMs. It forms a standalone, transformative change in the evolution of AI rather than a mere incremental improvement over existing methods.

Let’s look at it from the latter perspective first.

The paradigm shift brought about by RLHF lies in the integration of human feedback directly into the training loop, enabling models to better align with human values and preferences. This approach prioritizes dynamic model-human interactions over static training datasets. By incorporating human insights throughout the training process, RLHF ensures that models are more context-aware and capable of handling the complexities of natural language.

I now hear you asking: “But how is injecting the human into the loop better than the traditional fine-tuning in which we train the model in a supervised fashion on a static dataset? Can’t we simply pass human preferences to the model by constructing a fine-tuning data set based on these preferences?“ That’s a fair question.

Consider succinctness as a preference for a text summarizing model. We could fine-tune a Large Language Model on concise summaries by training it in a supervised manner on the set of input-output pairs where input is the original text and output is the desired summary.

The problem here is that different summaries can be equally good, and different groups of people will have preferences as to what level of succinctness is optimal in different contexts. When relying solely on traditional supervised fine-tuning, the model might learn to generate concise summaries, but it won’t necessarily grasp the subtle balance between brevity and informativeness that different users might prefer. This is where RLHF offers a distinct advantage.

In RLHF, we train the model on the following data set:

Each example consists of the long input text, two alternative summaries, and a label that signals which of the two was preferred by a human annotator. By directly passing human preference to the model via the label indicating the “better” output, we can ensure it aligns with it properly.

Let’s discuss how this works in detail.

The RLHF process

The RLHF process consists of three steps:

Collecting human feedback.
Training a reward model.
Fine-tuning the LLM using the reward model.

The algorithm enabling the last step in the process is the Proximal Policy Optimization (PPO).

High-Level overview of Reinforcement Learning from Human Feeback (RLHF). A reward model is trained on a preference dataset that includes the input, alternative outputs, and a label indicating which of the outputs is preferable. The LLM is fine-tuned through reinforcement learning with Proximal Policy Optimization (PPO).

Collecting human feedback

The first step in RLHF is to collect human feedback in the so-called preference dataset. In its simplest form, each example in this dataset consists of a prompt, two different answers produced by the LLM as the response to this prompt, and an indicator for which of the two answers was deemed better by a human evaluator.

The specific dataset formats differ and are not too important. The schematic dataset shown above used four fields: Input text, Summary 1, Summary 2, and Preference. Anthropic’s hh-rlhf dataset uses a different format: two columns with the chosen and rejected version of a dialogue between a human and an AI assistant, where the prompt is the same in both cases.

An example entry from Anthropic’s hh-rlhf preference dataset. The left column contains the prompt and the better answer produced by the model. The right column contains the exact same prompt and the worse answer, as judged by a human evaluator. | Source

Regardless of the format, the information contained in the human preference data set is always the same: It’s not that one answer is good and the other is bad. It’s that one, albeit imperfect, is preferred over the other – it’s all about preference.

Now you might wonder why the labelers are asked to pick one of two responses instead of, say, scoring each response on a scale. The problem with scores is that they are subjective. Scores provided by different individuals, or even two scores from the same labeler but on different examples, are not comparable.

So how do the labelers decide which of the two responses to pick? This is arguably the most important nuance in RLHF. The labelers are offered specific instructions outlining the evaluation protocol. For example, they might be instructed to pick the answers that don’t use swear words, the ones that sound more friendly, or the ones that don’t offer any dangerous information. What the instructions tell the labelers to focus on is crucial to the RLHF-trained model, as it will only align with those human values that are contained within these instructions.

More advanced approaches to building a preference dataset might involve humans ranking more than two responses to the same prompt. Consider three different responses: A, B, and C.

Human annotators have ranked them as follows, where “1” is best, and “3” is worst:

A – 2

B – 1

C – 3

Out of these, we can create three pairs resulting in three training examples:

Preferred response	Non-preferred response
B	A
A	C
B	C

Training a reward model

Once we have our preference dataset in place, we can use it to train a reward model (RM).

The reward model is typically also an LLM, often encoder-only, such as BERT. During training, the RM receives three inputs from the preference dataset: the prompt, the winning response, and the losing response. It produces two outputs, called rewards, for each of the responses:

The training objective is to maximize the reward difference between the winning and losing response. An often-used loss function is the cross-entropy loss between the two rewards.

This way, the reward model learns to distinguish between more and less preferred responses, effectively ranking them based on the preferences encoded in the dataset. As the model continues to train, it becomes better at predicting which responses will likely be preferred by a human evaluator.

Once trained, the reward model serves as a simple regressor predicting the reward value for the given prompt-completion pair:

Fine-tuning the LLM with the reward model

The third and final RLHF stage is fine-tuning. This is where the reinforcement learning takes place.

The fine-tuning stage requires another dataset that is different from the preference dataset. It consists of prompts only, which should be similar to what we expect our LLM to deal with in production. Fine-tuning teaches the LLM to produce aligned responses for these prompts.

Specifically, the goal of fine-tuning is to train the LLM to produce completions that maximize the rewards given by the reward model. The training loop looks as follows:

First, we pass a prompt from the training set to the LLM and generate a completion. Next, the prompt and the completion are passed to the reward model, which in turn predicts the reward. This reward is fed into an optimization algorithm such as PPO (more about it in the next section), which then adjusts the LLM’s weights in a direction resulting in a better RM-predicted reward for the given training example (not unlike gradient descent in traditional deep learning).

Proximal Policy Optimization (PPO)

One of the most popular optimizers for RLHF is the Proximal Policy Optimization algorithm or PPO. Let’s unpack this mouthful.

In the reinforcement learning context, the term “policy” refers to the strategy used by an agent to decide its actions. In the RLHF world, the policy is the LLM we are training which decides which tokens to generate in its responses. Hence, “policy optimization” means we are optimizing the LLM’s weights.

What about “proximal”? The term “proximal” refers to the key idea in PPO of making only small, controlled changes to the policy during training. This prevents an issue all too common in traditional policy gradient methods, where large updates to the policy can sometimes lead to significant performance drops.

PPO under the hood

The PPO loss function is composed of three components:

Policy Loss: PPO’s primary objective when improving the LLM.
Value Loss: Used to train the value function, which estimates the future rewards from a given state. The value function allows for computing the advantage, which in turn is used to update the policy.
Entropy Loss: Encourages exploration by penalizing certainty in the action distribution, allowing the LLM to remain creative.

The total PPO loss can be expressed as:

L_PPO = L_POLICY + a × L_VALUE + b × L_ENTROPY

where a and b are weight hyperparameters.

The entropy loss component is just the entropy of the probability distribution over the next tokens during generations. We don’t want it to be too small, as this would discourage diversity in the produced texts.

The value loss component is computed step-by-step as the LLM generates subsequent tokens. At each step, the value loss is the difference between the actual future total reward (based on the full completion) and its current-step approximation through the so-called value function. Reducing the value loss trains the value function to be more accurate, resulting in better future reward prediction.

In the policy loss component, we use the value function to predict future rewards over different possible completions (next tokens). Based on these, we can estimate the so-called advantage term that captures how better or worse one completion is compared to all possible completions.

If the advantage term for a given completion is positive, it means that increasing the probability of this particular completion being generated will lead to a higher reward and, thus, a better-aligned model. Hence, we should tweak the LLM’s parameters such that this probability is increased.

PPO alternatives

PPO is not the only optimizer used for RLHF. With the current pace of AI research, new alternatives spring up like mushrooms. Let’s take a look at a few worth mentioning.

Direct Preference Optimization (DPO) is based on an observation that the cross-entropy loss used to train the reward model in RLHF can be directly applied to fine-tune the LLM. DPO is more efficient than PPO and has been shown to yield better response quality.

Comparison between Direct Policy Optimization (DPO) and Proximal Policy Optimization (PPO). DPO (right) requires fewer steps as it does not use the reward model, unlike PPO (left). | Modified based on: Source

Another interesting alternative to PPO is Contrastive Preference Learning (CPL). The proponents claim that PPO’s assumption that human preferences are distributed according to reward is incorrect. Rather, recent work would suggest that they instead follow regret. Similarly to DPO, CPL circumvents the need for training a reward model. It replaces it with a regret-based model of human preferences trained with a contrastive loss.

A comparison between traditional RLHF and Contrastive Preference Learning (CPL). CPL uses a regret-based model instead of a reward model. | Source

Best practices for RLHF

Let’s go back to the vanilla PPO-based RLHF. Having gone through the RLHF training process on a conceptual level, we’ll now discuss a couple of best practices to follow when implementing RLHF and the tools that might come in handy.

Avoiding reward hacking

Reward hacking is a prevalent issue in reinforcement learning. It refers to a situation where the agent has learned to cheat the system in that it maximizes the reward by taking actions that don’t align with the original objective.

In the context of RHLF, reward hacking means that the training has converged to a particular unlucky place in the loss surface where the generated responses lead to high rewards for some reason, but don’t make much sense to the user.

Luckily, there is a simple trick that helps prevent reward hacking. During fine-tuning, we take advantage of the initial, frozen copy of the LLM (as it was before RLHF training) and pass it the same prompt that we pass the LLM instance we are training.

Then, we compute the Kullback-Leibler Divergence between the responses from the original model and the model under training. KL Divergence measures the dissimilarity between the two responses. We want the responses to actually be rather similar to make sure that the updated model did not diverge too far away from its starting version. Thus, we dub the KL Divergence value a “reward penalty” and add it to the reward before passing it to the PPO optimizer.

Incorporating this anti-reward-hacking trick into our fine-tuning pipeline yields the following updated version of the previous figure:

To prevent reward hacking, we pass the prompt to two instances of the LLM: the one being trained and its frozen version from before the training. Then, we compute the reward penalty as the KL Divergence between the two models’ outputs and add it to the reward. This prevents the trained model from diverging too much from its initial version.

Scaling human feedback

As you might have noticed, the RLHF process has one bottleneck: the collection of human feedback in the form of the preference dataset is a slow manual process that needs to be repeated whenever alignment criteria (labelers’ instructions) change. Can we completely remove humans from the process?

We can certainly reduce their engagement, thus making the process more efficient. One approach to doing this is model self-supervision, or “Constitutional AI.”

The central point is the Constitution, which consists of a set of rules that should govern the model’s behavior (think: “do not swear,” “be friendly,” etc.). A human red team then prompts the LLM to generate harmful or misaligned responses. Whenever they succeed, they ask the model to critique its own responses according to the constitution and revise them. Finally, the model is trained using the red team’s prompts and the model’s revised responses.

An overview of Constitutional AI. In this approach, the model is asked to follow a set of guidelines (“constitution”) and learns to critique its own misaligned responses according to it. | Modified based on: source

Reinforcement Learning from AI Feedback (RLAIF) is yet another way to eliminate the need for human feedback. In this approach, one simply uses an off-the-shelf LLM to provide preferences for the preference dataset.

A comparison between RLAIF (top) and RLHF (bottom). In RLAIF, an off-the-shelf LLM takes the place of the human to generate feedback in the form of a preference dataset. | Modified based on: s ource

Tooling and frameworks

Let’s briefly examine some available tools and frameworks that facilitate RLHF implementation.

Data collection

Don’t have your preference dataset yet? Two great platforms that facilitate its collection are Prolific and Mechanical Turk.

Prolific is a platform for collecting human feedback at scale that is useful for gathering preference data through surveys and experiments. Amazon’s Mechanical Turk (MTurk) service allows for outsourcing data labeling tasks to a large pool of human workers, commonly used for obtaining labels for machine-learning models.

Prolific is known for having a more curated and diverse participant pool. The platform emphasizes quality and typically recruits reliable participants with a history of providing high-quality data. MTurk, on the other hand, has a more extensive and varied participant pool, but it can be less curated. This means there may be a broader range of participant quality.

End-to-end RLHF frameworks

If you are a Google Cloud Platform (GCP) user, you can very easily take advantage of their Vertex AI RLHF pipeline. It abstracts away the while training logic; all you need to do is to supply the preference dataset (to train the Reward Model) and the prompt dataset (for the RL-based fine-tuning) and just execute the pipeline.

The disadvantage is that since the training logic is abstracted away, it’s not straightforward to make custom changes. However, this is a great place to start if you are just beginning your RLHF adventure or don’t have the time or resources to build custom implementations.

Alternatively, check out DeepSpeed Chat, Microsoft’s open-source system for training and deploying chat models using RLHF, providing tools for data collection, model training, and deployment.

Conclusion

We have discussed how important the paradigm shift brought about by RLHF is to training language models, making them aligned with human preferences. We analyzed the three steps of the RLHF training pipeline: collecting human feedback, training the reward model, and fine-tuning the LLM. Next, we took a more detailed look at Proximal Policy Optimization, the algorithm behind RLHF, while mentioning some alternatives. Finally, we discussed how to avoid reward hacking using KL Divergence and how to reduce human engagement in the process with approaches such as Constitutional AI and RLAIF. We also reviewed a couple of tools facilitating RLHF implementation.

You are now well-equipped to fine-tune your own large language models with RLHF! If you do, let me know what you built!

Continuous Control With Deep Reinforcement Learning

Piotr Januszewski — Fri, 22 Jul 2022 06:32:47 +0000

This time I want to explore how deep reinforcement learning can be utilized e.g. making a humanoid model walk. This kind of task is a continuous control task. A solution to such a task differs from the one you might know and use to play Atari games, like Pong, with e.g. Deep Q-Network (DQN).

I’ll talk about what characterizes continuous control environments. Then, I’ll introduce the actor-critic architecture to you and show the example of the state-of-the-art actor-critic method, Soft Actor-Critic (SAC). Finally, we will dive into the code. I’ll briefly explain how it is implemented in the amazing SpinningUp framework. Let’s go!

What is continuous control?

Continuous Control with Deep Reinforcement Learning | Source: Roboschool

Meet Humanoid. It is a three-dimensional bipedal robot environment. Its observations are 376-dimensional vectors that describe the kinematic properties of the robot. Its actions are 17-dimensional vectors that specify torques to be applied on the robot joints. The goal is to run forward as fast as possible… and don’t fall over.

The actions are continuously valued vectors. This is very different from a fixed set of possible actions that you might already know from Atari environments. It requires a policy to return not the scores, or qualities, for all the possible actions, but simply return one action to be executed. Different policy output requires a different training strategy which we will explore in the next section.

What is MuJoCo?

MuJoCo is a fast and accurate physics simulation engine aimed at research and development in robotics, biomechanics, graphics, and animation. OpenAI Gym, and the Humanoid environment it includes, utilizes it for simulating the environment dynamics. I wrote the whole post about installing and using it here. We won’t need this for the matter of this post.

Check also

Best Reinforcement Learning Tutorials, Examples, Projects, and Courses

The Best Tools for Reinforcement Learning in Python You Actually Want to Try

7 Applications of Reinforcement Learning in Finance and Trading

Off-policy actor-critic methods

Reinforcement Learning | Source: Sutton & Barto, Reinforcement Learning: An Introduction, 2nd edition

Let’s recap: Reinforcement learning (RL) is learning what to do — how to map situations to actions — to maximize some notion of cumulative reward. RL consists of an agent that, in order to learn, acts in an environment. The environment provides a response to each agent’s action that is fed back to the agent. A reward is used as a reinforcing signal and a state is used to condition the agent’s decisions.

Really the goal is to find an optimal policy. The policy tells the agent how it should behave in whatever state it will find itself. This is the agent’s map to the environment’s objective.

Reinforcement learning | Source: Researchgate

The actor-critic architecture, depicted in the diagram above, divides the agent into two pieces, the Actor and the Critic.

The Actor represents the policy – it learns this mapping from states to actions.
The Critic represents the Q-function – it learns to evaluate how good each action is in every possible state. You can see that the actor uses the critic evaluations for improving the policy.

Why use such a construct? If you already know Q-Learning (here you can learn about it), you know that training the Q-function can be useful for solving an RL task. The Q-function will tell you how good each action is in any state. You can then simply pick the best action. It’s easy when you have a fixed set of actions, you simply evaluate each and every one of them and take the best!

However, what to do in a situation when an action is continuous? You can’t evaluate every value, you could evaluate some values and pick the best, but it creates its own problems with e.g. resolution – how many values and what values to evaluate? The actor is the answer to these problems. It approximated the argmax operator in the discrete case. It simply is trained to predict the best action we would get if we could evaluate every possible action with the critic. Below we describe the example of Soft Actor-Critic (SAC).

SAC in pseudo-code

SAC in pseudo-code | Source: Spinning Up in Deep RL

SAC’s critic is trained off-policy, meaning it can reuse data collected by the older, less trained policy. The off-policy critic training in lines 11-13 utilizes a very similar technique to that of the DQN, e.g. it uses the target Q-Network to stabilize the training. Being off-policy makes it more sample-efficient than the on-policy methods like PPO because we can construct the experience replay buffer where each collected data sample can be reused for training multiple times – contrary to the on-policy training where data is discarded after only one update!

You can see the critics and the replay buffer being initialized in line 1, alongside the target critics in line 2. We use two critics to fight the overestimation error described in the papers: “Double Q-learning” and “Addressing Function Approximation Error in Actor-Critic”, you can learn more about it here. Then, the data is collected and fed to the replay buffer in lines 4-8. The policy is updated in line 14 and the target networks are updated in line 15.

You could notice that both the critic and the actor updates include some additional log terms. This is the max-entropy regularization that keeps the agent from exploiting its, possibly imperfect, knowledge too much and bonuses exploration of promising actions. If you want to understand it in detail I recommend you read this resource.

Soft actor-critic in code

We will work with the Spinning Up in Deep RL – TF2 implementation framework. There is the installation instruction in the repo README. Note that you don’t have to install MuJoCo for now. We will run an example Soft Actor-Critic agent on the Pendulum-v0 environment from the OpenAI Gym suite. Let’s jump into it!

The Pendulum-v0 environment

The pendulum-v0 environment | Source: Keras-Gym – Pendulum with PPO

Pendulum-v0 is the continuous control environment where:

Actions	The torque of only one joint in one dimension
Observations	Three-dimensional vectors, where the first two dimensions represent the pendulum position – they are cos and sin of the pendulum angle – and the third dimension is the pendulum angle velocity
Goal	Spin the pendulum to the straight-up position and remain vertical, with the least angular velocity, and the least effort (torque introduced with the actions)

You can imagine it as the simplified model of more complex robots like Humanoid. Humanoid is created from many similar, but two or three-dimensional, joints.

Training the SAC agent

In the repo, the SAC agent code is here. The core.py file includes the actor-critic models’ factory method and other utilities. The sac.py includes the replay buffer definition and the implementation of the training algorithm presented above. I recommend you look through it and try to map lines of the pseudo-code from above to the actual implementation in that file. Then, check with my list:

the initialization from lines 1-2 of the pseudo-code is implemented in lines 159–179 in the sac.py,
the main loop from line 3 of the pseudo-code is implemented in line 264 in the sac.py,
the data collection from lines 4-8 of the pseudo-code is implemented in lines 270–295 in the sac.py,
the update handling from lines 9-11 of the pseudo-code is implemented in lines 298–300 in the sac.py,
the parameters update from lines 12-15 of the pseudo-code is called in line 301 and implemented in lines 192–2 4 0 in the sac.py,
and the rest of the code in the sac.py is mostly logging handling and some more boilerplate code.

The example training in the Pendulum-v0 environment is implemented in the run_example.py in the repo root. Simply run it like this: python run_example.py. After the training – or after 200 000 environment steps – the training will automatically finish and save the trained model in the ./out/checkpoint directory.

Below is the example log from the beginning and the end of the training. Note how the AverageTestEpReturn got smaller – from a huge negative number to something closer to zero which is a maximum return. Returns are negative because the agent is penalized for the pendulum not being in the goal position: vertical, with zero angular velocity and zero torque.

The training took 482 seconds (around 8 minutes) on my MacBook with the Intel i5 processor.

Before training

---------------------------------------
|      AverageEpRet |       -1.48e+03 |
|          StdEpRet |             334 |
|          MaxEpRet |            -973 |
|          MinEpRet |       -1.89e+03 |
|  AverageTestEpRet |        -1.8e+03 |
|      StdTestEpRet |             175 |
|      MaxTestEpRet |       -1.48e+03 |
|      MinTestEpRet |       -1.94e+03 |
|             EpLen |             200 |
|         TestEpLen |             200 |
| TotalEnvInteracts |           2e+03 |
|     AverageQ1Vals |       -4.46e+03 |
|         StdQ1Vals |         7.1e+04 |
|         MaxQ1Vals |           0.744 |
|         MinQ1Vals |           -63.3 |
|     AverageQ2Vals |       -4.46e+03 |
|         StdQ2Vals |        7.11e+04 |
|         MaxQ2Vals |            0.74 |
|         MinQ2Vals |           -63.5 |
|      AverageLogPi |           -35.2 |
|          StdLogPi |             562 |
|          MaxLogPi |            3.03 |
|          MinLogPi |           -8.33 |
|            LossPi |            17.4 |
|            LossQ1 |            2.71 |
|            LossQ2 |            2.13 |
|    StepsPerSecond |        4.98e+03 |
|              Time |             3.8 |
---------------------------------------

After training

---------------------------------------
|      AverageEpRet |            -176 |
|          StdEpRet |            73.8 |
|          MaxEpRet |           -9.95 |
|          MinEpRet |            -250 |
|  AverageTestEpRet |            -203 |
|      StdTestEpRet |            55.3 |
|      MaxTestEpRet |            -129 |
|      MinTestEpRet |            -260 |
|             EpLen |             200 |
|         TestEpLen |             200 |
| TotalEnvInteracts |           2e+05 |
|     AverageQ1Vals |       -1.56e+04 |
|         StdQ1Vals |        2.48e+05 |
|         MaxQ1Vals |           -41.8 |
|         MinQ1Vals |            -367 |
|     AverageQ2Vals |       -1.56e+04 |
|         StdQ2Vals |        2.48e+05 |
|         MaxQ2Vals |           -42.9 |
|         MinQ2Vals |            -380 |
|      AverageLogPi |             475 |
|          StdLogPi |        7.57e+03 |
|          MaxLogPi |            7.26 |
|          MinLogPi |           -10.6 |
|            LossPi |            61.6 |
|            LossQ1 |            2.01 |
|            LossQ2 |            1.27 |
|    StepsPerSecond |        2.11e+03 |
|              Time |             482 |
---------------------------------------

Visualizing the trained policy

Now, with the trained model saved, we can run it and see how it does! Run this script:

python run_policy.py --model_path ./out/checkpoint --env_name Pendulum-v0

in the repo root. You’ll see your agent playing 10 episodes one after another! Isn’t it cool? Did your agent train to perfectly align the pendulum vertically? Mine not. You may try playing with the hyper-parameters in the run_example.py file (the agent’s function parameters) and make the agent find a better policy. Small hint: I observed that finishing training earlier might help. All the hyper-parameters are defined in the SAC’s docstring in the sac.py file.

You may wonder, why is each episode different? It is because the initial conditions (the pendulum starting angle and velocity) are randomized each time the environment is reset and the new episode starts.

Conclusions

The next step for you is to train SAC in some more complex environment like Humanoid or any other environment from the MuJoCo suite. Installing MuJoCo to Work With OpenAI Gym Environments is the guide I wrote on how to install MuJoCo and get access to these complex environments. It also describes useful diagnostics to track. You can read more about logging these diagnostics in Logging in Reinforcement Learning Frameworks – What You Need to Know. There are also other frameworks that implement algorithms that can solve the continuous control tasks. Read about them in this post: Best Benchmarks for Reinforcement Learning: The Ultimate List. Thank you for your time and see you next time!

Installing MuJoCo to Work With OpenAI Gym Environments

Piotr Januszewski — Fri, 22 Jul 2022 06:13:25 +0000

In this article, I’ll show you how to install MuJoCo on your Mac/Linux machine in order to run continuous control environments from OpenAI’s Gym. These environments include classic ones like HalfCheetah, Hopper, Walker, Ant, and Humanoid and harder ones like object manipulation with a robotic arm or robotic hand dexterity. I’ll also discuss additional agent diagnostics provided by the environments that you might not have considered before.

How do you get MuJoCo?

You might wonder, what’s so special about installing MuJoCo that it needs a guide? Well, getting a license and properly installing it might be relatively easy, but the big problems start when you’re matching MuJoCo and OpenAI Gym versions, and installing the mujoco-py package. It took me many hours to get it right the first time I tried!

To save you the trouble, I’ll walk you through the installation process step by step. Then I’ll discuss some useful diagnostics to keep an eye on, we’ll take a look at example diagnostics from Humanoid training. Finally, I’ll link the code that lets you train agents on MuJoCo tasks and watch the diagnostics using neptune.ai. To start, I’ll give you a bit of context about MuJoCo and OpenAI Gym environments.

Editor’s note

Do you feel like experimenting with neptune.ai?

Request a free trial
Play with a live project
See the docs or watch a short product demo (2 min)

MuJoCo – Multi-Joint dynamics with Contact

Source: MuJoCo Plugin and Unity Integration

MuJoCo is a fast and accurate physics simulation engine aimed at research and development in robotics, biomechanics, graphics, and animation. It’s an engine, meaning, it doesn’t provide ready-to-use models or environments to work with, rather it runs environments (like those that OpenAI’s Gym offers).

What is OpenAI Gym?

OpenAI Gym (or Gym for short) is a collection of environments. Some of them called continuous control in general, run on the MuJoCo engine. All the environments share two important characteristics:

An agent observes vectors that describe the kinematic properties of the controlled robot. This means that the state space is continuous.
Agent actions are vectors too and they specify torques to be applied on the robot joints. This means that the action space is also continuous

Gym MuJoCo environments include classic continuous control, objects manipulation with a robotic arm, and robotic hand (Shadow Hand) dexterity. There are multiple tasks available for training in these environments. Some of them are presented in the figures below. You can find details about all of them in the Gym environments list. This post is especially useful for robo-arm and robo-hand environments. If you don’t know the Gym API yet, I encourage you to read the documentation – the two short sections “Environments” and “Observations” should be enough to start.

Classic continuous control – tasks from left to right: Walker2d, And, and Humanoid.
Source: OpenAI Roboschool

Objects manipulation with a robotic arm – the pick and place task.
Source: Overcoming exploration in RL from demos

Shadow Hand dexterity – the hand manipulate block task.
Source: OpenAI Gym Robotics

Installing MuJoCo and OpenAI Gym

In this section, I’ll show you where to get the MuJoCo license, how to install everything required, and also how to troubleshoot a common macOS problem.

License

You can get a 30-day free trial on the MuJoCo website or—if you’re a student—a free 1-year license for education. The license key will arrive in an email with your username and password. If you’re not a student, you might try to encourage the institution you work with to buy a license.

Installing mujoco-py

Here are step-by-step instructions, and below I added some explanations and troubleshooting tips:

Download the MuJoCo version 1.50 binaries for Linux or macOS.
Unzip the downloaded mjpro150 directory into ~/.mujoco/mjpro150, and place your license key (the mjkey.txt file from your email) at ~/.mujoco/mjkey.txt.
Run pip3 install -U 'mujoco-py<1.50.2,>=1.50.1'
Run python3 -c 'import mujoco_py'

If you see warnings like objc[…]: Class GLFW… is implemented in both…, then ignore them. If you’re on macOS and see clang: error: unsupported option ‘-fopenmp’or any other compilation-related error, then go to the Troubleshooting subsection. If you wonder why MuJoCo 1.5, then go to the Version subsection. If you have no more concerns, then you can jump into Gym installation!

Troubleshooting

If, on macOS, clang: error: unsupported option ‘-fopenmp’ error or any other error related to a compiler (e.g. gcc if you have one installed) happened to you during the installation or running python3 -c ‘import mujoco_py’ then follow these steps:

1. Install brew if you don’t have it already.

2. Uninstall all other compilers if you have some, e.g. run brew uninstall gcc. You may need to run it a couple of times if you have more than one version.

3. Run brew install llvm boost hdf5

4. Add this to your .bashrc / .zshrc

export PATH="/usr/local/opt/llvm/bin:$PATH"
export CC="/usr/local/opt/llvm/bin/clang"
export CXX="/usr/local/opt/llvm/bin/clang++"
export CXX11="/usr/local/opt/llvm/bin/clang++"
export CXX14="/usr/local/opt/llvm/bin/clang++"
export CXX17="/usr/local/opt/llvm/bin/clang++"
export CXX1X="/usr/local/opt/llvm/bin/clang++"
export LDFLAGS="-L/usr/local/opt/llvm/lib"
export CPPFLAGS="-I/usr/local/opt/llvm/include"

5. Don’t forget to source your .bashrc / .zshrc (e.g. relaunch your cmd) after editing it and make sure your python environment is activated.

6. Try to uninstall and install mujoco-py again.

See this GitHub issue for more information. You should also see the Troubleshooting section of the mujoco-py README.

Version

Here we bump into the first trap! The newest OpenAI Gym doesn’t work with MuJoCo 2.0, see this GitHub issue if you want to know the details. This is why you need to download MuJoCo version 1.50 binaries. Alternatively, if you really need to use MuJoCo 2.0, you can download the MuJoCo 2.0 binaries for Linux or OSX, install the newest mujoco-py, and then install the last Gym that supports MuJoCo 2.0: pip install -U gym[all]==0.15.3

Installing OpenAI Gym Environments (tutorial)

Here, it’s important to install the OpenAI Gym package with the “mujoco” and “robotics” extras or simply all extras:

Run pip3 install gym[mujoco,robotics] or pip3 install gym[all]
Check the installation by running:

python3 -c "import gym; env = gym.make('Humanoid-v2'); print('nIt is OKAY!' if env.reset() is not None else 'nSome problem here...')"

If you see “It is OKAY!” printed at the end of the cmd, then it’s OKAY! Again, you can ignore warnings like objc[…]: Class GLFW… is implemented in both….

MuJoCo diagnostics

Now I’ll talk about useful metrics provided by the OpenAI Gym MuJoCo environments. They depend on an environment version, so I divide them into v2 and v3 diagnostics. You can access these metrics in an “info” dictionary provided by the environment step method: observation, reward, done, info = env.step(action). See Gym documentation for more. The table below presents keys that allow you to access the metrics in the dictionary and metrics short descriptions.

Name	Version	Key	Descripton
HalfCheetah	v2 / v3	reward_run reward_ctrl	The positive reward for the robot forward velocity. The negative reward for the robot action vector magnitude.
HalfCheetah	v3	x_position x_velocity	Position in the X-axis. Velocity in the X-axis (forward velocity).
Hopper	v3	x_position x_velocity	Position in the X-axis. Velocity in the X-axis (forward velocity).
Walker2d	v3	x_position x_velocity	Position in the X-axis. Velocity in the X-axis (forward velocity).
Ant	v2 / v3	reward_forward reward_ctrl reward_contact reward_survive	The positive reward for the robot forward velocity. The negative reward for the robot action vector magnitude. The negative reward for the contact force magnitude between the robot and the ground. The constant positive reward at each time step when the robot is alive (until the end of an episode or the robot falls).
Ant	v3	x_position x_velocity y_position y_velocity distance_from _origin	Position in the X-axis. Velocity in the X-axis. Position in the Y-axis. Velocity in the Y-axis. Distance from the robot starting position, (0, 0).
Humanoid	v2 / v3	reward_linvel reward_quadctrl reward_impact reward_alive	The positive reward for the robot forward velocity. The negative reward for the robot action vector magnitude. The negative reward for the contact force magnitude between the robot and the ground. The constant positive reward at each time step when the robot is alive (until the end of an episode or the robot falls).
Humanoid	v3	x_position x_velocity y_position y_velocity distance_from _origin	Position in the X-axis. Velocity in the X-axis. Position in the Y-axis. Velocity in the Y-axis. Distance from the robot starting position, (0, 0).

Table: The most useful metrics provided by the OpenAI Gym MuJoCo environments

Reward components can be especially useful, for example, a forward velocity reward – which is the goal of these tasks. However, note that the absence of some metric in the info dictionary doesn’t mean that, say, survival reward isn’t added to the rewards of Hopper or Walker —it is! For more nitty-gritty details like this, I encourage you to look into the code of the specific task on GitHub, e.g. Walker2d-v3.

Now, let’s take a look at example metric values on the Humanoid task.

Humanoid diagnostics

Comparison of velocities of three different DRL algorithms: SAC, SOP, and SUNR

The figure above compares velocities of three different DRL algorithms: SAC, SOP, and SUNRISE. The velocities are plotted for fully trained agents at different points of the episode. You can see that the SOP agent runs the fastest, which is the goal of this task. In the figures below we investigate the positions of the SAC agent at the end of episodes at different stages of training.

SAC final positions in the X-axis across training on the Humanoid task. | See in the Neptune app

SAC final positions in the Y-axis across training on the Humanoid task | See in the Neptune app

You can see that this particular SAC agent runs in the negative X and positive Y direction and that with training it gets further and further. Because the time it has before the end of the episode remains the same, it means that it learns to run faster with training. Note that the agent isn’t trained to run in any particular direction. It’s trained to run forward as fast as possible in whatever direction. This means that different agents can learn to run in different directions. Plus, the agent can change the run direction at some point of training, which is shown in the figures below.

SAC final positions in the X-axis across training on the Humanoid task. It changes the run direction in one-third of training | See in the Neptune app

SAC final positions in the Y-axis across training on the Humanoid task. It changes the run direction late in the training | See in the Neptune app

Conclusions

Congratulations, you’ve got MuJoCo up and running! Now you’ll be interested in training agents in these environments—check out this repository. It includes an easy-to-understand code of DRL algorithms implemented in modern TF2. This code is based on the newcomer-friendly SpinningUp codebase. Moreover, it includes the ability to log into Neptune platform, which is very convenient to store and analyze the training results! I use it in my research and you have to give it a try too.

Best Benchmarks for Reinforcement Learning: The Ultimate List

Piotr Januszewski — Thu, 21 Jul 2022 15:04:41 +0000

In this post, I’ll share with you my library of environments that support training reinforcement learning (RL) agents. The basis for RL research, or even playing with or learning RL, is the environment. It’s where you run your algorithm to evaluate how good it is. We’re going to explore 23 different benchmarks, so I guarantee you’ll find something interesting!

But first, we’ll do a short introduction to what you should be looking for if you’re just starting with RL. Whatever your current level of knowledge, I recommend looking through the whole list. I hope it will motivate you to keep doing good work, and inspire you to start your own project in something different than standard benchmarks!

Rule of thumb

If you’re interested in algorithms specialized in discrete action spaces (PPO, DQN, Rainbow, …), where the action input can be, for example, buttons on the ATARI 2600 game controller, then you should look at the Atari environments in the OpenAI Gym. These include Pong, Breakout, Space Invaders, Seaquest, and more.

On the other hand, if you’re more interested in algorithms specialized in continuous action spaces (DDPG, TD3, SAC, …), where the action input is, say, torque on the joints of a humanoid robot learning to walk, then you should look at the MuJoCo environments in the OpenAI Gym and DeepMind Control Suite. PyBullet Gymperium is an unpaid alternative. Harder environments include Robotics in the OpenAI Gym.

If you don’t know what you’re interested in yet, then I suggest playing around with classic control environments in the OpenAI Gym, and reading SpinningUp in Deep RL.

Enough introduction, let’s check out the benchmarks!

Benchmarks

The first part of this section is just a list, in alphabetical order, of all 23 benchmarks. Further down, I add a bit of description from each benchmark’s creator to show you what it’s for.

List of RL benchmarks

AI Habitat – Virtual embodiment; Photorealistic & efficient 3D simulator;
Behaviour Suite – Test core RL capabilities; Fundamental research; Evaluate generalization;
DeepMind Control Suite – Continuous control; Physics-based simulation; Creating environments;
DeepMind Lab – 3D navigation; Puzzle-solving;
DeepMind Memory Task Suite – Require memory; Evaluate generalization;
DeepMind Psychlab – Require memory; Evaluate generalization;
Google Research Football – Multi-task; Single-/Multi-agent; Creating environments;
Meta-World – Meta-RL; Multi-task;
MineRL – Imitation learning; Offline RL; 3D navigation; Puzzle-solving;
Multiagent emergence environments – Multi-agent; Creating environments; Emergence behavior;
OpenAI Gym – Continuous control; Physics-based simulation; Classic video games; RAM state as observations;
OpenAI Gym Retro – Classic video games; RAM state as observations;
OpenSpiel – Classic board games; Search and planning; Single-/Multi-agent;
Procgen Benchmark – Evaluate generalization; Procedurally-generated;
PyBullet Gymperium – Continuous control; Physics-based simulation; MuJoCo unpaid alternative;
Real-World Reinforcement Learning – Continuous control; Physics-based simulation; Adversarial examples;
RLCard – Classic card games; Search and planning; Single-/Multi-agent;
RL Unplugged – Offline RL; Imitation learning; Datasets for the common benchmarks;
Screeps – Compete with others; Sandbox; MMO for programmers;
Serpent.AI – Game Agent Framework – Turn ANY video game into the RL env;
StarCraft II Learning Environment – Rich action and observation spaces; Multi-agent; Multi-task;
The Unity Machine Learning Agents Toolkit (ML-Agents) – Create environments; Curriculum learning; Single-/Multi-agent; Imitation learning;
WordCraft -Test core capabilities; Commonsense knowledge;

A

AI Habitat

The embodiment hypothesis is the idea that “intelligence emerges in the interaction of an agent with an environment and as a result of sensorimotor activity”. Habitat is a simulation platform for research in Embodied AI.
Imagine walking up to a home robot and asking “Hey robot – can you go check if my laptop is on my desk? And if so, bring it to me”. Or asking an egocentric AI assistant (sitting on your smart glasses): “Hey – where did I last see my keys?”. AI Habitat enables training of such embodied AI agents (virtual robots and egocentric assistants) in a highly photorealistic & efficient 3D simulator, before transferring the learned skills to reality.

It’ll be the best fit for you if you study intelligent systems with a physical or virtual embodiment.

B

Behaviour Suite

bsuite is a collection of carefully designed experiments that investigate the core capabilities of a reinforcement learning (RL) agent with two main objectives.

To collect clear, informative, and scalable problems that capture key issues in the design of efficient and general learning algorithms.
To study agent behavior through their performance on these shared benchmarks.

This library automates the evaluation and analysis of any agent on these benchmarks. It serves to facilitate reproducible, and accessible, research on the core issues in RL, and ultimately the design of superior learning algorithms.

D

DeepMind Control Suite

The dm_control software package is a collection of Python libraries and task suites for reinforcement learning agents in an articulated-body simulation. A MuJoCo wrapper provides convenient bindings to functions and data structures to create your own tasks.

Moreover, the Control Suite is a fixed set of tasks with a standardized structure, intended to serve as performance benchmarks. It includes classic tasks like HalfCheetah, Humanoid, Hopper, Walker, Graber, and more (see the picture). The Locomotion framework provides high-level abstractions and examples of locomotion tasks like soccer. A set of configurable manipulation tasks with a robot arm and snap-together bricks is also included.

An introductory tutorial for this package is available as a Colaboratory notebook.

DeepMind Lab

DeepMind Lab is a 3D learning environment based on Quake III Arena via ioquake3 and other open-source software. DeepMind Lab provides a suite of challenging 3D navigation and puzzle-solving tasks for learning agents. Its primary purpose is to act as a testbed for research in artificial intelligence, where agents have to act on visual observations.

DeepMind Memory Task Suite

The DeepMind Memory Task Suite is a set of 13 diverse machine-learning tasks that require memory to solve. They are constructed to let us evaluate generalization performance on a memory-specific holdout set.

DeepMind Psychlab

Psychlab is a simulated psychology laboratory inside the first-person 3D game world of DeepMind Lab. Psychlab enables implementations of classical laboratory psychological experiments so that they work with both human and artificial agents. Psychlab has a simple and flexible API that enables users to easily create their own tasks. As an example, the Psychlab includes several classical experimental paradigms including visual search, change detection, random dot motion discrimination, and multiple object tracking.

G

Google Research Football

Google Research Football is a novel RL environment where agents aim to master the world’s most popular sport – football! Modeled after popular football video games, the Football Environment provides an efficient physics-based 3D football simulation where agents control either one or all football players on their team, learn how to pass between them, and manage to overcome their opponent’s defense in order to score goals. The Football Environment provides a demanding set of research problems called Football Benchmarks, as well as the Football Academy, a set of progressively harder RL scenarios.
It’s perfect for multi-agent and multi-task research. It also allows you to create your own academy scenarios as well as completely new tasks using the simulator, based on the included examples.

M

Meta-World

Meta-reinforcement learning algorithms can enable robots to acquire new skills much more quickly, by leveraging prior experience to learn how to learn. Meta-World is an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic manipulation tasks. The authors aim to provide task distributions that are sufficiently broad to evaluate meta-RL algorithms’ generalization ability to new behaviors.

MineRL

MineRL is a research project started at Carnegie Mellon University aimed at developing various aspects of artificial intelligence within Minecraft. In short, MineRL consists of two major components:

MineRL-v0 Dataset – One of the largest imitation learning datasets with over 60 million frames of recorded human player data. The dataset includes a set of environments that highlight many of the hardest problems in modern-day Reinforcement Learning: sparse rewards and hierarchical policies.
minerl – A rich python3 package for doing artificial intelligence research in Minecraft. This includes two major submodules: minerl.env – A growing set of OpenAI Gym environments in Minecraft and minerl.data – The main python module for experimenting with the MineRL-v0 dataset.

Multiagent emergence environments

Environment generation code for Emergent Tool Use From Multi-Agent Autocurricula. It’s a fun paper, I highly recommend you read it. The authors observed agents discovering progressively more complex tool use while playing a simple game of hide-and-seek. Through training in the simulated hide-and-seek environment, agents build a series of six distinct strategies and counterstrategies. The self-supervised emergent complexity in this simple environment further suggests that multi-agent co-adaptation may one day produce extremely complex and intelligent behavior.

It uses “Worldgen: Randomized MuJoCo environments” which allows users to generate complex, heavily randomized environments. You should try it too, if you’re into creating your own environments!

O

OpenAI Gym

Gym, besides being the most widly known benchmark, is an amazing toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking the simulated humanoid (requires MuJoCo, see PyBullet Gymperium for the free alternative) to playing Atari games like Pong or Pinball. I personally use it in my research the most. It’s very easy to use and it’s kind of standard nowadays. You should get to know it well.

OpenAI Gym Retro

Gym Retro can be thought of as the extension of the OpenAI Gym. It lets you turn classic video games into OpenAI Gym environments for reinforcement learning and comes with integrations for ~1000 games. It uses various emulators that support the Libretro API, making it fairly easy to add new emulators.

OpenSpiel

OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games. OpenSpiel supports n-player (single- and multi- agent) zero-sum, cooperative and general-sum, one-shot and sequential, strictly turn-taking and simultaneous-move, perfect and imperfect information games, as well as traditional multiagent environments such as (partially- and fully- observable) grid worlds and social dilemmas. OpenSpiel also includes tools to analyze learning dynamics and other common evaluation metrics. Games are represented as procedural extensive-form games, with some natural extensions. The core API and games are implemented in C++ for efficiency and exposed to Python for your ease of use.

P

Procgen Benchmark

Procgen Benchmark consists of 16 unique environments designed to measure both sample efficiency and generalization in reinforcement learning. This benchmark is ideal for evaluating generalization since distinct training and test sets can be generated in each environment. This benchmark is also well-suited to evaluate sample efficiency since all environments pose diverse and compelling challenges for RL agents. The environments’ intrinsic diversity demands that agents learn robust policies; overfitting to narrow regions in state space will not suffice. Put differently, the ability to generalize becomes an integral component of success when agents are faced with ever-changing levels.

PyBullet Gymperium

PyBullet Gymperium is an open-source implementation of the OpenAI Gym MuJoCo environments and more. These are challenging continuous control environments like training a humanoid to walk. What’s cool about it, is that it doesn’t require the user to install MuJoCo, a commercial physics engine that requires a paid license to run for longer than 30 days.

R

Real-World Reinforcement Learning

The Challenges of Real-World RL paper identifies and describes a set of nine challenges that are currently preventing Reinforcement Learning (RL) agents from being utilized on real-world applications and products. It also describes an evaluation framework and a set of environments that can provide an evaluation of an RL algorithm’s potential applicability to real-world systems. It has since then been followed up with the An Empirical Investigation of the challenges of real-world reinforcement learning paper which implements eight of the nine described challenges and analyses their effects on various state-of-the-art RL algorithms.
This is the codebase used to perform this analysis and is also intended as a common platform for easily reproducible experimentation around these challenges. It is referred to as the realworldrl-suite (Real-World Reinforcement Learning (RWRL) Suite).

RLCard

RLCard is a toolkit for Reinforcement Learning (RL) in card games. It supports multiple card environments with easy-to-use interfaces. Games include Blackjack, UNO, Limit Texas Hold’em, and more! It also lets you create your own environments. The goal of RLCard is to bridge reinforcement learning and imperfect information games.

RL Unplugged

RL Unplugged is a suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed to facilitate ease of use, it provides the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established. It includes datasets for the most common benchmarks: Atari, DeepMind Locomotion, DeepMind Control Suite, Realworld RL, DeepMind Lab, and bsuite.

S

Screeps

Screeps is a massive, multiplayer, online, real-time, strategy game (phwee, it’s a lot). Each player can create their own colony in a single persistent world shared by all the players. Such a colony can mine resources, build units, conquer territory. As you conquer more territory, your influence in the game world grows, as well as your abilities to expand your footprint. However, it requires a lot of effort on your part, since multiple players may aim at the same territory. And what’s the most important, you build an AI that does all of it!

Screeps is developed for people with programming skills. Unlike some other RTS games, your units in Screeps can react to events without your participation – provided that you have programmed them properly.

Serpent.AI – Game Agent Framework

Serpent.AI is a simple yet powerful, novel framework to assist developers in the creation of game agents. Turn ANY video game you own into a sandbox environment ripe for experimentation, all with familiar Python code. For example, see this autonomous driving agent in GTA. The framework first and foremost provides a valuable tool for Machine Learning & AI research. It also turns out to be ridiculously fun to use as a hobbyist (and dangerously addictive)!

StarCraft II Learning Environment

PySC2 provides an interface for RL agents to interact with StarCraft 2, getting observations and sending actions. It exposes Blizzard Entertainment’s StarCraft II Machine Learning API as a Python RL Environment. This is a collaboration between DeepMind and Blizzard to develop StarCraft II into a rich environment for RL research. PySC2 has many pre-configured mini-game maps for benchmarking the RL agents.

T

The Unity Machine Learning Agents Toolkit (ML-Agents)

It’s an open-source project that enables games and simulations to serve as environments for training intelligent agents. Unity provides implementations (based on PyTorch) of state-of-the-art algorithms to enable game developers and hobbyists to easily train intelligent agents for 2D, 3D, and VR/AR games. Researchers, however, can use the provided simple-to-use Python API to train Agents using reinforcement learning, imitation learning, neuroevolution, or any other methods! See for example Marathon Environments.

W

WordCraft

This is the official Python implementation of WordCraft: An Environment for Benchmarking Commonsense Agents. The ability to quickly solve a wide range of real-world tasks requires a commonsense understanding of the world. To better enable research on agents making use of commonsense knowledge you should try WordCraft, an RL environment based on Little Alchemy 2. Little Alchemy 2 is a fun and addictive game which allows players to combine elements to create even more elements. This lightweight environment is fast to run and built upon entities and relations inspired by real-world semantics.

Conclusion

This concludes our list of RL benchmarks. I can’t really tell you which one you should pick. For some, the more classic benchmarks like OpenAI Gym or DM Control Suite described in the “Rule of thumb” section will be the best fit. For others, it will be not enough and they might want to jump into something less tired like the Unity ML-agents or Screeps.

Personally, I worked with GRF on one occasion and it was fun to see how my agents learn to play football and score goals. At the moment, I work on some more fundamental research and I test my agents using the well-recognized OpenAI Gym MuJoCo environments, which is fun in other ways like seeing that my method really works.

Whatever is your choice, I hope this list helps you make your RL research more exciting!

Markov Decision Process in Reinforcement Learning: Everything You Need to Know

Andre Ye — Thu, 21 Jul 2022 13:22:57 +0000

Take a moment to locate the nearest big city around you. If you were to go there, how would you do it? Go by car, take a bus, take a train? Maybe ride a bike, or buy an airplane ticket?

Making this choice, you incorporate probability into your decision-making process. Perhaps there’s a 70% chance of rain or a car crash, which can cause traffic jams. If your bike tire is old, it may break down – this is certainly a large probabilistic factor.

On the other hand, there are deterministic costs – for instance, the cost of gas or an airplane ticket – as well as deterministic rewards – like much faster travel times taking an airplane.

These types of problems – in which an agent must balance probabilistic and deterministic rewards and costs – are common in decision-making. Markov Decision Processes are used to model these types of optimization problems, and can also be applied to more complex tasks in Reinforcement Learning.

Defining Markov Decision Processes in Machine Learning

To illustrate a Markov Decision process, think about a dice game:

Each round, you can either continue or quit.
If you quit, you receive $5 and the game ends.
If you continue, you receive $3 and roll a 6-sided die. If the die comes up as 1 or 2, the game ends. Otherwise, the game continues onto the next round.

There is a clear trade-off here. For one, we can trade a deterministic gain of $2 for the chance to roll dice and continue to the next round.

To create an MDP to model this game, first we need to define a few things:

A state is a status that the agent (decision-maker) can hold. In the dice game, the agent can either be in the game or out of the game.
An action is a movement the agent can choose. It moves the agent between states, with certain penalties or rewards.
Transition probabilities describe the probability of ending up in a state s’ (s prime) given an action a. These will be often denoted as a function P(s, a, s’) that outputs the probability of ending up in s’ given current state s and action a.
For example, P(s=playing the game, a=choose to continue playing, s’=not playing the game) is ⅓, since there is a two-sixths (one-third) chance of losing the dice roll.
Rewards are given depending on the action. The reward for continuing the game is 3, whereas the reward for quitting is $5. The ‘overall’ reward is to be optimized.

We can formally describe a Markov Decision Process as m = (S, A, P, R, gamma), where:

S represents the set of all states.
A represents the set of possible actions.
P represents the transition probabilities.
R represents the rewards.
Gamma is known as the discount factor (more on this later).

The goal of the MDP m is to find a policy, often denoted as pi, that yields the optimal long-term reward. Policies are simply a mapping of each state s to a distribution of actions a. For each state s, the agent should take action a with a certain probability. Alternatively, policies can also be deterministic (i.e. the agent will take action a in state s).

Our Markov Decision Process would look like the graph below. An agent traverses the graph’s two states by making decisions and following probabilities.

It’s important to mention the Markov Property, which applies not only to Markov Decision Processes but anything Markov-related (like a Markov Chain).

It states that the next state can be determined solely by the current state – no ‘memory’ is necessary. This applies to how the agent traverses the Markov Decision Process, but note that optimization methods use previous learning to fine tune policies. This is not a violation of the Markov property, which only applies to the traversal of an MDP.

The Bellman equation & dynamic programming

The Bellman Equation is central to Markov Decision Processes. It outlines a framework for determining the optimal expected reward at a state s by answering the question: “what is the maximum reward an agent can receive if they make the optimal action now and for all future decisions?”

Although versions of the Bellman Equation can become fairly complicated, fundamentally most of them can be boiled down to this form:

It is a relatively common-sense idea, put into formulaic terms. Notice the role gamma – which is between 0 or 1 (inclusive) – plays in determining the optimal reward. If gamma is set to 0, the V(s’) term is completely canceled out and the model only cares about the immediate reward.

On the other hand, if gamma is set to 1, the model weights potential future rewards just as much as it weights immediate rewards. The optimal value of gamma is usually somewhere between 0 and 1, such that the value of farther-out rewards has diminishing effects.

Let’s use the Bellman equation to determine how much money we could receive in the dice game. We can choose between two choices, so our expanded equation will look like max(choice 1’s reward, choice 2’s reward).

Choice 1 – quitting – yields a reward of 5.

On the other hand, choice 2 yields a reward of 3, plus a two-thirds chance of continuing to the next stage, in which the decision can be made again (we are calculating by expected return). We add a discount factor gamma in front of terms indicating the calculating of s’ (the next state).

This equation is recursive, but inevitably it will converge to one value, given that the value of the next iteration decreases by ⅔, even with a maximum gamma of 1.

At some point, it will not be profitable to continue staying in game. Let’s calculate four iterations of this, with a gamma of 1 to keep things simple and to calculate the total long-term optimal reward.

At each step, we can either quit and receive an extra $5 in expected value, or stay and receive an extra $3 in expected value. Each new round, the expected value is multiplied by two-thirds, since there is a two-thirds probability of continuing, even if the agent chooses to stay.

Here, the decimal values are computed, and we find that (with our current number of iterations) we can expect to get $7.8 if we follow the best choices.

Here, we calculated the best profit manually, which means there was an error in our calculation: we terminated our calculations after only four rounds.

If we were to continue computing expected values for several dozen more rows, we would find that the optimal value is actually higher. In order to compute this efficiently with a program, you would need to use a specialized data structure.

Plus, in order to be efficient, we don’t want to calculate each expected value independently, but in relation with previous ones. The solution: Dynamic Programming.

Richard Bellman, of the Bellman Equation, coined the term Dynamic Programming, and it’s used to compute problems that can be broken down into subproblems. For example, the expected value for choosing Stay > Stay > Stay > Quit can be found by calculating the value of Stay > Stay > Stay first.

These pre-computations would be stored in a two-dimensional array, where the row represents either the state [In] or [Out], and the column represents the iteration. We can write rules that relate each cell in the table to a previously precomputed cell (this diagram doesn’t include gamma).

Then, the solution is simply the largest value in the array after computing enough iterations. Through dynamic programming, computing the expected value – a key component of Markov Decision Processes and methods like Q-Learning – becomes efficient.

Q-learning: Markov Decision Process + Reinforcement Learning

Let’s think about a different simple game, in which the agent (the circle) must navigate a grid in order to maximize the rewards for a given number of iterations.

There are seven types of blocks:

-2 punishment,
-5 punishment,
-1 punishment,
+1 reward,
+10 reward,
block that moves the agent to space A1 or B3 with equal probability,
empty blocks.

Note that this is an MDP in grid form – there are 9 states and each connects to the state around it. The game terminates if the agent has a punishment of -5 or less, or if the agent has reward of 5 or more.

In Q-learning, we don’t know about probabilities – it isn’t explicitly defined in the model. Instead, the model must learn this and the landscape by itself by interacting with the environment.

This makes Q-learning suitable in scenarios where explicit probabilities and values are unknown. If they are known, then you might not need to use Q-learning.

In our game, we know the probabilities, rewards, and penalties because we are strictly defining them. But if, say, we are training a robot to navigate a complex landscape, we wouldn’t be able to hard-code the rules of physics; using Q-learning or another reinforcement learning method would be appropriate.

Each step of the way, the model will update its learnings in a Q-table. The table below, which stores possible state-action pairs, reflects current known information about the system, which will be used to drive future decisions.

Each of the cells contain Q-values, which represent the expected value of the system given the current action is taken. (Does this sound familiar? It should – this is the Bellman Equation again!)

All values in the table begin at 0 and are updated iteratively. Note that there is no state for A3 because the agent cannot control their movement from that point.

To update the Q-table, the agent begins by choosing an action. It cannot move up or down, but if it moves right, it suffers a penalty of -5, and the game terminates. The Q-table can be updated accordingly.

When the agent traverses the environment for the second time, it considers its options. Given the current Q-table, it can either move right or down. Moving right yields a loss of -5, compared to moving down, currently set at 0.

For the sake of simulation, let’s imagine that the agent travels along the path indicated below, and ends up at C1, terminating the game with a reward of 10. We can then fill in the reward that the agent received for each action they took along the way.

Obviously, this Q-table is incomplete. Even if the agent moves down from A1 to A2, there is no guarantee that it will receive a reward of 10. After enough iterations, the agent should have traversed the environment to the point where values in the Q-table tell us the best and worst decisions to make at every location.

This example is a simplification of how Q-values are actually updated, which involves the Bellman Equation discussed above. For instance, depending on the value of gamma, we may decide that recent information collected by the agent, based on a more recent and accurate Q-table, may be more important than old information, so we can discount the importance of older information in constructing our Q-table.

It’s important to note the exploration vs exploitation trade-off here. If the agent traverses the correct path towards the goal but ends up, for some reason, at an unlucky penalty, it will record that negative value in the Q-table and associate every move it took with this penalty.

If the agent is purely ‘exploitative’ – it always seeks to maximize direct immediate gain – it may never dare to take a step in the direction of that path.

Alternatively, if an agent follows the path to a small reward, a purely exploitative agent will simply follow that path every time and ignore any other path, since it leads to a reward that is larger than 1.

By allowing the agent to ‘explore’ more, it can focus less on choosing the optimal path to take and more on collecting information. This usually happens in the form of randomness, which allows the agent to have some sort of randomness in their decision process.

However, a purely ‘explorative’ agent is also useless and inefficient – it will take paths that clearly lead to large penalties and can take up valuable computing time.

It’s good practice to incorporate some intermediate mix of randomness, such that the agent bases its reasoning on previous discoveries, but still has opportunities to address less explored paths.

A sophisticated form of incorporating the exploration-exploitation trade-off is simulated annealing, which comes from metallurgy, the controlled heating and cooling of metals.

Instead of allowing the model to have some sort of fixed constant in choosing how explorative or exploitative it is, simulated annealing begins by having the agent heavily explore, then become more exploitative over time as it gets more information.

This method has shown enormous success in discrete problems like the Travelling Salesman Problem, so it also applies well to Markov Decision Processes.

Because simulated annealing begins with high exploration, it is able to generally gauge which solutions are promising and which are less so. As the model becomes more exploitative, it directs its attention towards the promising solution, eventually closing in on the most promising solution in a computationally efficient way.

Summary

Let’s wrap up what we explored in this article:

A Markov Decision Process (MDP) is used to model decisions that can have both probabilistic and deterministic rewards and punishments.

MDPs have five core elements:

S, a set of possible states for an agent to be in,
A, a set of possible actions an agent can take at a particular state,
R, the rewards for making an action A at state S;
P, the probabilities for transitioning to a new state S’ after taking action A at original state S;
gamma, which controls how far-looking the Markov Decision Process agent will be.

All Markov Processes, including MDPs, must follow the Markov Property, which states that the next state can be determined purely by the current state.

The Bellman Equation determines the maximum reward an agent can receive if they make the optimal decision at the current state and at all following states. It defines the value of the current state recursively as being the maximum possible value of the current state reward, plus the value of the next state.

Dynamic programming utilizes a grid structure to store previously computed values and builds upon them to compute new values. It can be used to efficiently calculate the value of a policy and to solve not only Markov Decision Processes, but many other recursive problems.

Q-Learning is the learning of Q-values in an environment, which often resembles a Markov Decision Process. It is suitable in cases where the specific probabilities, rewards, and penalties are not completely known, as the agent traverses the environment repeatedly to learn the best strategy by itself.

Hope you enjoyed exploring these topics with me. Thank you for reading!

Model-Based and Model-Free Reinforcement Learning: Pytennis Case Study

Elisha Odemakinde — Thu, 21 Jul 2022 13:21:34 +0000

Reinforcement learning is a field of Artificial Intelligence in which you build an intelligent system that learns from its environment through interaction and evaluates what it learns in real-time.

A good example of this is self-driving cars, or when DeepMind built what we know today as AlphaGo, AlphaStar, and AlphaZero.

AlphaZero is a program built to master the games of chess, shogi and go (AlphaGo is the first program that beat a human Go master). AlphaStar plays the video game StarCraft II.

In this article, we’ll compare model-free vs model-based reinforcement learning. Along the way, we will explore:

Fundamental concepts of Reinforcement Learning
a) Markov decision processes / Q-Value / Q-Learning / Deep Q Network
Difference between model-based and model-free reinforcement learning.
Discrete mathematical approach to playing tennis – model-free reinforcement learning.
Tennis game using Deep Q Network – model-based reinforcement learning.
Comparison/Evaluation
References to learn more

7 Applications of Reinforcement Learning in Finance and Trading
10 Real-Life Applications of Reinforcement Learning
Best Reinforcement Learning Tutorials, Examples, Projects, and Courses

Fundamental concepts of Reinforcement Learning

Any reinforcement learning problem includes the following elements:

Agent – the program controlling the object of concern (for instance, a robot).
Environment – this defines the outside world programmatically. Everything the agent(s) interacts with is part of the environment. It’s built for the agent to make it seem like a real-world case. It’s needed to prove the performance of an agent, meaning if it will do well once implemented in a real world application.
Rewards – this gives us a score of how the algorithm performs with respect to the environment. It’s represented as 1 or 0. ‘1’ means that the policy network made the right move, ‘0’ means wrong move. In other words, rewards represent gains and losses.
Policy – the algorithm used by the agent to decide its actions. This is the part that can be model-based or model-free.

Every problem that needs an RL solution starts with simulating an environment for the agent. Next, you build a policy network that guides the actions of the agent. The agent can then evaluate the policy if its corresponding action resulted in a gain or a loss.

The policy is our main discussion point for this article. Policy can be model-based or model-free. When building, our concern is how to optimize the policy network via policy gradient (PG).

PG algorithms directly try to optimize the policy to increase rewards. To understand these algorithms, we must take a look at Markov decision processes (MDP).

Markov decision processes / Q-Value / Q-Learning / Deep Q Network

MDP is a process with a fixed number of states, and it randomly evolves from one state to another at each step. The probability for it to evolve from state A to state B is fixed.

A lot of Reinforcement Learning problems with discrete actions are modeled as Markov decision processes, with the agent having no initial clue on the next transition state. The agent also has no idea on the rewarding principle, so it has to explore all possible states to begin to decode how to adjust to a perfect rewarding system. This will lead us to what we call Q Learning.

The Q-Learning algorithm is adapted from the Q-Value Iteration algorithm, in a situation where the agent has no prior knowledge of preferred states and rewarding principles. Q-Values can be defined as an optimal estimate of a state-action value in an MDP.

It is often said that Q-Learning doesn’t scale well to large (or even medium) MDPs with many states and actions. The solution is to approximate the Q-Value of any state-action pair (s,a). This is called Approximate Q-Learning.

DeepMind proposed the use of deep neural networks, which work much better, especially for complex problems – without the use of any feature engineering. A deep neural network used to estimate Q-Values is called a deep Q-network (DQN). Using DQN for approximated Q-learning is called Deep Q-Learning.

Difference between model-based and model-free Reinforcement Learning

RL algorithms can be mainly divided into two categories – model-based and model-free.

Model-based, as it sounds, has an agent trying to understand its environment and creating a model for it based on its interactions with this environment. In such a system, preferences take priority over the consequences of the actions i.e. the greedy agent will always try to perform an action that will get the maximum reward irrespective of what that action may cause.

On the other hand, model-free algorithms seek to learn the consequences of their actions through experience via algorithms such as Policy Gradient, Q-Learning, etc. In other words, such an algorithm will carry out an action multiple times and will adjust the policy (the strategy behind its actions) for optimal rewards, based on the outcomes.

Think of it this way, if the agent can predict the reward for some action before actually performing it thereby planning what it should do, the algorithm is model-based. While if it actually needs to carry out the action to see what happens and learn from it, it is model-free.

This results in different applications for these two classes, for e.g. a model-based approach may be the perfect fit for playing chess or for a robotic arm in the assembly line of a product, where the environment is static and getting the task done most efficiently is our main concern. However, in the case of real-world applications such as self-driving cars, a model-based approach might prompt the car to run over a pedestrian to reach its destination in less time (maximum reward), but a model-free approach would make the car wait till the road is clear (optimal way out).

To better understand this, we will explain everything with an example. In the example, we’ll build model-free and model-based RL for tennis games. To build the model, we need an environment for the policy to get implemented. However we won’t build the environment in this article, we’ll import one to use for our program.

Pytennis environment

We’ll use the Pytennis environment to build a model-free and model-based RL system.

A tennis game requires the following:

2 players which implies 2 agents.
A tennis lawn – main environment.
A single tennis ball.
Movement of the agents left-right (or right-left direction).

The Pytennis environment specifications are:

There are 2 agents (2 players) with a ball.
There’s a tennis field of dimension (x, y) – (300, 500)
The ball was designed to move on a straight line, such that agent A decides a target point between x1 (0) and x2 (300) of side B (Agent B side), therefore it displays the ball 50 different times with respect to an FPS of 20. This makes the ball move in a straight line from source to destination. This also applies to agent B.
Movement of Agent A and Agent B is bound between (x1= 100, to x2 = 600).
Movement of the ball is bound along the y-axis (y1 = 100 to y2 = 600).
Movement of the ball is bound along the x-axis (x1 = 100, to x2 = 600).

Pytennis is an environment that mimics real-life tennis situations. As shown below, the image on the left is a model-free Pytennis game, and the one on the right is model-based.

Discrete mathematical approach to playing tennis – model-free Reinforcement Learning

Why “discrete mathematical approach to playing tennis”? Because this method is a logical implementation of the Pytennis environment.

The code below shows us the implementation of the ball movement on the lawn. You can find the source code here.

import time
import numpy as np
import pygame
import sys
#import seaborn as sns

from pygame.locals import *
pygame.init()


class Network:
   def __init__(self, xmin, xmax, ymin, ymax):
       """
       xmin: 150,
       xmax: 450,
       ymin: 100,
       ymax: 600
       """

       self.StaticDiscipline = {
           'xmin': xmin,
           'xmax': xmax,
           'ymin': ymin,
           'ymax': ymax
       }

   def network(self, xsource, ysource=100, Ynew=600, divisor=50):  # ysource will always be 100
       """
       For Network A
       ysource: will always be 100
       xsource: will always be between xmin and xmax (static discipline)
       For Network B
       ysource: will always be 600
       xsource: will always be between xmin and xmax (static discipline)
       """

       while True:
           ListOfXsourceYSource = []
           Xnew = np.random.choice([i for i in range(
               self.StaticDiscipline['xmin'], self.StaticDiscipline['xmax'])], 1)
           #Ynew = np.random.choice([i for i in range(self.StaticDiscipline['ymin'], self.StaticDiscipline['ymax'])], 1)

           source = (xsource, ysource)
           target = (Xnew[0], Ynew)

           #Slope and intercept
           slope = (ysource - Ynew)/(xsource - Xnew[0])
           intercept = ysource - (slope*xsource)
           if (slope != np.inf) and (intercept != np.inf):
               break
           else:
               continue

       #print(source, target)
       # randomly select 50 new values along the slope between xsource and xnew (monotonically decreasing/increasing)
       XNewList = [xsource]

       if xsource < Xnew:
           differences = Xnew[0] - xsource
           increment = differences / divisor
           newXval = xsource
           for i in range(divisor):

               newXval += increment
               XNewList.append(int(newXval))
       else:
           differences = xsource - Xnew[0]
           decrement = differences / divisor
           newXval = xsource
           for i in range(divisor):

               newXval -= decrement
               XNewList.append(int(newXval))

       # determine the values of y, from the new values of x, using y= mx + c
       yNewList = []
       for i in XNewList:
           findy = (slope * i) + intercept  # y = mx + c
           yNewList.append(int(findy))

       ListOfXsourceYSource = [(x, y) for x, y in zip(XNewList, yNewList)]

       return XNewList, yNewList

Here is how this works once the networks are initialized (Network A for Agent A and Network B for Agent B):

# Testing
net = Network(150, 450, 100, 600)
NetworkA = net.network(300, ysource=100, Ynew=600)  # Network A
NetworkB = net.network(200, ysource=600, Ynew=100)  # Network B

Each network is bounded by the directions of ball movement. Network A represents Agent A, which defines the movement of the ball from Agent A to any position between 100 and 300 along the x-axis at Agent B. This also applies to Network B (Agent B).

When the network is started, the .network method discretely generates 50 y-points (between y1 = 100 and y2 = 600), and corresponding x-points (between x1 which happens to be the location of the ball from Agent A to a randomly selected point x2 on Agent B side) for network A. This also applies to Network B (Agent B).

To automate the movement of each agent, the opposing agent has to move in a corresponding direction with respect to the ball. This can only be done by setting the x position of the ball to be the x position of the opposing agent, as in the code below.

playerax = ballx #When Agent A plays.

playerbx = ballx #When Agent B plays.

Meanwhile the source agent has to move back to its default position from its current position. The code below illustrates this.

def DefaultToPosition(x1, x2=300, divisor=50):
   XNewList = []
   if x1 < x2:
       differences = x2 - x1
       increment = differences / divisor
       newXval = x1
       for i in range(divisor):
           newXval += increment
           XNewList.append(int(np.floor(newXval)))

   else:
       differences = x1 - x2
       decrement = differences / divisor
       newXval = x1
       for i in range(divisor):
           newXval -= decrement
           XNewList.append(int(np.floor(newXval)))
   return XNewList

Now, to make the agents play with each other recursively, this has to run in a loop. After every 50 counts (50 frame display of the ball), the opposing player is made the next player. The code below puts all of it together in a loop.

def main():
   while True:
       display()
       if nextplayer == 'A':
           # playerA should play
           if count == 0:
               #playerax = lastxcoordinate
               NetworkA = net.network(
                   lastxcoordinate, ysource=100, Ynew=600)  # Network A
               out = DefaultToPosition(lastxcoordinate)

               # update lastxcoordinate

               bally = NetworkA[1][count]
               playerax = ballx #When Agent A plays.
               count += 1
#                 soundObj = pygame.mixer.Sound('sound/sound.wav')
#                 soundObj.play()
#                 time.sleep(0.3)
#                 soundObj.stop()
           else:
               ballx = NetworkA[0][count]
               bally = NetworkA[1][count]
               playerbx = ballx
               playerax = out[count]
               count += 1

           # let playerB play after 50 new coordinate of ball movement
           if count == 49:
               count = 0
               nextplayer = 'B'
           else:
               nextplayer = 'A'

       else:
           # playerB can play
           if count == 0:
               #playerbx = lastxcoordinate
               NetworkB = net.network(
                   lastxcoordinate, ysource=600, Ynew=100)  # Network B
               out = DefaultToPosition(lastxcoordinate)

               # update lastxcoordinate
               bally = NetworkB[1][count]
               playerbx = ballx
               count += 1

#                 soundObj = pygame.mixer.Sound('sound/sound.wav')
#                 soundObj.play()
#                 time.sleep(0.3)
#                 soundObj.stop()
           else:
               ballx = NetworkB[0][count]
               bally = NetworkB[1][count]
               playerbx = out[count]
               playerax = ballx
               count += 1
           # update lastxcoordinate

           # let playerA play after 50 new coordinate of ball movement
           if count == 49:
               count = 0
               nextplayer = 'A'
           else:
               nextplayer = 'B'

       # CHECK BALL MOVEMENT
       DISPLAYSURF.blit(PLAYERA, (playerax, 50))
       DISPLAYSURF.blit(PLAYERB, (playerbx, 600))
       DISPLAYSURF.blit(ball, (ballx, bally))

       # update last coordinate
       lastxcoordinate = ballx

       pygame.display.update()
       fpsClock.tick(FPS)

       for event in pygame.event.get():

           if event.type == QUIT:
               pygame.quit()
               sys.exit()
       return

And this is basic model-free reinforcement learning. It’s model-free because you need no form of learning or modelling for the 2 agents to play simultaneously and accurately.

Tennis game using Deep Q Network – model-based Reinforcement Learning

A typical example of model-based reinforcement learning is the Deep Q Network. Source code to this work is available here.

The code below illustrates the Deep Q Network, which is the model architecture for this work.

from keras import Sequential, layers
from keras.optimizers import Adam
from keras.layers import Dense
from collections import deque
import numpy as np



class DQN:
   def __init__(self):
       self.learning_rate = 0.001
       self.momentum = 0.95
       self.eps_min = 0.1
       self.eps_max = 1.0
       self.eps_decay_steps = 2000000
       self.replay_memory_size = 500
       self.replay_memory = deque([], maxlen=self.replay_memory_size)
       n_steps = 4000000 # total number of training steps
       self.training_start = 10000 # start training after 10,000 game iterations
       self.training_interval = 4 # run a training step every 4 game iterations
       self.save_steps = 1000 # save the model every 1,000 training steps
       self.copy_steps = 10000 # copy online DQN to target DQN every 10,000 training steps
       self.discount_rate = 0.99
       self.skip_start = 90 # Skip the start of every game (it's just waiting time).
       self.batch_size = 100
       self.iteration = 0 # game iterations
       self.done = True # env needs to be reset




       self.model = self.DQNmodel()

       return



   def DQNmodel(self):
       model = Sequential()
       model.add(Dense(64, input_shape=(1,), activation='relu'))
       model.add(Dense(64, activation='relu'))
       model.add(Dense(10, activation='softmax'))
       model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.learning_rate))
       return model


   def sample_memories(self, batch_size):
       indices = np.random.permutation(len(self.replay_memory))[:batch_size]
       cols = [[], [], [], [], []] # state, action, reward, next_state, continue
       for idx in indices:
           memory = self.replay_memory[idx]
           for col, value in zip(cols, memory):
               col.append(value)
       cols = [np.array(col) for col in cols]
       return (cols[0], cols[1], cols[2].reshape(-1, 1), cols[3],cols[4].reshape(-1, 1))


   def epsilon_greedy(self, q_values, step):
       self.epsilon = max(self.eps_min, self.eps_max - (self.eps_max-self.eps_min) * step/self.eps_decay_steps)
       if np.random.rand() < self.epsilon:
           return np.random.randint(10) # random action
       else:
           return np.argmax(q_values) # optimal action

In this case, we need a policy network to control the movement of each agent as they move along the x-axis. Since the values are continuous, that is from (x1 = 100 to x2 = 300), we can’t have a model that predicts or works with 200 states.

To simplify this problem, we can split x1 and x2 into 10 states / 10 actions, and define an upper and lower bound for each state.

Note that we have 10 actions, because from a state there are 10 possibilities.

The code below illustrates the definition of both upper and lower bounds for each state.

def evaluate_state_from_last_coordinate(self, c):
       """
       cmax: 450
       cmin: 150

       c definately will be between 150 and 450.
       state0 - (150 - 179)
       state1 - (180 - 209)
       state2 - (210 - 239)
       state3 - (240 - 269)
       state4 - (270 - 299)
       state5 - (300 - 329)
       state6 - (330 - 359)
       state7 - (360 - 389)
       state8 - (390 - 419)
       state9 - (420 - 450)
       """
       if c >= 150 and c <= 179:
           return 0
       elif c >= 180 and c <= 209:
           return 1
       elif c >= 210 and c <= 239:
           return 2
       elif c >= 240 and c <= 269:
           return 3
       elif c >= 270 and c <= 299:
           return 4
       elif c >= 300 and c <= 329:
           return 5
       elif c >= 330 and c <= 359:
           return 6
       elif c >= 360 and c <= 389:
           return 7
       elif c >= 390 and c <= 419:
           return 8
       elif c >= 420 and c <= 450:
           return 9

The Deep Neural Network (DNN) used experimentally for this work is a network of 1 input (which represents the previous state), 2 hidden layers of 64 neurons each, and an output layer of 10 neurons (binary selection from 10 different states). This is shown below:

def DQNmodel(self):
       model = Sequential()
       model.add(Dense(64, input_shape=(1,), activation='relu'))
       model.add(Dense(64, activation='relu'))
       model.add(Dense(10, activation='softmax'))
       model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.learning_rate))
       return model

Now that we have a DQN model that predicts the next state/action of the model, and the Pytennis environment already sorted out the ball movement in a straight line, let’s go ahead and write a function that carries out an action by an agent, based on the DQN model prediction regarding it’s next state.

The detailed code below illustrates how agent A makes a decision on where to direct the ball (on Agent B’s side and vice-versa). This code also evaluates agent B, if it was able to receive the ball.

   def randomVal(self, action):
       """
       cmax: 450
       cmin: 150

       c definately will be between 150 and 450.
       state0 - (150 - 179)
       state1 - (180 - 209)
       state2 - (210 - 239)
       state3 - (240 - 269)
       state4 - (270 - 299)
       state5 - (300 - 329)
       state6 - (330 - 359)
       state7 - (360 - 389)
       state8 - (390 - 419)
       state9 - (420 - 450)
       """
       if action == 0:
           val = np.random.choice([i for i in range(150, 180)])
       elif action == 1:
           val = np.random.choice([i for i in range(180, 210)])
       elif action == 2:
           val = np.random.choice([i for i in range(210, 240)])
       elif action == 3:
           val = np.random.choice([i for i in range(240, 270)])
       elif action == 4:
           val = np.random.choice([i for i in range(270, 300)])
       elif action == 5:
           val = np.random.choice([i for i in range(300, 330)])
       elif action == 6:
           val = np.random.choice([i for i in range(330, 360)])
       elif action == 7:
           val = np.random.choice([i for i in range(360, 390)])
       elif action == 8:
           val = np.random.choice([i for i in range(390, 420)])
       else:
           val = np.random.choice([i for i in range(420, 450)])
       return val

   def stepA(self, action, count=0):
       # playerA should play
       if count == 0:
           self.NetworkA = self.net.network(
               self.ballx, ysource=100, Ynew=600)  # Network A
           self.bally = self.NetworkA[1][count]
           self.ballx = self.NetworkA[0][count]

           if self.GeneralReward == True:
               self.playerax = self.randomVal(action)
           else:
               self.playerax = self.ballx


#             soundObj = pygame.mixer.Sound('sound/sound.wav')
#             soundObj.play()
#             time.sleep(0.4)
#             soundObj.stop()

       else:
           self.ballx = self.NetworkA[0][count]
           self.bally = self.NetworkA[1][count]

       obsOne = self.evaluate_state_from_last_coordinate(
           int(self.ballx))  # last state of the ball
       obsTwo = self.evaluate_state_from_last_coordinate(
           int(self.playerbx))  # evaluate player bx
       diff = np.abs(self.ballx - self.playerbx)
       obs = obsTwo
       reward = self.evaluate_action(diff)
       done = True
       info = str(diff)

       return obs, reward, done, info


   def evaluate_action(self, diff):

       if (int(diff) <= 30):
           return True
       else:
           return False

From the code above, function stepA gets executed when AgentA has to play. While playing, AgentA uses the next action predicted by DQN to estimate the target (x2 position, at Agent B, from the current position of the ball, x1, which is on it’s own side), by using the ball trajectory network developed by the Pytennis environment to make its own move.

Agent A, for example, is able to get a precise point x2 on Agent’s B side by using the function randomVal, as shown above, to randomly select a coordinate x2 bounded by the action given by DQN.

Finally, function stepA evaluates the response of AgentB to target point x2 by using the function evaluate_action. The function evaluate_action defines if AgentB should be penalized or rewarded. Just as this is described for AgentA to AgentB, it applies for AgentB to AgentA (same code by different variable names).

Now that we have the policy, reward, environment, states and actions correctly defined, we can go ahead and recursively make the two agents play the game with each other.

The code below shows how turns are taken by each agent after 50 ball displays. Note that for each ball display, the DQN is making a decision on where to toss the ball for the next agent to play.

while iteration < iterations:

           self.display()
           self.randNumLabelA = self.myFontA.render(
               'A (Win): '+str(self.updateRewardA) + ', A(loss): '+str(self.lossA), 1, self.BLACK)
           self.randNumLabelB = self.myFontB.render(
               'B (Win): '+str(self.updateRewardB) + ', B(loss): ' + str(self.lossB), 1, self.BLACK)
           self.randNumLabelIter = self.myFontIter.render(
               'Iterations: '+str(self.updateIter), 1, self.BLACK)

           if nextplayer == 'A':

               if count == 0:
                   # Online DQN evaluates what to do
                   q_valueA = self.AgentA.model.predict([stateA])
                   actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)

                   # Online DQN plays
                   obsA, rewardA, doneA, infoA = self.stepA(
                       action=actionA, count=count)
                   next_stateA = actionA

                   # Let's memorize what just happened
                   self.AgentA.replay_memory.append(
                       (stateA, actionA, rewardA, next_stateA, 1.0 - doneA))
                   stateA = next_stateA

               elif count == 49:

                   # Online DQN evaluates what to do
                   q_valueA = self.AgentA.model.predict([stateA])
                   actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)
                   obsA, rewardA, doneA, infoA = self.stepA(
                       action=actionA, count=count)
                   next_stateA = actionA

                   self.updateRewardA += rewardA
                   self.computeLossA(rewardA)

                   # Let's memorize what just happened
                   self.AgentA.replay_memory.append(
                       (stateA, actionA, rewardA, next_stateA, 1.0 - doneA))

                   # restart the game if player A fails to get the ball, and let B start the game
                   if rewardA == 0:
                       self.restart = True
                       time.sleep(0.5)
                       nextplayer = 'B'
                       self.GeneralReward = False
                   else:
                       self.restart = False
                       self.GeneralReward = True

                   # Sample memories and use the target DQN to produce the target Q-Value
                   X_state_val, X_action_val, rewards, X_next_state_val, continues = (
                       self.AgentA.sample_memories(self.AgentA.batch_size))
                   next_q_values = self.AgentA.model.predict(
                       [X_next_state_val])
                   max_next_q_values = np.max(
                       next_q_values, axis=1, keepdims=True)
                   y_val = rewards + continues * self.AgentA.discount_rate * max_next_q_values

                   # Train the online DQN
                   self.AgentA.model.fit(X_state_val, tf.keras.utils.to_categorical(
                       X_next_state_val, num_classes=10), verbose=0)

                   nextplayer = 'B'
                   self.updateIter += 1

                   count = 0
                   # evaluate A

               else:
                   # Online DQN evaluates what to do
                   q_valueA = self.AgentA.model.predict([stateA])
                   actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)

                   # Online DQN plays
                   obsA, rewardA, doneA, infoA = self.stepA(
                       action=actionA, count=count)
                   next_stateA = actionA

                   # Let's memorize what just happened
                   self.AgentA.replay_memory.append(
                       (stateA, actionA, rewardA, next_stateA, 1.0 - doneA))
                   stateA = next_stateA

               if nextplayer == 'A':
                   count += 1
               else:
                   count = 0

           else:
               if count == 0:
                   # Online DQN evaluates what to do
                   q_valueB = self.AgentB.model.predict([stateB])
                   actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)

                   # Online DQN plays
                   obsB, rewardB, doneB, infoB = self.stepB(
                       action=actionB, count=count)
                   next_stateB = actionB

                   # Let's memorize what just happened
                   self.AgentB.replay_memory.append(
                       (stateB, actionB, rewardB, next_stateB, 1.0 - doneB))
                   stateB = next_stateB

               elif count == 49:

                   # Online DQN evaluates what to do
                   q_valueB = self.AgentB.model.predict([stateB])
                   actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)

                   # Online DQN plays
                   obs, reward, done, info = self.stepB(
                       action=actionB, count=count)
                   next_stateB = actionB

                   # Let's memorize what just happened
                   self.AgentB.replay_memory.append(
                       (stateB, actionB, rewardB, next_stateB, 1.0 - doneB))

                   stateB = next_stateB
                   self.updateRewardB += rewardB
                   self.computeLossB(rewardB)

                   # restart the game if player A fails to get the ball, and let B start the game
                   if rewardB == 0:
                       self.restart = True
                       time.sleep(0.5)
                       self.GeneralReward = False
                       nextplayer = 'A'
                   else:
                       self.restart = False
                       self.GeneralReward = True

                   # Sample memories and use the target DQN to produce the target Q-Value
                   X_state_val, X_action_val, rewards, X_next_state_val, continues = (
                       self.AgentB.sample_memories(self.AgentB.batch_size))
                   next_q_values = self.AgentB.model.predict(
                       [X_next_state_val])
                   max_next_q_values = np.max(
                       next_q_values, axis=1, keepdims=True)
                   y_val = rewards + continues * self.AgentB.discount_rate * max_next_q_values

                   # Train the online DQN
                   self.AgentB.model.fit(X_state_val, tf.keras.utils.to_categorical(
                       X_next_state_val, num_classes=10), verbose=0)

                   nextplayer = 'A'
                   self.updateIter += 1
                   # evaluate B

               else:
                   # Online DQN evaluates what to do
                   q_valueB = self.AgentB.model.predict([stateB])
                   actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)

                   # Online DQN plays
                   obsB, rewardB, doneB, infoB = self.stepB(
                       action=actionB, count=count)
                   next_stateB = actionB

                   # Let's memorize what just happened
                   self.AgentB.replay_memory.append(
                       (stateB, actionB, rewardB, next_stateB, 1.0 - doneB))
                   tateB = next_stateB

               if nextplayer == 'B':
                   count += 1
               else:
                   count = 0

           iteration += 1

Comparison/Evaluation

Having played this game model-free and model-based, here are some differences that we need to be aware of:

s/n	Model-free	Model-based
1	rewards are not accounted for (since this is automated, reward = 1)	rewards are accounted for
2	no modelling (no decision policy is required)	modelling is required (policy network)
3	this doesn’t require the use of initial states to predict the next state	this requires the use of initial states to predict the next state using the policy network
4	the rate of missing the ball with respect to time is zero	the rate of missing the ball with respect to time approaches zero

If you’re interested, the videos below show these two techniques in action playing tennis games:

1. Model-free

2. Model-based

Conclusion

Tennis might be simple compared to self-driving cars, but hopefully this example showed you a few things about RL that you didn’t know.

The main difference between model-free and model-based RL is the policy network, which is required for model-based RL and unnecessary in model-free.

It’s worth noting that oftentimes, model-based RL takes a massive amount of time for the DNN to learn the states perfectly without getting it wrong.

But every technique has its drawbacks and advantages, choosing the right one depends on what exactly you need your program to do.

Thanks for reading, I left a few additional references for you to follow if you want to explore this topic more.

References

AlphaGo documentary: https://www.youtube.com/watch?v=WXuK6gekU1Y
List of reinforcement learning environments: https://medium.com/@mauriciofadelargerich/reinforcement-learning-environments-cff767bc241f
Create your own reinforcement learning environment: https://towardsdatascience.com/create-your-own-reinforcement-learning-environment-beb12f4151ef
Types of RL Environments: https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781838649777/1/ch01lvl1sec14/types-of-rl-environment
Model-based Deep Q Network: https://github.com/elishatofunmi/pytennis-Deep-Q-Network-DQN
Discrete mathematics approach youtube video: https://youtu.be/iUYxZ2tYKHw
Deep Q Network approach YouTube video: https://youtu.be/FCwGNRiq9SY
Model-free discrete mathematics implementation: https://github.com/elishatofunmi/pytennis-Discrete-Mathematics-Approach-
Hands-on Machine Learning with scikit-learn and TensorFlow: https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291

The Best Tools for Reinforcement Learning in Python You Actually Want to Try

Vladimir Lyashenko — Thu, 21 Jul 2022 13:13:10 +0000

Nowadays, Deep Reinforcement Learning (RL) is one of the hottest topics in the Data Science community. The fast development of RL has resulted in the growing demand for easy to understand and convenient to use RL tools.

In recent years, plenty of RL libraries have been developed. These libraries were designed to have all the necessary tools to both implement and test Reinforcement Learning models.

Still, they differ quite a lot. That’s why it is important to pick a library that will be quick, reliable, and relevant for your RL task.

In this article we will cover:

Criteria for choosing Deep Reinforcement Learning library,
RL libraries: Pyqlearning, KerasRL, Tensorforce, RL_Coach, TFAgents, MAME RL, MushroomRL.

Python libraries for Reinforcement Learning

There are a lot of RL libraries, so choosing the right one for your case might be a complicated task. We need to form criteria to evaluate each library.

Criteria

Each RL library in this article will be analyzed based on the following criteria:

Number of state-of-the-art (SOTA) RL algorithms implemented – the most important one in my opinion
Official documentation, availability of simple tutorials and examples
Readable code that is easy to customize
Number of supported environments – a crucial decision factor for Reinforcement Learning library
Logging and tracking tools support – for example, Neptune or TensorBoard
Vectorized environment (VE) feature – method to do multiprocess training. Using parallel environments, your agent will experience way more situations than with one environment
Regular updates – RL develops quite rapidly and you want to use up-to-date technologies

We will talk about the following libraries:

KerasRL

KerasRL is a Deep Reinforcement Learning Python library. It implements some state-of-the-art RL algorithms, and seamlessly integrates with Deep Learning library Keras.

Moreover, KerasRL works with OpenAI Gym out of the box. This means you can evaluate and play around with different algorithms quite easily.

To install KerasRL simply use a pip command:

pip install keras-rl

Let’s see if KerasRL fits the criteria:

Number of SOTA RL algorithms implemented

As of today KerasRL has the following algorithms implemented:

Deep Q-Learning (DQN) and its improvements (Double and Dueling)
Deep Deterministic Policy Gradient (DDPG)
Continuous DQN (CDQN or NAF)
Cross-Entropy Method (CEM)
Deep SARSA

As you may have noticed, KerasRL misses two important agents: Actor-Critic Methods and Proximal Policy Optimization (PPO).

Official documentation, availability of tutorials and examples

The code is easy to read and it’s full of comments, which is quite useful. Still, the documentation seems incomplete as it misses the explanation of parameters and tutorials. Also, practical examples leave much to be desired.

Readable code that is easy to customize

Very easy. All you need to do is to create a new agent following the example and then add it to rl.agents.

Number of supported environments

KerasRL was made to work only with OpenAI Gym. Therefore you need to modify the agent if you want to use any other environment.

Logging and tracking tools support

Logging and tracking tools support is not implemented. Nevertheless, you can use neptune.ai to track your experiments.

Vectorized environment feature

Includes a vectorized environment feature.

Regular updates

The library seems not to be maintained anymore as the last updates were more than a year ago.

To sum up, KerasRL has a good set of implementations. Unfortunately, it misses valuable points such as visualization tools, new architectures and updates. You should probably use another library.

Pyqlearning

Pyqlearning is a Python library to implement RL. It focuses on Q-Learning and multi-agent Deep Q-Network.

Pyqlearning provides components for designers, not for end user state-of-the-art black boxes. Thus, this library is a tough one to use. You can use it to design the information search algorithm, for example, GameAI or web crawlers.

To install Pyqlearning simply use a pip command:

pip install pyqlearning

Let’s see if Pyqlearning fits the criteria:

Number of SOTA RL algorithms implemented

As of today Pyqlearning has the following algorithms implemented:

Deep Q-Learning (DQN) and its improvements (Epsilon Greedy and Boltzmann)

As you may have noticed, Pyqlearning has only one important agent. The library leaves much to be desired.

Official documentation, availability of tutorials and examples

Pyqlearning has a couple of examples for various tasks and two tutorials featuring Maze Solving and the pursuit-evasion game by Deep Q-Network. You may find them in the official documentation. The documentation seems incomplete as it focuses on the math, and not the library’s description and usage.

Readable code that is easy to customize

Pyqlearning is an open-source library. Source code can be found on Github. The code lacks comments. It may be a complicated task to customize it. Still, the tutorials might help.

Number of supported environments

Since the library is agnostic, it’s relatively easy to add to any environment.

Logging and tracking tools support

The author uses a simple logging package in the tutorials. Pyqlearning does not support other logging and tracking tools, for example, TensorBoard.

Vectorized environment feature

Pyqlearning does not support Vectorized environment feature.

Regular updates

The library is maintained. The last update was made two months ago. Still, the development process seems to be a slow-going one.

To sum up, Pyqlearning leaves much to be desired. It is not a library that you will use commonly. Thus, you should probably use something else.

Tensorforce

Tensorforce is an open-source Deep RL library built on Google’s Tensorflow framework. It’s straightforward in its usage and has a potential to be one of the best Reinforcement Learning libraries.

Tensorforce has key design choices that differentiate it from other RL libraries:

Modular component-based design: Feature implementations, above all, tend to be as generally applicable and configurable as possible.
Separation of RL algorithm and application: Algorithms are agnostic to the type and structure of inputs (states/observations) and outputs (actions/decisions), as well as the interaction with the application environment.

To install Tensorforce simply use a pip command:

pip install tensorforce

Let’s see if Tensorforce fits the criteria:

Number of SOTA RL algorithms implemented

As of today, Tensorforce has the following set of algorithms implemented:

Deep Q-Learning (DQN) and its improvements (Double and Dueling)
Vanilla Policy Gradient (PG)
Deep Deterministic Policy Gradient (DDPG)
Continuous DQN (CDQN or NAF)
Actor Critic (A2C and A3C)
Trust Region Policy Optimization (TRPO)
Proximal Policy Optimization (PPO)

As you may have noticed, Tensorforce misses the Soft Actor Critic (SAC) implementation. Besides that it is perfect.

Official documentation, availability of tutorials and examples

It is quite easy to start using Tensorforce thanks to the variety of simple examples and tutorials. The official documentation seems complete and convenient to navigate through.

Readable code that is easy to customize

Tensorforce benefits from its modular design. Each part of the architecture, for example, networks, models, runners is distinct. Thus, you can easily modify them. However, the code lacks comments and that could be a problem.

Number of supported environments

Tensorforce works with multiple environments, for example, OpenAI Gym, OpenAI Retro and DeepMind Lab. It also has documentation to help you plug into other environments.

Logging and tracking tools support

The library supports TensorBoard and other logging/tracking tools.

Vectorized environment feature

Tensorforce supports Vectorized environment feature.

Regular updates

Tensorforce is regularly updated. The last update was just a few weeks ago.

To sum up, Tensorforce is a powerful RL tool. It is up-to-date and has all necessary documentation for you to start working with it.

RL_Coach

Reinforcement Learning Coach (Coach) by Intel AI Lab is a Python RL framework containing many state-of-the-art algorithms.

It exposes a set of easy-to-use APIs for experimenting with new RL algorithms. The components of the library, for example, algorithms, environments, neural network architectures are modular. Thus, extending and reusing existent components is fairly painless.

To install Coach simply use a pip command.

pip install rl_coach

Still, you should check the official installation tutorial as a few prerequisites are required.

Let’s see if Coach fits the criteria:

Number of SOTA RL algorithms implemented

As of today, RL_Coach has the following set of algorithms implemented:

As you may have noticed, RL_Coach has a variety of algorithms. It’s the most complete library of all covered in this article.

Official documentation, availability of tutorials and examples

The documentation is complete. Also, RL_Coach has a set of valuable tutorials. It will be easy for newcomers to start working with it.

Readable code that is easy to customize

RL_Coach is the open-source library. It benefits from the modular design, but the code lacks comments. It may be a complicated task to customize it.

Number of supported environments

Coach supports the following environments:

OpenAI Gym
ViZDoom
Roboschool
GymExtensions
PyBullet
CARLA
And other

For more information including installation and usage instructions please refer to official documentation.

Logging and tracking tools support

Coach supports various logging and tracking tools. It even has its own visualization dashboard.

Vectorized environment feature

RL_Coach supports Vectorized environment feature. For usage instructions please refer to the documentation.

Regular updates

The library seems to be maintained. However, the last major update was almost a year ago.

To sum up, RL_Coach has a perfect up-to-date set of algorithms implemented. And it’s newcomer friendly. I would strongly recommend Coach.

TFAgents

TFAgents is a Python library designed to make implementing, deploying, and testing RL algorithms easier. It has a modular structure and provides well-tested components that can be easily modified and extended.

TFAgents is currently under active development, but even the current set of components makes it the most promising RL library.

To install TFAgents simply use a pip command:

pip install tf-agents

Let’s see if TFAgents fits the criteria:

Number of SOTA RL algorithms implemented

As of today, TFAgents has the following set of algorithms implemented:

Deep Q-Learning (DQN) and its improvements (Double)
Deep Deterministic Policy Gradient (DDPG)
TD3
REINFORCE
Proximal Policy Optimization (PPO)
Soft Actor Critic (SAC)

Overall, TFAgents has a great set of algorithms implemented.

Official documentation, availability of tutorials and examples

TFAgents has a series of tutorials on each major component. Still, the official documentation seems incomplete, I would even say there is none. However, the tutorials and simple examples do their job, but the lack of well-written documentation is a major disadvantage.

Readable code that is easy to customize

The code is full of comments and the implementations are very clean. TFAgents seems to have the best library code.

Number of supported environments

The library is agnostic. That is why it’s easy to plug it into any environment.

Logging and tracking tools support

Logging and tracking tools are supported.

Vectorized environment feature

Vectorized environment is supported.

Regular updates

As mentioned above, TFAgents is currently under active development. The last update was made just a couple of days ago.

To sum up, TFAgents is a very promising library. It already has all necessary tools to start working with it. I wonder what it will look like when the development is over.

Stable Baselines

Stable Baselines is a set of improved implementations of Reinforcement Learning (RL) algorithms based on OpenAI Baselines. The OpenAI Baselines library was not good. That’s why Stable Baselines was created.

Stable Baselines features unified structure for all algorithms, a visualization tool and excellent documentation.

To install Stable Baselines simply use a pip command.

pip install story-baselines

Still, you should check the official installation tutorial as a few prerequisites are required.

Let’s see if Stable Baselines fits the criteria:

Number of SOTA RL algorithms implemented

As of today, Stable Baselines has the following set of algorithms implemented:

A2C
ACER
ACKTR
DDPG
DQN
HER
GAIL
PPO1 and PPO2
SAC
TD3
TRPO

Overall, Stable Baselines has a great set of algorithms implemented.

Official documentation, availability of tutorials and examples

The documentation is complete and excellent. The set of tutorials and examples is also really helpful.

Readable code that is easy to customize

On the other hand, modifying the code can be tricky. But because Stable Baselines provides a lot of useful comments in the code and awesome documentation, the modification process will be less complex.

Number of supported environments

Stable Baselines provides good documentation about how to plug into your custom environment, however, you need to do it using OpenAI Gym.

Logging and tracking tools support

Stable Baselines has the TensorBoard support implemented.

Vectorized environment feature

Vectorized environment feature is supported by a majority of the algorithms. Please check the documentation in case you want to learn more.

Regular updates

The last major updates were made almost two years ago, but the library is maintained as the documentation is regularly updated.

To sum up, Stable Baselines is a library with a great set of algorithms and awesome documentation. You should consider using it as your RL tool.

MushroomRL

MushroomRL is a Python Reinforcement Learning library whose modularity allows you to use well-known Python libraries for tensor computation and RL benchmarks.

It enables RL experiments providing classical RL algorithms and deep RL algorithms. The idea behind MushroomRL consists of offering the majority of RL algorithms, providing a common interface in order to run them without doing too much work.

To install MushroomRL simply use a pip command.

pip install mushroom_rl

Let’s see if MushroomRL fits the criteria:

Number of SOTA RL algorithms implemented

As of today, MushroomRL has the following set of algorithms implemented:

Q-Learning
SARSA
FQI
DQN
DDPG
SAC
TD3
TRPO
PPO

Overall, MushroomRL has everything you need to work on RL tasks.

Official documentation, availability of tutorials and examples

The official documentation seems incomplete. It misses valuable tutorials, and simple examples leave much to be desired.

Readable code that is easy to customize

The code lacks comments and parameter description. It’s really hard to customize it. Although MushroomRL never positioned itself as a library that is easy to customize.

Number of supported environments

MushroomRL supports the following environments:

OpenAI Gym
DeepMind Control Suite
MuJoCo

For more information including installation and usage instructions please refer to official documentation.

Logging and tracking tools support

MushroomRL supports various logging and tracking tools. I would recommend using TensorBoard as the most popular one.

Vectorized environment feature

Vectorized environment feature is supported.

Regular updates

The library is maintained. The last updates were made just a few weeks ago.

To sum up, MushroomRL has a good set of algorithms implemented. Still, it misses tutorials and examples which are crucial when you start to work with a new library.

RLlib

“RLlib is an open-source library for reinforcement learning that offers both high scalability and a unified API for a variety of applications. RLlib natively supports TensorFlow, TensorFlow Eager, and PyTorch, but most of its internals are framework agnostic.” ~ Website

Number of state-of-the-art (SOTA) RL algorithms implemented
RLlib implements them ALL! PPO? It’s there. A2C and A3C? Yep. DDPG, TD3, SAC? Of course! DQN, Rainbow, APEX??? Yes, in many shapes and flavours! Evolution Strategies, IMPALA, Dreamer, R2D2, APPO, AlphaZero, SlateQ, LinUCB, LinTS, MADDPG, QMIX, … Stop it! I’m not sure if you make up these acronyms. Nonetheless, yes, RLlib has them ALL. See the full list here.
Official documentation, availability of simple tutorials and examples
RLlib has comprehensive documentation with many examples. Its code is also well commented.
Readable code that is easy to customize
It’s easiest to customize RLlib with callbacks. Although RLlib is open-sourced and you can edit the code, it’s not a straightforward thing to do. RLlib codebase is quite complicated because of its size and many layers of abstractions. Here is a guide that should help you with that if you want to e.g. add a new algorithm.
Number of supported environments
RLlib works with several different types of environments, including OpenAI Gym, user-defined, multi-agent, and also batched environments. Here you’ll find more.
Logging and tracking tools support
RLlib has extensive logging features. RLlib will print logs to the standard output (command line). You can also access the logs (and manage jobs) in Ray Dashboard. In this post, I described how to extend RLlib logging to send metrics to Neptune. It also describes different logging techniques. I highly recommend reading it!
Vectorized environment (VE) feature
Yes, see here. Moreover, it’s possible to distribute the training among multiple compute nodes e.g. on the cluster.
Regular updates
RLlib is maintained and actively developed.

From my experience, RLlib is a very powerful framework that covers many applications and at the same time remains quite easy to use. That being said, because of the many layers of abstractions, it’s really hard to extend with your code as it’s hard to find where you should even put your code! That’s why I would recommend it for developers that look for training the models for production and not for researchers that have to rapidly change algorithms and implement new features.

Dopamine

“Dopamine is a research framework for fast prototyping of reinforcement learning algorithms. It aims to fill the need for a small, easily grokked codebase in which users can freely experiment with wild ideas (speculative research).” ~ GitHub

Number of state-of-the-art (SOTA) RL algorithms implemented
It focuses on supporting the state-of-the-art, single-GPU DQN, Rainbow, C51, and IQN agents. Their Rainbow agent implements the three components identified as most important by Hessel et al.:
1. n-step Bellman updates (see e.g. Mnih et al., 2016)
2. Prioritized experience replay (Schaul et al., 2015)
3. Distributional reinforcement learning (C51; Bellemare et al., 2017)
Official documentation, availability of simple tutorials and examples
Concise documentation is available in the GitHub repo here. It’s not a very popular framework, so it may lack tutorials. However, the authors provide colabs with many examples of training and visualization.
Readable code that is easy to customize
The authors’ design principles are:
1. Easy experimentation: Make it easy for new users to run benchmark experiments.
2. Flexible development: Make it easy for new users to try out research ideas.
3. Compact and reliable: Provide implementations for a few, battle-tested algorithms.
4. Reproducible: Facilitate reproducibility in results. In particular, their setup follows the recommendations given by Machado et al. (2018).
Number of supported environments
It’s mainly thought for the Atari 2600 game-playing. It supports OpenAI Gym.
Logging and tracking tools support
It supports TensorBoard logging and provides some other visualization tools, presented in colabs, like recording video of an agent play and seaborn plotting.
Vectorized environment (VE) feature
No vectorized environments support.
Regular updates
Dopamine is maintained.

If you look for a customizable framework with well-tested DQN based algorithms, then this may be your pick. Under the hood, it runs using TensorFlow or JAX.

SpinningUp

“While fantastic repos like garage, Baselines, and rllib make it easier for researchers who are already in the field to make progress, they build algorithms into frameworks in ways that involve many non-obvious choices and trade-offs, which makes them hard to learn from. […] The algorithm implementations in the Spinning Up repo are designed to be:

as simple as possible while still being reasonably good,
and highly consistent with each other to expose fundamental similarities between algorithms.

They are almost completely self-contained, with virtually no common code shared between them (except for logging, saving, loading, and MPI utilities), so that an interested person can study each algorithm separately without having to dig through an endless chain of dependencies to see how something is done. The implementations are patterned so that they come as close to pseudocode as possible, to minimize the gap between theory and code.” ~ Website

Number of state-of-the-art (SOTA) RL algorithms implemented
VPG, PPO, TRPO, DDPG, TD3, SAC
Official documentation, availability of simple tutorials and examples
Great documentation and education materials with multiple examples.
Readable code that is easy to customize
This code is highly readable. From my experience, it’s the most readable framework you can find there. Every algorithm is contained in its own two, well-commented files. Because of it, it’s also as easy as it can be to modify it. On the other hand, it’s harder to maintain for the same reason. If you add something to one algorithm you have to manually add it to others too.
Number of supported environments
It supports the OpenAI Gym environments out of the box and relies on its API. So you can extend it to use other environments that conform to this API.
Logging and tracking tools support
It has a light logger that prints metrics to the standard output (cmd) and saves them to a file. I’ve written the post on how to add the Neptune support to SpinUp.
Vectorized environment (VE) feature
No vectorized environments support.
Regular updates
SpinningUp is maintained.

Although it was created as an educational resource, the code simplicity and state-of-the-art results make it a perfect framework for fast prototyping your research ideas. I use it in my own research and even implement new algorithms in it using the same code structure. Here you can find a port of SpinningUp code to the TensorFlow v2 from me and my colleagues from AwareLab.

garage

“garage is a toolkit for developing and evaluating reinforcement learning algorithms, and an accompanying library of state-of-the-art implementations built using that toolkit. […] The most important feature of garage is its comprehensive automated unit test and benchmarking suite, which helps ensure that the algorithms and modules in garage maintain state-of-the-art performance as the software changes.” ~ GitHub

Number of state-of-the-art (SOTA) RL algorithms implemented
All major RL algorithms (VPG, PPO, TRPO, DQN, DDPG, TD3, SAC, …), with their multi-task versions (MT-PPO, MT-TRPO, MT-SAC), meta-RL algorithms (Task embedding, MAML, PEARL, RL2, …), evolutional strategy algorithms (CEM, CMA-ES), and behavioural cloning.
Official documentation, availability of simple tutorials and examples
Comprehensive documentation included with many examples and some tutorials of e.g. how to add a new environment or implement a new algorithm.
Readable code that is easy to customize
It’s created as a flexible and structured tool for developing, experimenting and evaluating algorithms. It provides a scaffold for adding new methods.
Number of supported environments
Garage supports a variety of external environment libraries for different RL training purposes including OpenAI Gym, DeepMind DM Control, MetaWorld, and PyBullet. You should be able to easily add your own environments.
Logging and tracking tools support
The garage logger supports many outputs including std. output (cmd), plain text files, CSV files, and TensorBoard.
Vectorized environment (VE) feature
It supports vectorized environments and even allows one to distribute the training on the cluster.
Regular updates
garage is maintained.

garage is similar to RLlib. It’s a big framework with distributed execution, supporting many additional features like Docker, which is beyond simple training and monitoring. If such a tool is something you need, i.e. in a production environment, then I would recommend comparing it with RLlib and picking the one you like more.

Acme

“Acme is a library of reinforcement learning (RL) agents and agent building blocks. Acme strives to expose simple, efficient, and readable agents, that serve both as reference implementations of popular algorithms and as strong baselines, while still providing enough flexibility to do novel research. The design of Acme also attempts to provide multiple points of entry to the RL problem at differing levels of complexity.” ~ GitHub

Number of state-of-the-art (SOTA) RL algorithms implemented
It includes algorithms for continual control (DDPG, D4PG, MPO, Distributional MPO, Multi-Objective MPO), discrete control (DQN, IMPALA, R2D2), learning from demonstrations (DQfD, R2D3), planning and learning (AlphaZero) and behavioural cloning.
Official documentation, availability of simple tutorials and examples
Documentation is rather sparse, but there are many examples and jupyter notebook tutorials available in the repo.
Readable code that is easy to customize
Code is easy to read but requires one to learn its structure first. It is easy to customize and add your own agents.
Number of supported environments
The Acme environment loop assumes an environment instance that implements the DeepMind Environment API. So any environment from DeepMind will work flawlessly (e.g. DM Control). It also provides a wrapper on the OpenAI Gym environments and the OpenSpiel RL environment loop. If your environment implements OpenAI or DeepMind API, then you shouldn’t have problems with pugging it in.
Logging and tracking tools support
It includes a basic logger and supports printing to the standard output (cmd) and saving to CSV files. I’ve written the post on how to add the Neptune support to Acme.
Vectorized environment (VE) feature
No vectorized environments support.
Regular updates
Acme is maintained and actively developed.

Acme is simple like SpinningUp but a tier higher if it comes to the use of abstraction. It makes it easier to maintain – code is more reusable – but on the other hand, harder to find the exact spot in the implementation you should change when tinkering with the algorithm. It supports both TensorFlow v2 and JAX, with the second being an interesting option as JAX gains traction recently.

coax

“Coax is a modular Reinforcement Learning (RL) python package for solving OpenAI Gym environments with JAX-based function approximators. […] The primary thing that sets coax apart from other packages is that is designed to align with the core RL concepts, not with the high-level concept of an agent. This makes coax more modular and user-friendly for RL researchers and practitioners.” ~ Website

Number of state-of-the-art (SOTA) RL algorithms implemented
It implements classical RL algorithms (SARSA, Q-Learning), value-based deep RL algorithms (Soft Q-Learning, DQN, Prioritized Experience Replay DQN, Ape-X DQN), and policy gradient methods (VPG, PPO, A2C, DDPG, TD3).
Official documentation, availability of simple tutorials and examples
Clear, if sometimes confusing, documentation with many code examples and algorithms explanation. It also includes tutorials for running training on Pong, Cartpole, ForzenLake, and Pendulum environments.
Readable code that is easy to customize
Other RL frameworks often hide structure that you (the RL practitioner) are interested in. Coax makes the network architecture take the center stage, so you can define your own forward-pass function. Moreover, the design of coax is agnostic of the details of your training loop. You decide how and when you update your function approximators.
Number of supported environments
Coax mostly focuses on OpenAI Gym environments. However, you should be able to extend it to other environments that implement this API.
Logging and tracking tools support
It utilizes the Python logging module.
Vectorized environment (VE) feature
No vectorized environments support.
Regular updates
coax is maintained.

I would recommend coax for education purposes. If you want to plug-n-play with nitty-gritty details of RL algorithms, this is a good tool to do this. It’s also built around JAX, which may be a plus in itself (because of hype around it).

SURREAL

“Our goal is to make Deep Reinforcement Learning accessible to everyone. We introduce Surreal, an open-source, reproducible, and scalable distributed reinforcement learning framework. Surreal provides a high-level abstraction for building distributed reinforcement learning algorithms.” ~ Website

Number of state-of-the-art (SOTA) RL algorithms implemented
It focuses on the distributed deep RL algorithms. As for now, the authors implemented their distributed variants of PPO and DDPG.
Official documentation, availability of simple tutorials and examples
It provides basic documentation in the repo of installing, running, and customizing the algorithms. However, it lacks code examples and tutorials.
Readable code that is easy to customize
Code structure can frighten one away, it’s not something for newcomers. That being said, the code includes docstrings and is readable.
Number of supported environments
It supports OpenAI Gym and DM Control environments, as well as Robotic Suite. Robosuite is a standardized and accessible robot manipulation benchmark with the MuJoCo physical engine.
Logging and tracking tools support
It includes specialized logging tools for the distributed environment that also allow you to record videos of agents playing.
Vectorized environment (VE) feature
No vectorized environments support. However, it allows one to distribute the training on the cluster.
Regular updates
It doesn’t seem to be maintained anymore.

I include this framework on the list mostly for reference. If you develop a distributed RL algorithm, you may learn from this repo one or two things e.g. how to manage work on the cluster. Nevertheless, there are better options to develop like RLlib or garage.

Final thoughts

In this article, we have figured out what to look out for when choosing RL tools, what RL libraries are there, and what features they have.

To my knowledge, the best publically available libraries are Tensorforce, Stable Baselines and RL_Coach. You should consider picking one of them as your RL tool. All of them can be considered up-to-date, have a great set of algorithms implemented, and provide valuable tutorials as well as complete documentation. If you want to experiment with different algorithms, you should use RL_Coach. For other tasks, please consider using either Stable Baselines or Tensorforce.

Hopefully, with this information, you will have no problems choosing the RL library for your next project.

Libraries KerasRL, Tensorforce, Pyqlearning, RL_Coach, TFAgents, Stable Baselines, and MushroomRL were described by Vladimir Lyashenko.

Libraries RLlib, Dopamine, SpinningUp, garage, Acme, coax, and SURREAL were described by Piotr Januszewski.

7 Applications of Reinforcement Learning in Finance and Trading

Soumo Chatterjee — Thu, 21 Jul 2022 10:31:22 +0000

In this article, we will explore 7 real world trading and finance applications where reinforcement learning is used to get a performance boost.

Ok but before we move on to the nitty gritty of this article let’s define a few concepts that I will use later.

For starters let’s quickly define reinforcement learning:

A learning process in which an agent interacts with its environment through trial and error, to reach a defined goal in such a way that the agent can maximize the number of rewards, and minimize the penalties given by the environment for each correct step made by the agent to reach its goal.

Cool, now a few keywords that I will use a lot:

Deep Reinforcement Learning (DRL): Algorithms that employ deep learning to approximate value or policy functions that are at the core of reinforcement learning.
Policy Gradient Reinforcement Learning Technique: Approach used in solving reinforcement learning problems. Policy gradient methods target modeling and optimizing the policy function directly.
Deep Q Learning: Using a neural network to approximate the Q-value function. The Q-value function creates an exact matrix for the working agent, which it can “refer to” to maximize its reward in the long run.
Gated Recurrent Unit (GRU): Special type of Recurrent Neural Network, implemented with the help of a gating mechanism.
Gated Deep Q Learning strategy: Combination of Deep Q Learning with GRU.
Gated Policy Gradient strategy: Combination of Policy gradient technique with GRU.
Deep Recurrent Q Network: Combination of Recurrent Neural networks with the Q Learning technique.

OK, now we’re ready to check out how reinforcement learning is used to maximize profits in the finance world.

1. Trading bots with Reinforcement Learning

Bots powered with reinforcement learning can learn from the trading and stock market environment by interacting with it. They use trial and error to optimize their learning strategy based on the characteristics of each and every stock listed in the stock market.

Image by Manfred Steger | Source: Pixabay

There are a few big advantages to this approach:

saves time
trading bots can trade on a 24hrs timeline basis
trading gets diversified across all industries

As an example, you can check out the Stock Trading Bot using Deep Q-Learning project. The idea here was to create a trading bot using the Deep Q Learning technique, and tests show that a trained bot is capable of buying or selling at a single piece of time given a set of stocks to trade on.

Please note that this project is not based on counting transactional costs, efficiency of executing trades, etc. – so this project can’t be outstanding in the real world. Plus, training of the project is done on CPU due to its sequential manner.

Source

2. Chatbot-based Reinforcement Learning

Chatbots are generally trained with the help of sequence to sequence modelling, but adding reinforcement learning to the mix can have big advantages for stock trading and finance:

Chatbots can act as brokers and offer real-time quotes to their user operators.
Conversational UI-based chatbots can help customers resolve their issues instead of someone from the staff or from the backend support team. This saves time, and relieves the support staff from repeatable tasks, letting them concentrate on more complicated issues.
Chatbots can also give suggestions on opening and closing sales values within trading hours.

The Deep Reinforcement Learning Chatbot project shows a chatbot implementation based on reinforcement learning, achieved with the Policy gradient technique.

Source

3. Risk optimization in peer-to-peer lending with Reinforcement Learning

P2P lending is a way of providing individuals and businesses with loans through online services. These online services do the job of matching lenders to their investors.

In these types of online marketplaces, reinforcement learning comes in handy. Specifically it can be used to:

Analyze borrowers’ credit scores to reduce risk.
Predicting annualized returns, since online businesses have low overhead, lenders can expect higher returns compared to savings and investment products offered by banks.
It can also help estimate the likelihood if the borrower will be able to meet his/her debt obligations.

The Peer-to-Peer Lending Robo-Advisor Using a Neural Network project is an online lending platform built with a Neural Network. It doesn’t use reinforcement learning, but you can see that it’s just the kind of trial & error scenario where RL would make perfect sense.

Source

4. Portfolio Management with Deep Reinforcement Learning

Portfolio Management means taking your client’s assets, putting it into stocks, and managing it on a continuous basis to help the client achieve their financial goals. With the help of Deep Policy Network Reinforcement Learning, the allocation of assets can be optimized over time.

In this case, the benefits of deep reinforcement learning are:

It enhances the efficiency and success rates of human managers.
It decreases organizational risk.
It increases Return on Investments (ROI) in terms of organizational profit.

Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem – this project shows an implementation of portfolio management with Deep Policy Network Reinforcement Learning.

Source

5. Price setting strategies with Reinforcement Learning

Complexity and dynamic stock price changes are the biggest challenges in understanding stock prices. In order to understand these properties, Gated Recurrent Unit (GRU) networks work well with reinforcement learning, providing advantages such as:

Extracting informative financial features which can represent the intrinsic character of a stock.
Helping to decide the stop loss and stop profit during trading.

Photo by Olya Kobruseva | Source: Pexels

To support the above statements, the Deep reinforcement learning for time series: playing idealized trading games paper shows which performs best out of Stacked Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM) units, Convolutional Neural Network (CNN), and Multi-Layer Perceptron (MLP).

The GRU-based agents used to model Q values show the best overall performance in the Univariate game to capture a wave-like price time series.

The two techniques with which reinforcement learning can be applied with GRU are:

Gated Deep Q Learning Strategy
Gated Policy Gradient Strategy

To understand these techniques better, you can check out this article: Adaptive stock trading strategies with deep reinforcement learning methods.

6. Recommendation systems with Reinforcement Learning

When it comes to online trading platforms, recommendation systems based on reinforcement learning techniques can be a gamechanger. These systems can help in recommending the right stocks to users while trading.

Photo by ThisIsEngineering | Source: Pexels

Reinforcement learning helps to choose the best stock or mutual fund after being trained on a number of stocks, ultimately leading to better ROI.

The advantages here can be:

Engaging existing users by providing lifelong stock-picking recommendations based on the users’ behaviour on the platform.
Helping beginners by suggesting good stocks to trade.
Making it easier to decide which stocks to pick.

The StockRecommendSystem project shows an implementation of a system like this.

7. Maximizing profit with minimum capital investments

If we combine all of the above points, we could get an automated system constructed to achieve high returns, while keeping the investments as low as possible.

Photo by Karolina Grabowska | Source: Pexels

An agent can be trained with the help of reinforcement learning, which can take the minimum asset from any source and allocate it to a stock, which can double the ROI in the future.

Nowadays, RL agents have been able to learn optimal trading strategies that outperform simple buy and sell strategies that people used to apply. This can be achieved with the help of the Markov Decision Process (MDP) model, using Deep Recurrent Q Network (DRQN). A good resource to understand this concept is Deep Recurrent Q-Learning for Partially Observable MDPs.

Proceed with caution

It’s important to add that a lot of the projects we listed are essentially projects made for fun. They’re trained on past data and not backtested properly. In the case of unseen data (for example COVID stats), the downside risk is much larger than expected by the model.

The market is a complicated system and it’s hard for machine learning systems to understand stocks based only on historical data. The performance of ML-based trading strategies can be great, but it can also cause you to drain your savings. So take these projects with a grain of salt.

Conclusion

Reinforcement learning has always been kind of underrated. By showing finance and trading use cases of RL in this article, I want to share awareness about how useful RL can be, creating a motivated path for new learners and existing developers to explore this domain more. It’s a fascinating topic!

Best Reinforcement Learning Tutorials, Examples, Projects, and Courses

Krissanawat Kaewsanmua — Thu, 21 Jul 2022 10:29:12 +0000

In reinforcement learning, your system learns how to interact intuitively with the environment by basically doing stuff and watching what happens – but obviously, there’s a lot more to it.

If you’re interested in RL, this article will provide you with a ton of new content to explore this concept. A lot of work has been done with reinforcement learning in the past few years, and I’ve collected some of the most interesting articles, videos, and use cases presenting different concepts, approaches, and methods.

In this list, you’ll find:

reinforcement learning tutorials,
examples of where to apply reinforcement learning,
interesting reinforcement learning projects,
courses to master reinforcement learning.

All this content will help you go from RL newbie to RL pro.

Reinforcement learning tutorials

1. RL with Mario Bros – Learn about reinforcement learning in this unique tutorial based on one of the most popular arcade games of all time – Super Mario.

2. Machine Learning for Humans: Reinforcement Learning – This tutorial is part of an ebook titled ‘Machine Learning for Humans’. It explains the core concept of reinforcement learning. There are numerous examples, guidance on the next step to follow in the future of reinforcement learning algorithms, and an easy-to-follow figurative explanation.

3. An introduction to Reinforcement Learning – There’s a lot of knowledge here, explained with much clarity and enthusiasm. It starts with an overview of reinforcement learning with its processes and tasks, explores different approaches to reinforcement learning, and ends with a fundamental introduction of deep reinforcement learning.

4. Reinforcement Learning from scratch – This article will take you through the author’s process of learning RL from scratch. The author has a lot of knowledge of deep reinforcement learning from working at Unity Technologies. Even beginners will be able to understand his overview of the core concepts of reinforcement learning.

5. Deep Reinforcement Learning for Automated Stock Trading – Here you’ll find a solution to a stock trading strategy using reinforcement learning, which optimizes the investment process and maximizes the return on investment. The article includes a proper explanation of three combined algorithms: Proximal Policy Optimization (PPO), Advantage Actor-Critic (A2C), and Deep Deterministic Policy Gradient (DDPG). The best of each algorithm is coordinated to provide a solution to optimized stock trading strategies.

6. Applications of Reinforcement Learning in Real World – Explore how reinforcement learning frameworks are undervalued when it comes to devising decision-making models. A detailed study of RL applications in real-world projects, explaining what a reinforcement learning framework is, and listing its use-cases in real-world environments. It narrows down the applications to 8 areas of learning, consisting of topics like machine learning, deep learning, computer games, and more. The author also explores the relationship of RL with other disciplines and discusses the future of RL.

7. Practical RL – This GitHub repo is an open-source course on reinforcement learning, taught on several college campuses. The repo is maintained to support online students with the option of two locales – Russian and English. The course features services like chat rooms, gradings, FAQs, feedback forms, and a virtual course environment. The course syllabus covers everything from the basics of RL to discussing and implementing different models, methods, and much more.

8. Simple Reinforcement Learning with Tensorflow Part 0: Q-Learning with Tables and Neural Networks – The first part of a tutorial series about reinforcement learning with TensorFlow. The author explores Q-learning algorithms, one of the families of RL algorithms. The simple tabular look-up version of the algorithm is implemented first. The detailed guidance on the implementation of neural networks using the Tensorflow Q-algorithm approach is definitely worth your interest.

Examples of where to apply reinforcement learning

1. Rocket engineering – Explore how reinforcement learning is used in the field of rocket engine development. You’ll find a lot of valuable information on the use of machine learning in manufacturing industries. See why reinforcement learning is favored over other machine learning algorithms when it comes to manufacturing rocket engines.

2. Traffic Light Control – This site provides multiple research papers and project examples that highlight the use of core reinforcement learning and deep reinforcement learning in traffic light control. It has tutorials, datasets, and relevant example papers that use RL as a backbone so that you can make a new finding of your own.

3. Marketing and advertising – See how to make an AI system learn from a pre-existing dataset which may be infeasible or unavailable, and how to make AI learn in real-time by creating advertising content. This is where they have made use of reinforcement learning.

4. Reinforcement Learning in Marketing | by Deepthi A R – This example focuses on the changing business dynamics to which marketers need to adapt. The AI equipped with a reinforcement learning scheme can learn from real-time changes and help devise a proper marketing strategy. This article highlights the changing business environment as a problem and reinforcement learning as a solution to it.

5. Robotics – This video demonstrates the use of reinforcement learning in robotics. The aim is to show the implementation of autonomous reinforcement learning agents for robotics. A prime example of using reinforcement learning in robotics.

6. Recommendation – Recommendation systems are widely used in eCommerce and business sites for product advertisement. There’s always a recommendation section displayed in many popular platforms such as YouTube, Google, etc. The ability of AI to learn from real-time user interactions, and then suggest them content, would not have been possible without reinforcement learning. This article shows the use of reinforcement learning algorithms and practical implementations in recommendation systems.

7. Healthcare – Healthcare is a huge industry with many state-of-the-art technologies bound to it, where the use of AI is not new. The main question here is how to optimize AI in healthcare, and make it learn based on real-time experiences. This is where reinforcement learning comes in. Reinforcement learning has undeniable value for healthcare, with its ability to regulate ultimate behaviors. With RL, healthcare systems can provide more detailed and accurate treatment at reduced costs.

8. NLP – This article shows the use of reinforcement learning in combination with Natural Language Processing to beat a question and answer adventure game. This example might be an inspiration for learners engaged in Natural Language Processing and gaming solutions.

9. Trading – Deep reinforcement learning is a force to reckon with when it comes to the stock trading market. The example here demonstrates how deep reinforcement learning techniques can be used to analyze the stock trading market, and provide proper investment reports. Only an AI equipped with reinforcement learning can provide accurate stock market reports.

Interesting reinforcement learning projects

1. CARLA – CARLA is an open-source simulator for autonomous driving research. The main objective of CARLA is to support the development, training, and validation of autonomous driving systems. With a package of open-source code and protocols, CARLA provides digital assets that are free to use. The CARLA eco-system also integrates code for running Conditional Reinforcement Learning models, with standalone GUI, to enhance maps with traffic lights and traffic signs information.

2. Deep Learning Flappy Bird – If you want to learn about deep Q learning algorithms in an interesting way, then this GitHub repo is for you. The project uses a Deep Q-Network to learn how to play Flappy Bird. It follows the concept of the Deep Q learning algorithm which is in the family of reinforcement learning.

3. Tensorforce – This project delivers an open-source deep reinforcement learning framework specialized in modular flexible library design and direct usability for applications in research and practice. It is built on top of Google’s Tensorflow framework. It houses high-level design implementation such as modular component-based design, separation of RL algorithm and application, and full-on TensorFlow models.

4. Ray – Ray’s main objective is to provide universal APIs for building distributed applications. This project makes use of the RLlib package, which is a scalable Reinforcement Learning library that accelerates machine learning workloads.

5. Neurojs – JavaScript is popular, and a must for developing websites. What if you need to incorporate reinforcement learning in your JS web project? Say hello to Neurojs, a JavaScript framework for deep learning in the browser using reinforcement learning. It can also perform some neural network tasks as well.

6. Mario AI – This one will definitely grab your interest if you are looking for a project with reinforcement learning algorithms for simulating games. Mario AI offers a coding implementation to train a model that plays the first level of Super Mario World automatically, using only raw pixels as the input. The algorithm applied is a deep Q-learning algorithm in the family of reinforcement learning algorithms.

7. Deep Trading Agent – Open-source project offering a deep reinforcement learning based trading agent for Bitcoin. The project makes use of the DeepSense Network for Q function approximation. The goal is to simplify the trading process using a reinforcement learning algorithm optimizing the Deep Q-learning agent. It can be a great source of knowledge.

8. Pwnagotchi – This project will blow your mind if you are into cracking Wifi networks using deep reinforcement learning techniques. Pwnagotchi is a system that learns from its surrounding Wi-Fi environment to maximize the crackable WPA key material it captures. Unlike most reinforcement learning-based systems, Pwnagotchi amplifies its parameters over time to get better at cracking WiFi networks in the environments you expose it to.

Courses to master reinforcement learning

1. Reinforcement Learning Specialization (Coursera) – One of the best courses available in the market. With a total rating of 4.8 stars and 21000+ students already enrolled, this course will help you master the concepts of reinforcement learning. You will learn how to implement a complete RL solution and take note of its application to solve real-world problems. By the end of this course, you will be able to formalize tasks as a reinforcement learning problem and its due solutions, understand the concepts of RL algorithms, and how RL fits under the broader umbrella of machine learning.

2. Reinforcement Learning in Python (Udemy) – This is a premium course offered by Udemy at the price of 29.99 USD. It has a rating of 4.5 stars overall with more than 39,000 learners enrolled. This course is a learning playground for those who are seeking to implement an AI solution with reinforcement learning engaged in Python programming. Through theoretical and practical implementations, you will learn to apply gradient-based supervised machine learning methods to reinforcement learning, programming implementations of numerous reinforcement learning algorithms, and also know the relationship between RL and psychology.

3. Practical Reinforcement Learning (Coursera) – With a rating of 4.2, and 37,000+learners, this course is the essential section of the Advanced Machine Learning Specialization. You are guaranteed to get knowledge of practical implementation of RL algorithms. You’ll get insights on the foundations of RL methods, and using neural network technologies for RL. One interesting part is training neural networks to play games on their own using RL.

4. Understanding Algorithms for Reinforcement Learning – If you are a total beginner in the field of Reinforcement learning then this might be the best course for you. With an overall rating of 4.0 stars and a duration of nearly 3 hours in the PluralSight platform, this course can be a quick way to get yourself started with reinforcement learning algorithms. You’ll get deep information on algorithms for reinforcement learning, basic principles of reinforcement learning algorithms, RL taxonomy, and RL family algorithms such as Q-learning and SARSA.

5. Reinforcement Learning by Georgia Tech (Udacity) – One of the best free courses available, offered by Georgia Tech through the Udacity platform. The course is formulated for those seeking to understand the world of Machine learning and Artificial Intelligence from a theoretical perspective. It provides rich insights into recent research on reinforcement learning, which will help you explore automated decision-making models.

6. Reinforcement Learning Winter (Stanford Education) – This course is provided by Stanford University as a winter session. There are some basic requirements for the course, such as Python programming proficiency, knowledge of linear algebra and calculus, basics of statistics and probability, and basics of machine learning. This course provides state of the art lectures. In the end, you will be able to define key features of RL, applications of RL on real-world problems, coding implementations of RL algorithms, and have deep knowledge of RL algorithms. This course is suited for those seeking advanced-level learning resources on the RL ecosystem.

7. Advanced AI: Deep Reinforcement Learning with Python – If you are looking for a high-level advanced course on Reinforcement learning, then this is no doubt the best course available in the Udemy platform for you. This is a premium course with a price tag of 29.99 USD, a rating of 4.6 stars, entertaining more than 32,000 students across the world. It is not just about reinforcement learning at the foundation level, but also deep reinforcement learning with its practical implementation using Python programming. The practical implementations of deep learning agents, Q-learning algorithms, deep neural networks, RBF networks, convolutional neural networks with deep Q-learning are the prime grabs of this course.

8. Practical Reinforcement Learning – Another popular course offered by Coursera, best for those looking for practical knowledge of reinforcement learning. It has a total rating of 4.2 stars with more than 37,000 students already enrolled.

What are you waiting for? Start learning!

Hopefully, these resources will help you get a deep understanding of reinforcement learning, and its practical applications in the real world.

RL is a fascinating part of machine learning, and it’s worth spending your time on it to master it. Good luck!

10 Real-Life Applications of Reinforcement Learning

Derrick Mwiti — Thu, 21 Jul 2022 10:09:07 +0000

In Reinforcement Learning (RL), agents are trained on a reward and punishment mechanism. The agent is rewarded for correct moves and punished for the wrong ones. In doing so, the agent tries to minimize wrong moves and maximize the right ones.

Modified based on: source

In this article, we’ll look at some of the real-world applications of reinforcement learning.

Applications in self-driving cars

Various papers have proposed Deep Reinforcement Learning for autonomous driving. In self-driving cars, there are various aspects to consider, such as speed limits at various places, drivable zones, avoiding collisions — just to mention a few.

Some of the autonomous driving tasks where reinforcement learning could be applied include trajectory optimization, motion planning, dynamic pathing, controller optimization, and scenario-based learning policies for highways.

For example, parking can be achieved by learning automatic parking policies. Lane changing can be achieved using Q-Learning while overtaking can be implemented by learning an overtaking policy while avoiding collision and maintaining a steady speed thereafter.

AWS DeepRacer is an autonomous racing car that has been designed to test out RL in a physical track. It uses cameras to visualize the runway and a reinforcement learning model to control the throttle and direction.

Source

Wayve.ai has successfully applied reinforcement learning to training a car on how to drive in a day. They used a deep reinforcement learning algorithm to tackle the lane following task. Their network architecture was a deep network with 4 convolutional layers and 3 fully connected layers. The example below shows the lane following task. The image in the middle represents the driver’s perspective.

Source

Self-Driving Cars With Convolutional Neural Networks (CNN)

Industry automation with Reinforcement Learning

In industry reinforcement, learning-based robots are used to perform various tasks. Apart from the fact that these robots are more efficient than human beings, they can also perform tasks that would be dangerous for people.

A great example is the use of AI agents by Deepmind to cool Google Data Centers. This led to a 40% reduction in energy spending. The centers are now fully controlled with the AI system without the need for human intervention. There is obviously still supervision from data center experts. The system works in the following way:

Taking snapshots of data from the data centers every five minutes and feeding this to deep neural networks
It then predicts how different combinations will affect future energy consumptions
Identifying actions that will lead to minimal power consumption while maintaining a set standard of safety criteria
Sending and implement these actions at the data center

The actions are verified by the local control system.

Reinforcement Learning applications in trading and finance

Supervised time series models can be used for predicting future sales as well as predicting stock prices. However, these models don’t determine the action to take at a particular stock price. Enter Reinforcement Learning (RL). An RL agent can decide on such a task; whether to hold, buy, or sell. The RL model is evaluated using market benchmark standards in order to ensure that it’s performing optimally.

This automation brings consistency into the process, unlike previous methods where analysts would have to make every single decision. IBM for example has a sophisticated reinforcement learning based platform that has the ability to make financial trades. It computes the reward function based on the loss or profit of every financial transaction.

Reinforcement Learning in NLP (Natural Language Processing)

In NLP, RL can be used in text summarization, question answering, and machine translation just to mention a few.

The authors of this paper Eunsol Choi, Daniel Hewlett, and Jakob Uszkoreit propose an RL based approach for question answering given long texts. Their method works by first selecting a few sentences from the document that are relevant for answering the question. A slow RNN is then employed to produce answers to the selected sentences.

Source

A combination of supervised and reinforcement learning is used for abstractive text summarization in this paper. The paper is fronted by Romain Paulus, Caiming Xiong & Richard Socher. Their goal is to solve the problem faced in summarization while using Attentional, RNN-based encoder-decoder models in longer documents. The authors of this paper propose a neural network with a novel intra-attention that attends over the input and continuously generates output separately. Their training methods are a combo of standard supervised word prediction and reinforcement learning.

Source

On the side of machine translation, authors from the University of Colorado and the University of Maryland, propose a reinforcement learning based approach to simultaneous machine translation. The interesting thing about this work is that it has the ability to learn when to trust the predicted words and uses RL to determine when to wait for more input.

Source

Researchers from Stanford University, Ohio State University, and Microsoft Research have fronted Deep RL for use in dialogue generation. The deep RL can be used to model future rewards in a chatbot dialogue. Conversations are simulated using two virtual agents. Policy gradient methods are used to reward sequences that contain important conversation attributes such as coherence, informativity, and ease of answering.

Source

More NLP applications can be found here or here.

Reinforcement Learning applications in healthcare

In healthcare, patients can receive treatment from policies learned from RL systems. RL is able to find optimal policies using previous experiences without the need for previous information on the mathematical model of biological systems. It makes this approach more applicable than other control-based systems in healthcare.

RL in healthcare is categorized as dynamic treatment regimes(DTRs) in chronic disease or critical care, automated medical diagnosis, and other general domains.

Source

In DTRs the input is a set of clinical observations and assessments of a patient. The outputs are the treatment options for every stage. These are similar to states in RL. Application of RL in DTRs is advantageous because it is capable of determining time-dependent decisions for the best treatment for a patient at a specific time.

The use of RL in healthcare also enables improvement of long-term outcomes by factoring the delayed effects of treatments.

RL has also been used for the discovery and generation of optimal DTRs for chronic diseases.

You can dive deeper into RL applications in healthcare by exploring this paper.

Reinforcement Learning applications in engineering

In the engineering frontier, Facebook has developed an open-source reinforcement learning platform — Horizon. The platform uses reinforcement learning to optimize large-scale production systems. Facebook has used Horizon internally:

to personalize suggestions
deliver more meaningful notifications to users
optimize video streaming quality.

Horizon also contains workflows for:

simulated environments
a distributed platform for data preprocessing
training and exporting models in production.

A classic example of reinforcement learning in video display is serving a user a low or high bit rate video based on the state of the video buffers and estimates from other machine learning systems.

Horizon is capable of handling production-like concerns such as:

deploying at scale
feature normalization
distributed learning
serving and handling datasets with high-dimensional data and thousands of feature types.

Reinforcement Learning in news recommendation

User preferences can change frequently, therefore recommending news to users based on reviews and likes could become obsolete quickly. With reinforcement learning, the RL system can track the reader’s return behaviors.

Construction of such a system would involve obtaining news features, reader features, context features, and reader news features. News features include but are not limited to the content, headline, and publisher. Reader features refer to how the reader interacts with the content e.g clicks and shares. Context features include news aspects such as timing and freshness of the news. A reward is then defined based on these user behaviors.

Reinforcement Learning in gaming

Let’s look at an application in the gaming frontier, specifically AlphaGo Zero. Using reinforcement learning, AlphaGo Zero was able to learn the game of Go from scratch. It learned by playing against itself. After 40 days of self-training, Alpha Go Zero was able to outperform the version of Alpha Go known as Master that has defeated world number one Ke Jie. It only used black and white stones from the board as input features and a single neural network. A simple tree search that relies on the single neural network is used to evaluate positions moves and sample moves without using any Monte Carlo rollouts.

Real-time bidding— Reinforcement Learning applications in marketing and advertising

In this paper, the authors propose real-time bidding with multi-agent reinforcement learning. The handling of a large number of advertisers is dealt with using a clustering method and assigning each cluster a strategic bidding agent. To balance the trade-off between the competition and cooperation among advertisers, a Distributed Coordinated Multi-Agent Bidding (DCMAB) is proposed.

In marketing, the ability to accurately target an individual is very crucial. This is because the right targets obviously lead to a high return on investment. The study in this paper was based on Taobao — the largest e-commerce platform in China. The proposed method outperforms the state-of-the-art single-agent reinforcement learning approaches.

Reinforcement Learning in robotics manipulation

The use of deep learning and reinforcement learning can train robots that have the ability to grasp various objects — even those unseen during training. This can, for example, be used in building products in an assembly line.

This is achieved by combining large-scale distributed optimization and a variant of deep Q-Learning called QT-Opt. QT-Opt support for continuous action spaces makes it suitable for robotics problems. A model is first trained offline and then deployed and fine-tuned on the real robot.

Google AI applied this approach to robotics grasping where 7 real-world robots ran for 800 robot hours in a 4-month period.

Source

In this experiment, the QT-Opt approach succeeds in 96% of the grasp attempts across 700 trials grasps on objects that were previously unseen. Google AI’s previous method had a 78% success rate.

Final thoughts

Whereas reinforcement learning is still a very active research area significant progress has been made to advance the field and apply it in real life.

In this article, we have barely scratched the surface as far as application areas of reinforcement learning are concerned. Hopefully, this has sparked some curiosity that will drive you to dive in a little deeper into this area. If you want to learn more check out this awesome repo — no pun intended, and this one as well.

Reinforcement Learning - neptune.ai

Reinforcement Learning From Human Feedback (RLHF) For LLMs

TL;DR

Importance of RLHF in LLMs

LLM Evaluation For Text Summarization

The RLHF process

Collecting human feedback

Training a reward model

Fine-tuning the LLM with the reward model

Proximal Policy Optimization (PPO)

PPO under the hood

PPO alternatives

Best practices for RLHF

Avoiding reward hacking

Scaling human feedback

Tooling and frameworks

Data collection

End-to-end RLHF frameworks

Conclusion

Continuous Control With Deep Reinforcement Learning

What is continuous control?

What is MuJoCo?

Check also

Off-policy actor-critic methods

SAC in pseudo-code

Soft actor-critic in code

The Pendulum-v0 environment

Training the SAC agent

Before training

After training

Visualizing the trained policy

Conclusions

Installing MuJoCo to Work With OpenAI Gym Environments

How do you get MuJoCo?

MuJoCo – Multi-Joint dynamics with Contact

What is OpenAI Gym?

Installing MuJoCo and OpenAI Gym

License

Installing mujoco-py

Troubleshooting

Version

Installing OpenAI Gym Environments (tutorial)

MuJoCo diagnostics

Humanoid diagnostics

Best Benchmarks for Reinforcement Learning: The Ultimate List

Conclusions

Best Benchmarks for Reinforcement Learning: The Ultimate List

Rule of thumb

Benchmarks

List of RL benchmarks

A

B

D

G

M

O

P

R

S

T

W

Conclusion

Markov Decision Process in Reinforcement Learning: Everything You Need to Know

Defining Markov Decision Processes in Machine Learning

The Bellman equation & dynamic programming

Q-learning: Markov Decision Process + Reinforcement Learning

Summary

Model-Based and Model-Free Reinforcement Learning: Pytennis Case Study

SEE RELATED ARTICLES

Fundamental concepts of Reinforcement Learning

Markov decision processes / Q-Value / Q-Learning / Deep Q Network

Difference between model-based and model-free Reinforcement Learning

Pytennis environment

Discrete mathematical approach to playing tennis – model-free Reinforcement Learning

Tennis game using Deep Q Network – model-based Reinforcement Learning

Comparison/Evaluation

Conclusion

References

The Best Tools for Reinforcement Learning in Python You Actually Want to Try

Python libraries for Reinforcement Learning