<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Reinforcement Learning - neptune.ai</title>
	<atom:link href="https://neptune.ai/blog/category/reinforcement-learning/feed" rel="self" type="application/rss+xml" />
	<link>https://neptune.ai/blog/category/reinforcement-learning</link>
	<description>The experiment tracker for foundation model training.</description>
	<lastBuildDate>Tue, 06 May 2025 11:31:38 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	

<image>
	<url>https://i0.wp.com/neptune.ai/wp-content/uploads/2022/11/cropped-Signet-1.png?fit=32%2C32&#038;ssl=1</url>
	<title>Reinforcement Learning - neptune.ai</title>
	<link>https://neptune.ai/blog/category/reinforcement-learning</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">211928962</site>	<item>
		<title>Reinforcement Learning From Human Feedback (RLHF) For LLMs</title>
		<link>https://neptune.ai/blog/reinforcement-learning-from-human-feedback-for-llms</link>
		
		<dc:creator><![CDATA[Michał Oleszak]]></dc:creator>
		<pubDate>Thu, 12 Sep 2024 11:00:00 +0000</pubDate>
				<category><![CDATA[LLMOps]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<guid isPermaLink="false">https://neptune.ai/?p=40623</guid>

					<description><![CDATA[Reinforcement Learning from Human Feedback (RLHF) has turned out to be the key to unlocking the full potential of today&#8217;s large language models (LLMs). There is arguably no better evidence for this than OpenAI’s GPT-3 model. It was released back in 2020, but it was only its RLHF-trained version dubbed ChatGPT that became an overnight&#8230;]]></description>
										<content:encoded><![CDATA[
<section id="note-block_c397b7f1417876f600d81b210dfabc45"
         class="block-note c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-note__header">
            TL;DR        </h3>
    
    <div class="block-note__content">
                    <div class="c-item c-item--text">

                                    <img
                        alt=""
                        class="c-item__arrow"
                        src="https://neptune.ai/wp-content/themes/neptune/img/blocks/note/list-arrow.svg"
                        loading="lazy"
                        decoding="async"
                        width="12"
                        height="10"
                    />
                
                <div class="c-item__content">

                                            <p>Reinforcement Learning from Human Feedback (RLHF) unlocked the full potential of today&#8217;s large language models (LLMs).</p>
                                    </div>

            </div>
                    <div class="c-item c-item--text">

                                    <img
                        alt=""
                        class="c-item__arrow"
                        src="https://neptune.ai/wp-content/themes/neptune/img/blocks/note/list-arrow.svg"
                        loading="lazy"
                        decoding="async"
                        width="12"
                        height="10"
                    />
                
                <div class="c-item__content">

                                            <p>By integrating human judgment into the training process, RLHF ensures that models not only produce coherent and useful outputs but also align more closely with human values, preferences, and expectations.</p>
                                    </div>

            </div>
                    <div class="c-item c-item--text">

                                    <img
                        alt=""
                        class="c-item__arrow"
                        src="https://neptune.ai/wp-content/themes/neptune/img/blocks/note/list-arrow.svg"
                        loading="lazy"
                        decoding="async"
                        width="12"
                        height="10"
                    />
                
                <div class="c-item__content">

                                            <p>The RLHF process consists of three steps: collecting human feedback in the form of a preference dataset, training a reward model to mimic human preferences, and fine-tuning the LLM using the reward model. The last step is enabled by the Proximal Policy Optimization (PPO) algorithm.</p>
                                    </div>

            </div>
                    <div class="c-item c-item--text">

                                    <img
                        alt=""
                        class="c-item__arrow"
                        src="https://neptune.ai/wp-content/themes/neptune/img/blocks/note/list-arrow.svg"
                        loading="lazy"
                        decoding="async"
                        width="12"
                        height="10"
                    />
                
                <div class="c-item__content">

                                            <p>Alternatives to RLHF include Constitutional AI where the model learns to critique itself whenever it fails to adhere to a predefined set of rules and Reinforcement Learning from AI Feedback (RLAIF) in which off-the-shelf LLMs replace humans as preference data providers.</p>
                                    </div>

            </div>
            </div>


</section>



<p>Reinforcement Learning from Human Feedback (RLHF) has turned out to be the key to unlocking the full potential of today&#8217;s large language models (LLMs). There is arguably no better evidence for this than OpenAI’s GPT-3 model. It was released back in 2020, but it was only its RLHF-trained version dubbed ChatGPT that became an overnight sensation, capturing the attention of millions and setting a new standard for conversational AI.</p>



<p>Before RLHF, the LLM training process typically consisted of a pre-training stage in which the model learned the general structure of the language and a fine-tuning stage in which it learned to perform a specific task. By integrating human judgment as a third training stage, RLHF ensures that models not only produce coherent and useful outputs but also align more closely with human values, preferences, and expectations. It achieves this through a feedback loop where human evaluators rate or rank the model&#8217;s outputs, which is then used to adjust the model’s behavior.</p>



<p>This article explores the intricacies of RLHF. We will look at its importance for language modeling, analyze its inner workings in detail, and discuss the best practices for implementation.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-importance-of-rlhf-in-llms">Importance of RLHF in LLMs</h2>



<p>When analyzing the importance of RLHF to language modeling, one could approach it from two different perspectives.</p>



<p>On the one hand, this technique has emerged as a response to the limitations of traditional supervised fine-tuning, such as reliance on static datasets often limited in scope, context, and diversity, as well as broader human values, ethics, or social norms. Additionally, traditional fine-tuning often struggles with tasks that involve subjective judgment or ambiguity, where there may be multiple valid answers. In such cases, a model might favor one answer over another based on the training data, even if the alternative might be more appropriate in a given context. RLHF provides a way to lift some of these limitations.</p>



<p>On the other hand, however, RLHF represents a paradigm shift in the fine-tuning of LLMs. It forms a standalone, transformative change in the evolution of AI rather than a mere incremental improvement over existing methods.</p>



<p>Let’s look at it from the latter perspective first.</p>



<p>The paradigm shift brought about by RLHF lies in the integration of human feedback directly into the training loop, enabling models to better align with human values and preferences. This approach prioritizes dynamic model-human interactions over static training datasets. By incorporating human insights throughout the training process, RLHF ensures that models are more context-aware and capable of handling the complexities of natural language.</p>



<p>I now hear you asking: “But how is injecting the human into the loop better than the traditional fine-tuning in which we train the model in a supervised fashion on a static dataset? Can’t we simply pass human preferences to the model by constructing a fine-tuning data set based on these preferences?“ That’s a fair question.</p>



<p>Consider succinctness as a preference for a text summarizing model. We could <a href="/blog/llm-fine-tuning-and-model-selection-with-neptune-transformers" target="_blank" rel="noreferrer noopener">fine-tune a Large Language Model</a> on concise summaries by training it in a supervised manner on the set of input-output pairs where input is the original text and output is the desired summary.</p>



<p>The problem here is that <a href="/blog/llm-evaluation-text-summarization" target="_blank" rel="noreferrer noopener">different summaries can be equally good</a>, and different groups of people will have preferences as to what level of succinctness is optimal in different contexts. When relying solely on traditional supervised fine-tuning, the model might learn to generate concise summaries, but it won&#8217;t necessarily grasp the subtle balance between brevity and informativeness that different users might prefer. This is where RLHF offers a distinct advantage.<br><br>In RLHF, we train the model on the following data set:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" fetchpriority="high" decoding="async" width="1200" height="628" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_1.png?resize=1200%2C628&#038;ssl=1" alt="In RLHF, we train the model on the following data set.
Each example consists of the long input text, two alternative summaries, and a label that signals which of the two was preferred by a human annotator. By directly passing human preference to the model via the label indicating the “better” output, we can ensure it aligns with it properly." class="wp-image-40636" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_1.png?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_1.png?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_1.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_1.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_1.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_1.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_1.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_1.png?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_1.png?resize=1020%2C534&amp;ssl=1 1020w" sizes="(max-width: 1000px) 100vw, 1000px" /></figure>
</div>


<div id="separator-block_6052665fbf8e2d919c61b6e27d3d9cf0"
         class="block-separator block-separator--20">
</div>



<p>Each example consists of the long input text, two alternative summaries, and a label that signals which of the two was preferred by a human annotator. By directly passing human preference to the model via the label indicating the “better” output, we can ensure it aligns with it properly.</p>



<p>Let’s discuss how this works in detail.</p>


    <a
        href="/blog/llm-evaluation-text-summarization"
        id="cta-box-related-link-block_aa4fd0e883b6a6cf5f71e2dbaaeac2e7"
        class="block-cta-box-related-link  l-margin__top--standard l-margin__bottom--standard"
        target="_blank" rel="nofollow noopener noreferrer"    >

    
    <div class="block-cta-box-related-link__description-wrapper block-cta-box-related-link__description-wrapper--full">

        
            <div class="c-eyebrow">

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-related--article.svg"
                    loading="lazy"
                    decoding="async"
                    width="16"
                    height="16"
                    alt=""
                    class="c-eyebrow__icon">

                <div class="c-eyebrow__text">
                    Related                </div>
            </div>

        
                    <h3 class="c-header" class="c-header" id="h-llm-evaluation-for-text-summarization">                LLM Evaluation For Text Summarization            </h3>        
                    <div class="c-button c-button--tertiary c-button--small">

                <span class="c-button__text">
                    Read more                </span>

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-button-arrow-right.svg"
                    loading="lazy"
                    decoding="async"
                    width="12"
                    height="12"
                    alt=""
                    class="c-button__arrow">

            </div>
            </div>

    </a>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-the-rlhf-process">The RLHF process</h2>



<p>The RLHF process consists of three steps:<br></p>



<ol class="wp-block-list">
<li>Collecting human feedback.</li>



<li>Training a reward model.</li>



<li>Fine-tuning the LLM using the reward model.</li>
</ol>



<p>The algorithm enabling the last step in the process is the Proximal Policy Optimization (PPO).</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" decoding="async" width="1200" height="628" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_2.png?resize=1200%2C628&#038;ssl=1" alt="High-Level overview of Reinforcement Learning from Human Feeback (RLHF). A reward model is trained on a preference dataset that includes the input, alternative outputs, and a label indicating which of the outputs is preferable. The LLM is fine-tuned through reinforcement learning with Proximal Policy Optimization (PPO)." class="wp-image-40637" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_2.png?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_2.png?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_2.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_2.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_2.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_2.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_2.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_2.png?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_2.png?resize=1020%2C534&amp;ssl=1 1020w" sizes="(max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">High-Level overview of Reinforcement Learning from Human Feeback (RLHF). A reward model is trained on a preference dataset that includes the input, alternative outputs, and a label indicating which of the outputs is preferable. The LLM is fine-tuned through reinforcement learning with Proximal Policy Optimization (PPO).</figcaption></figure>
</div>


<h3 class="wp-block-heading" class="wp-block-heading" id="h-collecting-human-feedback">Collecting human feedback</h3>



<p>The first step in RLHF is to collect human feedback in the so-called preference dataset. In its simplest form, each example in this dataset consists of a prompt, two different answers produced by the LLM as the response to this prompt, and an indicator for which of the two answers was deemed better by a human evaluator.</p>



<p>The specific dataset formats differ and are not too important. The schematic dataset shown above used four fields: Input text, Summary 1, Summary 2, and Preference. <a previewlistener="true" href="https://huggingface.co/datasets/Anthropic/hh-rlhf?row=41" target="_blank" rel="noreferrer noopener nofollow">Anthropic’s hh-rlhf dataset</a> uses a different format: two columns with the chosen and rejected version of a dialogue between a human and an AI assistant, where the prompt is the same in both cases.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" decoding="async" width="978" height="324" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_13.png?resize=978%2C324&#038;ssl=1" alt="An example entry from Anthropic’s hh-rlhf preference dataset. The left column contains the prompt and the better answer produced by the model. The right column contains the exact same prompt and the worse answer, as judged by a human evaluator." class="wp-image-40647" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_13.png?w=978&amp;ssl=1 978w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_13.png?resize=768%2C254&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_13.png?resize=200%2C66&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_13.png?resize=220%2C73&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_13.png?resize=120%2C40&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_13.png?resize=160%2C53&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_13.png?resize=300%2C99&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_13.png?resize=480%2C159&amp;ssl=1 480w" sizes="(max-width: 978px) 100vw, 978px" /><figcaption class="wp-element-caption">An example entry from Anthropic’s hh-rlhf preference dataset. The left column contains the prompt and the better answer produced by the model. The right column contains the exact same prompt and the worse answer, as judged by a human evaluator. | <a href="https://huggingface.co/datasets/Anthropic/hh-rlhf?row=2" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>Regardless of the format, the information contained in the human preference data set is always the same: It’s not that one answer is good and the other is bad. It’s that one, albeit imperfect, is preferred over the other – it’s all about <em>preference.</em></p>



<p>Now you might wonder why the labelers are asked to pick one of two responses instead of, say, scoring each response on a scale. The problem with scores is that they are subjective. Scores provided by different individuals, or even two scores from the same labeler but on different examples, are not comparable.</p>



<p>So how do the labelers decide which of the two responses to pick? This is arguably the most important nuance in RLHF. The labelers are offered specific instructions outlining the evaluation protocol. For example, they might be instructed to pick the answers that don’t use swear words, the ones that sound more friendly, or the ones that don’t offer any dangerous information. What the instructions tell the labelers to focus on is crucial to the RLHF-trained model, as it will only align with those human values that are contained within these instructions.</p>



<p>More advanced approaches to building a preference dataset might involve humans ranking more than two responses to the same prompt. Consider three different responses: A, B, and C.</p>



<p>Human annotators have ranked them as follows, where “1” is best, and “3” is worst:</p>



<p>A &#8211; 2</p>



<p>B &#8211; 1</p>



<p>C &#8211; 3</p>



<p>Out of these, we can create three pairs resulting in three training examples:</p>



<div id="medium-table-block_ed5be6dc68b79292c924a040df8f5f9f"
     class="block-medium-table c-table__outer-wrapper  aligncenter l-padding__top--0 l-padding__bottom--0 l-margin__top--0 l-margin__bottom--0">

    <table class="c-table">
                    <thead class="c-table__head">
            <tr>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Preferred response                        </div>
                    </td>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Non-preferred response                        </div>
                    </td>
                            </tr>
            </thead>
        
        <tbody class="c-table__body">

                    
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>B</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>A</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>A</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>C</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>B</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>C</p>
                                                            </div>
                        </td>

                    
                </tr>

                    
        </tbody>
    </table>

</div>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-training-a-reward-model">Training a reward model</h3>



<p>Once we have our preference dataset in place, we can use it to train a reward model (RM).</p>



<p>The reward model is typically also an LLM, often encoder-only, such as BERT. During training, the RM receives three inputs from the preference dataset: the prompt, the winning response, and the losing response. It produces two outputs, called rewards, for each of the responses:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1200" height="628" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_3.png?resize=1200%2C628&#038;ssl=1" alt="Training a reward model: the reward model is typically also an LLM, often encoder-only, such as BERT. During training, the RM receives three inputs from the preference dataset: the prompt, the winning response, and the losing response. It produces two outputs, called rewards, for each of the responses." class="wp-image-40638" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_3.png?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_3.png?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_3.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_3.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_3.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_3.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_3.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_3.png?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_3.png?resize=1020%2C534&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></figure>
</div>


<div id="separator-block_6052665fbf8e2d919c61b6e27d3d9cf0"
         class="block-separator block-separator--20">
</div>



<p>The training objective is to maximize the reward difference between the winning and losing response. An often-used loss function is the <a href="/blog/cross-entropy-loss-and-its-applications-in-deep-learning" target="_blank" rel="noreferrer noopener">cross-entropy loss</a> between the two rewards.</p>



<p>This way, the reward model learns to distinguish between more and less preferred responses, effectively ranking them based on the preferences encoded in the dataset. As the model continues to train, it becomes better at predicting which responses will likely be preferred by a human evaluator.</p>



<p>Once trained, the reward model serves as a simple regressor predicting the reward value for the given prompt-completion pair:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1200" height="628" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_4.png?resize=1200%2C628&#038;ssl=1" alt="Once trained, the reward model serves as a simple regressor predicting the reward value for the given prompt-completion pair." class="wp-image-40639" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_4.png?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_4.png?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_4.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_4.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_4.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_4.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_4.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_4.png?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_4.png?resize=1020%2C534&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></figure>
</div>


<h3 class="wp-block-heading" class="wp-block-heading" id="h-fine-tuning-the-llm-with-the-reward-model">Fine-tuning the LLM with the reward model</h3>



<p>The third and final RLHF stage is fine-tuning. This is where the reinforcement learning takes place.</p>



<p>The fine-tuning stage requires another dataset that is different from the preference dataset. It consists of prompts only, which should be similar to what we expect our LLM to deal with in production. Fine-tuning teaches the LLM to produce aligned responses <em>for these prompts.</em></p>



<p>Specifically, the goal of fine-tuning is to train the LLM to produce completions that maximize the rewards given by the reward model. The training loop looks as follows:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1200" height="628" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_6.png?resize=1200%2C628&#038;ssl=1" alt="Fine-tuning the LLM with the reward model: first, we pass a prompt from the training set to the LLM and generate a completion. Next, the prompt and the completion are passed to the reward model, which in turn predicts the reward. This reward is fed into an optimization algorithm such as PPO, which then adjusts the LLM’s weights in a direction resulting in a better RM-predicted reward for the given training example." class="wp-image-40641" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_6.png?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_6.png?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_6.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_6.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_6.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_6.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_6.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_6.png?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_6.png?resize=1020%2C534&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></figure>
</div>


<div id="separator-block_6052665fbf8e2d919c61b6e27d3d9cf0"
         class="block-separator block-separator--20">
</div>



<p>First, we pass a prompt from the training set to the LLM and generate a completion. Next, the prompt and the completion are passed to the reward model, which in turn predicts the reward. This reward is fed into an optimization algorithm such as PPO (more about it in the next section), which then adjusts the LLM’s weights in a direction resulting in a better RM-predicted reward for the given training example (not unlike <a href="/blog/deep-learning-optimization-algorithms" target="_blank" rel="noreferrer noopener">gradient descent</a> in traditional deep learning).</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-proximal-policy-optimization-ppo">Proximal Policy Optimization (PPO)</h3>



<p>One of the most popular optimizers for RLHF is the Proximal Policy Optimization algorithm or PPO. Let’s unpack this mouthful.</p>



<p>In the reinforcement learning context, the term “policy” refers to the strategy used by an agent to decide its actions. In the RLHF world, the policy is the LLM we are training which decides which tokens to generate in its responses. Hence, “policy optimization” means we are optimizing the LLM’s weights.</p>



<p>What about “proximal”? The term &#8220;proximal&#8221; refers to the key idea in PPO of making only small, controlled changes to the policy during training. This prevents an issue all too common in traditional policy gradient methods, where large updates to the policy can sometimes lead to significant performance drops.</p>



<h4 class="wp-block-heading">PPO under the hood</h4>



<p>The PPO loss function is composed of three components:</p>



<ul class="wp-block-list">
<li><strong>Policy Loss:</strong> PPO’s primary objective when improving the LLM.</li>



<li><strong>Value Loss:</strong> Used to train the value function, which estimates the future rewards from a given state. The value function allows for computing the advantage, which in turn is used to update the policy.</li>



<li><strong>Entropy Loss:</strong> Encourages exploration by penalizing certainty in the action distribution, allowing the LLM to remain creative.</li>
</ul>



<p>The total PPO loss can be expressed as:</p>



<section id="note-block_ee009d8b0f94e34b3ac09d9cb383168c"
         class="block-note c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

    
    <div class="block-note__content">
                    <div class="c-item c-item--wysiwyg_editor">

                
                
                <div class="c-item__content">

                                            <p>L_PPO = L_POLICY + a × L_VALUE + b × L_ENTROPY</p>
                                    </div>

            </div>
            </div>


</section>



<p>where a and b are weight hyperparameters.</p>



<p>The entropy loss component is just the entropy of the probability distribution over the next tokens during generations. We don’t want it to be too small, as this would discourage diversity in the produced texts.</p>



<p>The value loss component is computed step-by-step as the LLM generates subsequent tokens. At each step, the value loss is the difference between the actual future total reward (based on the full completion) and its current-step approximation through the so-called value function. Reducing the value loss trains the value function to be more accurate, resulting in better future reward prediction.</p>



<p>In the policy loss component, we use the value function to predict future rewards over different possible completions (next tokens). Based on these, we can estimate the so-called advantage term that captures how better or worse one completion is compared to all possible completions.</p>



<p>If the advantage term for a given completion is positive, it means that increasing the probability of this particular completion being generated will lead to a higher reward and, thus, a better-aligned model. Hence, we should tweak the LLM’s parameters such that this probability is increased.</p>



<h4 class="wp-block-heading">PPO alternatives</h4>



<p>PPO is not the only optimizer used for RLHF. With the current pace of AI research, new alternatives spring up like mushrooms. Let’s take a look at a few worth mentioning.</p>



<p><a href="https://arxiv.org/pdf/2305.18290" target="_blank" rel="noreferrer noopener nofollow">Direct Preference Optimization (DPO)</a> is based on an observation that the cross-entropy loss used to train the reward model in RLHF can be directly applied to fine-tune the LLM. DPO is more efficient than PPO and has been shown to yield better response quality.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1200" height="628" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_8.png?resize=1200%2C628&#038;ssl=1" alt="Comparison between Direct Policy Optimization (DPO) and Proximal Policy Optimization (PPO). DPO (right) requires fewer steps as it does not use the reward model, unlike PPO (left)." class="wp-image-40643" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_8.png?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_8.png?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_8.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_8.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_8.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_8.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_8.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_8.png?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_8.png?resize=1020%2C534&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><br>Comparison between Direct Policy Optimization (DPO) and Proximal Policy Optimization (PPO). DPO (right) requires fewer steps as it does not use the reward model, unlike PPO (left). | Modified based on: <a href="https://arxiv.org/pdf/2305.18290" target="_blank" rel="noreferrer noopener nofollow">Source</a> </figcaption></figure>
</div>


<p>Another interesting alternative to PPO is <a href="https://arxiv.org/pdf/2310.13639" target="_blank" rel="noreferrer noopener nofollow">Contrastive Preference Learning (CPL)</a>. The proponents claim that PPO’s assumption that human preferences are distributed according to reward is incorrect. Rather, recent work would suggest that they instead follow regret. Similarly to DPO, CPL circumvents the need for training a reward model. It replaces it with a regret-based model of human preferences trained with a contrastive loss.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="818" height="251" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_12.png?resize=818%2C251&#038;ssl=1" alt="A comparison between traditional RLHF and Contrastive Preference Learning (CPL). CPL uses a regret-based model instead of a reward model." class="wp-image-40646" style="width:810px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_12.png?w=818&amp;ssl=1 818w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_12.png?resize=768%2C236&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_12.png?resize=200%2C61&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_12.png?resize=220%2C68&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_12.png?resize=120%2C37&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_12.png?resize=160%2C49&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_12.png?resize=300%2C92&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_12.png?resize=480%2C147&amp;ssl=1 480w" sizes="auto, (max-width: 818px) 100vw, 818px" /><figcaption class="wp-element-caption">A comparison between traditional RLHF and Contrastive Preference Learning (CPL). CPL uses a regret-based model instead of a reward model. | <a href="https://arxiv.org/pdf/2310.13639" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<h2 class="wp-block-heading" class="wp-block-heading" id="h-best-practices-for-rlhf">Best practices for RLHF</h2>



<p>Let’s go back to the vanilla PPO-based RLHF. Having gone through the RLHF training process on a conceptual level, we’ll now discuss a couple of best practices to follow when implementing RLHF and the tools that might come in handy.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-avoiding-reward-hacking">Avoiding reward hacking</h3>



<p><a href="https://en.wikipedia.org/wiki/Reward_hacking" target="_blank" rel="noreferrer noopener nofollow">Reward hacking</a> is a prevalent issue in reinforcement learning. It refers to a situation where the agent has learned to cheat the system in that it maximizes the reward by taking actions that don’t align with the original objective.</p>



<p>In the context of RHLF, reward hacking means that the training has converged to a particular unlucky place in the loss surface where the generated responses lead to high rewards for some reason, but don’t make much sense to the user.</p>



<p>Luckily, there is a simple trick that helps prevent reward hacking. During fine-tuning, we take advantage of the initial, frozen copy of the LLM (as it was before RLHF training) and pass it the same prompt that we pass the LLM instance we are training.</p>



<p>Then, we compute the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence" target="_blank" rel="noreferrer noopener nofollow">Kullback-Leibler Divergence</a> between the responses from the original model and the model under training. KL Divergence measures the dissimilarity between the two responses. We want the responses to actually be rather similar to make sure that the updated model did not diverge too far away from its starting version. Thus, we dub the KL Divergence value a “reward penalty” and add it to the reward before passing it to the PPO optimizer.</p>



<p>Incorporating this anti-reward-hacking trick into our fine-tuning pipeline yields the following updated version of the previous figure:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1200" height="628" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_7.png?resize=1200%2C628&#038;ssl=1" alt="To prevent reward hacking, we pass the prompt to two instances of the LLM: the one being trained and its frozen version from before the training. Then, we compute the reward penalty as the KL Divergence between the two models’ outputs and add it to the reward. " class="wp-image-40642" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_7.png?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_7.png?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_7.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_7.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_7.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_7.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_7.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_7.png?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_7.png?resize=1020%2C534&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></figure>
</div>


<div id="separator-block_6052665fbf8e2d919c61b6e27d3d9cf0"
         class="block-separator block-separator--20">
</div>



<p>To prevent reward hacking, we pass the prompt to two instances of the LLM: the one being trained and its frozen version from before the training. Then, we compute the reward penalty as the KL Divergence between the two models’ outputs and add it to the reward. This prevents the trained model from diverging too much from its initial version.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-scaling-human-feedback">Scaling human feedback</h3>



<p>As you might have noticed, the RLHF process has one bottleneck: the collection of human feedback in the form of the preference dataset is a slow manual process that needs to be repeated whenever alignment criteria (labelers’ instructions) change. Can we completely remove humans from the process?</p>



<p>We can certainly reduce their engagement, thus making the process more efficient. One approach to doing this is model self-supervision, or “<a href="https://arxiv.org/abs/2212.08073" target="_blank" rel="noreferrer noopener nofollow">Constitutional AI</a>.”</p>



<p>The central point is the Constitution, which consists of a set of rules that should govern the model’s behavior (think: “do not swear,” “be friendly,” etc.). A human <a href="https://en.wikipedia.org/wiki/Red_team" target="_blank" rel="noreferrer noopener nofollow">red team</a> then prompts the LLM to generate harmful or misaligned responses. Whenever they succeed, they ask the model to critique its own responses according to the constitution and revise them. Finally, the model is trained using the red team’s prompts and the model’s revised responses.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1200" height="628" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_9.png?resize=1200%2C628&#038;ssl=1" alt="An overview of Constitutional AI. In this approach, the model is asked to follow a set of guidelines (“constitution”) and learns to critique its own misaligned responses according to it. " class="wp-image-40644" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_9.png?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_9.png?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_9.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_9.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_9.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_9.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_9.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_9.png?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_9.png?resize=1020%2C534&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">An overview of Constitutional AI. In this approach, the model is asked to follow a set of guidelines (“constitution”) and learns to critique its own misaligned responses according to it. | Modified based on: <a previewlistener="true" href="https://arxiv.org/abs/2212.08073" target="_blank" rel="noreferrer noopener nofollow">source</a></figcaption></figure>
</div>


<p><a previewlistener="true" href="https://arxiv.org/pdf/2309.00267">Reinforcement Learning from AI Feedback (RLAIF)</a> is yet another way to eliminate the need for human feedback. In this approach, one simply uses an off-the-shelf LLM to provide preferences for the preference dataset.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1200" height="628" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_14.png?resize=1200%2C628&#038;ssl=1" alt="A comparison between RLAIF (top) and RLHF (bottom). In RLAIF, an off-the-shelf LLM takes the place of the human to generate feedback in the form of a preference dataset." class="wp-image-40671" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_14.png?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_14.png?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_14.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_14.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_14.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_14.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_14.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_14.png?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/09/Reinforcement-Learning-From-Human-Feedback-For-LLMs_14.png?resize=1020%2C534&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">A comparison between RLAIF (top) and RLHF (bottom). In RLAIF, an off-the-shelf LLM takes the place of the human to generate feedback in the form of a preference dataset. | Modified based on: <a previewlistener="true" href="https://arxiv.org/pdf/2309.00267">s</a><a previewlistener="true" href="https://arxiv.org/pdf/2309.00267" target="_blank" rel="noreferrer noopener nofollow">ource</a></figcaption></figure>
</div>


<h2 class="wp-block-heading" class="wp-block-heading" id="h-tooling-and-frameworks">Tooling and frameworks</h2>



<p>Let’s briefly examine some available tools and frameworks that facilitate RLHF implementation.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-data-collection">Data collection</h3>



<p>Don’t have your preference dataset yet? Two great platforms that facilitate its collection are Prolific and Mechanical Turk.</p>



<p><a href="https://www.prolific.com/rlhf" target="_blank" rel="noreferrer noopener nofollow">Prolific</a> is a platform for collecting human feedback at scale that is useful for gathering preference data through surveys and experiments. Amazon’s <a href="https://www.mturk.com/" target="_blank" rel="noreferrer noopener nofollow">Mechanical Turk</a> (MTurk) service allows for outsourcing data labeling tasks to a large pool of human workers, commonly used for obtaining labels for machine-learning models.</p>



<p>Prolific is known for having a more curated and diverse participant pool. The platform emphasizes quality and typically recruits reliable participants with a history of providing high-quality data. MTurk, on the other hand, has a more extensive and varied participant pool, but it can be less curated. This means there may be a broader range of participant quality.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-end-to-end-rlhf-frameworks">End-to-end RLHF frameworks</h3>



<p>If you are a Google Cloud Platform (GCP) user, you can very easily take advantage of their <a href="https://cloud.google.com/blog/products/ai-machine-learning/rlhf-on-google-cloud" target="_blank" rel="noreferrer noopener nofollow">Vertex AI RLHF pipeline</a>. It abstracts away the while training logic; all you need to do is to supply the preference dataset (to train the Reward Model) and the prompt dataset (for the RL-based fine-tuning) and just execute the pipeline.</p>



<p>The disadvantage is that since the training logic is abstracted away, it’s not straightforward to <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-text-models-rlhf" target="_blank" rel="noreferrer noopener nofollow">make custom changes</a>. However, this is a great place to start if you are just beginning your RLHF adventure or don’t have the time or resources to build custom implementations.</p>



<p>Alternatively, check out <a href="https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat" target="_blank" rel="noreferrer noopener nofollow">DeepSpeed Chat</a>, Microsoft’s open-source system for training and deploying chat models using RLHF, providing tools for data collection, model training, and deployment.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-conclusion">Conclusion</h2>



<p>We have discussed how important the paradigm shift brought about by RLHF is to training language models, making them aligned with human preferences. We analyzed the three steps of the RLHF training pipeline: collecting human feedback, training the reward model, and fine-tuning the LLM. Next, we took a more detailed look at Proximal Policy Optimization, the algorithm behind RLHF, while mentioning some alternatives. Finally, we discussed how to avoid reward hacking using KL Divergence and how to reduce human engagement in the process with approaches such as Constitutional AI and RLAIF. We also reviewed a couple of tools facilitating RLHF implementation.</p>



<p>You are now well-equipped to fine-tune your own large language models with RLHF! If you do, let me know what you built!</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">40623</post-id>	</item>
		<item>
		<title>Continuous Control With Deep Reinforcement Learning</title>
		<link>https://neptune.ai/blog/continuous-control-with-deep-reinforcement-learning</link>
		
		<dc:creator><![CDATA[Piotr Januszewski]]></dc:creator>
		<pubDate>Fri, 22 Jul 2022 06:32:47 +0000</pubDate>
				<category><![CDATA[ML Model Development]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<guid isPermaLink="false">https://neptune.test/continuous-control-with-deep-reinforcement-learning/</guid>

					<description><![CDATA[This time I want to explore how deep reinforcement learning can be utilized e.g. making a humanoid model walk. This kind of task is a continuous control task. A solution to such a task differs from the one you might know and use to play Atari games, like Pong, with e.g. Deep Q-Network (DQN). I’ll&#8230;]]></description>
										<content:encoded><![CDATA[
<p>This time I want to explore how deep <a href="/blog/reinforcement-learning-basics-markov-chain-tree-search" target="_blank" rel="noreferrer noopener">reinforcement learning</a> can be utilized e.g. making a humanoid model walk. This kind of task is a continuous control task. A solution to such a task differs from the one you might know and use to play Atari games, like Pong, with e.g. Deep Q-Network (DQN).</p>



<p>I’ll talk about what characterizes continuous control environments. Then, I’ll introduce the actor-critic architecture to you and show the example of the state-of-the-art actor-critic method, Soft Actor-Critic (SAC). Finally, we will dive into the code. I’ll briefly explain how it is implemented in the amazing SpinningUp framework. Let’s go!</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-what-is-continuous-control">What is continuous control?</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Continuous-Control-with-Deep-Reinforcement-Learning_4.gif?ssl=1" alt="Continuous Control with Deep Reinforcement Learning" class="wp-image-53267"/><figcaption class="wp-element-caption"><em>Continuous Control with Deep Reinforcement Learning | Source: <a href="https://openai.com/blog/roboschool/" target="_blank" rel="noreferrer noopener nofollow">Roboschool</a></em></figcaption></figure>
</div>


<p>Meet Humanoid. It is a three-dimensional bipedal robot environment. Its observations are 376-dimensional vectors that describe the kinematic properties of the robot. Its actions are 17-dimensional vectors that specify torques to be applied on the robot joints. The goal is to run forward as fast as possible… and don’t fall over.</p>



<p>The actions are continuously valued vectors. This is very different from a fixed set of possible actions that you might already know from Atari environments. It requires a policy to return not the scores, or qualities, for all the possible actions, but simply return one action to be executed. Different policy output requires a different training strategy which we will explore in the next section.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-what-is-mujoco">What is MuJoCo?</h2>



<p><a href="http://www.mujoco.org/" target="_blank" rel="noreferrer noopener nofollow">MuJoCo</a> is a fast and accurate physics simulation engine aimed at research and development in robotics, biomechanics, graphics, and animation. OpenAI Gym, and the Humanoid environment it includes, utilizes it for simulating the environment dynamics. I wrote the whole post about installing and using it <a href="/blog/installing-mujoco-to-work-with-openai-gym-environments" target="_blank" rel="noreferrer noopener">here</a>. We won’t need this for the matter of this post.</p>



<section id="blog-intext-cta-block_71731d2ad6235493602c657ab0ae4f14" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-check-also">Check also</h3>
    
            <p>  <a href="/blog/best-reinforcement-learning-tutorials-examples-projects-and-courses" target="_blank" rel="noopener">Best Reinforcement Learning Tutorials, Examples, Projects, and Courses</a></p>
<p>  <a href="/blog/the-best-tools-for-reinforcement-learning-in-python" target="_blank" rel="noopener">The Best Tools for Reinforcement Learning in Python You Actually Want to Try</a></p>
<p>  <a href="/blog/7-applications-of-reinforcement-learning-in-finance-and-trading" target="_blank" rel="noopener">7 Applications of Reinforcement Learning in Finance and Trading</a></p>
    
    </section>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-off-policy-actor-critic-methods">Off-policy actor-critic methods</h2>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Continuous-Control-with-Deep-Reinforcement-Learning_5.png?resize=768%2C296&#038;ssl=1" alt="Reinforcement Learning" class="wp-image-53266" width="768" height="296"/><figcaption class="wp-element-caption"><em>Reinforcement Learning | Source: Sutton &amp; Barto, Reinforcement Learning: An Introduction, 2nd edition</em></figcaption></figure>
</div>


<p>Let’s recap: Reinforcement learning (RL) is learning what to do — how to map situations to actions — to maximize some notion of cumulative reward. RL consists of an agent that, in order to learn, acts in an environment. The environment provides a response to each agent’s action that is fed back to the agent. A reward is used as a reinforcing signal and a state is used to condition the agent&#8217;s decisions.</p>



<p>Really the goal is to find an optimal policy. The policy tells the agent how it should behave in whatever state it will find itself. This is the agent&#8217;s map to the environment’s objective.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Continuous-Control-with-Deep-Reinforcement-Learning_1.png?resize=768%2C434&#038;ssl=1" alt="Reinforcement learning" class="wp-image-53270" width="768" height="434"/><figcaption class="wp-element-caption"><em>Reinforcement learning | Source: <a href="https://www.researchgate.net/figure/A-high-level-block-diagram-of-the-actor-critic-reinforcement-learning-architecture-is_fig7_321666649" target="_blank" rel="noreferrer noopener nofollow">Researchgate</a></em></figcaption></figure>
</div>


<p>The actor-critic architecture, depicted in the diagram above, divides the agent into two pieces, the <strong>Actor</strong> and the <strong>Critic</strong>.&nbsp;</p>



<ul class="wp-block-list">
<li>The <strong>Actor</strong> represents the policy – it learns this mapping from states to actions.&nbsp;</li>



<li>The <strong>Critic</strong> represents the Q-function – it learns to evaluate how good each action is in every possible state. You can see that the actor uses the critic evaluations for improving the policy.</li>
</ul>



<p>Why use such a construct? If you already know Q-Learning (<a href="https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0" target="_blank" rel="noreferrer noopener nofollow">here</a> you can learn about it), you know that training the Q-function can be useful for solving an RL task. The Q-function will tell you how good each action is in any state. You can then simply pick the best action. It’s easy when you have a fixed set of actions, you simply evaluate each and every one of them and take the best!&nbsp;</p>



<p>However, what to do in a situation when an action is continuous? You can’t evaluate every value, you could evaluate some values and pick the best, but it creates its own problems with e.g. resolution – how many values and what values to evaluate? The actor is the answer to these problems. It approximated the argmax operator in the discrete case. It simply is trained to predict the best action we would get if we could evaluate every possible action with the critic. Below we describe the example of Soft Actor-Critic (SAC).</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-sac-in-pseudo-code">SAC in pseudo-code</h3>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Continuous-Control-with-Deep-Reinforcement-Learning_2.png?resize=900%2C977&#038;ssl=1" alt="" class="wp-image-53269" width="900" height="977"/><figcaption class="wp-element-caption"><em>SAC in pseudo-code | Source: <a href="https://spinningup.openai.com/en/latest/algorithms/sac.html" target="_blank" rel="noreferrer noopener nofollow">Spinning Up in Deep RL</a></em></figcaption></figure>
</div>


<p>SAC’s critic is trained off-policy, meaning it can reuse data collected by the older, less trained policy. The off-policy critic training in lines 11-13 utilizes a very similar technique to that of the DQN, e.g. it uses the target Q-Network to stabilize the training. Being off-policy makes it more sample-efficient than the on-policy methods like <a href="https://spinningup.openai.com/en/latest/algorithms/ppo.html" target="_blank" rel="noreferrer noopener nofollow">PPO</a> because we can construct the experience replay buffer where each collected data sample can be reused for training multiple times – contrary to the on-policy training where data is discarded after only one update!</p>



<p>You can see the critics and the replay buffer being initialized in line 1, alongside the target critics in line 2. We use two critics to fight the overestimation error described in the papers: “Double Q-learning” and “Addressing Function Approximation Error in Actor-Critic”, you can learn more about it <a href="https://spinningup.openai.com/en/latest/algorithms/td3.html" target="_blank" rel="noreferrer noopener nofollow">here</a>. Then, the data is collected and fed to the replay buffer in lines 4-8. The policy is updated in line 14 and the target networks are updated in line 15.</p>



<p>You could notice that both the critic and the actor updates include some additional log terms. This is the max-entropy regularization that keeps the agent from exploiting its, possibly imperfect, knowledge too much and bonuses exploration of promising actions. If you want to understand it in detail I recommend you read <a href="https://spinningup.openai.com/en/latest/algorithms/sac.html" target="_blank" rel="noreferrer noopener nofollow">this</a> resource.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-soft-actor-critic-in-code">Soft actor-critic in code</h2>



<p>We will work with the <a href="https://github.com/awarelab/spinningup_tf2" target="_blank" rel="noreferrer noopener nofollow">Spinning Up in Deep RL &#8211; TF2 implementation</a> framework. There is the installation instruction in the repo README. Note that you don’t have to install MuJoCo for now. We will run an example Soft Actor-Critic agent on the Pendulum-v0 environment from the OpenAI Gym suite. Let’s jump into it!</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-the-pendulum-v0-environment">The Pendulum-v0 environment</h3>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Continuous-Control-with-Deep-Reinforcement-Learning_3.gif?resize=317%2C317&#038;ssl=1" alt="" class="wp-image-53268" width="317" height="317"/><figcaption class="wp-element-caption"><em>The pendulum-v0 environment | Source: <a href="https://keras-gym.readthedocs.io/en/stable/notebooks/pendulum/ppo.html" target="_blank" rel="noreferrer noopener nofollow">Keras-Gym &#8211; Pendulum with PPO</a></em></figcaption></figure>
</div>


<p>Pendulum-v0 is the continuous control environment where:</p>



<div id="separator-block_2ddea7b818877ee67b3a95482958a87f"
         class="block-separator block-separator--15">
</div>



<div id="medium-table-block_e1e65ec6365135045597c629fca28779"
     class="block-medium-table c-table__outer-wrapper  l-padding__top--0 l-padding__bottom--0 l-margin__top--unset l-margin__bottom--unset">

    <table class="c-table">
                    <thead class="c-table__head">
            <tr>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Actions                        </div>
                    </td>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            The torque of only one joint in one dimension                        </div>
                    </td>
                            </tr>
            </thead>
        
        <tbody class="c-table__body">

                    
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>Observations</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Three-dimensional vectors, where the first two dimensions represent the pendulum position – they are cos and sin of the pendulum angle – and the third dimension is the pendulum angle velocity</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>Goal</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Spin the pendulum to the straight-up position and remain vertical, with the least angular velocity, and the least effort (torque introduced with the actions)</p>
                                                            </div>
                        </td>

                    
                </tr>

                    
        </tbody>
    </table>

</div>



<div id="separator-block_2ddea7b818877ee67b3a95482958a87f"
         class="block-separator block-separator--15">
</div>



<p>You can imagine it as the simplified model of more complex robots like Humanoid. Humanoid is created from many similar, but two or three-dimensional, joints.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-training-the-sac-agent">Training the SAC agent</h2>



<p>In the repo, the SAC agent code is <a href="https://github.com/awarelab/spinningup_tf2/tree/main/spinup_bis/algos/tf2/sac" target="_blank" rel="noreferrer noopener nofollow">here</a>. The core.py file includes the actor-critic models’ factory method and other utilities. The sac.py includes the replay buffer definition and the implementation of the training algorithm presented above. I recommend you look through it and try to map lines of the pseudo-code from above to the actual implementation in that file. Then, check with my list:</p>



<ul class="wp-block-list">
<li>the initialization from lines 1-2 of the pseudo-code is implemented in lines <a href="https://github.com/awarelab/spinningup_tf2/blob/f6fd8f679d76d43d44a2388549eb2c123cf0e743/spinup_bis/algos/tf2/sac/sac.py#L159" target="_blank" rel="noreferrer noopener nofollow">159</a>&#8211;<a href="https://github.com/awarelab/spinningup_tf2/blob/f6fd8f679d76d43d44a2388549eb2c123cf0e743/spinup_bis/algos/tf2/sac/sac.py#L179" target="_blank" rel="noreferrer noopener nofollow">179</a> in the sac.py,</li>



<li>the main loop from line 3 of the pseudo-code is implemented in line <a href="https://github.com/awarelab/spinningup_tf2/blob/f6fd8f679d76d43d44a2388549eb2c123cf0e743/spinup_bis/algos/tf2/sac/sac.py#L264" target="_blank" rel="noreferrer noopener nofollow">264</a> in the sac.py,</li>



<li>the data collection from lines 4-8 of the pseudo-code is implemented in lines <a href="https://github.com/awarelab/spinningup_tf2/blob/f6fd8f679d76d43d44a2388549eb2c123cf0e743/spinup_bis/algos/tf2/sac/sac.py#L270" target="_blank" rel="noreferrer noopener nofollow">270</a>&#8211;<a href="https://github.com/awarelab/spinningup_tf2/blob/f6fd8f679d76d43d44a2388549eb2c123cf0e743/spinup_bis/algos/tf2/sac/sac.py#L295" target="_blank" rel="noreferrer noopener nofollow">295</a> in the sac.py,</li>



<li>the update handling from lines 9-11 of the pseudo-code is implemented in lines <a href="https://github.com/awarelab/spinningup_tf2/blob/f6fd8f679d76d43d44a2388549eb2c123cf0e743/spinup_bis/algos/tf2/sac/sac.py#L298" target="_blank" rel="noreferrer noopener nofollow">298</a>&#8211;<a href="https://github.com/awarelab/spinningup_tf2/blob/f6fd8f679d76d43d44a2388549eb2c123cf0e743/spinup_bis/algos/tf2/sac/sac.py#L300" target="_blank" rel="noreferrer noopener nofollow">300</a> in the sac.py,</li>



<li>the parameters update from lines 12-15 of the pseudo-code is called in line <a href="https://github.com/awarelab/spinningup_tf2/blob/f6fd8f679d76d43d44a2388549eb2c123cf0e743/spinup_bis/algos/tf2/sac/sac.py#L301" target="_blank" rel="noreferrer noopener nofollow">301</a> and implemented in lines <a href="https://github.com/awarelab/spinningup_tf2/blob/f6fd8f679d76d43d44a2388549eb2c123cf0e743/spinup_bis/algos/tf2/sac/sac.py#L192" target="_blank" rel="noreferrer noopener nofollow">192</a>&#8211;<a href="https://github.com/awarelab/spinningup_tf2/blob/f6fd8f679d76d43d44a2388549eb2c123cf0e743/spinup_bis/algos/tf2/sac/sac.py#L240" target="_blank" rel="noreferrer noopener">2</a><a href="https://github.com/awarelab/spinningup_tf2/blob/f6fd8f679d76d43d44a2388549eb2c123cf0e743/spinup_bis/algos/tf2/sac/sac.py#L240" target="_blank" rel="noreferrer noopener nofollow">4</a><a href="https://github.com/awarelab/spinningup_tf2/blob/f6fd8f679d76d43d44a2388549eb2c123cf0e743/spinup_bis/algos/tf2/sac/sac.py#L240" target="_blank" rel="noreferrer noopener">0</a> in the sac.py,</li>



<li>and the rest of the code in the sac.py is mostly logging handling and some more boilerplate code.</li>
</ul>



<p>The example training in the Pendulum-v0 environment is implemented in the run_example.py in the repo root. Simply run it like this: python run_example.py. After the training – or after 200 000 environment steps – the training will automatically finish and save the trained model in the ./out/checkpoint directory.</p>



<p>Below is the example log from the beginning and the end of the training. Note how the AverageTestEpReturn got smaller – from a huge negative number to something closer to zero which is a maximum return. Returns are negative because the agent is penalized for the pendulum not being in the goal position: vertical, with zero angular velocity and zero torque.</p>



<p>The training took 482 seconds (around 8 minutes) on my MacBook with the Intel i5 processor.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-before-training">Before training</h3>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">---------------------------------------
|      AverageEpRet |       <span class="hljs-number" style="color: teal;">-1.48e+03</span> |
|          StdEpRet |             <span class="hljs-number" style="color: teal;">334</span> |
|          MaxEpRet |            <span class="hljs-number" style="color: teal;">-973</span> |
|          MinEpRet |       <span class="hljs-number" style="color: teal;">-1.89e+03</span> |
|  AverageTestEpRet |        <span class="hljs-number" style="color: teal;">-1.8e+03</span> |
|      StdTestEpRet |             <span class="hljs-number" style="color: teal;">175</span> |
|      MaxTestEpRet |       <span class="hljs-number" style="color: teal;">-1.48e+03</span> |
|      MinTestEpRet |       <span class="hljs-number" style="color: teal;">-1.94e+03</span> |
|             EpLen |             <span class="hljs-number" style="color: teal;">200</span> |
|         TestEpLen |             <span class="hljs-number" style="color: teal;">200</span> |
| TotalEnvInteracts |           <span class="hljs-number" style="color: teal;">2e+03</span> |
|     AverageQ1Vals |       <span class="hljs-number" style="color: teal;">-4.46e+03</span> |
|         StdQ1Vals |         <span class="hljs-number" style="color: teal;">7.1e+04</span> |
|         MaxQ1Vals |           <span class="hljs-number" style="color: teal;">0.744</span> |
|         MinQ1Vals |           <span class="hljs-number" style="color: teal;">-63.3</span> |
|     AverageQ2Vals |       <span class="hljs-number" style="color: teal;">-4.46e+03</span> |
|         StdQ2Vals |        <span class="hljs-number" style="color: teal;">7.11e+04</span> |
|         MaxQ2Vals |            <span class="hljs-number" style="color: teal;">0.74</span> |
|         MinQ2Vals |           <span class="hljs-number" style="color: teal;">-63.5</span> |
|      AverageLogPi |           <span class="hljs-number" style="color: teal;">-35.2</span> |
|          StdLogPi |             <span class="hljs-number" style="color: teal;">562</span> |
|          MaxLogPi |            <span class="hljs-number" style="color: teal;">3.03</span> |
|          MinLogPi |           <span class="hljs-number" style="color: teal;">-8.33</span> |
|            LossPi |            <span class="hljs-number" style="color: teal;">17.4</span> |
|            LossQ1 |            <span class="hljs-number" style="color: teal;">2.71</span> |
|            LossQ2 |            <span class="hljs-number" style="color: teal;">2.13</span> |
|    StepsPerSecond |        <span class="hljs-number" style="color: teal;">4.98e+03</span> |
|              Time |             <span class="hljs-number" style="color: teal;">3.8</span> |
---------------------------------------
</pre>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-after-training">After training</h3>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">---------------------------------------
|      AverageEpRet |            <span class="hljs-number" style="color: teal;">-176</span> |
|          StdEpRet |            <span class="hljs-number" style="color: teal;">73.8</span> |
|          MaxEpRet |           <span class="hljs-number" style="color: teal;">-9.95</span> |
|          MinEpRet |            <span class="hljs-number" style="color: teal;">-250</span> |
|  AverageTestEpRet |            <span class="hljs-number" style="color: teal;">-203</span> |
|      StdTestEpRet |            <span class="hljs-number" style="color: teal;">55.3</span> |
|      MaxTestEpRet |            <span class="hljs-number" style="color: teal;">-129</span> |
|      MinTestEpRet |            <span class="hljs-number" style="color: teal;">-260</span> |
|             EpLen |             <span class="hljs-number" style="color: teal;">200</span> |
|         TestEpLen |             <span class="hljs-number" style="color: teal;">200</span> |
| TotalEnvInteracts |           <span class="hljs-number" style="color: teal;">2e+05</span> |
|     AverageQ1Vals |       <span class="hljs-number" style="color: teal;">-1.56e+04</span> |
|         StdQ1Vals |        <span class="hljs-number" style="color: teal;">2.48e+05</span> |
|         MaxQ1Vals |           <span class="hljs-number" style="color: teal;">-41.8</span> |
|         MinQ1Vals |            <span class="hljs-number" style="color: teal;">-367</span> |
|     AverageQ2Vals |       <span class="hljs-number" style="color: teal;">-1.56e+04</span> |
|         StdQ2Vals |        <span class="hljs-number" style="color: teal;">2.48e+05</span> |
|         MaxQ2Vals |           <span class="hljs-number" style="color: teal;">-42.9</span> |
|         MinQ2Vals |            <span class="hljs-number" style="color: teal;">-380</span> |
|      AverageLogPi |             <span class="hljs-number" style="color: teal;">475</span> |
|          StdLogPi |        <span class="hljs-number" style="color: teal;">7.57e+03</span> |
|          MaxLogPi |            <span class="hljs-number" style="color: teal;">7.26</span> |
|          MinLogPi |           <span class="hljs-number" style="color: teal;">-10.6</span> |
|            LossPi |            <span class="hljs-number" style="color: teal;">61.6</span> |
|            LossQ1 |            <span class="hljs-number" style="color: teal;">2.01</span> |
|            LossQ2 |            <span class="hljs-number" style="color: teal;">1.27</span> |
|    StepsPerSecond |        <span class="hljs-number" style="color: teal;">2.11e+03</span> |
|              Time |             <span class="hljs-number" style="color: teal;">482</span> |
---------------------------------------
</pre>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-visualizing-the-trained-policy">Visualizing the trained policy</h2>



<p>Now, with the trained model saved, we can run it and see how it does! Run this script: </p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">python run_policy.py --model_path ./out/checkpoint --env_name Pendulum-v0</pre>



<p>in the repo root. You’ll see your agent playing 10 episodes one after another! Isn’t it cool? Did your agent train to perfectly align the pendulum vertically? Mine not. You may try playing with the hyper-parameters in the run_example.py file (the agent’s function parameters) and make the agent find a better policy. Small hint: I observed that finishing training earlier might help. All the hyper-parameters are defined in the SAC’s docstring in the sac.py file.</p>



<p>You may wonder, why is each episode different? It is because the initial conditions (the pendulum starting angle and velocity) are randomized each time the environment is reset and the new episode starts.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-conclusions">Conclusions</h2>



<p>The next step for you is to train SAC in some more complex environment like Humanoid or any other environment from the MuJoCo suite. <a href="/blog/installing-mujoco-to-work-with-openai-gym-environments" target="_blank" rel="noreferrer noopener"><em>Installing MuJoCo to Work With OpenAI Gym Environments</em></a> is the guide I wrote on how to install MuJoCo and get access to these complex environments. It also describes useful diagnostics to track. You can read more about logging these diagnostics in <a href="/blog/logging-in-reinforcement-learning-frameworks" target="_blank" rel="noreferrer noopener"><em>Logging in Reinforcement Learning Frameworks – What You Need to Know</em></a>. There are also other frameworks that implement algorithms that can solve the continuous control tasks. Read about them in this post: <a href="/blog/best-benchmarks-for-reinforcement-learning" target="_blank" rel="noreferrer noopener"><em>Best Benchmarks for Reinforcement Learning: The Ultimate List</em></a>. Thank you for your time and see you next time!</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">6026</post-id>	</item>
		<item>
		<title>Installing MuJoCo to Work With OpenAI Gym Environments</title>
		<link>https://neptune.ai/blog/installing-mujoco-to-work-with-openai-gym-environments</link>
		
		<dc:creator><![CDATA[Piotr Januszewski]]></dc:creator>
		<pubDate>Fri, 22 Jul 2022 06:13:25 +0000</pubDate>
				<category><![CDATA[Reinforcement Learning]]></category>
		<guid isPermaLink="false">https://neptune.test/installing-mujoco-to-work-with-openai-gym-environments/</guid>

					<description><![CDATA[In this article, I’ll show you how to install MuJoCo on your Mac/Linux machine in order to run continuous control environments from OpenAI’s Gym. These environments include classic ones like HalfCheetah, Hopper, Walker, Ant, and Humanoid and harder ones like object manipulation with a robotic arm or robotic hand dexterity. I’ll also discuss additional agent&#8230;]]></description>
										<content:encoded><![CDATA[
<p><strong>In this article, I’ll show you how to install MuJoCo on your Mac/Linux machine in order to run continuous control environments from OpenAI’s Gym</strong>. These environments include classic ones like HalfCheetah, Hopper, Walker, Ant, and Humanoid and harder ones like object manipulation with a robotic arm or robotic hand dexterity. I’ll also discuss additional agent diagnostics provided by the environments that you might not have considered before.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-how-do-you-get-mujoco">How do you get MuJoCo?</h2>



<p>You might wonder, what’s so special about installing MuJoCo that it needs a guide? Well, getting a license and properly installing it might be relatively easy, but <strong>the big problems start when you’re matching MuJoCo and OpenAI Gym versions, and installing the mujoco-py package</strong>. It took me many hours to get it right the first time I tried!</p>



<p>To save you the trouble, <strong>I’ll walk you through the installation process step by step.</strong> Then I’ll discuss some useful diagnostics to keep an eye on, we’ll take a look at example diagnostics from Humanoid training. Finally, I’ll link the code that lets you train agents on MuJoCo tasks and watch the diagnostics using <a href="/" target="_blank" rel="noreferrer noopener">neptune.ai</a>. To start, I’ll give you a bit of context about MuJoCo and OpenAI Gym environments.</p>



<section
	id="i-box-block_80742bc86d7f24957bbbf5b0816c2024"
	class="block-i-box  l-margin__top--large l-margin__bottom--large">

			<header class="c-header">
			<img
				src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
				data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/i-box/header-icon.svg"
				width="24"
				height="24"
				class="c-header__icon lazyload"
				alt="">

			
            <h2 class="c-header__text animation " style='max-width: 100%;'   >
                 <strong>Editor&#8217;s note</strong>
            </h2>		</header>
	
	<div class="block-i-box__inner">
		

<p>Do you feel like experimenting with neptune.ai?</p>



<ul
    id="arrow-list-block_03a4c75bc770048538199c525af91875"
    class="block-arrow-list block-list-item--font-size-regular">
    

<li class="block-list-item ">
    <img loading="lazy" decoding="async"
        src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
        data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/list-item/arrow.svg"
        width="10"
        height="10"
        class="block-list-item__arrow lazyload"
        alt="">

    

<p>Request a <a href="/free-trial" target="_blank" rel="noreferrer noopener">free trial</a></p>


</li>



<li class="block-list-item ">
    <img loading="lazy" decoding="async"
        src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
        data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/list-item/arrow.svg"
        width="10"
        height="10"
        class="block-list-item__arrow lazyload"
        alt="">

    

<p>Play with a <a href="https://scale.neptune.ai/o/examples/org/LLM-Pretraining/reports/9e6a2cad-77e7-42df-9d64-28f07d37e908" target="_blank" rel="noreferrer noopener nofollow">live project</a></p>


</li>



<li class="block-list-item ">
    <img loading="lazy" decoding="async"
        src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
        data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/list-item/arrow.svg"
        width="10"
        height="10"
        class="block-list-item__arrow lazyload"
        alt="">

    

<p><a previewlistener="true" href="https://docs.neptune.ai/" target="_blank" rel="noreferrer noopener">See the docs</a>&nbsp;or watch a short&nbsp;<a href="/walkthrough" target="_blank" rel="noreferrer noopener">product demo (2 min)</a></p>


</li>


</ul>


	</div>

</section>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-mujoco-multi-joint-dynamics-with-contact">MuJoCo &#8211; <strong>Mu</strong>lti-<strong>Jo</strong>int dynamics with <strong>Co</strong>ntact</h2>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/MuJoCo-demo.png?ssl=1" alt="MuJoCo demo" class="wp-image-49346"/><figcaption class="wp-element-caption"><em>Source: </em><a href="http://www.mujoco.org/book/unity.html" target="_blank" rel="noreferrer noopener nofollow"><em>MuJoCo Plugin and Unity Integration</em></a></figcaption></figure>
</div>


<p>MuJoCo is a <strong>fast and accurate physics simulation engine</strong> aimed at research and development in robotics, biomechanics, graphics, and animation. It’s an engine, meaning, it doesn’t provide ready-to-use models or environments to work with, rather it <strong>runs environments </strong>(like those that OpenAI’s Gym offers).</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-what-is-openai-gym">What is OpenAI Gym?</h2>



<p>OpenAI Gym (or Gym for short) is a collection of environments. Some of them called continuous control in general, run on the MuJoCo engine. All the environments share two important characteristics:</p>



<ol class="wp-block-list">
<li>An agent observes vectors that describe the kinematic properties of the controlled robot. This means that the state space is continuous.</li>



<li>Agent actions are vectors too and they specify torques to be applied on the robot joints. This means that the action space is also continuous</li>
</ol>



<p><strong>Gym MuJoCo environments include classic continuous control, objects manipulation with a robotic arm, and robotic hand (Shadow Hand) dexterity.</strong> There are multiple tasks available for training in these environments. Some of them are presented in the figures below. You can find details about all of them in the Gym <a href="https://gym.openai.com/envs/#mujoco" target="_blank" rel="noreferrer noopener nofollow">environments list</a>. <a href="https://openai.com/blog/ingredients-for-robotics-research/" target="_blank" rel="noreferrer noopener nofollow">This post</a> is especially useful for robo-arm and robo-hand environments. If you don’t know the Gym API yet, I encourage you to read the <a href="https://gymnasium.farama.org/" target="_blank" rel="noreferrer noopener nofollow">documentation</a> – the two short sections “Environments” and “Observations” should be enough to start.</p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow"><div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Continuous-control-1.gif?ssl=1" alt="Continuous control " class="wp-image-49348"/></figure>
</div></div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow"><div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Continuous-control-2.gif?ssl=1" alt="Continuous control " class="wp-image-49350"/></figure>
</div></div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow"><div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Continuous-control-3.gif?ssl=1" alt="Continuous control " class="wp-image-49349"/></figure>
</div></div>
</div>



<p class="has-text-align-center has-small-font-size"><em>Classic continuous control &#8211; tasks from left to right: Walker2d, And, and Humanoid.<br>Source: </em><a href="https://openai.com/blog/roboschool/" target="_blank" rel="noreferrer noopener nofollow"><em>OpenAI Roboschool</em></a></p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Robotic-arm.gif?ssl=1" alt="Robotic arm " class="wp-image-49351"/><figcaption class="wp-element-caption"><em>Objects manipulation with a robotic arm &#8211; the pick and place task.<br>Source: </em><a href="https://jangirrishabh.github.io/2018/03/25/Overcoming-exploration-demos.html" target="_blank" rel="noreferrer noopener nofollow"><em>Overcoming exploration in RL from demos</em></a></figcaption></figure>
</div>

<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Hand-manipulation.gif?ssl=1" alt="Hand manipulation" class="wp-image-49353"/></figure>
</div>


<p class="has-text-align-center has-small-font-size"><em>Shadow Hand dexterity &#8211; the hand manipulate block task.<br>Source: </em><a href="https://gym.openai.com/envs/#robotics" target="_blank" rel="noreferrer noopener nofollow"><em>OpenAI Gym Robotics</em></a></p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-installing-mujoco-and-openai-gym">Installing MuJoCo and OpenAI Gym</h2>



<p>In this section, I’ll show you where to get the MuJoCo license, how to install everything required, and also how to troubleshoot a common macOS problem.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-license">License</h3>



<p><strong>You can get a 30-day free trial on the </strong><a href="https://www.roboti.us/license.html" target="_blank" rel="noreferrer noopener nofollow"><strong>MuJoCo website</strong></a> or—if you’re a student—a free 1-year license for education. The license key will arrive in an email with your username and password. If you’re not a student, you might try to encourage the institution you work with to buy a license.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-installing-mujoco-py">Installing mujoco-py</h3>



<p>Here are step-by-step instructions, and below I added some explanations and troubleshooting tips:</p>



<ol class="wp-block-list">
<li>Download the MuJoCo version 1.50 binaries for <a href="https://www.roboti.us/download/mjpro150_linux.zip" target="_blank" rel="noreferrer noopener nofollow">Linux</a> or <a href="https://www.roboti.us/download/mjpro150_osx.zip" target="_blank" rel="noreferrer noopener nofollow">macOS</a>.</li>



<li>Unzip the downloaded <code><span class="c-code-snippet">mjpro150</span></code> directory into <code><span class="c-code-snippet">~/.mujoco/mjpro150</span></code>, and place your license key (the <code><span class="c-code-snippet">mjkey.txt</span></code> file from your email) at <code><span class="c-code-snippet">~/.mujoco/mjkey.txt</span>.</code></li>



<li>Run <code><span class="c-code-snippet">pip3 install -U 'mujoco-py&lt;1.50.2,>=1.50.1'</span></code></li>



<li>Run <code><span class="c-code-snippet">python3 -c 'import mujoco_py'</span></code></li>
</ol>



<p>If you see warnings like <span class="c-code-snippet">objc[&#8230;]: Class GLFW&#8230; is implemented in both&#8230;</span>, then ignore them. If you’re on macOS and see <span class="c-code-snippet">clang: error: unsupported option &#8216;-fopenmp’</span>or any other compilation-related error, then go to the <span class="c-code-snippet">Troubleshooting</span> subsection. If you wonder why MuJoCo 1.5, then go to the <span class="c-code-snippet">Version</span> subsection. If you have no more concerns, then you can jump into Gym installation!</p>



<h4 class="wp-block-heading">Troubleshooting</h4>



<p>If, on macOS, <span class="c-code-snippet">clang: error: unsupported option &#8216;-fopenmp’</span><strong> </strong>error or any other error related to a compiler (e.g. gcc if you have one installed) happened to you during the installation or running <code><span class="c-code-snippet">python3 -c ‘import mujoco_py’</span></code> then follow these steps:</p>



<p>1. Install <a href="https://brew.sh">brew</a> if you don’t have it already.</p>



<p>2. Uninstall <strong>all</strong> other compilers if you have some, e.g. <span class="c-code-snippet">run <code>brew uninstall gcc</code></span>. You may need to run it a couple of times if you have more than one version.</p>



<p>3. Run <code><span class="c-code-snippet">brew install llvm boost hdf5</span></code></p>



<p>4. Add this to your <code><span class="c-code-snippet">.bashrc / .zshrc</span></code></p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">export PATH=<span class="hljs-string" style="color: rgb(221, 17, 68);">"/usr/local/opt/llvm/bin:$PATH"</span>
export CC=<span class="hljs-string" style="color: rgb(221, 17, 68);">"/usr/local/opt/llvm/bin/clang"</span>
export CXX=<span class="hljs-string" style="color: rgb(221, 17, 68);">"/usr/local/opt/llvm/bin/clang++"</span>
export CXX11=<span class="hljs-string" style="color: rgb(221, 17, 68);">"/usr/local/opt/llvm/bin/clang++"</span>
export CXX14=<span class="hljs-string" style="color: rgb(221, 17, 68);">"/usr/local/opt/llvm/bin/clang++"</span>
export CXX17=<span class="hljs-string" style="color: rgb(221, 17, 68);">"/usr/local/opt/llvm/bin/clang++"</span>
export CXX1X=<span class="hljs-string" style="color: rgb(221, 17, 68);">"/usr/local/opt/llvm/bin/clang++"</span>
export LDFLAGS=<span class="hljs-string" style="color: rgb(221, 17, 68);">"-L/usr/local/opt/llvm/lib"</span>
export CPPFLAGS=<span class="hljs-string" style="color: rgb(221, 17, 68);">"-I/usr/local/opt/llvm/include"</span></pre>



<p>5. Don&#8217;t forget to source your <code><span class="c-code-snippet">.bashrc / .zshrc</span></code> (e.g. relaunch your cmd) after editing it and make sure your python environment is activated.</p>



<p>6. Try to uninstall and install mujoco-py again.</p>



<p>See this <a href="https://github.com/openai/mujoco-py/issues/465#issuecomment-651124360" target="_blank" rel="noreferrer noopener nofollow">GitHub issue</a> for more information. You should also see the <span class="c-code-snippet">Troubleshooting</span> section of the <a href="https://github.com/openai/mujoco-py/tree/master#troubleshooting" target="_blank" rel="noreferrer noopener nofollow">mujoco-py README</a>.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-version">Version</h3>



<p>Here we bump into the first trap! <strong>The newest OpenAI Gym doesn’t work with MuJoCo 2.0</strong>, see <a href="https://github.com/openai/gym/issues/1541">this GitHub issue</a> if you want to know the details. This is why you need to download MuJoCo version 1.50 binaries. Alternatively, if you really need to use MuJoCo 2.0, you can download the MuJoCo 2.0 binaries for <a href="https://www.roboti.us/download/mujoco200_linux.zip">Linux</a> or <a href="https://www.roboti.us/download/mujoco200_macos.zip">OSX</a>, install the newest mujoco-py, and then install the last Gym that supports <span class="c-code-snippet">MuJoCo 2.0: <code>pip install -U gym[all]==0.15.3</code></span></p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-installing-openai-gym-environments-tutorial">Installing OpenAI Gym Environments (tutorial)</h2>



<p>Here, it’s important to install the OpenAI Gym package with the “mujoco” and “robotics” extras or simply all extras:</p>



<ol class="wp-block-list">
<li>Run <span class="c-code-snippet"><code>pip3 install gym[mujoco,robotics]</code> or <code>pip3 install gym[all]</code></span></li>



<li>Check the installation by running:</li>
</ol>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">python3 -c <span class="hljs-string" style="color: rgb(221, 17, 68);">"import gym; env = gym.make('Humanoid-v2'); print('nIt is OKAY!' if env.reset() is not None else 'nSome problem here...')"</span>
</pre>



<p>If you see “It is OKAY!” printed at the end of the cmd, then it’s OKAY! Again, <strong>you can ignore warnings like <span class="c-code-snippet">objc[&#8230;]: Class GLFW&#8230; is implemented in both&#8230;</span></strong>.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-mujoco-diagnostics">MuJoCo diagnostics</h2>



<p>Now I’ll talk about useful metrics provided by the OpenAI Gym MuJoCo environments. They depend on an environment version, so I divide them into v2 and v3 diagnostics. You can access these metrics in an “info” dictionary provided by the environment step method: observation, reward, done, <strong>info</strong> = env.step(action). See <a href="https://gymnasium.farama.org/" target="_blank" rel="noreferrer noopener nofollow">Gym documentation</a> for more. <strong>The table below presents keys that allow you to access the metrics in the dictionary and metrics short descriptions</strong>.</p>



<div id="separator-block_ad2e1e8a994e6bfc27dee8d0475e0b59"
         class="block-separator block-separator--10">
</div>



<div id="medium-table-block_23fa682b5d5ba5bb9c999f64a759c691"
     class="block-medium-table c-table__outer-wrapper  l-padding__top--0 l-padding__bottom--0 l-margin__top--unset l-margin__bottom--unset">

    <table class="c-table">
                    <thead class="c-table__head">
            <tr>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Name                        </div>
                    </td>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Version                        </div>
                    </td>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Key                        </div>
                    </td>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Descripton                        </div>
                    </td>
                            </tr>
            </thead>
        
        <tbody class="c-table__body">

                    
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;"><strong>HalfCheetah</strong></span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">v2 / v3</span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">reward_run</span></p>
<p><span style="font-size: 15px;">reward_ctrl</span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">The positive reward for the robot forward velocity.</span></p>
<p><span style="font-size: 15px;">The negative reward for the robot action vector magnitude.</span></p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;"><strong>HalfCheetah</strong></span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">v3</span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">x_position</span></p>
<p><span style="font-size: 15px;">x_velocity</span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">Position in the X-axis.</span></p>
<p><span style="font-size: 15px;">Velocity in the X-axis (forward velocity).</span></p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong><span style="font-size: 15px;">Hopper</span></strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">v3</span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">x_position</span></p>
<p><span style="font-size: 15px;">x_velocity</span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">Position in the X-axis.</span></p>
<p><span style="font-size: 15px;">Velocity in the X-axis (forward velocity).</span></p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;"><strong>Walker2d</strong></span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">v3</span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">x_position</span></p>
<p><span style="font-size: 15px;">x_velocity</span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">Position in the X-axis.</span></p>
<p><span style="font-size: 15px;">Velocity in the X-axis (forward velocity).</span></p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;"><strong>Ant</strong></span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">v2 / v3</span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">reward_forward</span></p>
<p><span style="font-size: 15px;">reward_ctrl</span></p>
<p><span style="font-size: 15px;">reward_contact</span></p>
<p>&nbsp;</p>
<p><span style="font-size: 15px;">reward_survive</span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">The positive reward for the robot forward velocity.</span></p>
<p><span style="font-size: 15px;">The negative reward for the robot action vector magnitude.</span></p>
<p><span style="font-size: 15px;">The negative reward for the contact force magnitude between the robot and the ground.</span></p>
<p><span style="font-size: 15px;">The constant positive reward at each time step when the robot is alive (until the end of an episode or the robot falls).</span></p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;"><strong>Ant</strong></span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">v3</span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">x_position</span></p>
<p><span style="font-size: 15px;">x_velocity</span></p>
<p><span style="font-size: 15px;">y_position</span></p>
<p><span style="font-size: 15px;">y_velocity</span></p>
<p><span style="font-size: 15px;">distance_from _origin</span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">Position in the X-axis.</span></p>
<p><span style="font-size: 15px;">Velocity in the X-axis.</span></p>
<p><span style="font-size: 15px;">Position in the Y-axis.</span></p>
<p><span style="font-size: 15px;">Velocity in the Y-axis.</span></p>
<p><span style="font-size: 15px;">Distance from the robot starting position, (0, 0).</span></p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;"><strong>Humanoid</strong></span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">v2 / v3</span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">reward_linvel</span></p>
<p><span style="font-size: 15px;">reward_quadctrl</span></p>
<p><span style="font-size: 15px;">reward_impact</span></p>
<p>&nbsp;</p>
<p><span style="font-size: 15px;">reward_alive</span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">The positive reward for the robot forward velocity.</span></p>
<p><span style="font-size: 15px;">The negative reward for the robot action vector magnitude.</span></p>
<p><span style="font-size: 15px;">The negative reward for the contact force magnitude between the robot and the ground.</span></p>
<p><span style="font-size: 15px;">The constant positive reward at each time step when the robot is alive (until the end of an episode or the robot falls).</span></p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;"><strong>Humanoid</strong></span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">v3</span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">x_position</span></p>
<p><span style="font-size: 15px;">x_velocity</span></p>
<p><span style="font-size: 15px;">y_position</span></p>
<p><span style="font-size: 15px;">y_velocity</span></p>
<p><span style="font-size: 15px;">distance_from _origin</span></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><span style="font-size: 15px;">Position in the X-axis.</span></p>
<p><span style="font-size: 15px;">Velocity in the X-axis.</span></p>
<p><span style="font-size: 15px;">Position in the Y-axis.</span></p>
<p><span style="font-size: 15px;">Velocity in the Y-axis.</span></p>
<p><span style="font-size: 15px;">Distance from the robot starting position, (0, 0).</span></p>
                                                            </div>
                        </td>

                    
                </tr>

                    
        </tbody>
    </table>

</div>



<div id="separator-block_74419ffcb8c39132a12a93ed3b58e106"
         class="block-separator block-separator--5">
</div>



<p class="has-text-align-center"><em>Table: The most useful metrics provided by the OpenAI Gym MuJoCo environments</em></p>



<div id="separator-block_65f73a6ab17eeac86beb6cea7147bb87"
         class="block-separator block-separator--20">
</div>



<p><strong>Reward components can be especially useful, </strong>for example, a forward velocity reward – which is the goal of these tasks. However, note that the absence of some metric in the info dictionary doesn’t mean that, say, survival reward isn’t added to the rewards of Hopper or Walker —it is! For more nitty-gritty details like this, I encourage you to look into the code of the specific task on GitHub, e.g. <a href="https://github.com/openai/gym/blob/master/gym/envs/mujoco/walker2d_v3.py" target="_blank" rel="noreferrer noopener nofollow">Walker2d-v3</a>.&nbsp;</p>



<p>Now, let’s take a look at example metric values on the Humanoid task.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-humanoid-diagnostics">Humanoid diagnostics</h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/humanoid_velocity.png?ssl=1" alt="Humanoid velocity" class="wp-image-49360"/><figcaption class="wp-element-caption"><em> Comparison of velocities of three different DRL algorithms: SAC, SOP, and SUNR  </em></figcaption></figure>
</div>


<p>The figure above compares velocities of three different DRL algorithms: <a href="https://arxiv.org/abs/1812.05905?ref=hackernoon.com" target="_blank" rel="noreferrer noopener nofollow">SAC</a>, <a href="https://arxiv.org/abs/1910.02208" target="_blank" rel="noreferrer noopener nofollow">SOP</a>, and <a href="https://arxiv.org/abs/2007.04938" target="_blank" rel="noreferrer noopener nofollow">SUNRISE</a>. The velocities are plotted for fully trained agents at different points of the episode. You can see that the SOP agent runs the fastest, which is the goal of this task. In the figures below we investigate the positions of the SAC agent at the end of episodes at different stages of training.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/MuJoCo-logs_AverageXPosition-2.png?ssl=1" alt="MuJoCo logs_AverageXPosition-2" class="wp-image-49361"/><figcaption class="wp-element-caption"><em>SAC final positions in the X-axis across training on the Humanoid task. | </em><a href="https://app.neptune.ai/piojanu/bayesian-exploration/e/BAY-1797/charts" target="_blank" rel="noreferrer noopener nofollow"><em>See in the Neptune ap</em>p</a></figcaption></figure>
</div>

<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/MuJoCo-logs_AverageYPosition-2.png?ssl=1" alt="MuJoCo logs_AverageYPosition-2" class="wp-image-49362"/><figcaption class="wp-element-caption">SAC final positions in the Y-axis across training on the Humanoid task | <a href="https://app.neptune.ai/piojanu/bayesian-exploration/e/BAY-1797/charts" target="_blank" rel="noreferrer noopener nofollow">See in the Neptune app</a></figcaption></figure>
</div>


<p>You can see that this particular SAC agent runs in the negative X and positive Y direction and that with training it gets further and further. Because the time it has before the end of the episode remains the same, it means that it learns to run faster with training. Note that the agent isn’t trained to run in any particular direction. It’s trained to run forward as fast as possible in whatever direction. This means that different agents can learn to run in different directions. Plus, the agent can change the run direction at some point of training, which is shown in the figures below.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/MuCoJo-logs_AverageXPosition-3.png?ssl=1" alt="MuCoJo logs_AverageXPosition-3" class="wp-image-49364"/><figcaption class="wp-element-caption"><em>SAC final positions in the X-axis across training on the Humanoid task. It changes the run direction in one-third of training |<a href="https://app.neptune.ai/piojanu/bayesian-exploration/e/BAY-1783/charts" target="_blank" rel="noreferrer noopener nofollow"> See in the  Neptune app</a></em></figcaption></figure>
</div>

<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/MuJoCo-logs_AverageYPosition-3.png?ssl=1" alt="MuJoCo logs_AverageYPosition-3" class="wp-image-49365"/><figcaption class="wp-element-caption"><em>SAC final positions in the Y-axis across training on the Humanoid task. It changes the run direction late in the training |<a href="https://app.neptune.ai/piojanu/bayesian-exploration/e/BAY-1783/charts" target="_blank" rel="noreferrer noopener nofollow"> See in the Neptune app</a></em></figcaption></figure>
</div>

    <a
        href="/blog/best-benchmarks-for-reinforcement-learning"
        id="cta-box-related-link-block_46db6bec6206e6ab562b8e3a9326714d"
        class="block-cta-box-related-link  l-margin__top--standard l-margin__bottom--standard"
        target="_blank" rel="nofollow noopener noreferrer"    >

    
    <div class="block-cta-box-related-link__description-wrapper block-cta-box-related-link__description-wrapper--full">

        
            <div class="c-eyebrow">

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-related--article.svg"
                    loading="lazy"
                    decoding="async"
                    width="16"
                    height="16"
                    alt=""
                    class="c-eyebrow__icon">

                <div class="c-eyebrow__text">
                    Related                </div>
            </div>

        
                    <h3 class="c-header" class="c-header" id="h-best-benchmarks-for-reinforcement-learning-the-ultimate-list">                Best Benchmarks for Reinforcement Learning: The Ultimate List            </h3>        
                    <div class="c-button c-button--tertiary c-button--small">

                <span class="c-button__text">
                    Read more                </span>

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-button-arrow-right.svg"
                    loading="lazy"
                    decoding="async"
                    width="12"
                    height="12"
                    alt=""
                    class="c-button__arrow">

            </div>
            </div>

    </a>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-conclusions">Conclusions</h2>



<p>Congratulations, you’ve got MuJoCo up and running! Now you’ll be interested in training agents in these environments—check out <a href="https://github.com/awarelab/spinningup_tf2" target="_blank" rel="noreferrer noopener nofollow">this repository</a>. It includes an easy-to-understand code of DRL algorithms implemented in modern TF2. This code is based on the newcomer-friendly <a href="https://spinningup.openai.com" target="_blank" rel="noreferrer noopener nofollow">SpinningUp</a> codebase. Moreover, <strong>it includes the ability to log into Neptune platform</strong>, which is very convenient to store and analyze the training results! I use it in my research and you have to give it a try too.&nbsp;</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">5532</post-id>	</item>
		<item>
		<title>Best Benchmarks for Reinforcement Learning: The Ultimate List</title>
		<link>https://neptune.ai/blog/best-benchmarks-for-reinforcement-learning</link>
		
		<dc:creator><![CDATA[Piotr Januszewski]]></dc:creator>
		<pubDate>Thu, 21 Jul 2022 15:04:41 +0000</pubDate>
				<category><![CDATA[Reinforcement Learning]]></category>
		<guid isPermaLink="false">https://neptune.test/best-benchmarks-for-reinforcement-learning/</guid>

					<description><![CDATA[In this post, I’ll share with you my library of environments that support training reinforcement learning (RL) agents. The basis for RL research, or even playing with or learning RL, is the environment. It’s where you run your algorithm to evaluate how good it is. We’re going to explore 23 different benchmarks, so I guarantee&#8230;]]></description>
										<content:encoded><![CDATA[
<p>In this post, I’ll share with you my library of environments that support training reinforcement learning (RL) agents. The basis for RL research, or even playing with or learning RL, is the environment. It’s where you run your algorithm to evaluate how good it is. We’re going to explore 23 different benchmarks, so I guarantee you’ll find something interesting!</p>



<p>But first, we’ll do a short introduction to what you should be looking for if you’re just starting with RL. Whatever your current level of knowledge, I recommend looking through the whole list. I hope it will motivate you to keep doing good work, and <strong>inspire you to start your own project in something different than standard benchmarks!</strong></p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-rule-of-thumb">Rule of thumb</h2>



<p>If you’re interested in algorithms specialized in <strong>discrete action spaces</strong> (PPO, DQN, Rainbow, &#8230;), where the action input can be, for example, buttons on the ATARI 2600 game controller, then you should look at the Atari environments in the <a href="https://gym.openai.com">OpenAI Gym</a>. These include Pong, Breakout, Space Invaders, Seaquest, and more.</p>



<p>On the other hand, if you’re more interested in algorithms specialized in <strong>continuous action spaces</strong> (DDPG, TD3, SAC, &#8230;), where the action input is, say, torque on the joints of a humanoid robot learning to walk, then you should look at the MuJoCo environments in the <a href="https://gym.openai.com" target="_blank" rel="noreferrer noopener nofollow">OpenAI Gym</a> and <a href="https://github.com/deepmind/dm_control" target="_blank" rel="noreferrer noopener nofollow">DeepMind Control Suite</a>. <a href="https://github.com/benelot/pybullet-gym" target="_blank" rel="noreferrer noopener nofollow">PyBullet Gymperium</a> is an unpaid alternative. Harder environments include Robotics in the <a href="https://gym.openai.com" target="_blank" rel="noreferrer noopener nofollow">OpenAI Gym</a>.</p>



<p>If you don’t know what you’re interested in yet, then I suggest playing around with <a href="https://gym.openai.com/envs/#classic_control" target="_blank" rel="noreferrer noopener nofollow">classic control</a> environments in the OpenAI Gym, and reading <a href="https://spinningup.openai.com" target="_blank" rel="noreferrer noopener nofollow">SpinningUp in Deep RL</a>.</p>



<p>Enough introduction, let’s check out the benchmarks!</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-benchmarks">Benchmarks</h2>



<p>The first part of this section is just a list, in alphabetical order, of all 23 benchmarks. Further down, I add a bit of description from each benchmark’s creator to show you what it’s for.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-list-of-rl-benchmarks">List of RL benchmarks</h3>



<ul class="wp-block-list">
<li><a href="#aihabitat" rel="nofollow">AI Habitat</a> &#8211; Virtual embodiment; Photorealistic &amp; efficient 3D simulator;</li>



<li><a href="#behavioursuite" rel="nofollow">Behaviour Suite</a> &#8211; Test core RL capabilities; Fundamental research; Evaluate generalization;</li>



<li><a href="#deepmind-control" rel="nofollow">DeepMind Control Suite</a> &#8211; Continuous control; Physics-based simulation; Creating environments;</li>



<li><a href="#deepmind-lab" rel="nofollow">DeepMind Lab</a> &#8211; 3D navigation; Puzzle-solving;</li>



<li><a href="#deepmind-memory" rel="nofollow">DeepMind Memory Task Suite</a> &#8211; Require memory; Evaluate generalization;</li>



<li><a href="#deepmind-psychlab" rel="nofollow">DeepMind Psychlab</a> &#8211; Require memory; Evaluate generalization;</li>



<li><a href="#googleresearch-football" rel="nofollow">Google Research Football</a> &#8211; Multi-task; Single-/Multi-agent; Creating environments;</li>



<li><a href="#metaworld" rel="nofollow">Meta-World</a> &#8211; Meta-RL; Multi-task;</li>



<li><a href="#minerl" rel="nofollow">MineRL</a> &#8211; Imitation learning; Offline RL; 3D navigation; Puzzle-solving;</li>



<li><a href="#multiagent" rel="nofollow">Multiagent emergence environments</a> &#8211; Multi-agent; Creating environments; Emergence behavior;</li>



<li><a href="#openai-gym" rel="nofollow">OpenAI Gym</a> &#8211; Continuous control; Physics-based simulation; Classic video games; RAM state as observations;</li>



<li><a href="#openai-gymretro" rel="nofollow">OpenAI Gym Retro</a> &#8211; Classic video games; RAM state as observations;</li>



<li><a href="#openspiel" rel="nofollow">OpenSpiel</a> &#8211; Classic board games; Search and planning; Single-/Multi-agent;</li>



<li><a href="#procgen" rel="nofollow">Procgen Benchmark</a> &#8211; Evaluate generalization; Procedurally-generated;</li>



<li><a href="#pybullet" rel="nofollow">PyBullet Gymperium</a> &#8211; Continuous control; Physics-based simulation; MuJoCo unpaid alternative;</li>



<li><a href="#realworld" rel="nofollow">Real-World Reinforcement Learning</a> &#8211; Continuous control; Physics-based simulation; Adversarial examples;</li>



<li><a href="#rlcard" rel="nofollow">RLCard</a> &#8211; Classic card games; Search and planning; Single-/Multi-agent;</li>



<li><a href="#rlunplugged" rel="nofollow">RL Unplugged</a> &#8211; Offline RL; Imitation learning; Datasets for the common benchmarks;</li>



<li><a href="#screeps" rel="nofollow">Screeps</a> &#8211; Compete with others; Sandbox; MMO for programmers;</li>



<li><a href="#serpentai" rel="nofollow">Serpent.AI &#8211; Game Agent Framework</a> &#8211; Turn ANY video game into the RL env;</li>



<li><a href="#starcraft" rel="nofollow">StarCraft II Learning Environment</a> &#8211; Rich action and observation spaces; Multi-agent; Multi-task;</li>



<li><a href="#unity" rel="nofollow">The Unity Machine Learning Agents Toolkit (ML-Agents)</a> &#8211; Create environments; Curriculum learning; Single-/Multi-agent; Imitation learning;</li>



<li><a href="#wordcraft" rel="nofollow">WordCraft</a> -Test core capabilities; Commonsense knowledge;<br></li>
</ul>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-a">A</h2>



<h3 class="wp-block-heading" id="aihabitat"><a href="https://aihabitat.org" target="_blank" rel="noreferrer noopener nofollow">AI Habitat</a></h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-benchmarks-AI-habitat.png?ssl=1" alt="RL benchmarks - AI habitat" class="wp-image-44458"/></figure>
</div>


<p>The embodiment hypothesis is the idea that “intelligence emerges in the interaction of an agent with an environment and as a result of sensorimotor activity”. <em>Habitat</em> is a simulation platform for research in Embodied AI.<br>Imagine walking up to a home robot and asking “Hey robot – can you go check if my laptop is on my desk? And if so, bring it to me”. Or asking an egocentric AI assistant (sitting on your smart glasses): “Hey – where did I last see my keys?”. AI Habitat enables training of such embodied AI agents (virtual robots and egocentric assistants) in a highly photorealistic &amp; efficient 3D simulator, before transferring the learned skills to reality.</p>



<p>It’ll be the best fit for you if you study intelligent systems with a physical or virtual embodiment.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-b">B</h2>



<h3 class="wp-block-heading" id="behavioursuite"><a href="https://github.com/deepmind/bsuite" target="_blank" rel="noreferrer noopener nofollow">Behaviour Suite</a></h3>



<p><em>bsuite</em> is a collection of carefully designed experiments that investigate the core capabilities of a reinforcement learning (RL) agent with two main objectives.</p>



<ul class="wp-block-list">
<li>To collect clear, informative, and scalable problems that capture key issues in the design of efficient and general learning algorithms.</li>



<li>To study agent behavior through their performance on these shared benchmarks.</li>
</ul>



<p>This library automates the evaluation and analysis of any agent on these benchmarks. It serves to facilitate reproducible, and accessible, research on the core issues in RL, and ultimately the design of superior learning algorithms.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-d">D</h2>



<h3 class="wp-block-heading" id="deepmind-control"><a href="https://github.com/deepmind/dm_control" target="_blank" rel="noreferrer noopener nofollow">DeepMind Control Suite</a></h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-benchmarks-Deepmind-control.png?ssl=1" alt="RL benchmarks - Deepmind control" class="wp-image-44459"/></figure>
</div>


<p>The <em>dm_control</em> software package is a collection of <strong>Python libraries and task suites</strong> for reinforcement learning agents in an articulated-body simulation. A MuJoCo wrapper provides convenient bindings to functions and data structures to create your own tasks.</p>



<p>Moreover, the Control Suite is a fixed set of tasks with a standardized structure, intended to serve as performance benchmarks. It includes classic tasks like HalfCheetah, Humanoid, Hopper, Walker, Graber, and more (see the picture). The Locomotion framework provides high-level abstractions and examples of locomotion tasks like soccer. A set of configurable manipulation tasks with a robot arm and snap-together bricks is also included.</p>



<p>An introductory tutorial for this package is available as a <a href="https://colab.research.google.com/github/deepmind/dm_control/blob/master/tutorial.ipynb" target="_blank" rel="noreferrer noopener nofollow">Colaboratory notebook</a>.</p>



<h3 class="wp-block-heading" id="deepmind-lab"><a href="https://github.com/deepmind/lab" target="_blank" rel="noreferrer noopener nofollow">DeepMind Lab</a></h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-benchmarks-Deepmind-Lab.png?ssl=1" alt="RL benchmarks - Deepmind Lab" class="wp-image-44460"/></figure>
</div>


<p><em>DeepMind Lab</em> is a 3D learning environment based on Quake III Arena via ioquake3 and other open-source software. DeepMind Lab provides a suite of challenging 3D navigation and puzzle-solving tasks for learning agents. Its primary purpose is to act as a testbed for research in artificial intelligence, where agents have to act on visual observations.</p>



<h3 class="wp-block-heading" id="deepmind-memory"><a href="https://github.com/deepmind/dm_memorytasks" target="_blank" rel="noreferrer noopener nofollow">DeepMind Memory Task Suite</a></h3>



<p><em>The DeepMind Memory Task Suite</em> is a set of 13 diverse machine-learning tasks that require memory to solve. They are constructed to let us evaluate generalization performance on a memory-specific holdout set.</p>



<h3 class="wp-block-heading" id="deepmind-psychlab"><a href="https://github.com/deepmind/lab/tree/master/game_scripts/levels/contributed/psychlab" target="_blank" rel="noreferrer noopener nofollow">DeepMind Psychlab</a></h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-benchmarks-Deepmind-Psychlab.png?ssl=1" alt="RL benchmarks - Deepmind Psychlab" class="wp-image-44461"/></figure>
</div>


<p><em>Psychlab</em> is a simulated psychology laboratory inside the first-person 3D game world of DeepMind Lab. Psychlab enables implementations of classical laboratory psychological experiments so that they work with both human and artificial agents. Psychlab has a simple and flexible API that enables users to easily create their own tasks. As an example, the Psychlab includes several classical experimental paradigms including visual search, change detection, random dot motion discrimination, and multiple object tracking.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-g">G</h2>



<h3 class="wp-block-heading" id="googleresearch-football"><a href="https://github.com/google-research/football" target="_blank" rel="noreferrer noopener nofollow">Google Research Football</a></h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-benchmarks-Google-Football.png?ssl=1" alt="RL benchmarks - Google Football" class="wp-image-44462"/></figure>
</div>


<p><em>Google Research Football</em> is a novel RL environment where agents aim to master the world’s most popular sport &#8211; football! Modeled after popular football video games, the Football Environment provides an efficient physics-based 3D football simulation where agents control either one or all football players on their team, learn how to pass between them, and manage to overcome their opponent’s defense in order to score goals. The Football Environment provides a demanding set of research problems called Football Benchmarks, as well as the Football Academy, a set of progressively harder RL scenarios.<br>It’s perfect for multi-agent and multi-task research. It also allows you to create your own academy scenarios as well as completely new tasks using the simulator, based on the included examples.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-m">M</h2>



<h3 class="wp-block-heading" id="metaworld"><a href="https://github.com/rlworkgroup/metaworld" target="_blank" rel="noreferrer noopener nofollow">Meta-World</a></h3>


<div class="wp-block-image">
<figure class="aligncenter"><img decoding="async" src="https://lh5.googleusercontent.com/sYQFn5xtu2A64zQ3w75S7FnOHYbPAeVS64vO8FwzgoPz6dvTdRV7D2SkndVbKUrFj6ZgIRNsWjnaMdpU7BnthCh_Xef1ap-Xzmte-OhR9q1AtsmfCgW-HhRyKBC4vwQ6BSBAzxZB" alt="RL benchmarks - Metaworld"/></figure>
</div>


<p>Meta-reinforcement learning algorithms can enable robots to acquire new skills much more quickly, by leveraging prior experience to learn how to learn. <em>Meta-World</em> is an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic manipulation tasks. The authors aim to provide task distributions that are sufficiently broad to evaluate meta-RL algorithms&#8217; generalization ability to new behaviors.</p>



<h3 class="wp-block-heading" id="minerl"><a href="https://minerl.io/docs/" target="_blank" rel="noreferrer noopener nofollow">MineRL</a></h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-benchmarks-MineRL.png?ssl=1" alt="RL benchmarks - MineRL" class="wp-image-44463"/></figure>
</div>


<p><em>MineRL</em> is a research project started at Carnegie Mellon University aimed at developing various aspects of artificial intelligence within Minecraft. In short, MineRL consists of two major components:</p>



<ul class="wp-block-list">
<li><a href="https://minerl.io/dataset/" target="_blank" rel="noreferrer noopener nofollow">MineRL-v0 Dataset</a> – One of the largest imitation learning datasets with over 60 million frames of recorded human player data. The dataset includes a set of environments that highlight many of the hardest problems in modern-day Reinforcement Learning: sparse rewards and hierarchical policies.</li>



<li><a href="https://minerl.io/docs/tutorials/index.html" target="_blank" rel="noreferrer noopener nofollow">minerl</a> – A rich python3 package for doing artificial intelligence research in Minecraft. This includes two major submodules: <em>minerl.env</em> – A growing set of OpenAI Gym environments in Minecraft and <em>minerl.data</em> – The main python module for experimenting with the MineRL-v0 dataset.</li>
</ul>



<h3 class="wp-block-heading" id="multiagent"><a href="https://github.com/openai/multi-agent-emergence-environments" target="_blank" rel="noreferrer noopener nofollow">Multiagent emergence environments</a></h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-benchmarks-Multiagent.png?ssl=1" alt="RL benchmarks - Multiagent" class="wp-image-44464"/></figure>
</div>


<p>Environment generation code for <a href="https://arxiv.org/abs/1909.07528" target="_blank" rel="noreferrer noopener nofollow">Emergent Tool Use From Multi-Agent Autocurricula</a>. It’s a fun paper, I highly recommend you read it. The authors observed agents discovering progressively more complex tool use while playing a simple game of hide-and-seek. Through training in the simulated hide-and-seek environment, agents build a series of six distinct strategies and counterstrategies. The self-supervised emergent complexity in this simple environment further suggests that multi-agent co-adaptation may one day produce extremely complex and intelligent behavior.</p>



<p>It uses “<a href="https://github.com/openai/mujoco-worldgen" target="_blank" rel="noreferrer noopener nofollow">Worldgen: Randomized MuJoCo environments</a>” which allows users to generate complex, heavily randomized environments. You should try it too, if you’re into creating your own environments!</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-o">O</h2>



<h3 class="wp-block-heading" id="openai-gym"><a href="https://gym.openai.com" target="_blank" rel="noreferrer noopener nofollow">OpenAI Gym</a></h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-benchmarks-OpenAI-gym.png?ssl=1" alt="RL benchmarks - OpenAI gym" class="wp-image-44465"/></figure>
</div>


<p><em>Gym</em>, besides being the most widly known benchmark, is an amazing toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking the simulated humanoid (requires MuJoCo, see <a href="https://github.com/benelot/pybullet-gym" target="_blank" rel="noreferrer noopener nofollow">PyBullet Gymperium</a> for the free alternative) to playing Atari games like Pong or Pinball. I personally use it in my research the most. It’s very easy to use and it’s kind of standard nowadays. You should get to know it well.</p>



<h3 class="wp-block-heading" id="openai-gymretro"><a href="https://github.com/openai/retro" target="_blank" rel="noreferrer noopener nofollow">OpenAI Gym Retro</a></h3>



<p><em>Gym Retro</em> can be thought of as the extension of the OpenAI Gym. It lets you turn classic video games into OpenAI Gym environments for reinforcement learning and comes with integrations for ~1000 games. It uses various emulators that support the Libretro API, making it fairly easy to add new emulators.</p>



<h3 class="wp-block-heading" id="openspiel"><a href="https://github.com/deepmind/open_spiel" target="_blank" rel="noreferrer noopener nofollow">OpenSpiel</a></h3>



<p><em>OpenSpiel</em> is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games. OpenSpiel supports n-player (single- and multi- agent) zero-sum, cooperative and general-sum, one-shot and sequential, strictly turn-taking and simultaneous-move, perfect and imperfect information games, as well as traditional multiagent environments such as (partially- and fully- observable) grid worlds and social dilemmas. OpenSpiel also includes tools to analyze learning dynamics and other common evaluation metrics. Games are represented as procedural extensive-form games, with some natural extensions. The core API and games are implemented in C++ for efficiency and exposed to Python for your ease of use.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-p">P</h2>



<h3 class="wp-block-heading" id="procgen"><a href="https://github.com/openai/procgen" target="_blank" rel="noreferrer noopener nofollow">Procgen Benchmark</a></h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-benchmarks-Procgen.png?ssl=1" alt="RL benchmarks - Procgen" class="wp-image-44466"/></figure>
</div>


<p><em>Procgen Benchmark</em> consists of 16 unique environments designed to measure both sample efficiency and generalization in reinforcement learning. This benchmark is ideal for evaluating generalization since distinct training and test sets can be generated in each environment. This benchmark is also well-suited to evaluate sample efficiency since all environments pose diverse and compelling challenges for RL agents. The environments’ intrinsic diversity demands that agents learn robust policies; overfitting to narrow regions in state space will not suffice. Put differently, the ability to generalize becomes an integral component of success when agents are faced with ever-changing levels.</p>



<h3 class="wp-block-heading" id="pybullet"><a href="https://github.com/benelot/pybullet-gym" target="_blank" rel="noreferrer noopener nofollow">PyBullet Gymperium</a></h3>



<p><em>PyBullet Gymperium</em> is an open-source implementation of the OpenAI Gym MuJoCo environments and more. These are challenging continuous control environments like training a humanoid to walk. What’s cool about it, is that it doesn’t require the user to install MuJoCo, a commercial physics engine that requires a paid license to run for longer than 30 days.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-r">R</h2>



<h3 class="wp-block-heading" id="realworld"><a href="https://github.com/google-research/realworldrl_suite" target="_blank" rel="noreferrer noopener nofollow">Real-World Reinforcement Learning</a></h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-benchmarks-Real-World.png?ssl=1" alt="RL benchmarks - Real World" class="wp-image-44467"/></figure>
</div>


<p>The <a href="https://arxiv.org/abs/1904.12901" target="_blank" rel="noreferrer noopener nofollow">Challenges of Real-World RL</a> paper identifies and describes a set of nine challenges that are currently preventing Reinforcement Learning (RL) agents from being utilized on real-world applications and products. It also describes an evaluation framework and a set of environments that can provide an evaluation of an RL algorithm’s potential applicability to real-world systems. It has since then been followed up with the <a href="https://arxiv.org/pdf/2003.11881.pdf" target="_blank" rel="noreferrer noopener nofollow">An Empirical Investigation of the challenges of real-world reinforcement learning</a> paper which implements eight of the nine described challenges and analyses their effects on various state-of-the-art RL algorithms.<br>This is the codebase used to perform this analysis and is also intended as a common platform for easily reproducible experimentation around these challenges. It is referred to as the <em>realworldrl-suite</em> (Real-World Reinforcement Learning (RWRL) Suite).</p>



<h3 class="wp-block-heading" id="rlcard"><a href="https://github.com/datamllab/rlcard" target="_blank" rel="noreferrer noopener nofollow">RLCard</a></h3>



<p><em>RLCard</em> is a toolkit for Reinforcement Learning (RL) in card games. It supports multiple card environments with easy-to-use interfaces. Games include Blackjack, UNO, Limit Texas Hold&#8217;em, and more! It also lets you create your own environments. The goal of RLCard is to bridge reinforcement learning and imperfect information games.</p>



<h3 class="wp-block-heading" id="rlunplugged"><a href="https://github.com/deepmind/deepmind-research/tree/master/rl_unplugged" target="_blank" rel="noreferrer noopener nofollow">RL Unplugged</a></h3>



<p><em>RL Unplugged</em> is a suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed to facilitate ease of use, it provides the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established. It includes datasets for the most common benchmarks: Atari, DeepMind Locomotion, DeepMind Control Suite, Realworld RL, DeepMind Lab, and bsuite.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-s">S</h2>



<h3 class="wp-block-heading" id="screeps"><a href="https://screeps.com" target="_blank" rel="noreferrer noopener nofollow">Screeps</a></h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-benchmarks-Screeps.png?ssl=1" alt="RL benchmarks - Screeps" class="wp-image-44468"/></figure>
</div>


<p><em>Screeps</em> is a massive, multiplayer, online, real-time, strategy game (phwee, it’s a lot). Each player can create their own colony in a single persistent world shared by all the players. Such a colony can mine resources, build units, conquer territory. As you conquer more territory, your influence in the game world grows, as well as your abilities to expand your footprint. However, it requires a lot of effort on your part, since multiple players may aim at the same territory. And what’s the most important, you build an AI that does all of it!</p>



<p>Screeps is developed for people with programming skills. Unlike some other RTS games, your units in Screeps can react to events without your participation – provided that you have programmed them properly.</p>



<h3 class="wp-block-heading" id="serpentai"><a href="https://github.com/SerpentAI/SerpentAI?utm_campaign=Revue%20newsletter&amp;utm_medium=Newsletter&amp;utm_source=The%20Wild%20Week%20in%20AI" target="_blank" rel="noreferrer noopener nofollow">Serpent.AI &#8211; Game Agent Framework</a></h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-benchmarks-Serpent-AI.png?ssl=1" alt="RL benchmarks - Serpent AI" class="wp-image-44469"/></figure>
</div>


<p>Serpent.AI is a simple yet powerful, novel framework to assist developers in the creation of game agents. Turn ANY video game you own into a sandbox environment ripe for experimentation, all with familiar Python code. For example, see this <a href="https://www.youtube.com/watch?v=rvnHikUJ9T0" target="_blank" rel="noreferrer noopener nofollow">autonomous driving agent in GTA</a>. The framework first and foremost provides a valuable tool for Machine Learning &amp; AI research. It also turns out to be ridiculously fun to use as a hobbyist (and dangerously addictive)!</p>



<h3 class="wp-block-heading" id="starcraft"><a href="https://github.com/deepmind/pysc2" target="_blank" rel="noreferrer noopener nofollow">StarCraft II Learning Environment</a></h3>



<p><em>PySC2</em> provides an interface for RL agents to interact with StarCraft 2, getting observations and sending actions. It exposes Blizzard Entertainment&#8217;s StarCraft II Machine Learning API as a Python RL Environment. This is a collaboration between DeepMind and Blizzard to develop StarCraft II into a rich environment for RL research. <em>PySC2</em> has many pre-configured mini-game maps for benchmarking the RL agents.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-t">T</h2>



<h3 class="wp-block-heading" id="unity"><a href="https://github.com/Unity-Technologies/ml-agents" target="_blank" rel="noreferrer noopener nofollow">The Unity Machine Learning Agents Toolkit (ML-Agents)</a></h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-benchmarks-Unity.png?ssl=1" alt="RL benchmarks - Unity" class="wp-image-44470"/></figure>
</div>


<p>It’s an open-source project that enables games and simulations to serve as environments for training intelligent agents. Unity provides implementations (based on PyTorch) of state-of-the-art algorithms to enable game developers and hobbyists to easily train intelligent agents for 2D, 3D, and VR/AR games. Researchers, however, can use the provided simple-to-use Python API to train Agents using reinforcement learning, imitation learning, neuroevolution, or any other methods! See for example <a href="https://github.com/Unity-Technologies/marathon-envs" target="_blank" rel="noreferrer noopener nofollow">Marathon Environments</a>.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-w">W</h2>



<h3 class="wp-block-heading" id="wordcraft"><a href="https://github.com/minqi/wordcraft" target="_blank" rel="noreferrer noopener nofollow">WordCraft</a></h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-benchmarks-Wordcraft.png?ssl=1" alt="RL benchmarks - Wordcraft" class="wp-image-44471"/></figure>
</div>


<p>This is the official Python implementation of <a href="https://larel-ws.github.io/assets/pdfs/wordcraft_an_environment_for_benchmarking_commonsense_agents.pdf" target="_blank" rel="noreferrer noopener nofollow">WordCraft: An Environment for Benchmarking Commonsense Agents</a>. The ability to quickly solve a wide range of real-world tasks requires a commonsense understanding of the world. To better enable research on agents making use of commonsense knowledge you should try WordCraft, an RL environment based on <a href="https://littlealchemy2.com" target="_blank" rel="noreferrer noopener nofollow">Little Alchemy 2</a>. Little Alchemy 2 is a fun and addictive game which allows players to combine elements to create even more elements. This lightweight environment is fast to run and built upon entities and relations inspired by real-world semantics.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-conclusion">Conclusion</h2>



<p>This concludes our list of RL benchmarks. I can’t really tell you which one you should pick. For some, the more classic benchmarks like OpenAI Gym or DM Control Suite described in the “Rule of thumb” section will be the best fit. For others, it will be not enough and they might want to jump into something less tired like the Unity ML-agents or Screeps.</p>



<p>Personally, I worked with GRF on one occasion and it was fun to see how my agents learn to play football and score goals. At the moment, I work on some more fundamental research and I test my agents using the well-recognized OpenAI Gym MuJoCo environments, which is fun in other ways like seeing that my method really works.</p>



<p>Whatever is your choice, I hope this list helps you make your RL research more exciting!</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">4697</post-id>	</item>
		<item>
		<title>Markov Decision Process in Reinforcement Learning: Everything You Need to Know</title>
		<link>https://neptune.ai/blog/markov-decision-process-in-reinforcement-learning</link>
		
		<dc:creator><![CDATA[Andre Ye]]></dc:creator>
		<pubDate>Thu, 21 Jul 2022 13:22:57 +0000</pubDate>
				<category><![CDATA[ML Model Development]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<guid isPermaLink="false">https://neptune.test/markov-decision-process-in-reinforcement-learning/</guid>

					<description><![CDATA[Take a moment to locate the nearest big city around you. If you were to go there, how would you do it? Go by car, take a bus, take a train? Maybe ride a bike, or buy an airplane ticket? Making this choice, you incorporate probability into your decision-making process. Perhaps there’s a 70% chance&#8230;]]></description>
										<content:encoded><![CDATA[
<p>Take a moment to locate the nearest big city around you. If you were to go there, how would you do it? Go by car, take a bus, take a train? Maybe ride a bike, or buy an airplane ticket?</p>



<p>Making this choice, you <strong>incorporate probability into your decision-making process</strong>. Perhaps there’s a 70% chance of rain or a car crash, which can cause traffic jams. If your bike tire is old, it may break down &#8211; this is certainly a large probabilistic factor.&nbsp;</p>



<p>On the other hand, there are <strong>deterministic costs</strong> &#8211; for instance, the cost of gas or an airplane ticket &#8211; as well as deterministic rewards &#8211; like much faster travel times taking an airplane.</p>



<p>These types of problems &#8211; in which an agent must balance probabilistic and deterministic rewards and costs &#8211; are common in decision-making. <a href="https://towardsdatascience.com/introduction-to-reinforcement-learning-markov-decision-process-44c533ebf8da" target="_blank" rel="noreferrer noopener nofollow">Markov Decision Processes</a> are used to model these types of optimization problems, and can also be applied to more complex tasks in Reinforcement Learning.</p>



<div id="separator-block_5d6e95510d6e0ea64d9056d82ca33484"
         class="block-separator block-separator--5">
</div>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-defining-markov-decision-processes-in-machine-learning">Defining Markov Decision Processes in Machine Learning</h2>



<p>To illustrate a Markov Decision process, think about a dice game:</p>



<ul class="wp-block-list">
<li>Each round, you can either <strong>continue</strong> or <strong>quit</strong>.</li>



<li>If you <strong>quit</strong>, you receive $5 and the game ends.</li>



<li>If you <strong>continue</strong>, you receive $3 and roll a 6-sided die. If the die comes up as 1 or 2, the game ends. Otherwise, the game continues onto the next round.</li>
</ul>



<p>There is a clear trade-off here. For one, we can trade a deterministic gain of $2 for the chance to roll dice and continue to the next round.</p>



<p>To create an MDP to model this game, first we need to define a few things:</p>



<ul class="wp-block-list">
<li>A <em>state </em>is a status that the agent (decision-maker) can hold. In the dice game, the agent can either be <em>in the game</em> or <em>out of the game</em>.&nbsp;</li>



<li>An <em>action</em> is a movement the agent can choose. It moves the agent between states, with certain penalties or rewards.</li>



<li><em>Transition probabilities</em> describe the probability of ending up in a state s’ (s prime) given an action <em>a</em>. These will be often denoted as a function <em>P</em>(<em>s</em>, <em>a</em>, <em>s</em>’) that outputs the probability of ending up in <em>s’</em> given current state <em>s</em> and action <em>a</em>.<br>For example, <em>P</em>(<em>s</em>=playing the game, <em>a</em>=choose to continue playing, <em>s’</em>=not playing the game) is ⅓, since there is a two-sixths (one-third) chance of losing the dice roll.</li>



<li><em>Rewards</em> are given depending on the action. The reward for continuing the game is 3, whereas the reward for quitting is $5. The ‘overall’ reward is to be optimized.</li>
</ul>



<p>We can formally describe a Markov Decision Process as <em>m</em> = (<em>S</em>, <em>A,</em> <em>P</em>, <em>R</em>, gamma), where:</p>



<ul class="wp-block-list">
<li><em>S</em> represents the set of all states.</li>



<li><em>A</em> represents the set of possible actions.</li>



<li><em>P</em> represents the transition probabilities.</li>



<li><em>R</em> represents the rewards.</li>



<li>Gamma is known as the discount factor (more on this later).</li>
</ul>



<p>The goal of the MDP <em>m</em> is to find a policy, often denoted as pi, that yields the optimal long-term reward. Policies are simply a mapping of each state <em>s</em> to a distribution of actions <em>a</em>. For each state <em>s</em>, the agent should take action <em>a</em> with a certain probability. Alternatively, policies can also be deterministic (i.e. the agent <em>will</em> take action <em>a</em> in state <em>s</em>).</p>



<p>Our Markov Decision Process would look like the graph below. An agent traverses the graph’s two states by making decisions and following probabilities.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Markov-Decision-Process.png?ssl=1" alt="Markov Decision Process" class="wp-image-31228"/></figure>
</div>


<p>It’s important to mention the <em>Markov Property</em>, which applies not only to Markov Decision Processes but anything Markov-related (like a Markov Chain).&nbsp;</p>



<p><strong>It states that the next state can be determined solely by the current state &#8211; no ‘memory’ is necessary.</strong> This applies to how the agent traverses the Markov Decision Process, but note that optimization methods use previous learning to fine tune policies. This is not a violation of the Markov property, which only applies to the <em>traversal</em> of an MDP.</p>



<div id="separator-block_5d6e95510d6e0ea64d9056d82ca33484"
         class="block-separator block-separator--5">
</div>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-the-bellman-equation-dynamic-programming">The Bellman equation &amp; dynamic programming</h2>



<p>The Bellman Equation is central to Markov Decision Processes. It outlines a framework for determining the optimal expected reward at a state <em>s</em> by answering the question: “what is the maximum reward an agent can receive if they make the optimal action now and for all future decisions?”</p>



<p>Although versions of the Bellman Equation can become fairly complicated, fundamentally most of them can be boiled down to this form:</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Bellman-Equation.png?ssl=1" alt="Bellman Equation" class="wp-image-31230"/></figure>
</div>


<p>It is a relatively common-sense idea, put into formulaic terms. Notice the role gamma &#8211; which is between 0 or 1 (inclusive) &#8211; plays in determining the optimal reward. If gamma is set to 0, the V(s’) term is completely canceled out and the model only cares about the immediate reward.&nbsp;</p>



<p>On the other hand, if gamma is set to 1, the model weights potential future rewards just as much as it weights immediate rewards. The optimal value of gamma is usually somewhere between 0 and 1, such that the value of farther-out rewards has diminishing effects.</p>



<p>Let’s use the Bellman equation to determine how much money we could receive in the dice game. We can choose between two choices, so our expanded equation will look like max(choice 1’s reward, choice 2’s reward).&nbsp;</p>



<p>Choice 1 &#8211; quitting &#8211; yields a reward of 5.&nbsp;</p>



<p>On the other hand, choice 2 yields a reward of 3, plus a two-thirds chance of continuing to the next stage, in which the decision can be made again (we are calculating by expected return). We add a discount factor gamma in front of terms indicating the calculating of s’ (the next state).</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Markov-decision-equation.png?resize=512%2C104&#038;ssl=1" alt="Markov decision equation" class="wp-image-31232" style="width:512px;height:104px" width="512" height="104"/></figure>
</div>


<p>This equation is recursive, but inevitably it will converge to one value, given that the value of the next iteration decreases by ⅔, even with a maximum gamma of 1.&nbsp;</p>



<p>At some point, it will not be profitable to continue staying in game. Let’s calculate four iterations of this, with a gamma of 1 to keep things simple and to calculate the total long-term optimal reward.</p>



<p>At each step, we can either quit and receive an extra $5 in expected value, or stay and receive an extra $3 in expected value. Each new round, the expected value is multiplied by two-thirds, since there is a two-thirds probability of continuing, even if the agent chooses to stay.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Markov-Decision-Process-graph.png?resize=768%2C428&#038;ssl=1" alt="Markov Decision Process graph" class="wp-image-31234" style="width:768px;height:428px" width="768" height="428"/></figure>
</div>


<p>Here, the decimal values are computed, and we find that (with our current number of iterations) we can expect to get $7.8 if we follow the best choices.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Markov-Decision-Process-graph-2.png?resize=768%2C424&#038;ssl=1" alt="Markov Decision Process graph" class="wp-image-31235" style="width:768px;height:424px" width="768" height="424"/></figure>
</div>


<p>Here, we calculated the best profit manually, which means there was an error in our calculation: we terminated our calculations after only four rounds.&nbsp;</p>



<p>If we were to continue computing expected values for several dozen more rows, we would find that the optimal value is actually higher. In order to compute this efficiently with a program, you would need to use a specialized data structure.&nbsp;</p>



<p>Plus, in order to be efficient, we don’t want to calculate each expected value independently, but in relation with previous ones. The solution: Dynamic Programming.</p>



<p>Richard Bellman, of the Bellman Equation, coined the term Dynamic Programming, and it’s used to compute problems that can be broken down into subproblems. For example, the expected value for choosing Stay &gt; Stay &gt; Stay &gt; Quit can be found by calculating the value of Stay &gt; Stay &gt; Stay first.&nbsp;</p>



<p>These pre-computations would be stored in a two-dimensional array, where the row represents either the state [In] or [Out], and the column represents the iteration.&nbsp;We can write rules that relate each cell in the table to a previously precomputed cell (this diagram doesn’t include gamma).</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Markov-Decision-Process-diagram.png?resize=512%2C193&#038;ssl=1" alt="Markov Decision Process diagram" class="wp-image-31239" style="width:512px;height:193px" width="512" height="193"/></figure>
</div>


<p>Then, the solution is simply the largest value in the array after computing enough iterations. Through dynamic programming, computing the expected value &#8211; a key component of Markov Decision Processes and methods like Q-Learning &#8211; becomes efficient.</p>



<div id="separator-block_5d6e95510d6e0ea64d9056d82ca33484"
         class="block-separator block-separator--5">
</div>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-q-learning-markov-decision-process-reinforcement-learning">Q-learning: Markov Decision Process + Reinforcement Learning</h2>



<p>Let’s think about a different simple game, in which the agent (the circle) must navigate a grid in order to maximize the rewards for a given number of iterations.</p>



<p>There are seven types of blocks:&nbsp;</p>



<ul class="wp-block-list">
<li>-2 punishment,</li>



<li>-5 punishment,&nbsp;</li>



<li>-1 punishment,&nbsp;</li>



<li>+1 reward,&nbsp;</li>



<li>+10 reward,&nbsp;</li>



<li>block that moves the agent to space A1 or B3 with equal probability,</li>



<li>empty blocks.&nbsp;</li>
</ul>



<p>Note that this is an <a href="https://www.mathworks.com/help/reinforcement-learning/ug/train-reinforcement-learning-agent-in-mdp-environment.html" target="_blank" rel="noreferrer noopener nofollow">MDP </a>in grid form &#8211; there are 9 states and each connects to the state around it. The game terminates if the agent has a punishment of -5 or less, or if the agent has reward of 5 or more.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/MDP-grid.png?resize=512%2C237&#038;ssl=1" alt="MDP grid" class="wp-image-31240" style="width:512px;height:237px" width="512" height="237"/></figure>
</div>


<p>In Q-learning, we don’t know about probabilities &#8211; it isn’t explicitly defined in the model. Instead, the model must learn this and the landscape by itself by interacting with the environment.&nbsp;</p>



<p>This makes Q-learning suitable in scenarios where explicit probabilities and values are unknown. If they are known, then you might not need to use Q-learning.&nbsp;</p>



<p>In our game, we know the probabilities, rewards, and penalties because we are strictly defining them. But if, say, we are training a robot to navigate a complex landscape, we wouldn’t be able to hard-code the rules of physics; using Q-learning or another reinforcement learning method would be appropriate.</p>



<p>Each step of the way, the model will update its learnings in a Q-table. The table below, which stores possible state-action pairs, reflects <em>current</em> known information about the system, which will be used to drive future decisions.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Q-table.png?ssl=1" alt="Q table" class="wp-image-31242"/></figure>
</div>


<p>Each of the cells contain Q-values, which represent the expected value of the system given the current action is taken. (Does this sound familiar? It should &#8211; this is the Bellman Equation again!)<strong>&nbsp;</strong></p>



<p>All values in the table begin at 0 and are updated iteratively. Note that there is no state for A3 because the agent cannot control their movement from that point.</p>



<p>To update the Q-table, the agent begins by choosing an action. It cannot move up or down, but if it moves right, it suffers a penalty of -5, and the game terminates. The Q-table can be updated accordingly.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Q-table-2.png?ssl=1" alt="Q table" class="wp-image-31244"/></figure>
</div>


<p>When the agent traverses the environment for the second time, it considers its options. Given the current Q-table, it can either move right or down. Moving right yields a loss of -5, compared to moving down, currently set at 0.&nbsp;</p>



<p>For the sake of simulation, let’s imagine that the agent travels along the path indicated below, and ends up at C1, terminating the game with a reward of 10. We can then fill in the reward that the agent received for each action they took along the way.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Q-table-3.png?ssl=1" alt="Q table" class="wp-image-31245"/></figure>
</div>


<p>Obviously, this Q-table is incomplete. Even if the agent moves down from A1 to A2, there is no guarantee that it will receive a reward of 10. After enough iterations, the agent should have traversed the environment to the point where values in the Q-table tell us the best and worst decisions to make at every location.&nbsp;</p>



<p>This example is a simplification of how Q-values are actually updated, which involves the Bellman Equation discussed above. For instance, depending on the value of gamma, we may decide that recent information collected by the agent, based on a more recent and accurate Q-table, may be more important than old information, so we can discount the importance of older information in constructing our Q-table.</p>



<p>It’s important to note the <strong>exploration vs exploitation trade-off</strong> here. If the agent traverses the correct path towards the goal but ends up, for some reason, at an unlucky penalty, it will record that negative value in the Q-table and associate every move it took with this penalty.&nbsp;</p>



<p>If the agent is purely ‘exploitative’ &#8211; it always seeks to maximize direct immediate gain &#8211; it may never dare to take a step in the direction of that path.&nbsp;</p>



<p>Alternatively, if an agent follows the path to a small reward, a purely exploitative agent will simply follow that path every time and ignore any other path, since it leads to a reward that is larger than 1.&nbsp;</p>



<p>By allowing the agent to ‘explore’ more, it can focus less on choosing the optimal path to take and more on collecting information. This usually happens in the form of randomness, which allows the agent to have some sort of randomness in their decision process.&nbsp;</p>



<p>However, a purely ‘explorative’ agent is also useless and inefficient &#8211; it will take paths that clearly lead to large penalties and can take up valuable computing time.&nbsp;</p>



<p>It’s good practice to incorporate some intermediate mix of randomness, such that the agent bases its reasoning on previous discoveries, but still has opportunities to address less explored paths.</p>



<p>A sophisticated form of incorporating the exploration-exploitation trade-off is <em>simulated annealing</em>, which comes from metallurgy, the controlled heating and cooling of metals.&nbsp;</p>



<p>Instead of allowing the model to have some sort of fixed constant in choosing how explorative or exploitative it is, simulated annealing begins by having the agent heavily explore, then become more exploitative over time as it gets more information.&nbsp;</p>



<p>This method has shown enormous success in discrete problems like the Travelling Salesman Problem, so it also applies well to Markov Decision Processes.&nbsp;</p>



<p>Because simulated annealing begins with high exploration, it is able to generally gauge which solutions are promising and which are less so. As the model becomes more exploitative, it directs its attention towards the promising solution, eventually closing in on the most promising solution in a computationally efficient way.</p>



<div id="separator-block_5d6e95510d6e0ea64d9056d82ca33484"
         class="block-separator block-separator--5">
</div>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-summary">Summary</h2>



<p>Let’s wrap up what we explored in this article:&nbsp;</p>



<p>A Markov Decision Process (MDP) is used to model decisions that can have both probabilistic and deterministic rewards and punishments.</p>



<p>MDPs have five core elements:&nbsp;</p>



<ul class="wp-block-list">
<li>S, a set of possible states for an agent to be in,&nbsp;</li>



<li>A, a set of possible actions an agent can take at a particular state,</li>



<li>R, the rewards for making an action A at state S;&nbsp;</li>



<li>P, the probabilities for transitioning to a new state S’ after taking action A at original state S;&nbsp;</li>



<li>gamma, which controls how far-looking the Markov Decision Process agent will be.</li>
</ul>



<p>All Markov Processes, including MDPs, must follow the <em>Markov Property</em>, which states that the next state can be determined purely by the current state.</p>



<p>The Bellman Equation determines the maximum reward an agent can receive if they make the optimal decision at the current state and at all following states. It defines the value of the current state recursively as being the maximum possible value of the current state reward, plus the value of the next state.</p>



<p>Dynamic programming utilizes a grid structure to store previously computed values and builds upon them to compute new values. It can be used to efficiently calculate the value of a policy and to solve not only Markov Decision Processes, but many other recursive problems.</p>



<p>Q-Learning is the learning of Q-values in an environment, which often resembles a Markov Decision Process. It is suitable in cases where the specific probabilities, rewards, and penalties are not completely known, as the agent traverses the environment repeatedly to learn the best strategy by itself.</p>



<p>Hope you enjoyed exploring these topics with me. Thank you for reading!</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">3025</post-id>	</item>
		<item>
		<title>Model-Based and Model-Free Reinforcement Learning: Pytennis Case Study</title>
		<link>https://neptune.ai/blog/model-based-and-model-free-reinforcement-learning-pytennis-case-study</link>
		
		<dc:creator><![CDATA[Elisha Odemakinde]]></dc:creator>
		<pubDate>Thu, 21 Jul 2022 13:21:34 +0000</pubDate>
				<category><![CDATA[ML Model Development]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<guid isPermaLink="false">https://neptune.test/model-based-and-model-free-reinforcement-learning-pytennis-case-study/</guid>

					<description><![CDATA[Reinforcement learning is a field of Artificial Intelligence in which you build an intelligent system that learns from its environment through interaction and evaluates what it learns in real-time.&#160; A good example of this is self-driving cars, or when DeepMind built what we know today as AlphaGo, AlphaStar, and AlphaZero.&#160; AlphaZero is a program built&#8230;]]></description>
										<content:encoded><![CDATA[
<p>Reinforcement learning is a field of Artificial Intelligence in which you build an intelligent system that learns from its environment through interaction and evaluates what it learns in real-time.&nbsp;</p>



<p>A good example of this is self-driving cars, or when DeepMind built what we know today as AlphaGo, AlphaStar, and AlphaZero.&nbsp;</p>



<p>AlphaZero is a program built to master the games of chess, shogi and go (AlphaGo is the first program that beat a human Go master). AlphaStar plays the video game StarCraft II.</p>



<p>In this article, we’ll compare model-free vs model-based reinforcement learning. Along the way, we will explore:</p>



<ol class="wp-block-list">
<li>Fundamental concepts of Reinforcement Learning<br>a) Markov decision processes / Q-Value / Q-Learning / Deep Q Network</li>



<li>Difference between model-based and model-free reinforcement learning.</li>



<li>Discrete mathematical approach to playing tennis &#8211; model-free reinforcement learning.</li>



<li>Tennis game using Deep Q Network &#8211; model-based reinforcement learning.</li>



<li>Comparison/Evaluation</li>



<li>References to learn more</li>
</ol>



<section id="blog-intext-cta-block_60039cf3b10bb0ecbb2c4d588a4f117b" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-see-related-articles">SEE RELATED ARTICLES</h3>
    
            <p>  <a href="/blog/best-reinforcement-learning-tutorials-examples-projects-and-courses" target="_blank" rel="noreferrer noopener">7 Applications of Reinforcement Learning in Finance and Trading</a><br />
  <a href="/blog/reinforcement-learning-applications" target="_blank" rel="noreferrer noopener">10 Real-Life Applications of Reinforcement Learning</a><br />
  <a href="/blog/best-reinforcement-learning-tutorials-examples-projects-and-courses" target="_blank" rel="noreferrer noopener">Best Reinforcement Learning Tutorials, Examples, Projects, and Courses</a></p>
    
    </section>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-fundamental-concepts-of-reinforcement-learning">Fundamental concepts of Reinforcement Learning</h2>



<p>Any reinforcement learning problem includes the following elements:</p>



<ol class="wp-block-list">
<li><strong>Agent</strong> &#8211; the program controlling the object of concern (for instance, a robot).</li>



<li><strong>Environment</strong> &#8211; this defines the outside world programmatically. Everything the agent(s) interacts with is part of the environment. It’s built for the agent to make it seem like a real-world case. It’s needed to prove the performance of an agent, meaning if it will do well once implemented in a real world application.</li>



<li><strong>Rewards</strong> &#8211; this gives us a score of how the algorithm performs with respect to the environment. It’s represented as 1 or 0. ‘1’ means that the policy network made the right move, ‘0’ means wrong move. In other words, rewards represent gains and losses.</li>



<li><strong>Policy</strong> &#8211; the algorithm used by the agent to decide its actions. This is the part that can be model-based or model-free.</li>
</ol>



<p>Every problem that needs an RL solution starts with simulating an environment for the agent. Next, you build a policy network that guides the actions of the agent. The agent can then evaluate the policy if its corresponding action resulted in a gain or a loss.</p>



<p>The policy is our main discussion point for this article. Policy can be model-based or model-free. When building, our concern is how to optimize the policy network via policy gradient (PG).&nbsp;</p>



<p>PG algorithms directly try to optimize the policy to increase rewards. To understand these algorithms, we must take a look at Markov decision processes (MDP).</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-markov-decision-processes-q-value-q-learning-deep-q-network">Markov decision processes / Q-Value / Q-Learning / Deep Q Network</h3>



<p>MDP is a process with a fixed number of states, and it randomly evolves from one state to another at each step. The probability for it to evolve from state A to state B is fixed.</p>



<p>A lot of Reinforcement Learning problems with discrete actions are modeled as <strong>Markov decision processes</strong>, with the agent having no initial clue on the next transition state. The agent also has no idea on the rewarding principle, so it has to explore all possible states to begin to decode how to adjust to a perfect rewarding system. This will lead us to what we call Q Learning.</p>



<p>The <strong>Q-Learning algorithm</strong> is adapted from the <strong>Q-Value</strong> Iteration algorithm, in a situation where the agent has no prior knowledge of preferred states and rewarding principles. Q-Values can be defined as an optimal estimate of a state-action value in an MDP.&nbsp;</p>



<p>It is often said that Q-Learning doesn’t scale well to large (or even medium) MDPs with many states and actions. The solution is to approximate the Q-Value of any state-action pair (s,a). This is called Approximate Q-Learning.&nbsp;</p>



<p>DeepMind proposed the use of deep neural networks, which work much better, especially for complex problems &#8211; without the use of any feature engineering. A deep neural network used to estimate Q-Values is called a <strong>deep Q-network (DQN).</strong> Using DQN for approximated Q-learning is called Deep Q-Learning.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-difference-between-model-based-and-model-free-reinforcement-learning">Difference between model-based and model-free Reinforcement Learning</h2>



<p>RL algorithms can be mainly divided into two categories &#8211; <strong>model-based and model-free</strong>. </p>



<p><strong>Model-based</strong>, as it sounds, has an agent trying to understand its environment and creating a model for it based on its interactions with this environment. In such a system, preferences take priority over the consequences of the actions i.e. the greedy agent will always try to perform an action that will get the maximum reward irrespective of what that action may cause.</p>



<p>On the other hand, <strong>model-free algorithms</strong> seek to learn the consequences of their actions through experience via algorithms such as Policy Gradient, Q-Learning, etc. In other words, such an algorithm will carry out an action multiple times and will adjust the policy (the strategy behind its actions) for optimal rewards, based on the outcomes.</p>



<p>Think of it this way, if the agent can predict the reward for some action before actually performing it thereby planning what it should do, the algorithm is model-based. While if it actually needs to carry out the action to see what happens and learn from it, it is model-free.</p>



<p>This results in different applications for these two classes, for e.g. a model-based approach may be the perfect fit for playing chess or for a robotic arm in the assembly line of a product, where the environment is static and getting the task done most efficiently is our main concern. However, in the case of real-world applications such as self-driving cars, a model-based approach might prompt the car to run over a pedestrian to reach its destination in less time (maximum reward), but a model-free approach would make the car wait till the road is clear (optimal way out).</p>



<p>To better understand this, we will explain everything with an example. In the example, <strong>we’ll build model-free and model-based RL for tennis games</strong>. To build the model, we need an environment for the policy to get implemented. However we won’t build the environment in this article, we’ll import one to use for our program.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-pytennis-environment">Pytennis environment</h2>



<p>We’ll use the Pytennis environment to build a model-free and model-based RL system.</p>



<p>A tennis game requires the following:</p>



<ol class="wp-block-list">
<li>2 players which implies 2 agents.</li>



<li>A tennis lawn &#8211; main environment.</li>



<li>A single tennis ball.</li>



<li>Movement of the agents left-right (or right-left direction).&nbsp;</li>
</ol>



<p>The Pytennis environment specifications are:</p>



<ol class="wp-block-list">
<li>There are 2 agents (2 players) with a ball.</li>



<li>There’s a tennis field of dimension (x, y) &#8211; (300, 500)</li>



<li>The ball was designed to move on a straight line, such that agent A decides a target point between x1 (0) and x2 (300) of side B (Agent B side), therefore it displays the ball 50 different times with respect to an FPS of 20. This makes the ball move in a straight line from source to destination. This also applies to agent B.</li>



<li>Movement of Agent A and Agent B is bound between (x1= 100, to x2 = 600).</li>



<li>Movement of the ball is bound along the y-axis (y1 = 100 to y2 = 600).</li>



<li>Movement of the ball is bound along the x-axis (x1 = 100, to x2 = 600).</li>
</ol>



<p>Pytennis is an environment that mimics real-life tennis situations. As shown below, the image on the left is a <a href="https://youtu.be/iUYxZ2tYKHw" target="_blank" rel="noreferrer noopener nofollow">model-free</a> Pytennis game, and the one on the right is <a href="https://youtu.be/FCwGNRiq9SY" target="_blank" rel="noreferrer noopener nofollow">model-based</a>.&nbsp;</p>



<div class="wp-block-columns are-vertically-aligned-center is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/pytennis-model-free-1.png?ssl=1" alt="pytennis model free" class="wp-image-31095"/></figure>
</div>



<div class="wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/pytennis-model-based-1.png?ssl=1" alt="pytennis model based" class="wp-image-31098"/></figure>
</div>
</div>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-discrete-mathematical-approach-to-playing-tennis-model-free-reinforcement-learning">Discrete mathematical approach to playing tennis &#8211; model-free Reinforcement Learning</h2>



<p>Why “discrete mathematical approach to playing tennis”? Because this method is a logical implementation of the Pytennis environment.&nbsp;</p>



<p>The code below shows us the implementation of the ball movement on the lawn. You can find the source code <a href="https://github.com/elishatofunmi/pytennis-Discrete-Mathematics-Approach-" target="_blank" rel="noreferrer noopener nofollow">here</a>.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> time
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> numpy <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> np
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> pygame
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> sys
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#import seaborn as sns</span>

<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> pygame.locals <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> *
pygame.init()


<span class="hljs-class"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">class</span> <span class="hljs-title" style="color: rgb(68, 85, 136); font-weight: 700;">Network</span>:</span>
   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">__init__</span><span class="hljs-params">(self, xmin, xmax, ymin, ymax)</span>:</span>
       <span class="hljs-string" style="color: rgb(221, 17, 68);">"""
       xmin: 150,
       xmax: 450,
       ymin: 100,
       ymax: 600
       """</span>

       self.StaticDiscipline = {
           <span class="hljs-string" style="color: rgb(221, 17, 68);">'xmin'</span>: xmin,
           <span class="hljs-string" style="color: rgb(221, 17, 68);">'xmax'</span>: xmax,
           <span class="hljs-string" style="color: rgb(221, 17, 68);">'ymin'</span>: ymin,
           <span class="hljs-string" style="color: rgb(221, 17, 68);">'ymax'</span>: ymax
       }

   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">network</span><span class="hljs-params">(self, xsource, ysource=<span class="hljs-number" style="color: teal;">100</span>, Ynew=<span class="hljs-number" style="color: teal;">600</span>, divisor=<span class="hljs-number" style="color: teal;">50</span>)</span>:</span>  <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># ysource will always be 100</span>
       <span class="hljs-string" style="color: rgb(221, 17, 68);">"""
       For Network A
       ysource: will always be 100
       xsource: will always be between xmin and xmax (static discipline)
       For Network B
       ysource: will always be 600
       xsource: will always be between xmin and xmax (static discipline)
       """</span>

       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">while</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>:
           ListOfXsourceYSource = []
           Xnew = np.random.choice([i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(
               self.StaticDiscipline[<span class="hljs-string" style="color: rgb(221, 17, 68);">'xmin'</span>], self.StaticDiscipline[<span class="hljs-string" style="color: rgb(221, 17, 68);">'xmax'</span>])], <span class="hljs-number" style="color: teal;">1</span>)
           <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Ynew = np.random.choice([i for i in range(self.StaticDiscipline['ymin'], self.StaticDiscipline['ymax'])], 1)</span>

           source = (xsource, ysource)
           target = (Xnew[<span class="hljs-number" style="color: teal;">0</span>], Ynew)

           <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Slope and intercept</span>
           slope = (ysource - Ynew)/(xsource - Xnew[<span class="hljs-number" style="color: teal;">0</span>])
           intercept = ysource - (slope*xsource)
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> (slope != np.inf) <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">and</span> (intercept != np.inf):
               <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">break</span>
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
               <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">continue</span>

       <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#print(source, target)</span>
       <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># randomly select 50 new values along the slope between xsource and xnew (monotonically decreasing/increasing)</span>
       XNewList = [xsource]

       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> xsource &lt; Xnew:
           differences = Xnew[<span class="hljs-number" style="color: teal;">0</span>] - xsource
           increment = differences / divisor
           newXval = xsource
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(divisor):

               newXval += increment
               XNewList.append(int(newXval))
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
           differences = xsource - Xnew[<span class="hljs-number" style="color: teal;">0</span>]
           decrement = differences / divisor
           newXval = xsource
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(divisor):

               newXval -= decrement
               XNewList.append(int(newXval))

       <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># determine the values of y, from the new values of x, using y= mx + c</span>
       yNewList = []
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> XNewList:
           findy = (slope * i) + intercept  <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># y = mx + c</span>
           yNewList.append(int(findy))

       ListOfXsourceYSource = [(x, y) <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> x, y <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> zip(XNewList, yNewList)]

       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> XNewList, yNewList
</pre>



<p>Here is how this works once the networks are initialized (Network A for Agent A and Network B for Agent B):</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Testing</span>
net = Network(<span class="hljs-number" style="color: teal;">150</span>, <span class="hljs-number" style="color: teal;">450</span>, <span class="hljs-number" style="color: teal;">100</span>, <span class="hljs-number" style="color: teal;">600</span>)
NetworkA = net.network(<span class="hljs-number" style="color: teal;">300</span>, ysource=<span class="hljs-number" style="color: teal;">100</span>, Ynew=<span class="hljs-number" style="color: teal;">600</span>)  <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Network A</span>
NetworkB = net.network(<span class="hljs-number" style="color: teal;">200</span>, ysource=<span class="hljs-number" style="color: teal;">600</span>, Ynew=<span class="hljs-number" style="color: teal;">100</span>)  <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Network B</span>
</pre>



<p>Each network is bounded by the directions of ball movement. Network A represents Agent A, which defines the movement of the ball from Agent A to any position between 100 and 300 along the x-axis at Agent B. This also applies to Network B (Agent B).</p>



<p>When the network is started, the .network method discretely generates 50 y-points (between y1 = 100 and y2 = 600), and corresponding x-points (between x1 which happens to be the location of the ball from Agent A to a randomly selected point x2 on Agent B side) for network A. This also applies to Network B (Agent B).&nbsp;</p>



<p>To automate the movement of each agent, the opposing agent has to move in a corresponding direction with respect to the ball. This can only be done by setting the x position of the ball to be the x position of the opposing agent, as in the code below.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">playerax = ballx <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#When Agent A plays.</span>

playerbx = ballx <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#When Agent B plays.</span></pre>



<p>Meanwhile the source agent has to move back to its default position from its current position. The code below illustrates this.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">DefaultToPosition</span><span class="hljs-params">(x1, x2=<span class="hljs-number" style="color: teal;">300</span>, divisor=<span class="hljs-number" style="color: teal;">50</span>)</span>:</span>
   XNewList = []
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> x1 &lt; x2:
       differences = x2 - x1
       increment = differences / divisor
       newXval = x1
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(divisor):
           newXval += increment
           XNewList.append(int(np.floor(newXval)))

   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
       differences = x1 - x2
       decrement = differences / divisor
       newXval = x1
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(divisor):
           newXval -= decrement
           XNewList.append(int(np.floor(newXval)))
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> XNewList
</pre>



<p>Now, to make the agents play with each other recursively, this has to run in a loop. After every 50 counts (50 frame display of the ball), the opposing player is made the next player. The code below puts all of it together in a loop.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">main</span><span class="hljs-params">()</span>:</span>
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">while</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>:
       display()
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> nextplayer == <span class="hljs-string" style="color: rgb(221, 17, 68);">'A'</span>:
           <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># playerA should play</span>
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> count == <span class="hljs-number" style="color: teal;">0</span>:
               <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#playerax = lastxcoordinate</span>
               NetworkA = net.network(
                   lastxcoordinate, ysource=<span class="hljs-number" style="color: teal;">100</span>, Ynew=<span class="hljs-number" style="color: teal;">600</span>)  <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Network A</span>
               out = DefaultToPosition(lastxcoordinate)

               <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># update lastxcoordinate</span>

               bally = NetworkA[<span class="hljs-number" style="color: teal;">1</span>][count]
               playerax = ballx <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#When Agent A plays.</span>
               count += <span class="hljs-number" style="color: teal;">1</span>
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#                 soundObj = pygame.mixer.Sound('sound/sound.wav')</span>
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#                 soundObj.play()</span>
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#                 time.sleep(0.3)</span>
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#                 soundObj.stop()</span>
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
               ballx = NetworkA[<span class="hljs-number" style="color: teal;">0</span>][count]
               bally = NetworkA[<span class="hljs-number" style="color: teal;">1</span>][count]
               playerbx = ballx
               playerax = out[count]
               count += <span class="hljs-number" style="color: teal;">1</span>

           <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># let playerB play after 50 new coordinate of ball movement</span>
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> count == <span class="hljs-number" style="color: teal;">49</span>:
               count = <span class="hljs-number" style="color: teal;">0</span>
               nextplayer = <span class="hljs-string" style="color: rgb(221, 17, 68);">'B'</span>
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
               nextplayer = <span class="hljs-string" style="color: rgb(221, 17, 68);">'A'</span>

       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
           <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># playerB can play</span>
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> count == <span class="hljs-number" style="color: teal;">0</span>:
               <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#playerbx = lastxcoordinate</span>
               NetworkB = net.network(
                   lastxcoordinate, ysource=<span class="hljs-number" style="color: teal;">600</span>, Ynew=<span class="hljs-number" style="color: teal;">100</span>)  <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Network B</span>
               out = DefaultToPosition(lastxcoordinate)

               <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># update lastxcoordinate</span>
               bally = NetworkB[<span class="hljs-number" style="color: teal;">1</span>][count]
               playerbx = ballx
               count += <span class="hljs-number" style="color: teal;">1</span>

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#                 soundObj = pygame.mixer.Sound('sound/sound.wav')</span>
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#                 soundObj.play()</span>
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#                 time.sleep(0.3)</span>
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#                 soundObj.stop()</span>
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
               ballx = NetworkB[<span class="hljs-number" style="color: teal;">0</span>][count]
               bally = NetworkB[<span class="hljs-number" style="color: teal;">1</span>][count]
               playerbx = out[count]
               playerax = ballx
               count += <span class="hljs-number" style="color: teal;">1</span>
           <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># update lastxcoordinate</span>

           <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># let playerA play after 50 new coordinate of ball movement</span>
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> count == <span class="hljs-number" style="color: teal;">49</span>:
               count = <span class="hljs-number" style="color: teal;">0</span>
               nextplayer = <span class="hljs-string" style="color: rgb(221, 17, 68);">'A'</span>
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
               nextplayer = <span class="hljs-string" style="color: rgb(221, 17, 68);">'B'</span>

       <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># CHECK BALL MOVEMENT</span>
       DISPLAYSURF.blit(PLAYERA, (playerax, <span class="hljs-number" style="color: teal;">50</span>))
       DISPLAYSURF.blit(PLAYERB, (playerbx, <span class="hljs-number" style="color: teal;">600</span>))
       DISPLAYSURF.blit(ball, (ballx, bally))

       <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># update last coordinate</span>
       lastxcoordinate = ballx

       pygame.display.update()
       fpsClock.tick(FPS)

       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> event <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> pygame.event.get():

           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> event.type == QUIT:
               pygame.quit()
               sys.exit()
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span></pre>



<p>And this is basic model-free reinforcement learning. It’s model-free because you need no form of learning or modelling for the 2 agents to play simultaneously and accurately.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-tennis-game-using-deep-q-network-model-based-reinforcement-learning">Tennis game using Deep Q Network &#8211; model-based Reinforcement Learning</h2>



<p>A typical example of model-based reinforcement learning is the Deep Q Network. Source code to this work is available <a href="https://github.com/elishatofunmi/pytennis-Deep-Q-Network-DQN-">here</a>.&nbsp;</p>



<p>The code below illustrates the Deep Q Network, which is the model architecture for this work.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> keras <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> Sequential, layers
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> keras.optimizers <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> Adam
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> keras.layers <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> Dense
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> collections <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> deque
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> numpy <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> np



<span class="hljs-class"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">class</span> <span class="hljs-title" style="color: rgb(68, 85, 136); font-weight: 700;">DQN</span>:</span>
   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">__init__</span><span class="hljs-params">(self)</span>:</span>
       self.learning_rate = <span class="hljs-number" style="color: teal;">0.001</span>
       self.momentum = <span class="hljs-number" style="color: teal;">0.95</span>
       self.eps_min = <span class="hljs-number" style="color: teal;">0.1</span>
       self.eps_max = <span class="hljs-number" style="color: teal;">1.0</span>
       self.eps_decay_steps = <span class="hljs-number" style="color: teal;">2000000</span>
       self.replay_memory_size = <span class="hljs-number" style="color: teal;">500</span>
       self.replay_memory = deque([], maxlen=self.replay_memory_size)
       n_steps = <span class="hljs-number" style="color: teal;">4000000</span> <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># total number of training steps</span>
       self.training_start = <span class="hljs-number" style="color: teal;">10000</span> <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># start training after 10,000 game iterations</span>
       self.training_interval = <span class="hljs-number" style="color: teal;">4</span> <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># run a training step every 4 game iterations</span>
       self.save_steps = <span class="hljs-number" style="color: teal;">1000</span> <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># save the model every 1,000 training steps</span>
       self.copy_steps = <span class="hljs-number" style="color: teal;">10000</span> <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># copy online DQN to target DQN every 10,000 training steps</span>
       self.discount_rate = <span class="hljs-number" style="color: teal;">0.99</span>
       self.skip_start = <span class="hljs-number" style="color: teal;">90</span> <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Skip the start of every game (it's just waiting time).</span>
       self.batch_size = <span class="hljs-number" style="color: teal;">100</span>
       self.iteration = <span class="hljs-number" style="color: teal;">0</span> <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># game iterations</span>
       self.done = <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span> <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># env needs to be reset</span>




       self.model = self.DQNmodel()

       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span>



   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">DQNmodel</span><span class="hljs-params">(self)</span>:</span>
       model = Sequential()
       model.add(Dense(<span class="hljs-number" style="color: teal;">64</span>, input_shape=(<span class="hljs-number" style="color: teal;">1</span>,), activation=<span class="hljs-string" style="color: rgb(221, 17, 68);">'relu'</span>))
       model.add(Dense(<span class="hljs-number" style="color: teal;">64</span>, activation=<span class="hljs-string" style="color: rgb(221, 17, 68);">'relu'</span>))
       model.add(Dense(<span class="hljs-number" style="color: teal;">10</span>, activation=<span class="hljs-string" style="color: rgb(221, 17, 68);">'softmax'</span>))
       model.compile(loss=<span class="hljs-string" style="color: rgb(221, 17, 68);">'categorical_crossentropy'</span>, optimizer=Adam(lr=self.learning_rate))
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> model


   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">sample_memories</span><span class="hljs-params">(self, batch_size)</span>:</span>
       indices = np.random.permutation(len(self.replay_memory))[:batch_size]
       cols = [[], [], [], [], []] <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># state, action, reward, next_state, continue</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> idx <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> indices:
           memory = self.replay_memory[idx]
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> col, value <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> zip(cols, memory):
               col.append(value)
       cols = [np.array(col) <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> col <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> cols]
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> (cols[<span class="hljs-number" style="color: teal;">0</span>], cols[<span class="hljs-number" style="color: teal;">1</span>], cols[<span class="hljs-number" style="color: teal;">2</span>].reshape(<span class="hljs-number" style="color: teal;">-1</span>, <span class="hljs-number" style="color: teal;">1</span>), cols[<span class="hljs-number" style="color: teal;">3</span>],cols[<span class="hljs-number" style="color: teal;">4</span>].reshape(<span class="hljs-number" style="color: teal;">-1</span>, <span class="hljs-number" style="color: teal;">1</span>))


   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">epsilon_greedy</span><span class="hljs-params">(self, q_values, step)</span>:</span>
       self.epsilon = max(self.eps_min, self.eps_max - (self.eps_max-self.eps_min) * step/self.eps_decay_steps)
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> np.random.rand() &lt; self.epsilon:
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> np.random.randint(<span class="hljs-number" style="color: teal;">10</span>) <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># random action</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> np.argmax(q_values) <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># optimal action</span>
</pre>



<p>In this case, we need a policy network to control the movement of each agent as they move along the x-axis. Since the values are continuous, that is from (x1 = 100 to x2 = 300), we can’t have a model that predicts or works with 200 states.&nbsp;</p>



<p>To simplify this problem, we can split x1 and x2 into 10 states / 10 actions, and define an upper and lower bound for each state.</p>



<p><strong>Note that we have 10 actions, because from a state there are 10 possibilities.</strong></p>



<p>The code below illustrates the definition of both upper and lower bounds for each state.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">evaluate_state_from_last_coordinate</span><span class="hljs-params">(self, c)</span>:</span>
       <span class="hljs-string" style="color: rgb(221, 17, 68);">"""
       cmax: 450
       cmin: 150

       c definately will be between 150 and 450.
       state0 - (150 - 179)
       state1 - (180 - 209)
       state2 - (210 - 239)
       state3 - (240 - 269)
       state4 - (270 - 299)
       state5 - (300 - 329)
       state6 - (330 - 359)
       state7 - (360 - 389)
       state8 - (390 - 419)
       state9 - (420 - 450)
       """</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> c &gt;= <span class="hljs-number" style="color: teal;">150</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">and</span> c &lt;= <span class="hljs-number" style="color: teal;">179</span>:
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> <span class="hljs-number" style="color: teal;">0</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> c &gt;= <span class="hljs-number" style="color: teal;">180</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">and</span> c &lt;= <span class="hljs-number" style="color: teal;">209</span>:
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> <span class="hljs-number" style="color: teal;">1</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> c &gt;= <span class="hljs-number" style="color: teal;">210</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">and</span> c &lt;= <span class="hljs-number" style="color: teal;">239</span>:
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> <span class="hljs-number" style="color: teal;">2</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> c &gt;= <span class="hljs-number" style="color: teal;">240</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">and</span> c &lt;= <span class="hljs-number" style="color: teal;">269</span>:
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> <span class="hljs-number" style="color: teal;">3</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> c &gt;= <span class="hljs-number" style="color: teal;">270</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">and</span> c &lt;= <span class="hljs-number" style="color: teal;">299</span>:
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> <span class="hljs-number" style="color: teal;">4</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> c &gt;= <span class="hljs-number" style="color: teal;">300</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">and</span> c &lt;= <span class="hljs-number" style="color: teal;">329</span>:
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> <span class="hljs-number" style="color: teal;">5</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> c &gt;= <span class="hljs-number" style="color: teal;">330</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">and</span> c &lt;= <span class="hljs-number" style="color: teal;">359</span>:
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> <span class="hljs-number" style="color: teal;">6</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> c &gt;= <span class="hljs-number" style="color: teal;">360</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">and</span> c &lt;= <span class="hljs-number" style="color: teal;">389</span>:
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> <span class="hljs-number" style="color: teal;">7</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> c &gt;= <span class="hljs-number" style="color: teal;">390</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">and</span> c &lt;= <span class="hljs-number" style="color: teal;">419</span>:
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> <span class="hljs-number" style="color: teal;">8</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> c &gt;= <span class="hljs-number" style="color: teal;">420</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">and</span> c &lt;= <span class="hljs-number" style="color: teal;">450</span>:
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> <span class="hljs-number" style="color: teal;">9</span>
</pre>



<p>The Deep Neural Network (DNN) used experimentally for this work is a network of 1 input (which represents the previous state), 2 hidden layers of 64 neurons each, and an output layer of 10 neurons (binary selection from 10 different states). This is shown below:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">DQNmodel</span><span class="hljs-params">(self)</span>:</span>
       model = Sequential()
       model.add(Dense(<span class="hljs-number" style="color: teal;">64</span>, input_shape=(<span class="hljs-number" style="color: teal;">1</span>,), activation=<span class="hljs-string" style="color: rgb(221, 17, 68);">'relu'</span>))
       model.add(Dense(<span class="hljs-number" style="color: teal;">64</span>, activation=<span class="hljs-string" style="color: rgb(221, 17, 68);">'relu'</span>))
       model.add(Dense(<span class="hljs-number" style="color: teal;">10</span>, activation=<span class="hljs-string" style="color: rgb(221, 17, 68);">'softmax'</span>))
       model.compile(loss=<span class="hljs-string" style="color: rgb(221, 17, 68);">'categorical_crossentropy'</span>, optimizer=Adam(lr=self.learning_rate))
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> model
</pre>



<p>Now that we have a DQN model that predicts the next state/action of the model, and the Pytennis environment already sorted out the ball movement in a straight line, let’s go ahead and write a function that carries out an action by an agent, based on the DQN model prediction regarding it’s next state.&nbsp;</p>



<p>The detailed code below illustrates how agent A makes a decision on where to direct the ball (on Agent B’s side and vice-versa). This code also evaluates agent B, if it was able to receive the ball.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">randomVal</span><span class="hljs-params">(self, action)</span>:</span>
       <span class="hljs-string" style="color: rgb(221, 17, 68);">"""
       cmax: 450
       cmin: 150

       c definately will be between 150 and 450.
       state0 - (150 - 179)
       state1 - (180 - 209)
       state2 - (210 - 239)
       state3 - (240 - 269)
       state4 - (270 - 299)
       state5 - (300 - 329)
       state6 - (330 - 359)
       state7 - (360 - 389)
       state8 - (390 - 419)
       state9 - (420 - 450)
       """</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> action == <span class="hljs-number" style="color: teal;">0</span>:
           val = np.random.choice([i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">150</span>, <span class="hljs-number" style="color: teal;">180</span>)])
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> action == <span class="hljs-number" style="color: teal;">1</span>:
           val = np.random.choice([i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">180</span>, <span class="hljs-number" style="color: teal;">210</span>)])
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> action == <span class="hljs-number" style="color: teal;">2</span>:
           val = np.random.choice([i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">210</span>, <span class="hljs-number" style="color: teal;">240</span>)])
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> action == <span class="hljs-number" style="color: teal;">3</span>:
           val = np.random.choice([i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">240</span>, <span class="hljs-number" style="color: teal;">270</span>)])
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> action == <span class="hljs-number" style="color: teal;">4</span>:
           val = np.random.choice([i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">270</span>, <span class="hljs-number" style="color: teal;">300</span>)])
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> action == <span class="hljs-number" style="color: teal;">5</span>:
           val = np.random.choice([i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">300</span>, <span class="hljs-number" style="color: teal;">330</span>)])
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> action == <span class="hljs-number" style="color: teal;">6</span>:
           val = np.random.choice([i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">330</span>, <span class="hljs-number" style="color: teal;">360</span>)])
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> action == <span class="hljs-number" style="color: teal;">7</span>:
           val = np.random.choice([i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">360</span>, <span class="hljs-number" style="color: teal;">390</span>)])
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> action == <span class="hljs-number" style="color: teal;">8</span>:
           val = np.random.choice([i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">390</span>, <span class="hljs-number" style="color: teal;">420</span>)])
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
           val = np.random.choice([i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">420</span>, <span class="hljs-number" style="color: teal;">450</span>)])
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> val

   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">stepA</span><span class="hljs-params">(self, action, count=<span class="hljs-number" style="color: teal;">0</span>)</span>:</span>
       <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># playerA should play</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> count == <span class="hljs-number" style="color: teal;">0</span>:
           self.NetworkA = self.net.network(
               self.ballx, ysource=<span class="hljs-number" style="color: teal;">100</span>, Ynew=<span class="hljs-number" style="color: teal;">600</span>)  <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Network A</span>
           self.bally = self.NetworkA[<span class="hljs-number" style="color: teal;">1</span>][count]
           self.ballx = self.NetworkA[<span class="hljs-number" style="color: teal;">0</span>][count]

           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> self.GeneralReward == <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>:
               self.playerax = self.randomVal(action)
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
               self.playerax = self.ballx


<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#             soundObj = pygame.mixer.Sound('sound/sound.wav')</span>
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#             soundObj.play()</span>
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#             time.sleep(0.4)</span>
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#             soundObj.stop()</span>

       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
           self.ballx = self.NetworkA[<span class="hljs-number" style="color: teal;">0</span>][count]
           self.bally = self.NetworkA[<span class="hljs-number" style="color: teal;">1</span>][count]

       obsOne = self.evaluate_state_from_last_coordinate(
           int(self.ballx))  <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># last state of the ball</span>
       obsTwo = self.evaluate_state_from_last_coordinate(
           int(self.playerbx))  <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># evaluate player bx</span>
       diff = np.abs(self.ballx - self.playerbx)
       obs = obsTwo
       reward = self.evaluate_action(diff)
       done = <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>
       info = str(diff)

       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> obs, reward, done, info


   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">evaluate_action</span><span class="hljs-params">(self, diff)</span>:</span>

       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> (int(diff) &lt;= <span class="hljs-number" style="color: teal;">30</span>):
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">False</span>
</pre>



<p>From the code above, function stepA gets executed when AgentA has to play. While playing, AgentA uses the next action predicted by DQN to estimate the target (x2 position, at Agent B, from the current position of the ball, x1, which is on it’s own side), by using the ball trajectory network developed by the Pytennis environment to make its own move.&nbsp;</p>



<p>Agent A, for example, is able to get a precise point x2 on Agent’s B side by using the function <strong>randomVal</strong>, as shown above, to randomly select a coordinate x2 bounded by the action given by DQN.&nbsp;</p>



<p>Finally, function stepA evaluates the response of AgentB to target point x2 by using the function <strong>evaluate_action</strong>. The function <strong>evaluate_action</strong> defines if AgentB should be penalized or rewarded. Just as this is described for AgentA to AgentB, it applies for AgentB to AgentA (same code by different variable names).</p>



<p>Now that we have the policy, reward, environment, states and actions correctly defined, we can go ahead and recursively make the two agents play the game with each other.&nbsp;</p>



<p>The code below shows how turns are taken by each agent after 50 ball displays. Note that for each ball display, the DQN is making a decision on where to toss the ball for the next agent to play.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">while</span> iteration &lt; iterations:

           self.display()
           self.randNumLabelA = self.myFontA.render(
               <span class="hljs-string" style="color: rgb(221, 17, 68);">'A (Win): '</span>+str(self.updateRewardA) + <span class="hljs-string" style="color: rgb(221, 17, 68);">', A(loss): '</span>+str(self.lossA), <span class="hljs-number" style="color: teal;">1</span>, self.BLACK)
           self.randNumLabelB = self.myFontB.render(
               <span class="hljs-string" style="color: rgb(221, 17, 68);">'B (Win): '</span>+str(self.updateRewardB) + <span class="hljs-string" style="color: rgb(221, 17, 68);">', B(loss): '</span> + str(self.lossB), <span class="hljs-number" style="color: teal;">1</span>, self.BLACK)
           self.randNumLabelIter = self.myFontIter.render(
               <span class="hljs-string" style="color: rgb(221, 17, 68);">'Iterations: '</span>+str(self.updateIter), <span class="hljs-number" style="color: teal;">1</span>, self.BLACK)

           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> nextplayer == <span class="hljs-string" style="color: rgb(221, 17, 68);">'A'</span>:

               <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> count == <span class="hljs-number" style="color: teal;">0</span>:
                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Online DQN evaluates what to do</span>
                   q_valueA = self.AgentA.model.predict([stateA])
                   actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Online DQN plays</span>
                   obsA, rewardA, doneA, infoA = self.stepA(
                       action=actionA, count=count)
                   next_stateA = actionA

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Let's memorize what just happened</span>
                   self.AgentA.replay_memory.append(
                       (stateA, actionA, rewardA, next_stateA, <span class="hljs-number" style="color: teal;">1.0</span> - doneA))
                   stateA = next_stateA

               <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> count == <span class="hljs-number" style="color: teal;">49</span>:

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Online DQN evaluates what to do</span>
                   q_valueA = self.AgentA.model.predict([stateA])
                   actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)
                   obsA, rewardA, doneA, infoA = self.stepA(
                       action=actionA, count=count)
                   next_stateA = actionA

                   self.updateRewardA += rewardA
                   self.computeLossA(rewardA)

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Let's memorize what just happened</span>
                   self.AgentA.replay_memory.append(
                       (stateA, actionA, rewardA, next_stateA, <span class="hljs-number" style="color: teal;">1.0</span> - doneA))

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># restart the game if player A fails to get the ball, and let B start the game</span>
                   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> rewardA == <span class="hljs-number" style="color: teal;">0</span>:
                       self.restart = <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>
                       time.sleep(<span class="hljs-number" style="color: teal;">0.5</span>)
                       nextplayer = <span class="hljs-string" style="color: rgb(221, 17, 68);">'B'</span>
                       self.GeneralReward = <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">False</span>
                   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
                       self.restart = <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">False</span>
                       self.GeneralReward = <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Sample memories and use the target DQN to produce the target Q-Value</span>
                   X_state_val, X_action_val, rewards, X_next_state_val, continues = (
                       self.AgentA.sample_memories(self.AgentA.batch_size))
                   next_q_values = self.AgentA.model.predict(
                       [X_next_state_val])
                   max_next_q_values = np.max(
                       next_q_values, axis=<span class="hljs-number" style="color: teal;">1</span>, keepdims=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)
                   y_val = rewards + continues * self.AgentA.discount_rate * max_next_q_values

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Train the online DQN</span>
                   self.AgentA.model.fit(X_state_val, tf.keras.utils.to_categorical(
                       X_next_state_val, num_classes=<span class="hljs-number" style="color: teal;">10</span>), verbose=<span class="hljs-number" style="color: teal;">0</span>)

                   nextplayer = <span class="hljs-string" style="color: rgb(221, 17, 68);">'B'</span>
                   self.updateIter += <span class="hljs-number" style="color: teal;">1</span>

                   count = <span class="hljs-number" style="color: teal;">0</span>
                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># evaluate A</span>

               <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Online DQN evaluates what to do</span>
                   q_valueA = self.AgentA.model.predict([stateA])
                   actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Online DQN plays</span>
                   obsA, rewardA, doneA, infoA = self.stepA(
                       action=actionA, count=count)
                   next_stateA = actionA

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Let's memorize what just happened</span>
                   self.AgentA.replay_memory.append(
                       (stateA, actionA, rewardA, next_stateA, <span class="hljs-number" style="color: teal;">1.0</span> - doneA))
                   stateA = next_stateA

               <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> nextplayer == <span class="hljs-string" style="color: rgb(221, 17, 68);">'A'</span>:
                   count += <span class="hljs-number" style="color: teal;">1</span>
               <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
                   count = <span class="hljs-number" style="color: teal;">0</span>

           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
               <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> count == <span class="hljs-number" style="color: teal;">0</span>:
                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Online DQN evaluates what to do</span>
                   q_valueB = self.AgentB.model.predict([stateB])
                   actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Online DQN plays</span>
                   obsB, rewardB, doneB, infoB = self.stepB(
                       action=actionB, count=count)
                   next_stateB = actionB

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Let's memorize what just happened</span>
                   self.AgentB.replay_memory.append(
                       (stateB, actionB, rewardB, next_stateB, <span class="hljs-number" style="color: teal;">1.0</span> - doneB))
                   stateB = next_stateB

               <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">elif</span> count == <span class="hljs-number" style="color: teal;">49</span>:

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Online DQN evaluates what to do</span>
                   q_valueB = self.AgentB.model.predict([stateB])
                   actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Online DQN plays</span>
                   obs, reward, done, info = self.stepB(
                       action=actionB, count=count)
                   next_stateB = actionB

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Let's memorize what just happened</span>
                   self.AgentB.replay_memory.append(
                       (stateB, actionB, rewardB, next_stateB, <span class="hljs-number" style="color: teal;">1.0</span> - doneB))

                   stateB = next_stateB
                   self.updateRewardB += rewardB
                   self.computeLossB(rewardB)

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># restart the game if player A fails to get the ball, and let B start the game</span>
                   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> rewardB == <span class="hljs-number" style="color: teal;">0</span>:
                       self.restart = <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>
                       time.sleep(<span class="hljs-number" style="color: teal;">0.5</span>)
                       self.GeneralReward = <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">False</span>
                       nextplayer = <span class="hljs-string" style="color: rgb(221, 17, 68);">'A'</span>
                   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
                       self.restart = <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">False</span>
                       self.GeneralReward = <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Sample memories and use the target DQN to produce the target Q-Value</span>
                   X_state_val, X_action_val, rewards, X_next_state_val, continues = (
                       self.AgentB.sample_memories(self.AgentB.batch_size))
                   next_q_values = self.AgentB.model.predict(
                       [X_next_state_val])
                   max_next_q_values = np.max(
                       next_q_values, axis=<span class="hljs-number" style="color: teal;">1</span>, keepdims=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)
                   y_val = rewards + continues * self.AgentB.discount_rate * max_next_q_values

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Train the online DQN</span>
                   self.AgentB.model.fit(X_state_val, tf.keras.utils.to_categorical(
                       X_next_state_val, num_classes=<span class="hljs-number" style="color: teal;">10</span>), verbose=<span class="hljs-number" style="color: teal;">0</span>)

                   nextplayer = <span class="hljs-string" style="color: rgb(221, 17, 68);">'A'</span>
                   self.updateIter += <span class="hljs-number" style="color: teal;">1</span>
                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># evaluate B</span>

               <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Online DQN evaluates what to do</span>
                   q_valueB = self.AgentB.model.predict([stateB])
                   actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Online DQN plays</span>
                   obsB, rewardB, doneB, infoB = self.stepB(
                       action=actionB, count=count)
                   next_stateB = actionB

                   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Let's memorize what just happened</span>
                   self.AgentB.replay_memory.append(
                       (stateB, actionB, rewardB, next_stateB, <span class="hljs-number" style="color: teal;">1.0</span> - doneB))
                   tateB = next_stateB

               <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> nextplayer == <span class="hljs-string" style="color: rgb(221, 17, 68);">'B'</span>:
                   count += <span class="hljs-number" style="color: teal;">1</span>
               <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
                   count = <span class="hljs-number" style="color: teal;">0</span>

           iteration += <span class="hljs-number" style="color: teal;">1</span>
</pre>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-comparison-evaluation">Comparison/Evaluation</h2>



<p>Having played this game model-free and model-based, here are some differences that we need to be aware of:</p>



<div id="separator-block_e30ebb678c4ba8377841bb9b0f568051"
         class="block-separator block-separator--10">
</div>



<div id="medium-table-block_4f8b1641178df01ef01ce7d5e521a2de"
     class="block-medium-table c-table__outer-wrapper  l-padding__top--0 l-padding__bottom--0 l-margin__top--unset l-margin__bottom--unset">

    <table class="c-table">
                    <thead class="c-table__head">
            <tr>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            s/n                        </div>
                    </td>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Model-free                        </div>
                    </td>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Model-based                        </div>
                    </td>
                            </tr>
            </thead>
        
        <tbody class="c-table__body">

                    
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>1</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>rewards are not accounted for (since this is automated, reward = 1)</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>rewards are accounted for</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>2</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>no modelling (no decision policy is required)</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>modelling is required (policy network)</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>3</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>this doesn’t require the use of initial states to predict the next state</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>this requires the use of initial states to predict the next state using the policy network</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>4</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>the rate of missing the ball with respect to time is zero</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>the rate of missing the ball with respect to time approaches zero</p>
                                                            </div>
                        </td>

                    
                </tr>

                    
        </tbody>
    </table>

</div>



<div id="separator-block_084b1a962f175a81255dd373244522d0"
         class="block-separator block-separator--25">
</div>



<p>If you’re interested, the videos below show these two techniques in action playing tennis games:</p>



<p>1. Model-free </p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="A reinforcement learning based tennis game - Discrete mathematics approach" width="500" height="281" src="https://www.youtube.com/embed/iUYxZ2tYKHw?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p>2. Model-based</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="An RL implementation of pytennis using Deep Q Network (DQN) - early stage of Learning" width="500" height="281" src="https://www.youtube.com/embed/FCwGNRiq9SY?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-conclusion">Conclusion</h2>



<p>Tennis might be simple compared to self-driving cars, but hopefully this example showed you a few things about RL that you didn’t know.&nbsp;</p>



<p>The main difference between model-free and model-based RL is the policy network, which is required for model-based RL and unnecessary in model-free.&nbsp;</p>



<p>It’s worth noting that oftentimes, model-based RL takes a massive amount of time for the DNN to learn the states perfectly without getting it wrong.</p>



<p>But every technique has its drawbacks and advantages, choosing the right one depends on what exactly you need your program to do.&nbsp;</p>



<p>Thanks for reading, I left a few additional references for you to follow if you want to explore this topic more.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-references">References</h3>



<ol class="wp-block-list">
<li>AlphaGo documentary: <a href="https://www.youtube.com/watch?v=WXuK6gekU1Y" target="_blank" rel="noreferrer noopener nofollow">https://www.youtube.com/watch?v=WXuK6gekU1Y</a></li>



<li>List of reinforcement learning environments: <a href="https://medium.com/@mauriciofadelargerich/reinforcement-learning-environments-cff767bc241f" target="_blank" rel="noreferrer noopener nofollow">https://medium.com/@mauriciofadelargerich/reinforcement-learning-environments-cff767bc241f</a> </li>



<li>Create your own reinforcement learning environment: <a href="https://towardsdatascience.com/create-your-own-reinforcement-learning-environment-beb12f4151ef" target="_blank" rel="noreferrer noopener nofollow">https://towardsdatascience.com/create-your-own-reinforcement-learning-environment-beb12f4151ef</a> </li>



<li>Types of RL Environments: <a href="https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781838649777/1/ch01lvl1sec14/types-of-rl-environment" target="_blank" rel="noreferrer noopener nofollow">https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781838649777/1/ch01lvl1sec14/types-of-rl-environment</a></li>



<li>Model-based Deep Q Network: <a href="https://github.com/elishatofunmi/pytennis-Deep-Q-Network-DQN-" target="_blank" rel="noreferrer noopener nofollow">https://github.com/elishatofunmi/pytennis-Deep-Q-Network-DQN</a></li>



<li>Discrete mathematics approach youtube video: <a href="https://youtu.be/iUYxZ2tYKHw" target="_blank" rel="noreferrer noopener nofollow">https://youtu.be/iUYxZ2tYKHw</a></li>



<li>Deep Q Network approach YouTube video: <a href="https://youtu.be/FCwGNRiq9SY" target="_blank" rel="noreferrer noopener nofollow">https://youtu.be/FCwGNRiq9SY</a></li>



<li>Model-free discrete mathematics implementation: <a href="https://github.com/elishatofunmi/pytennis-Discrete-Mathematics-Approach-" target="_blank" rel="noreferrer noopener nofollow">https://github.com/elishatofunmi/pytennis-Discrete-Mathematics-Approach-</a></li>



<li>Hands-on Machine Learning with scikit-learn and TensorFlow: <a href="https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291" target="_blank" rel="noreferrer noopener nofollow">https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291</a></li>
</ol>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">3004</post-id>	</item>
		<item>
		<title>The Best Tools for Reinforcement Learning in Python You Actually Want to Try</title>
		<link>https://neptune.ai/blog/the-best-tools-for-reinforcement-learning-in-python</link>
		
		<dc:creator><![CDATA[Vladimir Lyashenko]]></dc:creator>
		<pubDate>Thu, 21 Jul 2022 13:13:10 +0000</pubDate>
				<category><![CDATA[ML Tools]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<guid isPermaLink="false">https://neptune.test/the-best-tools-for-reinforcement-learning-in-python/</guid>

					<description><![CDATA[Nowadays, Deep Reinforcement Learning (RL) is one of the hottest topics in the Data Science community. The fast development of RL has resulted in the growing demand for easy to understand and convenient to use RL tools. In recent years, plenty of RL libraries have been developed. These libraries were designed to have all the&#8230;]]></description>
										<content:encoded><![CDATA[
<p>Nowadays, <a href="https://medium.com/ai%C2%B3-theory-practice-business/reinforcement-learning-part-1-a-brief-introduction-a53a849771cf" target="_blank" rel="noreferrer noopener nofollow">Deep Reinforcement Learning</a> (RL) is one of the hottest topics in the Data Science community. The fast development of RL has resulted in the growing demand for easy to understand and convenient to use RL tools.</p>



<p>In recent years, plenty of RL libraries have been developed. These libraries were designed to have all the necessary tools to both implement and test <strong>Reinforcement Learning</strong> models.</p>



<p>Still, they differ quite a lot. That’s why it is important to pick a library that will be quick, reliable, and relevant for your RL task.</p>



<p>In this article we will cover:</p>



<ul class="wp-block-list">
<li>Criteria for choosing <strong>Deep Reinforcement Learning</strong> library,</li>



<li>RL libraries: <strong>Pyqlearning</strong>, <strong>KerasRL</strong>, <strong>Tensorforce</strong>, <strong>RL_Coach</strong>, <strong>TFAgents</strong>, <strong>MAME RL</strong>, <strong>MushroomRL</strong>.</li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-tools.png?ssl=1" alt="RL tools" class="wp-image-30814"/></figure>
</div>


<h2 class="wp-block-heading" class="wp-block-heading" id="h-python-libraries-for-reinforcement-learning">Python libraries for Reinforcement Learning</h2>



<p>There are a lot of RL libraries, so choosing the right one for your case might be a complicated task. We need to form criteria to evaluate each library.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-criteria"><strong>Criteria</strong></h3>



<p>Each RL library in this article will be analyzed based on the following criteria:</p>



<ol class="wp-block-list">
<li>Number of <strong>state-of-the-art</strong> (<strong>SOTA) </strong>RL algorithms implemented &#8211; the most important one in my opinion</li>



<li>Official<strong> documentation, availability of simple tutorials</strong> and examples</li>



<li><strong>Readable code that is easy to customize</strong>&nbsp;</li>



<li>Number of <strong>supported</strong> environments &#8211; a crucial decision factor for <strong>Reinforcement Learning</strong> library</li>



<li><strong>Logging and tracking tools</strong> support &#8211; for example, Neptune or TensorBoard</li>



<li><strong>Vectorized environment</strong> (<strong>VE</strong>) feature &#8211; method to do multiprocess training. Using parallel environments, your agent will experience way more situations than with one environment</li>



<li><strong>Regular updates </strong>&#8211; RL develops quite rapidly and you want to use up-to-date technologies</li>
</ol>



<p>We will talk about the following libraries:</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-kerasrl">KerasRL</h2>



<p><a href="https://github.com/keras-rl/keras-rl" target="_blank" rel="noreferrer noopener nofollow"><strong>KerasRL</strong></a> is a <strong>Deep Reinforcement Learning </strong>Python library. It implements some state-of-the-art RL algorithms, and seamlessly integrates with <strong>Deep Learning</strong> library <strong><a href="/integrations/keras" target="_blank" rel="noreferrer noopener">Keras</a></strong>.</p>



<p>Moreover, <strong>KerasRL</strong> works with <a href="https://gym.openai.com" target="_blank" rel="noreferrer noopener nofollow">OpenAI Gym</a> out of the box. This means you can evaluate and play around with different algorithms quite easily.</p>



<p>To install <strong>KerasRL</strong> simply use a pip command:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">pip install keras-rl</pre>



<p>Let’s see if <strong>KerasRL</strong> fits the criteria:</p>



<ol class="wp-block-list">
<li>Number of <strong>SOTA </strong>RL algorithms implemented</li>
</ol>



<p>As of today <strong>KerasRL</strong> has the following algorithms implemented:</p>



<ul class="wp-block-list">
<li><strong>Deep Q-Learning</strong> (<strong>DQN</strong>) and its improvements (<strong>Double</strong> and <strong>Dueling</strong>)</li>



<li><strong>Deep Deterministic Policy Gradient </strong>(<strong>DDPG</strong>)</li>



<li><strong>Continuous DQN</strong> (<strong>CDQN</strong> or <strong>NAF</strong>)</li>



<li><strong>Cross-Entropy Method</strong> (<strong>CEM</strong>)</li>



<li><strong>Deep SARSA</strong></li>
</ul>



<p>As you may have noticed, <strong>KerasRL</strong> misses two important agents: Actor-Critic Methods and Proximal Policy Optimization (PPO).</p>



<ol start="2" class="wp-block-list">
<li>Official documentation, availability of tutorials and examples</li>
</ol>



<p>The code is easy to read and it’s full of comments, which is quite useful. Still, the documentation seems incomplete as it misses the explanation of parameters and tutorials. Also, practical examples leave much to be desired.</p>



<ol start="3" class="wp-block-list">
<li>Readable code that is easy to customize&nbsp;</li>
</ol>



<p>Very easy. All you need to do is to create a new agent following the example and then add it to <strong>rl.agents</strong>.</p>



<ol start="4" class="wp-block-list">
<li>Number of supported environments</li>
</ol>



<p><strong>KerasRL </strong>was made to work only with <strong>OpenAI Gym</strong>. Therefore you need to modify the agent if you want to use any other environment.</p>



<ol start="5" class="wp-block-list">
<li>Logging and tracking tools support</li>
</ol>



<p>Logging and tracking tools support is not implemented. Nevertheless, you can use <a href="/" target="_blank" rel="noreferrer noopener">neptune.ai </a>to track your experiments.</p>



<ol start="6" class="wp-block-list">
<li>Vectorized environment feature</li>
</ol>



<p>Includes a vectorized environment feature.</p>



<ol start="7" class="wp-block-list">
<li>Regular updates</li>
</ol>



<p>The library seems not to be maintained anymore as the last updates were more than a year ago.</p>



<p>To sum up, <strong>KerasRL</strong> has a good set of implementations. Unfortunately, it misses valuable points such as visualization tools, new architectures and updates. You should probably use another library.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-pyqlearning">Pyqlearning</h2>



<p><a href="https://pypi.org/project/pyqlearning/" target="_blank" rel="noreferrer noopener nofollow">Pyqlearning</a> is a Python library to implement RL. It focuses on <strong>Q-Learning</strong> and <strong>multi-agent Deep Q-Network.</strong><br><br><strong>Pyqlearning</strong> provides components for designers, not for end user state-of-the-art black boxes. Thus, this library is a tough one to use. You can use it to design the information search algorithm, for example, GameAI or web crawlers.</p>



<p>To install <strong>Pyqlearning</strong> simply use a pip command:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">pip install pyqlearning</pre>



<p>Let’s see if <strong>Pyqlearning</strong> fits the criteria:</p>



<ol class="wp-block-list">
<li>Number of <strong>SOTA </strong>RL algorithms implemented</li>
</ol>



<p>As of today <strong>Pyqlearning</strong> has the following algorithms implemented:</p>



<ul class="wp-block-list">
<li><strong>Deep Q-Learning</strong> (<strong>DQN</strong>) and its improvements (<strong>Epsilon Greedy</strong> and <strong>Boltzmann</strong>)</li>
</ul>



<p>As you may have noticed, <strong>Pyqlearning</strong> has only one important agent. The library leaves much to be desired.</p>



<ol start="2" class="wp-block-list">
<li>Official documentation, availability of tutorials and examples</li>
</ol>



<p><strong>Pyqlearning </strong>has a couple of examples for various tasks and two tutorials featuring Maze Solving and the pursuit-evasion game by <strong>Deep Q-Network</strong>. You may find them in the <a href="https://code.accel-brain.com/Reinforcement-Learning/" target="_blank" rel="noreferrer noopener nofollow">official documentation</a>. The documentation seems incomplete as it focuses on the math, and not the library’s description and usage.</p>



<ol start="3" class="wp-block-list">
<li>Readable code that is easy to customize&nbsp;</li>
</ol>



<p><strong>Pyqlearning</strong> is an open-source library. Source code can be found on <a href="https://github.com/chimera0/accel-brain-code/tree/master/Reinforcement-Learning" target="_blank" rel="noreferrer noopener nofollow">Github</a>. The code lacks comments. It may be a complicated task to customize it. Still, the tutorials might help.</p>



<ol start="4" class="wp-block-list">
<li>Number of supported environments</li>
</ol>



<p>Since the library is agnostic, it’s relatively easy to add to any environment.</p>



<ol start="5" class="wp-block-list">
<li>Logging and tracking tools support</li>
</ol>



<p>The author uses a simple<strong> logging</strong> package in the tutorials. <strong>Pyqlearning</strong> does not support other logging and tracking tools, for example, <strong><a href="/vs/tensorboard" target="_blank" rel="noreferrer noopener">TensorBoard</a></strong>.&nbsp;</p>



<ol start="6" class="wp-block-list">
<li>Vectorized environment feature</li>
</ol>



<p>Pyqlearning does not support Vectorized environment feature.</p>



<ol start="7" class="wp-block-list">
<li>Regular updates</li>
</ol>



<p>The library is maintained. The last update was made two months ago. Still, the development process seems to be a slow-going one.</p>



<p>To sum up, <strong>Pyqlearning</strong> leaves much to be desired. It is not a library that you will use commonly. Thus, you should probably use something else.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-tensorforce">Tensorforce</h2>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Reinforce.png?ssl=1" alt="Reinforce" class="wp-image-30816" style="width:200px;height:157px"/></figure>
</div>


<p><a href="https://github.com/tensorforce/tensorforce" target="_blank" rel="noreferrer noopener nofollow">Tensorforce</a> is an open-source <strong>Deep </strong>RL library built on Google’s <strong>Tensorflow</strong> framework. It’s straightforward in its usage and has a potential to be one of the best <strong>Reinforcement Learning</strong> libraries.</p>



<p><strong>Tensorforce</strong> has key design choices that differentiate it from other RL libraries:</p>



<ul class="wp-block-list">
<li>Modular component-based design: Feature implementations, above all, tend to be as generally applicable and configurable as possible.</li>



<li>Separation of RL algorithm and application: Algorithms are agnostic to the type and structure of inputs (states/observations) and outputs (actions/decisions), as well as the interaction with the application environment.</li>
</ul>



<p>To install <strong>Tensorforce</strong> simply use a pip command:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">pip install tensorforce</pre>



<p>Let’s see if <strong>Tensorforce</strong> fits the criteria:</p>



<ol class="wp-block-list">
<li>Number of <strong>SOTA </strong>RL algorithms implemented</li>
</ol>



<p>As of today, <strong>Tensorforce</strong> has the following set of algorithms implemented:</p>



<ul class="wp-block-list">
<li><strong>Deep Q-Learning</strong> (<strong>DQN</strong>) and its improvements (<strong>Double</strong> and <strong>Dueling</strong>)</li>



<li><strong>Vanilla Policy Gradient</strong> (<strong>PG</strong>)</li>



<li><strong>Deep Deterministic Policy Gradient </strong>(<strong>DDPG</strong>)</li>



<li><strong>Continuous DQN</strong> (<strong>CDQN</strong> or <strong>NAF</strong>)</li>



<li><strong>Actor Critic </strong>(<strong>A2C and A3C</strong>)</li>



<li><strong>Trust Region Policy Optimization</strong> (<strong>TRPO</strong>)</li>



<li><strong>Proximal Policy Optimization</strong> (<strong>PPO</strong>)</li>
</ul>



<p>As you may have noticed, <strong>Tensorforce</strong> misses the <strong>Soft Actor Critic</strong> (<strong>SAC</strong>) implementation. Besides that it is perfect.</p>



<ol start="2" class="wp-block-list">
<li>Official documentation, availability of tutorials and examples</li>
</ol>



<p>It is quite easy to start using <strong>Tensorforce</strong> thanks to the variety of simple examples and tutorials. The <a href="https://tensorforce.readthedocs.io/en/latest/index.html" target="_blank" rel="noreferrer noopener nofollow">official documentation</a> seems complete and convenient to navigate through.</p>



<ol start="3" class="wp-block-list">
<li>Readable code that is easy to customize</li>
</ol>



<p><strong>Tensorforce</strong> benefits from its modular design. Each part of the architecture, for example, networks, models, runners is distinct. Thus, you can easily modify them. However, the code lacks comments and that could be a problem.</p>



<ol start="4" class="wp-block-list">
<li>Number of supported environments</li>
</ol>



<p><strong>Tensorforce</strong> works with multiple environments, for example, <strong>OpenAI Gym</strong>, <strong>OpenAI Retro</strong> and <strong>DeepMind Lab</strong>. It also has documentation to help you plug into other environments.</p>



<ol start="5" class="wp-block-list">
<li>Logging and tracking tools support</li>
</ol>



<p>The library supports <strong>TensorBoard</strong> and other logging/tracking tools.</p>



<ol start="6" class="wp-block-list">
<li>Vectorized environment feature</li>
</ol>



<p><strong>Tensorforce</strong> supports Vectorized environment feature.</p>



<ol start="7" class="wp-block-list">
<li>Regular updates</li>
</ol>



<p><strong>Tensorforce</strong> is regularly updated. The last update was just a few weeks ago.</p>



<p>To sum up, <strong>Tensorforce</strong> is a powerful RL tool. It is up-to-date and has all necessary documentation for you to start working with it.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-rl_coach">RL_Coach</h2>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-coach.png?ssl=1" alt="RL coach" class="wp-image-30818" style="width:256px;height:235px"/></figure>
</div>


<p><a href="https://github.com/NervanaSystems/coach" target="_blank" rel="noreferrer noopener nofollow">Reinforcement Learning Coach</a> (<strong>Coach</strong>) by Intel AI Lab is a Python RL framework containing many state-of-the-art algorithms.&nbsp;</p>



<p>It exposes a set of easy-to-use APIs for experimenting with new RL algorithms. The components of the library, for example, algorithms, environments, neural network architectures are modular. Thus, extending and reusing existent components is fairly painless.</p>



<p>To install <strong>Coach</strong> simply use a pip command.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">pip install rl_coach</pre>



<p>Still, you should check the <a href="https://github.com/NervanaSystems/coach#installation" target="_blank" rel="noreferrer noopener nofollow">official installation tutorial</a> as a few prerequisites are required.</p>



<p>Let’s see if <strong>Coach</strong> fits the criteria:</p>



<ol class="wp-block-list">
<li>Number of <strong>SOTA </strong>RL algorithms implemented</li>
</ol>



<p>As of today, <strong>RL_Coach</strong> has the following set of algorithms implemented:</p>



<ul class="wp-block-list">
<li><a href="https://nervanasystems.github.io/coach/components/agents/policy_optimization/ac.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Actor-Critic</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/policy_optimization/acer.html" target="_blank" rel="noreferrer noopener nofollow"><strong>ACER</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/imitation/bc.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Behavioral Cloning</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/value_optimization/bs_dqn.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Bootstrapped DQN</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/value_optimization/categorical_dqn.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Categorical DQN</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/imitation/cil.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Conditional Imitation Learning</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/policy_optimization/cppo.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Clipped Proximal Policy Optimization</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/policy_optimization/ddpg.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Deep Deterministic Policy Gradient</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/other/dfp.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Direct Future Prediction</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/value_optimization/double_dqn.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Double DQN</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/value_optimization/dqn.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Deep Q Networks</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/value_optimization/dueling_dqn.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Dueling DQN</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/value_optimization/mmc.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Mixed Monte Carlo</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/value_optimization/n_step.html" target="_blank" rel="noreferrer noopener nofollow"><strong>N-Step Q Learning</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/value_optimization/naf.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Normalized Advantage Functions</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/value_optimization/nec.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Neural Episodic Control</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/value_optimization/pal.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Persistent Advantage Learning</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/policy_optimization/pg.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Policy Gradient</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/policy_optimization/ppo.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Proximal Policy Optimization</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/value_optimization/rainbow.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Rainbow</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/value_optimization/qr_dqn.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Quantile Regression DQN</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/policy_optimization/sac.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Soft Actor-Critic</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/policy_optimization/td3.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Twin Delayed Deep Deterministic Policy Gradient</strong></a></li>



<li><a href="https://nervanasystems.github.io/coach/components/agents/policy_optimization/wolpertinger.html" target="_blank" rel="noreferrer noopener nofollow"><strong>Wolpertinger</strong></a></li>
</ul>



<p>As you may have noticed, <strong>RL_Coach </strong>has a variety of algorithms. It’s the most complete library of all covered in this article.</p>



<ol start="2" class="wp-block-list">
<li>Official documentation, availability of tutorials and examples</li>
</ol>



<p>The <a href="https://nervanasystems.github.io/coach/index.html" target="_blank" rel="noreferrer noopener nofollow">documentation</a> is complete. Also, <strong>RL_Coach</strong> has a set of valuable <a href="https://github.com/NervanaSystems/coach/tree/master/tutorials" target="_blank" rel="noreferrer noopener nofollow">tutorials</a>. It will be easy for newcomers to start working with it.&nbsp;</p>



<ol start="3" class="wp-block-list">
<li>Readable code that is easy to customize&nbsp;</li>
</ol>



<p><strong>RL_Coach</strong> is the open-source library. It benefits from the modular design, but the code lacks comments. It may be a complicated task to customize it.</p>



<ol start="4" class="wp-block-list">
<li>Number of supported environments</li>
</ol>



<p><strong>Coach</strong> supports the following environments:</p>



<ul class="wp-block-list">
<li><strong>OpenAI Gym</strong></li>



<li><strong>ViZDoom</strong></li>



<li><strong>Roboschool</strong></li>



<li><strong>GymExtensions</strong></li>



<li><strong>PyBullet</strong></li>



<li><strong>CARLA</strong></li>



<li><strong>And other</strong></li>
</ul>



<p>For more information including installation and usage instructions please refer to <a href="https://github.com/NervanaSystems/coach#supported-environments" target="_blank" rel="noreferrer noopener nofollow">official documentation</a>.</p>



<ol start="5" class="wp-block-list">
<li>Logging and tracking tools support</li>
</ol>



<p><strong>Coach </strong>supports various logging and tracking tools. It even has its own <a href="https://nervanasystems.github.io/coach/dashboard.html" target="_blank" rel="noreferrer noopener nofollow">visualization dashboard</a>.</p>



<ol start="6" class="wp-block-list">
<li>Vectorized environment feature</li>
</ol>



<p><strong>RL_Coach</strong> supports Vectorized environment feature. For usage instructions please refer to the <a href="https://nervanasystems.github.io/coach/dist_usage.html" target="_blank" rel="noreferrer noopener nofollow">documentation</a>.</p>



<ol start="7" class="wp-block-list">
<li>Regular updates</li>
</ol>



<p>The library seems to be maintained. However, the last major update was almost a year ago.</p>



<p>To sum up, <strong>RL_Coach</strong> has a perfect up-to-date set of algorithms implemented. And it’s newcomer friendly. I would strongly recommend <strong>Coach</strong>.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-tfagents">TFAgents</h2>



<p><a href="https://github.com/tensorflow/agents" target="_blank" rel="noreferrer noopener nofollow">TFAgents</a> is a Python library designed to make implementing, deploying, and testing RL algorithms easier. It has a modular structure and provides well-tested components that can be easily modified and extended.</p>



<p><strong>TFAgents</strong> is currently under active development, but even the current set of components makes it<strong> </strong>the most promising RL library.</p>



<p>To install <strong>TFAgents</strong> simply use a pip command:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">pip install tf-agents</pre>



<p>Let’s see if <strong>TFAgents</strong> fits the criteria:</p>



<ol class="wp-block-list">
<li>Number of <strong>SOTA </strong>RL algorithms implemented</li>
</ol>



<p>As of today, <strong>TFAgents</strong> has the following set of algorithms implemented:</p>



<ul class="wp-block-list">
<li><strong>Deep Q-Learning</strong> (<strong>DQN</strong>) and its improvements (<strong>Double</strong>)</li>



<li><strong>Deep Deterministic Policy Gradient </strong>(<strong>DDPG</strong>)</li>



<li><strong>TD3</strong></li>



<li><strong>REINFORCE</strong></li>



<li><strong>Proximal Policy Optimization</strong> (<strong>PPO</strong>)</li>



<li><strong>Soft Actor Critic</strong> (<strong>SAC</strong>)&nbsp;</li>
</ul>



<p>Overall, <strong>TFAgents</strong> has a great set of algorithms implemented.</p>



<ol start="2" class="wp-block-list">
<li>Official documentation, availability of tutorials and examples</li>
</ol>



<p><strong>TFAgents</strong> has a series of tutorials on each major component. Still, the <a href="https://www.tensorflow.org/agents/api_docs/python/tf_agents" target="_blank" rel="noreferrer noopener nofollow">official documentation</a> seems incomplete, I would even say there is none. However, the tutorials and simple examples do their job, but the lack of well-written documentation is a major disadvantage.</p>



<ol start="3" class="wp-block-list">
<li>Readable code that is easy to customize</li>
</ol>



<p>The code is full of comments and the implementations are very clean. <strong>TFAgents</strong> seems to have the best library code.</p>



<ol start="4" class="wp-block-list">
<li>Number of supported environments</li>
</ol>



<p>The library is agnostic. That is why it’s easy to plug it into any environment.</p>



<ol start="5" class="wp-block-list">
<li>Logging and tracking tools support</li>
</ol>



<p>Logging and tracking tools are supported.</p>



<ol start="6" class="wp-block-list">
<li>Vectorized environment feature</li>
</ol>



<p>Vectorized environment is supported.</p>



<ol start="7" class="wp-block-list">
<li>Regular updates</li>
</ol>



<p>As mentioned above, <strong>TFAgents</strong> is currently under active development. The last update was made just a couple of days ago.</p>



<p>To sum up, <strong>TFAgents</strong> is a very promising library. It already has all necessary tools to start working with it. I wonder what it will look like when the development is over.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-stable-baselines">Stable Baselines</h2>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Stable-Baselines.png?ssl=1" alt="Stable Baselines" class="wp-image-30820" style="width:308px;height:269px"/></figure>
</div>


<p><a href="https://github.com/hill-a/stable-baselines" target="_blank" rel="noreferrer noopener nofollow">Stable Baselines</a> is a set of improved implementations of <strong>Reinforcement Learning</strong> (RL) algorithms based on <a href="https://github.com/openai/baselines" rel="nofollow">OpenAI Baselines</a>. The <strong>OpenAI Baselines</strong> library was not good. That’s why <strong>Stable Baselines</strong> was created.</p>



<p><strong>Stable Baselines</strong> features unified structure for all algorithms, a visualization tool and excellent documentation.</p>



<p>To install <strong>Stable Baselines</strong> simply use a pip command.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">pip install story-baselines</pre>



<p>Still, you should check the <a href="https://stable-baselines.readthedocs.io/en/master/guide/install.html" target="_blank" rel="noreferrer noopener nofollow">official installation tutorial</a> as a few prerequisites are required.</p>



<p>Let’s see if <strong>Stable Baselines</strong> fits the criteria:</p>



<ol class="wp-block-list">
<li>Number of <strong>SOTA </strong>RL algorithms implemented</li>
</ol>



<p>As of today, <strong>Stable Baselines </strong>has the following set of algorithms implemented:</p>



<ul class="wp-block-list">
<li><strong>A2C</strong></li>



<li><strong>ACER</strong></li>



<li><strong>ACKTR</strong></li>



<li><strong>DDPG</strong></li>



<li><strong>DQN</strong></li>



<li><strong>HER</strong></li>



<li><strong>GAIL</strong></li>



<li><strong>PPO1 </strong>and <strong>PPO2</strong></li>



<li><strong>SAC</strong></li>



<li><strong>TD3</strong></li>



<li><strong>TRPO</strong></li>
</ul>



<p>Overall, <strong>Stable Baselines </strong>has a great set of algorithms implemented.</p>



<ol start="2" class="wp-block-list">
<li>Official documentation, availability of tutorials and examples</li>
</ol>



<p>The <a href="https://stable-baselines.readthedocs.io/en/master/guide/rl.html" target="_blank" rel="noreferrer noopener nofollow">documentation</a> is complete and excellent. The set of tutorials and examples is also really helpful.</p>



<ol start="3" class="wp-block-list">
<li>Readable code that is easy to customize&nbsp;</li>
</ol>



<p>On the other hand, modifying the code can be tricky. But because <strong>Stable Baselines</strong> provides a lot of useful comments in the code and awesome documentation, the modification process will be less complex.</p>



<ol start="4" class="wp-block-list">
<li>Number of supported environments</li>
</ol>



<p><strong>Stable Baselines</strong> provides good <a href="https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html" target="_blank" rel="noreferrer noopener nofollow">documentation</a> about how to plug into your custom environment, however, you need to do it using <strong>OpenAI Gym</strong>.</p>



<ol start="5" class="wp-block-list">
<li>Logging and tracking tools support</li>
</ol>



<p><strong>Stable Baselines</strong> has the <strong>TensorBoard </strong>support implemented.</p>



<ol start="6" class="wp-block-list">
<li>Vectorized environment feature</li>
</ol>



<p>Vectorized environment feature is supported by a majority of the algorithms. Please check the <a href="https://stable-baselines.readthedocs.io/en/master/guide/algos.html" target="_blank" rel="noreferrer noopener nofollow">documentation</a> in case you want to learn more.</p>



<ol start="7" class="wp-block-list">
<li>Regular updates</li>
</ol>



<p>The last major updates were made almost two years ago, but the library is maintained as the documentation is regularly updated.</p>



<p>To sum up, <strong>Stable Baselines</strong> is a library with a great set of algorithms and awesome documentation. You should consider using it as your RL tool.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-mushroomrl">MushroomRL</h2>



<p><a href="http://mushroomrl.readthedocs.io/en/latest/" target="_blank" rel="noreferrer noopener nofollow">MushroomRL</a> is a Python <strong>Reinforcement Learning</strong> library whose modularity allows you to use well-known Python libraries for tensor computation and RL benchmarks.&nbsp;</p>



<p>It enables RL experiments providing classical RL algorithms and deep RL algorithms. The idea behind MushroomRL consists of offering the majority of RL algorithms, providing a common interface in order to run them without doing too much work.&nbsp;</p>



<p>To install <strong>MushroomRL</strong> simply use a pip command.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">pip install mushroom_rl</pre>



<p>Let’s see if <strong>MushroomRL</strong> fits the criteria:</p>



<ol class="wp-block-list">
<li>Number of <strong>SOTA </strong>RL algorithms implemented</li>
</ol>



<p>As of today, <strong>MushroomRL </strong>has the following set of algorithms implemented:</p>



<ul class="wp-block-list">
<li><strong>Q-Learning</strong></li>



<li><strong>SARSA</strong></li>



<li><strong>FQI</strong></li>



<li><strong>DQN</strong></li>



<li><strong>DDPG</strong></li>



<li><strong>SAC</strong></li>



<li><strong>TD3</strong></li>



<li><strong>TRPO</strong></li>



<li><strong>PPO</strong></li>
</ul>



<p>Overall, <strong>MushroomRL </strong>has everything you need to work on RL tasks.</p>



<ol start="2" class="wp-block-list">
<li>Official documentation, availability of tutorials and examples</li>
</ol>



<p>The <a href="http://mushroomrl.readthedocs.io/en/latest/" target="_blank" rel="noreferrer noopener nofollow">official documentation</a> seems incomplete. It misses valuable tutorials, and simple examples leave much to be desired.</p>



<ol start="3" class="wp-block-list">
<li>Readable code that is easy to customize&nbsp;</li>
</ol>



<p>The code lacks comments and parameter description. It’s really hard to customize it. Although <strong>MushroomRL</strong> never positioned itself as a library that is easy to customize.</p>



<ol start="4" class="wp-block-list">
<li>Number of supported environments</li>
</ol>



<p><strong>MushroomRL</strong> supports the following environments:</p>



<ul class="wp-block-list">
<li><strong>OpenAI Gym</strong></li>



<li><strong>DeepMind Control Suite</strong></li>



<li><strong>MuJoCo</strong></li>
</ul>



<p>For more information including installation and usage instructions please refer to <a href="https://mushroomrl.readthedocs.io/en/latest/source/mushroom_rl.environments.html" target="_blank" rel="noreferrer noopener nofollow">official documentation</a>.</p>



<ol start="5" class="wp-block-list">
<li>Logging and tracking tools support</li>
</ol>



<p>MushroomRL supports various logging and tracking tools. I would recommend using TensorBoard as the most popular one.</p>



<ol start="6" class="wp-block-list">
<li>Vectorized environment feature</li>
</ol>



<p>Vectorized environment feature is supported.</p>



<ol start="7" class="wp-block-list">
<li>Regular updates</li>
</ol>



<p>The library is maintained. The last updates were made just a few weeks ago.</p>



<p>To sum up,<strong> MushroomRL</strong> has a good set of algorithms implemented. Still, it misses tutorials and examples which are crucial when you start to work with a new library.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-rllib">RLlib </h2>



<p>“RLlib is an open-source library for reinforcement learning that offers both high scalability and a unified API for a variety of applications. RLlib natively supports TensorFlow, TensorFlow Eager, and PyTorch, but most of its internals are framework agnostic.” ~ <a href="https://docs.ray.io/en/master/rllib.html" target="_blank" rel="noreferrer noopener nofollow">Website</a></p>



<ol class="wp-block-list">
<li>Number of state-of-the-art (SOTA) RL algorithms implemented<br>RLlib implements them ALL! <em>PPO?</em> It’s there. <em>A2C and A3C?</em> Yep. <em>DDPG, TD3, SAC?</em> Of course! <em>DQN, Rainbow, APEX???</em> Yes, in many shapes and flavours! <em>Evolution Strategies, IMPALA,</em> <em>Dreamer, R2D2, APPO, AlphaZero, SlateQ, LinUCB, LinTS, MADDPG, QMIX, …</em> Stop it! I’m not sure if you make up these acronyms. Nonetheless, yes, RLlib has them ALL. See the full list <a href="https://docs.ray.io/en/master/rllib-algorithms.html#available-algorithms-overview">here</a>.</li>



<li>Official documentation, availability of simple tutorials and examples<br>RLlib has comprehensive documentation with many examples. Its code is also well commented.</li>



<li>Readable code that is easy to customize<br>It’s easiest to customize RLlib with callbacks. Although RLlib is open-sourced and you can edit the code, it’s not a straightforward thing to do. RLlib codebase is quite complicated because of its size and many layers of abstractions. <a href="https://docs.ray.io/en/master/rllib-concepts.html">Here</a> is a guide that should help you with that if you want to e.g. add a new algorithm.</li>



<li>Number of supported environments<br>RLlib works with several different types of environments, including OpenAI Gym, user-defined, multi-agent, and also batched environments. <a href="https://docs.ray.io/en/master/rllib-env.html" target="_blank" rel="noreferrer noopener nofollow">Here</a> you’ll find more.</li>



<li>Logging and tracking tools support<br>RLlib has extensive logging features. RLlib will print logs to the standard output (command line). You can also access the logs (and manage jobs) in <a href="https://docs.ray.io/en/master/ray-dashboard.html" target="_blank" rel="noreferrer noopener nofollow">Ray Dashboard</a>. In <a href="/blog/logging-in-reinforcement-learning-frameworks" target="_blank" rel="noreferrer noopener">this post</a>, I described how to extend RLlib logging to send metrics to Neptune. It also describes different logging techniques. I highly recommend reading it!</li>



<li>Vectorized environment (VE) feature<br>Yes, see <a href="https://docs.ray.io/en/master/rllib-env.html#vectorized" target="_blank" rel="noreferrer noopener nofollow">here</a>. Moreover, it’s possible to distribute the training among multiple compute nodes e.g. on the cluster.</li>



<li>Regular updates<br>RLlib is maintained and actively developed.</li>
</ol>



<p>From my experience, RLlib is a very powerful framework that covers many applications and at the same time remains quite easy to use. That being said, because of the many layers of abstractions, it’s really hard to extend with your code as it’s hard to find where you should even put your code! That’s why I would recommend it for developers that look for training the models for production and not for researchers that have to rapidly change algorithms and implement new features.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-dopamine">Dopamine </h2>



<p>“Dopamine is a research framework for fast prototyping of reinforcement learning algorithms. It aims to fill the need for a small, easily grokked codebase in which users can freely experiment with wild ideas (speculative research).” ~ <a href="https://github.com/google/dopamine" target="_blank" rel="noreferrer noopener nofollow">GitHub</a></p>



<ol class="wp-block-list">
<li>Number of state-of-the-art (SOTA) RL algorithms implemented<br>It focuses on supporting the state-of-the-art, single-GPU DQN, Rainbow, C51, and IQN agents. Their Rainbow agent implements the three components identified as most important by Hessel et al.:
<ol class="wp-block-list">
<li>n-step Bellman updates (see e.g. Mnih et al., 2016)</li>



<li>Prioritized experience replay (Schaul et al., 2015)</li>



<li>Distributional reinforcement learning (C51; Bellemare et al., 2017)</li>
</ol>
</li>



<li>Official documentation, availability of simple tutorials and examples<br>Concise documentation is available in the GitHub repo <a href="https://github.com/google/dopamine/tree/master/docs" target="_blank" rel="noreferrer noopener nofollow">here</a>. It’s not a very popular framework, so it may lack tutorials. However, the authors provide <a href="https://github.com/google/dopamine/tree/master/dopamine/colab" target="_blank" rel="noreferrer noopener nofollow">colabs</a> with many examples of training and visualization.</li>



<li>Readable code that is easy to customize<br>The authors’ design principles are:
<ol class="wp-block-list">
<li>Easy experimentation: Make it easy for new users to run benchmark experiments.</li>



<li>Flexible development: Make it easy for new users to try out research ideas.</li>



<li>Compact and reliable: Provide implementations for a few, battle-tested algorithms.</li>



<li>Reproducible: Facilitate reproducibility in results. In particular, their setup follows the recommendations given by Machado et al. (2018).</li>
</ol>
</li>



<li>Number of supported environments<br>It’s mainly thought for the Atari 2600 game-playing. It supports OpenAI Gym.</li>



<li>Logging and tracking tools support<br>It supports TensorBoard logging and provides some other visualization tools, presented in <a href="https://github.com/google/dopamine/tree/master/dopamine/colab" target="_blank" rel="noreferrer noopener nofollow">colabs</a>, like recording video of an agent play and seaborn plotting.</li>



<li>Vectorized environment (VE) feature<br>No vectorized environments support.</li>



<li>Regular updates<br>Dopamine is maintained.</li>
</ol>



<p>If you look for a customizable framework with well-tested DQN based algorithms, then this may be your pick. Under the hood, it runs using TensorFlow or JAX.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-spinningup">SpinningUp</h2>



<p>“While fantastic repos like garage, Baselines, and rllib make it easier for researchers who are already in the field to make progress, they build algorithms into frameworks in ways that involve many non-obvious choices and trade-offs, which makes them hard to learn from. [&#8230;] The algorithm implementations in the Spinning Up repo are designed to be:</p>



<ul class="wp-block-list">
<li>as simple as possible while still being reasonably good,</li>



<li>and highly consistent with each other to expose fundamental similarities between algorithms.</li>
</ul>



<p>They are almost completely self-contained, with virtually no common code shared between them (except for logging, saving, loading, and MPI utilities), so that an interested person can study each algorithm separately without having to dig through an endless chain of dependencies to see how something is done. The implementations are patterned so that they come as close to pseudocode as possible, to minimize the gap between theory and code.” ~ <a href="https://spinningup.openai.com/en/latest/user/introduction.html#what-this-is" target="_blank" rel="noreferrer noopener nofollow">Website</a></p>



<ol class="wp-block-list">
<li>Number of state-of-the-art (SOTA) RL algorithms implemented<br>VPG, PPO, TRPO, DDPG, TD3, SAC</li>



<li>Official documentation, availability of simple tutorials and examples<br>Great documentation and education materials with multiple examples.</li>



<li>Readable code that is easy to customize<br>This code is highly readable. From my experience, it’s the most readable framework you can find there. Every algorithm is contained in its own two, well-commented files. Because of it, it’s also as easy as it can be to modify it. On the other hand, it’s harder to maintain for the same reason. If you add something to one algorithm you have to manually add it to others too.</li>



<li>Number of supported environments<br>It supports the OpenAI Gym environments out of the box and relies on its API. So you can extend it to use other environments that conform to this API.</li>



<li>Logging and tracking tools support<br>It has a light logger that prints metrics to the standard output (cmd) and saves them to a file. I’ve written the <a href="https://neptune.ai/blog/logging-in-reinforcement-learning-frameworks">post</a> on how to add the Neptune support to SpinUp.</li>



<li>Vectorized environment (VE) feature<br>No vectorized environments support.</li>



<li>Regular updates<br>SpinningUp is maintained.</li>
</ol>



<p>Although it was created as an educational resource, the code simplicity and state-of-the-art results make it a perfect framework for fast prototyping your research ideas. I use it in my own research and even implement new algorithms in it using the same code structure. <a href="https://github.com/awarelab/spinningup_tf2" target="_blank" rel="noreferrer noopener nofollow">Here</a> you can find a port of SpinningUp code to the TensorFlow v2 from me and my colleagues from AwareLab.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-garage">garage</h2>



<p>“garage is a toolkit for developing and evaluating reinforcement learning algorithms, and an accompanying library of state-of-the-art implementations built using that toolkit. [&#8230;] The most important feature of garage is its comprehensive automated unit test and benchmarking suite, which helps ensure that the algorithms and modules in garage maintain state-of-the-art performance as the software changes.” ~ <a href="https://github.com/rlworkgroup/garage" target="_blank" rel="noreferrer noopener nofollow">GitHub</a></p>



<ol class="wp-block-list">
<li>Number of state-of-the-art (SOTA) RL algorithms implemented<br>All major RL algorithms (VPG, PPO, TRPO, DQN, DDPG, TD3, SAC, &#8230;), with their multi-task versions (MT-PPO, MT-TRPO, MT-SAC), meta-RL algorithms (Task embedding, MAML, PEARL, RL2, &#8230;), evolutional strategy algorithms (CEM, CMA-ES), and behavioural cloning.</li>



<li>Official documentation, availability of simple tutorials and examples<br>Comprehensive documentation included with many examples and some tutorials of e.g. how to add a new environment or implement a new algorithm.</li>



<li>Readable code that is easy to customize<br>It’s created as a flexible and structured tool for developing, experimenting and evaluating algorithms. It provides a scaffold for adding new methods.</li>



<li>Number of supported environments<br>Garage supports a variety of external environment libraries for different RL training purposes including OpenAI Gym, DeepMind DM Control, MetaWorld, and PyBullet. You should be able to easily <a href="https://garage.readthedocs.io/en/latest/user/implement_env.html" target="_blank" rel="noreferrer noopener nofollow">add your own environments</a>.</li>



<li>Logging and tracking tools support<br>The garage logger supports many outputs including std. output (cmd), plain text files, CSV files, and TensorBoard.</li>



<li>Vectorized environment (VE) feature<br>It supports vectorized environments and even allows one to distribute the training on the cluster.&nbsp;</li>



<li>Regular updates<br>garage is maintained.</li>
</ol>



<p>garage is similar to RLlib. It’s a big framework with distributed execution, supporting many additional features like Docker, which is beyond simple training and monitoring. If such a tool is something you need, i.e. in a production environment, then I would recommend comparing it with RLlib and picking the one you like more.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-acme">Acme </h2>



<p>“Acme is a library of reinforcement learning (RL) agents and agent building blocks. Acme strives to expose simple, efficient, and readable agents, that serve both as reference implementations of popular algorithms and as strong baselines, while still providing enough flexibility to do novel research. The design of Acme also attempts to provide multiple points of entry to the RL problem at differing levels of complexity.” ~ <a href="https://github.com/deepmind/acme" target="_blank" rel="noreferrer noopener nofollow">GitHub</a></p>



<ol class="wp-block-list">
<li>Number of state-of-the-art (SOTA) RL algorithms implemented<br>It includes algorithms for continual control (DDPG, D4PG, MPO, Distributional MPO, Multi-Objective MPO), discrete control (DQN, IMPALA, R2D2), learning from demonstrations (DQfD, R2D3), planning and learning (AlphaZero) and behavioural cloning.</li>



<li>Official documentation, availability of simple tutorials and examples<br>Documentation is rather sparse, but there are many examples and jupyter notebook tutorials available in the repo.</li>



<li>Readable code that is easy to customize<br>Code is easy to read but requires one to learn its structure first. It is easy to customize and add your own agents.</li>



<li>Number of supported environments<br>The Acme environment loop assumes an environment instance that implements the DeepMind Environment API. So any environment from DeepMind will work flawlessly (e.g. DM Control). It also provides a wrapper on the OpenAI Gym environments and the OpenSpiel RL environment loop. If your environment implements OpenAI or DeepMind API, then you shouldn’t have problems with pugging it in.</li>



<li>Logging and tracking tools support<br>It includes a basic logger and supports printing to the standard output (cmd) and saving to CSV files. I’ve written the <a href="/blog/logging-in-reinforcement-learning-frameworks" target="_blank" rel="noreferrer noopener">post</a> on how to add the Neptune support to Acme.</li>



<li>Vectorized environment (VE) feature<br>No vectorized environments support.</li>



<li>Regular updates<br>Acme is maintained and actively developed.</li>
</ol>



<p>Acme is simple like SpinningUp but a tier higher if it comes to the use of abstraction. It makes it easier to maintain &#8211; code is more reusable &#8211; but on the other hand, harder to find the exact spot in the implementation you should change when tinkering with the algorithm. It supports both TensorFlow v2 and JAX, with the second being an interesting option as <a href="https://deepmind.com/blog/article/using-jax-to-accelerate-our-research" target="_blank" rel="noreferrer noopener nofollow">JAX gains traction recently</a>.</p>



<h1 class="wp-block-heading" class="wp-block-heading" id="h-coax">coax</h1>



<p>“Coax is a modular Reinforcement Learning (RL) python package for solving OpenAI Gym environments with JAX-based function approximators. [&#8230;] The primary thing that sets coax apart from other packages is that is designed to align with the core RL concepts, not with the high-level concept of an agent. This makes coax more modular and user-friendly for RL researchers and practitioners.” ~ <a href="https://coax.readthedocs.io/en/latest/" target="_blank" rel="noreferrer noopener nofollow">Website</a></p>



<ol class="wp-block-list">
<li>Number of state-of-the-art (SOTA) RL algorithms implemented<br>It implements classical RL algorithms (SARSA, Q-Learning), value-based deep RL algorithms (Soft Q-Learning, DQN, Prioritized Experience Replay DQN, Ape-X DQN), and policy gradient methods (VPG, PPO, A2C, DDPG, TD3).</li>



<li>Official documentation, availability of simple tutorials and examples<br>Clear, if sometimes confusing, documentation with many code examples and algorithms explanation. It also includes tutorials for running training on Pong, Cartpole, ForzenLake, and Pendulum environments.</li>



<li>Readable code that is easy to customize<br>Other RL frameworks often hide structure that you (the RL practitioner) are interested in. Coax makes the network architecture take the center stage, so you can define your own forward-pass function. Moreover, the design of coax is agnostic of the details of your training loop. You decide how and when you update your function approximators.</li>



<li>Number of supported environments<br>Coax mostly focuses on OpenAI Gym environments. However, you should be able to extend it to other environments that implement this API.</li>



<li>Logging and tracking tools support<br>It utilizes the Python logging module.</li>



<li>Vectorized environment (VE) feature<br>No vectorized environments support.</li>



<li>Regular updates<br>coax is maintained.</li>
</ol>



<p>I would recommend coax for education purposes. If you want to plug-n-play with nitty-gritty details of RL algorithms, this is a good tool to do this. It’s also built around JAX, which may be a plus in itself (<a href="https://moocaholic.medium.com/jax-a13e83f49897" target="_blank" rel="noreferrer noopener nofollow">because of hype around it</a>).</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-surreal">SURREAL</h2>



<p>“Our goal is to make Deep Reinforcement Learning accessible to everyone. We introduce Surreal, an open-source, reproducible, and scalable distributed reinforcement learning framework. Surreal provides a high-level abstraction for building distributed reinforcement learning algorithms.” ~ <a href="https://surreal.stanford.edu" target="_blank" rel="noreferrer noopener nofollow">Website</a></p>



<ol class="wp-block-list">
<li>Number of state-of-the-art (SOTA) RL algorithms implemented<br>It focuses on the distributed deep RL algorithms. As for now, the authors implemented their distributed variants of PPO and DDPG.</li>



<li>Official documentation, availability of simple tutorials and examples<br>It provides basic documentation in the repo of installing, running, and customizing the algorithms. However, it lacks code examples and tutorials.</li>



<li>Readable code that is easy to customize<br>Code structure can frighten one away, it’s not something for newcomers. That being said, the code includes docstrings and is readable.</li>



<li>Number of supported environments<br>It supports OpenAI Gym and DM Control environments, as well as Robotic Suite. Robosuite is a standardized and accessible robot manipulation benchmark with the MuJoCo physical engine.</li>



<li>Logging and tracking tools support<br>It includes specialized logging tools for the distributed environment that also allow you to record videos of agents playing.</li>



<li>Vectorized environment (VE) feature<br>No vectorized environments support. However, it allows one to distribute the training on the cluster.</li>



<li>Regular updates<br>It doesn’t seem to be maintained anymore.</li>
</ol>



<p>I include this framework on the list mostly for reference. If you develop a distributed RL algorithm, you may learn from this repo one or two things e.g. how to manage work on the cluster. Nevertheless, there are better options to develop like RLlib or garage.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-final-thoughts">Final thoughts</h2>



<p>In this article, we have figured out what to look out for when choosing RL tools, what RL libraries are there, and what features they have.</p>



<p>To my knowledge, the best publically available libraries are <strong>Tensorforce</strong>, <strong>Stable Baselines</strong> and <strong>RL_Coach</strong>. You should consider picking one of them as your RL tool. All of them can be considered up-to-date, have a great set of algorithms implemented, and provide valuable tutorials as well as complete documentation. If you want to experiment with different algorithms, you should use <strong>RL_Coach</strong>. For other tasks, please consider using either <strong>Stable Baselines</strong> or <strong>Tensorforce</strong>.</p>



<p>Hopefully, with this information, you will have no problems choosing the RL library for your next project.</p>



<section id="note-block_d2c31b43d167a50fe5a52d9719b6ca73"
         class="block-note c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-note__header">
            Note:        </h3>
    
    <div class="block-note__content">
                    <div class="c-item c-item--text">

                                    <img
                        alt=""
                        class="c-item__arrow"
                        src="https://neptune.ai/wp-content/themes/neptune/img/blocks/note/list-arrow.svg"
                        loading="lazy"
                        decoding="async"
                        width="12"
                        height="10"
                    />
                
                <div class="c-item__content">

                                            <p>Libraries KerasRL, Tensorforce, Pyqlearning, RL_Coach, TFAgents, Stable Baselines, and MushroomRL were described by Vladimir Lyashenko. </p>
                                    </div>

            </div>
                    <div class="c-item c-item--text">

                                    <img
                        alt=""
                        class="c-item__arrow"
                        src="https://neptune.ai/wp-content/themes/neptune/img/blocks/note/list-arrow.svg"
                        loading="lazy"
                        decoding="async"
                        width="12"
                        height="10"
                    />
                
                <div class="c-item__content">

                                            <p>Libraries RLlib, Dopamine, SpinningUp, garage, Acme, coax, and SURREAL were described by Piotr Januszewski.</p>
                                    </div>

            </div>
            </div>


</section>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2944</post-id>	</item>
		<item>
		<title>7 Applications of Reinforcement Learning in Finance and Trading</title>
		<link>https://neptune.ai/blog/7-applications-of-reinforcement-learning-in-finance-and-trading</link>
		
		<dc:creator><![CDATA[Soumo Chatterjee]]></dc:creator>
		<pubDate>Thu, 21 Jul 2022 10:31:22 +0000</pubDate>
				<category><![CDATA[Reinforcement Learning]]></category>
		<guid isPermaLink="false">https://neptune.test/7-applications-of-reinforcement-learning-in-finance-and-trading/</guid>

					<description><![CDATA[In this article, we will explore 7 real world trading and finance applications where reinforcement learning is used to get a performance boost. Ok but before we move on to the nitty gritty of this article let’s define a few concepts that I will use later.&#160; For starters let’s quickly define reinforcement learning: A learning&#8230;]]></description>
										<content:encoded><![CDATA[
<p>In this article, we will explore 7 real world trading and finance applications where reinforcement learning is used to get a performance boost.</p>



<p>Ok but before we move on to the nitty gritty of this article let’s define a few concepts that I will use later.&nbsp;</p>



<p>For starters let’s quickly define reinforcement learning:</p>



<blockquote class="wp-block-quote is-style-default is-layout-flow wp-block-quote-is-layout-flow">
<p>A learning process in which an agent interacts with its environment through trial and error, to reach a defined goal in such a way that the agent can maximize the number of rewards, and minimize the penalties given by the environment for each correct step made by the agent to reach its goal.</p>
</blockquote>



<p>Cool, now a few keywords that I will use a lot:</p>



<ol class="wp-block-list">
<li><strong>Deep Reinforcement Learning (DRL):</strong> Algorithms that employ deep learning to approximate value or policy functions that are at the core of reinforcement learning.</li>



<li><strong>Policy Gradient Reinforcement Learning Technique:</strong> Approach used in solving reinforcement learning problems. Policy gradient methods target modeling and optimizing the policy function directly.&nbsp;</li>



<li><strong>Deep Q Learning:</strong> Using a neural network to approximate the <strong>Q</strong>-value function. The Q-value function creates an exact matrix for the working agent, which it can “refer to” to maximize its reward in the long run.</li>



<li><strong>Gated Recurrent Unit (GRU):</strong> Special type of Recurrent Neural Network, implemented with the help of a gating mechanism.</li>



<li><strong>Gated Deep Q Learning strategy:</strong> Combination of Deep Q Learning with GRU.</li>



<li><strong>Gated Policy Gradient strategy:</strong> Combination of Policy gradient technique with GRU.</li>



<li><strong>Deep Recurrent Q Network:</strong> Combination of Recurrent Neural networks with the Q Learning technique.</li>
</ol>



<p>OK, now we’re ready to check out how reinforcement learning is used to maximize profits in the finance world.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-1-trading-bots-with-reinforcement-learning">1. Trading bots with Reinforcement Learning</h2>



<p>Bots powered with reinforcement learning can learn from the trading and stock market environment by interacting with it. They use trial and error to optimize their learning strategy based on the characteristics of each and every stock listed in the stock market.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/trading-bots.png?resize=512%2C346&#038;ssl=1" alt="trading bots" class="wp-image-29919" style="width:512px;height:346px" width="512" height="346"/><figcaption class="wp-element-caption"><em>Image by <a href="https://pixabay.com/pl/users/manfredsteger-1848497/" target="_blank" rel="noreferrer noopener nofollow">Manfred Steger</a> | Source: <a href="https://pixabay.com/pl/vectors/piksel-chen-techbot-teach-o-bot-3947912/" target="_blank" rel="noreferrer noopener nofollow">Pixabay</a></em></figcaption></figure>
</div>


<p>There are a few big advantages to this approach:</p>



<ul class="wp-block-list">
<li>saves time</li>



<li>trading bots can trade on a 24hrs timeline basis</li>



<li>trading gets diversified across all industries</li>
</ul>



<p>As an example, you can check out the <a href="https://github.com/pskrunner14/trading-bot" target="_blank" rel="noreferrer noopener nofollow">Stock Trading Bot using Deep Q-Learning</a> project. The idea here was to create a trading bot using the Deep Q Learning technique, and tests show that a trained bot is capable of buying or selling at a single piece of time given a set of stocks to trade on.</p>



<p>Please note that this project is not based on counting transactional costs, efficiency of executing trades, etc. &#8211; so this project can’t be outstanding in the real world. Plus, training of the project is done on CPU due to its sequential manner.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/trading-bot-visualization.png?ssl=1" alt="trading bot visualization" class="wp-image-29920"/><figcaption class="wp-element-caption"><a href="https://github.com/pskrunner14/trading-bot" target="_blank" rel="noreferrer noopener nofollow"><em>Source</em></a></figcaption></figure>
</div>


<h2 class="wp-block-heading" class="wp-block-heading" id="h-2-chatbot-based-reinforcement-learning">2. Chatbot-based Reinforcement Learning</h2>



<p>Chatbots are generally trained with the help of <strong>sequence to sequence modelling</strong>, but adding reinforcement learning to the mix can have big advantages for stock trading and finance:</p>



<ul class="wp-block-list">
<li>Chatbots can act as brokers and offer real-time quotes to their user operators.</li>



<li>Conversational UI-based chatbots can help customers resolve their issues instead of someone from the staff or from the backend support team. This saves time, and relieves the support staff from repeatable tasks, letting them concentrate on more complicated issues.</li>



<li>Chatbots can also give suggestions on opening and closing sales values within trading hours.</li>
</ul>



<p>The <a href="https://github.com/pochih/RL-Chatbot" target="_blank" rel="noreferrer noopener nofollow">Deep Reinforcement Learning Chatbot</a> project shows a chatbot implementation based on reinforcement learning, achieved with the Policy gradient technique.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/chatbot-results.png?ssl=1" alt="chatbot results" class="wp-image-29923"/><figcaption class="wp-element-caption"><a href="https://github.com/pochih/RL-Chatbot" target="_blank" rel="noreferrer noopener nofollow"><em>Source</em></a></figcaption></figure>
</div>


<h2 class="wp-block-heading" class="wp-block-heading" id="h-3-risk-optimization-in-peer-to-peer-lending-with-reinforcement-learning">3. Risk optimization in peer-to-peer lending with Reinforcement Learning</h2>



<p>P2P lending is a way of providing individuals and businesses with loans through online services. These online services do the job of matching lenders to their investors.</p>



<p>In these types of online marketplaces, reinforcement learning comes in handy. Specifically it can be used to:</p>



<ul class="wp-block-list">
<li>Analyze borrowers’ credit scores to reduce risk.</li>



<li>Predicting annualized returns, since online businesses have low overhead, lenders can expect higher returns compared to savings and investment products offered by banks.</li>



<li>It can also help estimate the likelihood if the borrower will be able to meet his/her debt obligations.</li>
</ul>



<p>The <a href="https://github.com/as6140/peervest" target="_blank" rel="noreferrer noopener nofollow">Peer-to-Peer Lending Robo-Advisor Using a Neural Network</a> project is an online lending platform built with a Neural Network. It doesn’t use reinforcement learning, but you can see that it’s just the kind of trial &amp; error scenario where RL would make perfect sense.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/online-lending-platform.png?ssl=1" alt="online lending platform" class="wp-image-29925"/><figcaption class="wp-element-caption"><a href="https://github.com/as6140/peervest" target="_blank" rel="noreferrer noopener nofollow"><em>Source</em></a></figcaption></figure>
</div>


<h2 class="wp-block-heading" class="wp-block-heading" id="h-4-portfolio-management-with-deep-reinforcement-learning">4. Portfolio Management with Deep Reinforcement Learning</h2>



<p>Portfolio Management means taking your client’s assets, putting it into stocks, and managing it on a continuous basis to help the client achieve their financial goals. With the help of Deep Policy Network Reinforcement Learning, the allocation of assets can be optimized over time.&nbsp;</p>



<p>In this case, the benefits of deep reinforcement learning are:</p>



<ul class="wp-block-list">
<li>It enhances the efficiency and success rates of human managers.</li>



<li>It decreases organizational risk.</li>



<li>It increases Return on Investments (ROI) in terms of organizational profit.&nbsp;</li>
</ul>



<p><a href="https://github.com/selimamrouni/Deep-Portfolio-Management-Reinforcement-Learning" target="_blank" rel="noreferrer noopener nofollow">Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem</a> &#8211; this project shows an implementation of portfolio management with Deep Policy Network Reinforcement Learning.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-portfolio-management.png?resize=768%2C443&#038;ssl=1" alt="RL portfolio management" class="wp-image-29928" style="width:768px;height:443px" width="768" height="443"/><figcaption class="wp-element-caption"><em><a href="https://github.com/selimamrouni/Deep-Portfolio-Management-Reinforcement-Learning" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h2 class="wp-block-heading" class="wp-block-heading" id="h-5-price-setting-strategies-with-reinforcement-learning">5. Price setting strategies with Reinforcement Learning</h2>



<p>Complexity and dynamic stock price changes are the biggest challenges in understanding stock prices. In order to understand these properties, Gated Recurrent Unit (GRU) networks work well with reinforcement learning, providing advantages such as:</p>



<ul class="wp-block-list">
<li>Extracting informative financial features which can represent the intrinsic character of a stock.</li>



<li>Helping to decide the stop loss and stop profit during trading.</li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-price-setting.jpg?resize=512%2C264&#038;ssl=1" alt="RL price setting" class="wp-image-29930" style="width:512px;height:264px" width="512" height="264"/><figcaption class="wp-element-caption"><em>Photo by Olya Kobruseva | Source: Pexels</em></figcaption></figure>
</div>


<p>To support the above statements, the <a href="https://arxiv.org/ftp/arxiv/papers/1803/1803.03916.pdf" target="_blank" rel="noreferrer noopener nofollow">Deep reinforcement learning for time series: playing idealized trading games</a> paper shows which performs best out of Stacked Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM) units, Convolutional Neural Network (CNN), and Multi-Layer Perceptron (MLP).&nbsp;</p>



<p><strong>The GRU-based agents used to model Q values show the best overall performance in the Univariate game to capture a wave-like price time series.</strong></p>



<p>The two techniques with which reinforcement learning can be applied with GRU are:</p>



<ul class="wp-block-list">
<li>Gated Deep Q Learning Strategy</li>



<li>Gated Policy Gradient Strategy</li>
</ul>



<p>To understand these techniques better, you can check out this article: <a href="https://www.sciencedirect.com/science/article/pii/S0020025520304692?dgcid=rss_sd_all" target="_blank" rel="noreferrer noopener nofollow">Adaptive stock trading strategies with deep reinforcement learning methods</a>.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-6-recommendation-systems-with-reinforcement-learning">6. Recommendation systems with Reinforcement Learning</h2>



<p>When it comes to online trading platforms, recommendation systems based on reinforcement learning techniques can be a gamechanger. These systems can help in recommending the right stocks to users while trading.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-recommendation-systems.jpg?resize=512%2C266&#038;ssl=1" alt="RL recommendation systems" class="wp-image-29932" style="width:512px;height:266px" width="512" height="266"/><figcaption class="wp-element-caption"><em>Photo by <a href="https://www.pexels.com/@thisisengineering?utm_content=attributionCopyText&amp;utm_medium=referral&amp;utm_source=pexels" target="_blank" rel="noreferrer noopener nofollow">ThisIsEngineering</a> | Source: <a href="https://www.pexels.com/photo/code-projected-over-woman-3861969/?utm_content=attributionCopyText&amp;utm_medium=referral&amp;utm_source=pexels" target="_blank" rel="noreferrer noopener nofollow">Pexels</a></em></figcaption></figure>
</div>


<p>Reinforcement learning helps to choose the best stock or mutual fund after being trained on a number of stocks, ultimately leading to better ROI.</p>



<p>The advantages here can be:</p>



<ul class="wp-block-list">
<li>Engaging existing users by providing lifelong stock-picking recommendations based on the users&#8217; behaviour on the platform.</li>



<li>Helping beginners by suggesting good stocks to trade.</li>



<li>Making it easier to decide which stocks to pick.</li>
</ul>



<p>The <a href="https://github.com/doncat99/StockRecommendSystem" rel="noreferrer noopener nofollow" target="_blank">StockRecommendSystem</a> project shows an implementation of a system like this.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-7-maximizing-profit-with-minimum-capital-investments">7. Maximizing profit with minimum capital investments</h2>



<p>If we combine all of the above points, we could get an automated system constructed to achieve high returns, while keeping the investments as low as possible.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/RL-maximizing-profit.jpg?resize=512%2C248&#038;ssl=1" alt="RL maximizing profit" class="wp-image-29934" style="width:512px;height:248px" width="512" height="248"/><figcaption class="wp-element-caption"><em>Photo by <a href="https://www.pexels.com/@karolina-grabowska?utm_content=attributionCopyText&amp;utm_medium=referral&amp;utm_source=pexels" target="_blank" rel="noreferrer noopener nofollow">Karolina Grabowska</a> | Source: <a href="https://www.pexels.com/photo/hands-holding-us-dollar-bills-4968630/?utm_content=attributionCopyText&amp;utm_medium=referral&amp;utm_source=pexels" target="_blank" rel="noreferrer noopener nofollow">Pexels</a></em></figcaption></figure>
</div>


<p>An agent can be trained with the help of reinforcement learning, which can take the minimum asset from any source and allocate it to a stock, which can double the ROI in the future.</p>



<p>Nowadays, RL agents have been able to learn optimal trading strategies that outperform simple buy and sell strategies that people used to apply. This can be achieved with the help of the Markov Decision Process (MDP) model, using Deep Recurrent Q Network (DRQN). A good resource to understand this concept is <a href="https://arxiv.org/pdf/1507.06527.pdf" target="_blank" rel="noreferrer noopener nofollow"><strong>Deep Recurrent Q-Learning for Partially Observable MDPs</strong></a><strong>.</strong></p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-proceed-with-caution">Proceed with caution</h2>



<p>It’s important to add that a lot of the projects we listed are essentially projects made for fun. They’re trained on past data and not backtested properly. In the case of unseen data (for example COVID stats), the downside risk is much larger than expected by the model.</p>



<p>The market is a complicated system and it’s hard for machine learning systems to understand stocks based only on historical data. The performance of ML-based trading strategies can be great, but it can also cause you to drain your savings. So take these projects with a grain of salt.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-conclusion">Conclusion</h2>



<p>Reinforcement learning has always been kind of underrated. By showing finance and trading use cases of RL in this article, I want to share awareness about how useful RL can be, creating a motivated path for new learners and existing developers to explore this domain more. It’s a fascinating topic!</p>


]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2741</post-id>	</item>
		<item>
		<title>Best Reinforcement Learning Tutorials, Examples, Projects, and Courses</title>
		<link>https://neptune.ai/blog/best-reinforcement-learning-tutorials-examples-projects-and-courses</link>
		
		<dc:creator><![CDATA[Krissanawat Kaewsanmua]]></dc:creator>
		<pubDate>Thu, 21 Jul 2022 10:29:12 +0000</pubDate>
				<category><![CDATA[Reinforcement Learning]]></category>
		<guid isPermaLink="false">https://neptune.test/best-reinforcement-learning-tutorials-examples-projects-and-courses/</guid>

					<description><![CDATA[In reinforcement learning, your system learns how to interact intuitively with the environment by basically doing stuff and watching what happens &#8211; but obviously, there’s a lot more to it. If you’re interested in RL, this article will provide you with a ton of new content to explore this concept. A lot of work has&#8230;]]></description>
										<content:encoded><![CDATA[
<p>In reinforcement learning, your system learns how to interact intuitively with the environment by basically doing stuff and watching what happens &#8211; but obviously, there’s a lot more to it.</p>



<p>If you’re interested in RL, this article will provide you with a ton of new content to explore this concept. A lot of work has been done with reinforcement learning in the past few years, and I’ve collected some of the most interesting articles, videos, and use cases presenting different concepts, approaches, and methods.</p>



<p>In this list, you’ll find:&nbsp;</p>



<ul class="wp-block-list">
<li>reinforcement learning tutorials,&nbsp;</li>



<li>examples of where to apply reinforcement learning,</li>



<li>interesting reinforcement learning projects,</li>



<li>courses to master reinforcement learning.</li>
</ul>



<p>All this content will help you go from RL newbie to RL pro.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-reinforcement-learning-tutorials">Reinforcement learning tutorials</h2>



<p><strong>1. <a href="https://cai.tools.sap/blog/the-future-with-reinforcement-learning-part-1/" target="_blank" rel="noreferrer noopener nofollow">RL with Mario Bros</a></strong> &#8211; Learn about reinforcement learning in this unique tutorial based on one of the most popular arcade games of all time &#8211; Super Mario.</p>



<p><strong>2. <a href="https://medium.com/machine-learning-for-humans/reinforcement-learning-6eacf258b265" target="_blank" rel="noreferrer noopener nofollow">Machine Learning for Humans: Reinforcement Learning</a></strong> &#8211; This tutorial is part of an ebook titled &#8216;Machine Learning for Humans&#8217;. It explains the core concept of reinforcement learning. There are numerous examples, guidance on the next step to follow in the future of reinforcement learning algorithms, and an easy-to-follow figurative explanation.</p>



<p><strong>3. <a href="https://medium.com/free-code-camp/an-introduction-to-reinforcement-learning-4339519de419" target="_blank" rel="noreferrer noopener nofollow">An introduction to Reinforcement Learning</a></strong> &#8211; There’s a lot of knowledge here, explained with much clarity and enthusiasm. It starts with an overview of reinforcement learning with its processes and tasks, explores different approaches to reinforcement learning, and ends with a fundamental introduction of deep reinforcement learning.</p>



<p><strong>4. <a href="https://blog.insightdatascience.com/reinforcement-learning-from-scratch-819b65f074d8" target="_blank" rel="noreferrer noopener nofollow">Reinforcement Learning from scratch</a> </strong>&#8211; This article will take you through the author’s process of learning RL from scratch. The author has a lot of knowledge of deep reinforcement learning from working at Unity Technologies. Even beginners will be able to understand his overview of the core concepts of reinforcement learning.</p>



<p><strong>5. <a href="https://towardsdatascience.com/deep-reinforcement-learning-for-automated-stock-trading-f1dad0126a02" target="_blank" rel="noreferrer noopener nofollow">Deep Reinforcement Learning for Automated Stock Trading</a> </strong>&#8211; Here you’ll find a solution to a stock trading strategy using reinforcement learning, which optimizes the investment process and maximizes the return on investment. The article includes a proper explanation of three combined algorithms: Proximal Policy Optimization (PPO), Advantage Actor-Critic (A2C), and Deep Deterministic Policy Gradient (DDPG). The best of each algorithm is coordinated to provide a solution to optimized stock trading strategies.</p>



<p><strong>6. <a href="https://towardsdatascience.com/applications-of-reinforcement-learning-in-real-world-1a94955bcd12" target="_blank" rel="noreferrer noopener nofollow">Applications of Reinforcement Learning in Real World</a> </strong>&#8211; Explore how reinforcement learning frameworks are undervalued when it comes to devising decision-making models. A detailed study of RL applications in real-world projects, explaining what a reinforcement learning framework is, and listing its use-cases in real-world environments. It narrows down the applications to 8 areas of learning, consisting of topics like machine learning, deep learning, computer games, and more. The author also explores the relationship of RL with other disciplines and discusses the future of RL.</p>



<p><strong>7. <a href="https://github.com/yandexdataschool/Practical_RL" target="_blank" rel="noreferrer noopener nofollow">Practical RL</a> </strong>&#8211; This GitHub repo is an open-source course on reinforcement learning, taught on several college campuses. The repo is maintained to support online students with the option of two locales &#8211; Russian and English. The course features services like chat rooms, gradings, FAQs, feedback forms, and a virtual course environment. The course syllabus covers everything from the basics of RL to discussing and implementing different models, methods, and much more.</p>



<p><strong>8.</strong><a href="https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0#.78km20i8r" target="_blank" rel="noreferrer noopener nofollow"><strong> Simple Reinforcement Learning with Tensorflow Part 0: Q-Learning with Tables and Neural Networks</strong></a> &#8211; The first part of a tutorial series about reinforcement learning with TensorFlow. The author explores Q-learning algorithms, one of the families of RL algorithms. The simple tabular look-up version of the algorithm is implemented first. The detailed guidance on the implementation of neural networks using the Tensorflow Q-algorithm approach is definitely worth your interest.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-examples-of-where-to-apply-reinforcement-learning">Examples of where to apply reinforcement learning</h2>



<p><strong>1. <a href="https://blog.insightdatascience.com/using-reinforcement-learning-to-design-a-better-rocket-engine-4dfd1770497a" target="_blank" rel="noreferrer noopener nofollow">Rocket engineering</a></strong> &#8211; Explore how reinforcement learning is used in the field of rocket engine development. You’ll find a lot of valuable information on the use of machine learning in manufacturing industries. See why reinforcement learning is favored over other machine learning algorithms when it comes to manufacturing rocket engines.</p>



<p><strong>2. <a href="https://traffic-signal-control.github.io/" target="_blank" rel="noreferrer noopener nofollow">Traffic Light Control</a> </strong>&#8211; This site provides multiple research papers and project examples that highlight the use of core reinforcement learning and deep reinforcement learning in traffic light control. It has tutorials, datasets, and relevant example papers that use RL as a backbone so that you can make a new finding of your own.</p>



<p><strong>3. <a href="https://www.mlq.ai/ai-in-advertising/" target="_blank" rel="noreferrer noopener nofollow">Marketing and advertising</a></strong> &#8211; See how to make an AI system learn from a pre-existing dataset which may be infeasible or unavailable, and how to make AI learn in real-time by creating advertising content. This is where they have made use of reinforcement learning.<a href="https://medium.com/@deepthiraghvendra96/reinforcement-learning-in-marketing-2d5b29f3ed8c">&nbsp;</a></p>



<p><strong>4. <a href="https://medium.com/@deepthiraghvendra96/reinforcement-learning-in-marketing-2d5b29f3ed8c">Reinforcement Learning in Marketing | by Deepthi A R</a></strong> &#8211; This example focuses on the changing business dynamics to which marketers need to adapt. The AI equipped with a reinforcement learning scheme can learn from real-time changes and help devise a proper marketing strategy. This article highlights the changing business environment as a problem and reinforcement learning as a solution to it.</p>



<p><strong>5. <a href="https://www.youtube.com/watch?v=luzOblzznIc" target="_blank" rel="noreferrer noopener nofollow">Robotics</a></strong> &#8211; This video demonstrates the use of reinforcement learning in robotics. The aim is to show the implementation of autonomous reinforcement learning agents for robotics. A prime example of using reinforcement learning in robotics.</p>



<p><strong>6. <a href="https://medium.com/inside-machine-learning/recommendation-systems-using-reinforcement-learning-de6379eecfde" target="_blank" rel="noreferrer noopener nofollow">Recommendation</a> </strong>&#8211; Recommendation systems are widely used in eCommerce and business sites for product advertisement. There’s always a recommendation section displayed in many popular platforms such as YouTube, Google, etc. The ability of AI to learn from real-time user interactions, and then suggest them content, would not have been possible without reinforcement learning. This article shows the use of reinforcement learning algorithms and practical implementations in recommendation systems.</p>



<p><strong>7. <a href="https://www.mygreatlearning.com/blog/reinforcement-learning-in-healthcare/" target="_blank" rel="noreferrer noopener nofollow">Healthcare</a></strong> &#8211; Healthcare is a huge industry with many state-of-the-art technologies bound to it, where the use of AI is not new. The main question here is how to optimize AI in healthcare, and make it learn based on real-time experiences. This is where reinforcement learning comes in. Reinforcement learning has undeniable value for healthcare, with its ability to regulate ultimate behaviors. With RL, healthcare systems can provide more detailed and accurate treatment at reduced costs.</p>



<p><strong>8. <a href="https://venturebeat.com/2020/06/30/researchers-combine-reinforcement-learning-and-nlp-to-escape-a-grue-monster/" target="_blank" rel="noreferrer noopener nofollow">NLP</a> </strong>&#8211; This article shows the use of reinforcement learning in combination with Natural Language Processing to beat a question and answer adventure game. This example might be an inspiration for learners engaged in Natural Language Processing and gaming solutions.</p>



<p><strong>9. <a href="https://towardsdatascience.com/deep-reinforcement-learning-for-automated-stock-trading-f1dad0126a02" target="_blank" rel="noreferrer noopener nofollow">Trading</a></strong> &#8211; Deep reinforcement learning is a force to reckon with when it comes to the stock trading market. The example here demonstrates how deep reinforcement learning techniques can be used to analyze the stock trading market, and provide proper investment reports. Only an AI equipped with reinforcement learning can provide accurate stock market reports.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-interesting-reinforcement-learning-projects">Interesting reinforcement learning projects</h2>



<p><strong>1. <a href="https://github.com/carla-simulator/carla" target="_blank" rel="noreferrer noopener nofollow">CARLA</a></strong><a href="https://github.com/carla-simulator/carla" target="_blank" rel="noreferrer noopener nofollow"> </a>&#8211; CARLA is an open-source simulator for autonomous driving research. The main objective of CARLA is to support the development, training, and validation of autonomous driving systems. With a package of open-source code and protocols, CARLA provides digital assets that are free to use. The CARLA eco-system also integrates code for running Conditional Reinforcement Learning models, with standalone GUI, to enhance maps with traffic lights and traffic signs information.</p>



<p><strong>2. <a href="https://github.com/yenchenlin/DeepLearningFlappyBird" target="_blank" rel="noreferrer noopener nofollow">Deep Learning Flappy Bird</a></strong> &#8211; If you want to learn about deep Q learning algorithms in an interesting way, then this GitHub repo is for you. The project uses a Deep Q-Network to learn how to play Flappy Bird. It follows the concept of the Deep Q learning algorithm which is in the family of reinforcement learning.</p>



<p><strong>3. <a href="https://github.com/tensorforce/tensorforce" target="_blank" rel="noreferrer noopener nofollow">Tensorforce</a></strong><a href="https://github.com/tensorforce/tensorforce" target="_blank" rel="noreferrer noopener nofollow"> </a>&#8211; This project delivers an open-source deep reinforcement learning framework specialized in modular flexible library design and direct usability for applications in research and practice. It is built on top of Google&#8217;s Tensorflow framework. It houses high-level design implementation such as modular component-based design, separation of RL algorithm and application, and full-on TensorFlow models.</p>



<p><strong>4. <a href="https://github.com/ray-project/ray" target="_blank" rel="noreferrer noopener nofollow">Ray</a></strong><a href="https://github.com/ray-project/ray" target="_blank" rel="noreferrer noopener nofollow"> </a>&#8211; Ray’s main objective is to provide universal APIs for building distributed applications. This project makes use of the RLlib package, which is a scalable Reinforcement Learning library that accelerates machine learning workloads.</p>



<p><strong>5. <a href="https://github.com/janhuenermann/neurojs" target="_blank" rel="noreferrer noopener nofollow">Neurojs</a> </strong>&#8211; JavaScript is popular, and a must for developing websites. What if you need to incorporate reinforcement learning in your JS web project? Say hello to Neurojs, a JavaScript framework for deep learning in the browser using reinforcement learning. It can also perform some neural network tasks as well.</p>



<p><strong>6. <a href="https://github.com/aleju/mario-ai" target="_blank" rel="noreferrer noopener nofollow">Mario AI</a> </strong>&#8211; This one will definitely grab your interest if you are looking for a project with reinforcement learning algorithms for simulating games. Mario AI&nbsp;offers a coding implementation to train a model that plays the first level of Super Mario World automatically, using only raw pixels as the input. The algorithm applied is a deep Q-learning algorithm in the family of reinforcement learning algorithms.</p>



<p><strong>7. <a href="https://github.com/samre12/deep-trading-agent" target="_blank" rel="noreferrer noopener nofollow">Deep Trading Agent</a> </strong>&#8211; Open-source project offering a deep reinforcement learning based trading agent for Bitcoin. The project makes use of the DeepSense Network for Q function approximation. The goal is to simplify the trading process using a reinforcement learning algorithm optimizing the Deep Q-learning agent. It can be a great source of knowledge.</p>



<p><strong>8. <a href="https://github.com/evilsocket/pwnagotchi" target="_blank" rel="noreferrer noopener nofollow">Pwnagotchi </a></strong>&#8211; This project will blow your mind if you are into cracking Wifi networks using deep reinforcement learning techniques. Pwnagotchi is a system that learns from its surrounding Wi-Fi environment to maximize the crackable WPA key material it captures. Unlike most reinforcement learning-based systems, Pwnagotchi amplifies its parameters over time to get better at cracking WiFi networks in the environments you expose it to.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-courses-to-master-reinforcement-learning">Courses to master reinforcement learning&nbsp;</h2>



<p><strong>1.</strong> <strong><a href="https://click.linksynergy.com/deeplink?id=vedj0cWlu2Y&amp;mid=40328&amp;u1=ddreinforcementlearning1&amp;murl=https%3A%2F%2Fwww.coursera.org%2Fspecializations%2Freinforcement-learning" target="_blank" rel="noreferrer noopener nofollow">Reinforcement Learning Specialization (Coursera)</a></strong> &#8211; One of the best courses available in the market. With a total rating of 4.8 stars and 21000+ students already enrolled, this course will help you master the concepts of reinforcement learning. You will learn how to implement a complete RL solution and take note of its application to solve real-world problems. By the end of this course,&nbsp; you will be able to formalize tasks as a reinforcement learning problem and its due solutions, understand the concepts of RL algorithms, and how RL fits under the broader umbrella of machine learning.</p>



<p><strong>2.</strong> <strong><a href="https://click.linksynergy.com/deeplink?id=vedj0cWlu2Y&amp;mid=39197&amp;u1=ddreinforcementlearning4&amp;murl=https%3A%2F%2Fwww.udemy.com%2Fcourse%2Fartificial-intelligence-reinforcement-learning-in-python%2F" rel="noreferrer noopener nofollow" target="_blank">Reinforcement Learning in Python (Udemy)</a> </strong>&#8211;<strong> </strong>This is a premium course offered by Udemy at the price of 29.99 USD. It has a rating of 4.5 stars overall with more than 39,000 learners enrolled. This course is a learning playground for those who are seeking to implement an AI solution with reinforcement learning engaged in Python programming. Through theoretical and practical implementations, you will learn to apply gradient-based supervised machine learning methods to reinforcement learning, programming implementations of numerous reinforcement learning algorithms, and also know the relationship between RL and psychology.</p>



<p><strong>3.</strong> <strong><a href="https://click.linksynergy.com/deeplink?id=vedj0cWlu2Y&amp;mid=40328&amp;u1=ddreinforcementlearning5&amp;murl=https%3A%2F%2Fwww.coursera.org%2Flearn%2Fpractical-rl" target="_blank" rel="noreferrer noopener nofollow">Practical Reinforcement Learning (Coursera)</a></strong> &#8211; With a rating of 4.2,&nbsp; and 37,000+learners, this course is the essential section of the Advanced Machine Learning Specialization. You are guaranteed to get knowledge of practical implementation of RL algorithms. You’ll get insights on the foundations of RL methods, and using neural network technologies for RL. One interesting part is training neural networks to play games on their own using RL.</p>



<p><strong>4. <a href="https://www.pluralsight.com/courses/understanding-algorithms-reinforcement-learning?aid=7010a000002LUv7AAG&amp;promo=&amp;utm_source=non_branded&amp;utm_medium=digital_paid_search_google&amp;utm_campaign=XYZ_APAC_Dynamic&amp;utm_content=&amp;gclid=Cj0KCQjwoJX8BRCZARIsAEWBFMJrW7gzrS94r_hfE0HJkb2JcGiOCPoL0SfrvNZSvGaYD-U9GJZKkdwaAjQFEALw_wcB" target="_blank" rel="noreferrer noopener nofollow">Understanding Algorithms for Reinforcement Learning</a></strong> &#8211; If you are a total beginner in the field of Reinforcement learning then this might be the best course for you. With an overall rating of 4.0 stars and a duration of nearly 3 hours in the PluralSight platform, this course can be a quick way to get yourself started with reinforcement learning algorithms. You’ll get deep information on algorithms for reinforcement learning, basic principles of reinforcement learning algorithms, RL taxonomy, and RL family algorithms such as Q-learning and SARSA.</p>



<p><strong>5. <a href="http://www.anrdoezrs.net/links/7964208/type/dlg/sid/ddreinforcementlearning/https://www.udacity.com/course/reinforcement-learning--ud600" rel="noreferrer noopener nofollow" target="_blank">Reinforcement Learning by Georgia Tech (Udacity)</a></strong> &#8211; One of the best free courses available, offered by Georgia Tech through the Udacity platform. The course is formulated for those seeking to understand the world of Machine learning and Artificial Intelligence from a theoretical perspective. It provides rich insights into recent research on reinforcement learning, which will help you explore automated decision-making models.</p>



<p><strong>6. <a href="http://web.stanford.edu/class/cs234/index.html" target="_blank" rel="noreferrer noopener nofollow">Reinforcement Learning Winter (Stanford Education)</a></strong> &#8211; This course is provided by Stanford University as a winter session. There are some basic requirements for the course, such as Python programming proficiency, knowledge of linear algebra and calculus, basics of statistics and probability, and basics of machine learning. This course provides state of the art lectures. In the end, you will be able to define key features of RL, applications of RL on real-world problems, coding implementations of RL algorithms, and have deep knowledge of RL algorithms. This course is suited for those seeking advanced-level learning resources on the RL ecosystem.</p>



<p><strong>7. <a href="https://click.linksynergy.com/deeplink?id=Fh5UMknfYAU&amp;mid=39197&amp;u1=coursesityBlog&amp;murl=https%3A%2F%2Fwww.udemy.com%2Fcourse%2Fdeep-reinforcement-learning-in-python%2F" rel="noreferrer noopener nofollow" target="_blank">Advanced AI: Deep Reinforcement Learning with Python</a></strong> &#8211; If you are looking for a high-level advanced course on Reinforcement learning, then this is no doubt the best course available in the Udemy platform for you. This is a premium course with a price tag of 29.99 USD, a rating of 4.6 stars, entertaining more than 32,000 students across the world. It is not just about reinforcement learning at the foundation level, but also deep reinforcement learning with its practical implementation using Python programming. The practical implementations of deep learning agents, Q-learning algorithms, deep neural networks, RBF networks, convolutional neural networks with deep Q-learning are the prime grabs of this course.</p>



<p><strong>8. <a href="https://click.linksynergy.com/deeplink?id=BuGceriufQM&amp;mid=40328&amp;u1=coursesityBlog&amp;murl=https%3A%2F%2Fwww.coursera.org%2Flearn%2Fpractical-rl" rel="noreferrer noopener nofollow" target="_blank">Practical Reinforcement Learning</a></strong> &#8211; Another popular course offered by Coursera, best for those looking for practical knowledge of reinforcement learning. It has a total rating of 4.2 stars with more than 37,000 students already enrolled.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-what-are-you-waiting-for-start-learning">What are you waiting for? Start learning!</h2>



<p>Hopefully, these resources will help you get a deep understanding of reinforcement learning, and its practical applications in the real world.</p>



<p>RL is a fascinating part of machine learning, and it’s worth spending your time on it to master it. Good luck!</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2673</post-id>	</item>
		<item>
		<title>10 Real-Life Applications of Reinforcement Learning</title>
		<link>https://neptune.ai/blog/reinforcement-learning-applications</link>
		
		<dc:creator><![CDATA[Derrick Mwiti]]></dc:creator>
		<pubDate>Thu, 21 Jul 2022 10:09:07 +0000</pubDate>
				<category><![CDATA[Reinforcement Learning]]></category>
		<guid isPermaLink="false">https://neptune.test/reinforcement-learning-applications/</guid>

					<description><![CDATA[In Reinforcement Learning (RL), agents are trained on a reward and punishment mechanism. The agent is rewarded for correct moves and punished for the wrong ones. In doing so, the agent tries to minimize wrong moves and maximize the right ones.&#160; In this article, we’ll look at some of the real-world applications of reinforcement learning.&#8230;]]></description>
										<content:encoded><![CDATA[
<p>In <strong>Reinforcement Learning (RL)</strong>, agents are trained on a <strong>reward</strong> and <strong>punishment</strong> mechanism. The <strong>agent</strong> is rewarded for correct moves and punished for the wrong ones. In doing so, the agent tries to minimize wrong moves and maximize the right ones.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1200" height="1200" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/12/Reinforcement-Learning-RL.png?resize=1200%2C1200&#038;ssl=1" alt="In Reinforcement Learning (RL), agents are trained on a reward and punishment mechanism. The agent is rewarded for correct moves and punished for the wrong ones. In doing so, the agent tries to minimize wrong moves and maximize the right ones." class="wp-image-43002" style="width:460px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/12/Reinforcement-Learning-RL.png?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/12/Reinforcement-Learning-RL.png?resize=768%2C768&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/12/Reinforcement-Learning-RL.png?resize=200%2C200&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/12/Reinforcement-Learning-RL.png?resize=220%2C220&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/12/Reinforcement-Learning-RL.png?resize=120%2C120&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/12/Reinforcement-Learning-RL.png?resize=88%2C88&amp;ssl=1 88w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/12/Reinforcement-Learning-RL.png?resize=44%2C44&amp;ssl=1 44w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/12/Reinforcement-Learning-RL.png?resize=160%2C160&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/12/Reinforcement-Learning-RL.png?resize=300%2C300&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/12/Reinforcement-Learning-RL.png?resize=480%2C480&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/12/Reinforcement-Learning-RL.png?resize=1020%2C1020&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/12/Reinforcement-Learning-RL.png?resize=100%2C100&amp;ssl=1 100w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Modified based on: <a href="https://upload.wikimedia.org/wikipedia/commons/1/1b/Reinforcement_learning_diagram.svg" target="_blank" rel="noreferrer noopener nofollow"><em>source</em></a></figcaption></figure>
</div>


<p>In this article, we’ll look at some of the real-world applications of reinforcement learning.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-applications-in-self-driving-cars">Applications in self-driving cars</h2>



<p>Various papers have proposed<a href="https://arxiv.org/pdf/2002.00444.pdf" target="_blank" rel="noreferrer noopener nofollow"> Deep Reinforcement Learning</a> for <strong>autonomous driving</strong>. In self-driving cars, there are various aspects to consider, such as speed limits at various places, drivable zones, avoiding collisions — just to mention a few.&nbsp;</p>



<p>Some of the autonomous driving tasks where reinforcement learning could be applied include trajectory optimization, motion planning, dynamic pathing, controller optimization, and scenario-based learning policies for highways.&nbsp;</p>



<p>For example, parking can be achieved by learning automatic parking policies. Lane changing can be achieved using<a href="https://towardsdatascience.com/simple-reinforcement-learning-q-learning-fcddc4b6fe56" target="_blank" rel="noreferrer noopener nofollow"> Q-Learning</a> while overtaking can be implemented by learning an overtaking policy while avoiding collision and maintaining a steady speed thereafter.</p>



<p><a href="https://aws.amazon.com/fr/deepracer/" target="_blank" rel="noreferrer noopener nofollow">AWS DeepRacer</a> is an autonomous racing car that has been designed to test out RL in a physical track. It uses cameras to visualize the runway and a reinforcement learning model to control the throttle and direction.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Robot.png?ssl=1" alt="" class="wp-image-20511" style="width:747px;height:596px"/><figcaption class="wp-element-caption"><em><a href="https://aws.amazon.com/fr/deepracer/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>Wayve.ai has successfully applied reinforcement learning to training a car on how to <strong>drive in a day.</strong> They used a deep reinforcement learning algorithm to tackle the lane following task. Their network architecture was a deep network with 4 convolutional layers and 3 fully connected layers. The example below shows the lane following task. The image in the middle represents the driver&#8217;s perspective.&nbsp;</p>



<figure class="wp-block-video aligncenter"><video height="188" style="aspect-ratio: 750 / 188;" width="750" autoplay loop muted src="https://neptune.ai/wp-content/uploads/2022/11/10-Real-Life-Applications-of-Reinforcement-Learning.mp4"></video><figcaption class="wp-element-caption"><a href="https://wayve.ai/blog/learning-to-drive-in-a-day-with-reinforcement-learning/" target="_blank" rel="noreferrer noopener nofollow"><strong>Source</strong></a></figcaption></figure>



<section id="blog-intext-cta-block_50ee1eec857b887852464c4b4c9a0fcc" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-read-more">Read more </h3>
    
            <p><a href="/blog/self-driving-cars-with-convolutional-neural-networks-cnn" target="_blank" rel="noopener">Self-Driving Cars With Convolutional Neural Networks (CNN)</a></p>
    
    </section>



<div id="separator-block_54e14ae7492eec50bbf3d8fd1970aff2"
         class="block-separator block-separator--15">
</div>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-industry-automation-with-reinforcement-learning">Industry automation with Reinforcement Learning</h2>



<p>In industry reinforcement, learning-based <strong>robots</strong> are used to perform various tasks. Apart from the fact that these robots are more efficient than human beings, they can also perform tasks that would be dangerous for people.&nbsp;</p>



<p>A great example is the use of AI agents by<a href="https://deepmind.com/blog/article/safety-first-ai-autonomous-data-centre-cooling-and-industrial-control" target="_blank" rel="noreferrer noopener nofollow"> Deepmind to cool Google Data Centers</a>. This led to a 40% reduction in <strong>energy spending</strong>. The centers are now fully controlled with the AI system without the need for human intervention. There is obviously still supervision from data center experts. The system works&nbsp; in the following way:</p>



<ul class="wp-block-list">
<li>Taking snapshots of data from the data centers every five minutes and feeding this to deep neural networks</li>



<li>It then predicts how different combinations will affect future energy consumptions</li>



<li>Identifying actions that will lead to minimal power consumption while maintaining a set standard of safety criteria&nbsp;</li>



<li>Sending&nbsp; and implement these actions at the data center</li>
</ul>



<p>&nbsp;The actions are verified by the local control system.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-reinforcement-learning-applications-in-trading-and-finance">Reinforcement Learning applications in trading and finance</h2>



<p>Supervised<strong> </strong><a href="https://arxiv.org/ftp/arxiv/papers/1803/1803.03916.pdf" target="_blank" rel="noreferrer noopener nofollow"><strong>time series</strong></a> models can be used for predicting future sales as well as predicting <strong>stock prices</strong>. However, these models don’t determine the action to take at a particular stock price. Enter Reinforcement Learning (RL). An RL agent can decide on such a task; whether to hold, buy, or sell. The RL model is evaluated using market benchmark standards in order to ensure that it&#8217;s performing optimally.&nbsp;</p>



<p>This automation brings consistency into the process, unlike previous methods where analysts would have to make every single decision.<a href="https://medium.com/inside-machine-learning/reinforcement-learning-the-business-use-case-part-2-c175740999" target="_blank" rel="noreferrer noopener nofollow"> IBM</a> for example has a sophisticated reinforcement learning based platform that has the ability to make <strong>financial trades</strong>. It computes the reward function based on the loss or profit of every financial transaction.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-reinforcement-learning-in-nlp-natural-language-processing">Reinforcement Learning in NLP (Natural Language Processing)</h2>



<p>In NLP, RL can be used in <strong>text summarization</strong>, <strong>question answering,</strong> and <strong>machine translation </strong>just to mention a few.&nbsp;</p>



<p>The authors of this<a href="https://homes.cs.washington.edu/~eunsol/papers/acl17eunsol.pdf" target="_blank" rel="noreferrer noopener nofollow"> paper</a>&nbsp;Eunsol Choi, Daniel Hewlett, and Jakob Uszkoreit propose an RL based approach for question answering given long texts. Their method works by first selecting a few sentences from the document that are relevant for answering the question. A slow RNN is then employed to produce answers to the selected sentences.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter is-resized"><img decoding="async" src="https://lh5.googleusercontent.com/DfwsYUqts8aaoO5f2OAcvB2x9pfk7P4uHbdKzNo6YmJSxRATnUOjHtNTw8G0ZKuBRH6IErycI05xq-JHh8GGaqXKQyHxSxW4A5RBwbq3nSE2zIKeAZtBrs-ovmb_rUnL9KM5Zq3W" alt="" style="width:462px;height:379px"/><figcaption class="wp-element-caption"><a href="https://homes.cs.washington.edu/~eunsol/papers/acl17eunsol.pdf" target="_blank" rel="noreferrer noopener nofollow"><em>Source</em></a></figcaption></figure>
</div>


<p>A combination of supervised and reinforcement learning is used for<a href="https://arxiv.org/pdf/1705.04304.pdf" target="_blank" rel="noreferrer noopener nofollow"> abstractive text summarization in this paper</a>.&nbsp;The paper is fronted by Romain Paulus, Caiming Xiong &amp; Richard Socher. Their goal is to solve the problem faced in <strong>summarization</strong> while using Attentional, RNN-based encoder-decoder models in longer documents. The authors of this paper propose a neural network with a novel intra-attention that attends over the input and continuously generates output separately. Their training methods are a combo of standard supervised word prediction and reinforcement learning.</p>


<div class="wp-block-image">
<figure class="aligncenter is-resized"><img decoding="async" src="https://lh3.googleusercontent.com/_4nAg1u50WPusd5jjuO6N3agTPuQn2GKBBYSUljs-d9PLUjjFZ3OWhZCkCLliyqX1y8lmBa4GJmz_TRtc61gH979pSgs1pBzJ-0YuZFOde0Hu7-D8qH3Oei_r1yDDTuhV7IDpHwr" alt="" style="width:828px;height:470px"/><figcaption class="wp-element-caption"><em><a href="https://arxiv.org/pdf/1705.04304.pdf" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>On the side of machine translation, <a href="http://users.umiacs.umd.edu/~jbg/docs/2014_emnlp_simtrans.pdf" target="_blank" rel="noreferrer noopener nofollow">authors from the University of Colorado and the University of Maryland</a>, propose a reinforcement learning based approach to simultaneous <strong>machine translation</strong>. The interesting thing about this work is that it has the ability to learn when to trust the predicted words and uses RL to determine when to wait for more input. </p>


<div class="wp-block-image">
<figure class="aligncenter is-resized"><img decoding="async" src="https://lh3.googleusercontent.com/rh-ZDbESVu96CqAz3q41EHJ3oC1mo2ARE7tHT0arpuFCHC1jwQ0wJ2RpawrAQJh1OaHRD6jRD6mflPBFXhQg9CHY_6UKMFdNJdGYzNZkh-Zq-3O3chnXVt177AocTRDkiuai9rer" alt="" style="width:870px;height:488px"/><figcaption class="wp-element-caption"><em><a href="http://users.umiacs.umd.edu/~jbg/docs/2014_emnlp_simtrans.pdf" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>Researchers from Stanford University, Ohio State University, and Microsoft Research have fronted Deep RL for use in<a href="https://arxiv.org/pdf/1606.01541.pdf" target="_blank" rel="noreferrer noopener nofollow"> dialogue generation</a>. The deep RL can be used to model future rewards in a <strong>chatbot dialogue.</strong> Conversations are simulated using two virtual agents. Policy gradient methods are used to reward sequences that contain important conversation attributes such as coherence, informativity, and ease of answering.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter is-resized"><img decoding="async" src="https://lh4.googleusercontent.com/WmQq-UQbtHLVD_WABBT1s-z1UCH1iQZEQGaA_6XTa2Qg7yOZHKFGcCRJsGi1o4QyplIAQVwftsx9WJGSatQ7-sK9HfTkdtjNiyjUfZLS2EU3QMqcp3o9c-nEjDc8nLeWHlu6xoAW" alt="" style="width:875px;height:416px"/><figcaption class="wp-element-caption"><em><a href="https://arxiv.org/pdf/1606.01541.pdf" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>More NLP applications can be found <a href="https://github.com/adityathakker/awesome-rl-nlp" target="_blank" rel="noreferrer noopener nofollow">here</a> or <a href="https://www.future-processing.com/blog/the-future-of-natural-language-processing/" target="_blank" rel="noreferrer noopener">here</a>.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-reinforcement-learning-applications-in-healthcare">Reinforcement Learning applications in healthcare</h2>



<p>In healthcare, patients can <strong>receive treatment</strong> from policies learned from RL systems. RL is able to find optimal policies using previous experiences without the need for previous information on the mathematical model of biological systems. It makes this approach more applicable than other control-based systems in healthcare.&nbsp;</p>



<p>RL in healthcare is categorized as <a href="https://arxiv.org/pdf/1908.08796.pdf" target="_blank" rel="noreferrer noopener nofollow">dynamic treatment regimes(DTRs) </a>in chronic disease or critical care, automated medical diagnosis, and other general domains.</p>


<div class="wp-block-image">
<figure class="aligncenter"><img decoding="async" src="https://lh5.googleusercontent.com/ZWo0AlDOtPVZWCZqq0qNFk8pHcQo2qDvodE5Xm0lDx0wPjfb-HVXasBwWKRIJ4LEXrDxwjBmx8_sBb0xO5M90cmCigYC0kEl9dUgg3A4TCtCXyaiC227UVvhEdKFlUjU2UVxnYIy" alt=""/><figcaption class="wp-element-caption"><a href="https://arxiv.org/pdf/1908.08796.pdf" target="_blank" rel="noreferrer noopener nofollow"><em>Source</em></a></figcaption></figure>
</div>


<p>In DTRs the input is a set of clinical observations and assessments of a patient. The outputs are the treatment options for every stage. These are similar to states in RL. Application of RL in DTRs is advantageous because it is capable of determining time-dependent decisions for the best treatment for a patient at a specific time.&nbsp;</p>



<p>The <a href="https://arxiv.org/pdf/1908.08796.pdf" target="_blank" rel="noreferrer noopener nofollow">use of RL in healthcare </a>also enables improvement of long-term outcomes by factoring the delayed effects of treatments.&nbsp;</p>



<p>RL has also been used for the discovery and generation of optimal DTRs for chronic diseases.&nbsp;</p>



<p>You can dive deeper into RL applications in healthcare by exploring this<a href="https://arxiv.org/pdf/1908.08796.pdf" target="_blank" rel="noreferrer noopener nofollow"> paper</a>. </p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-reinforcement-learning-applications-in-engineering">Reinforcement Learning applications in engineering</h2>



<p>In the engineering frontier, Facebook has developed an <strong>open-source reinforcement learning platform </strong>— <a href="https://engineering.fb.com/ml-applications/horizon/" target="_blank" rel="noreferrer noopener nofollow">Horizon</a>. The platform uses reinforcement learning to optimize large-scale production systems. Facebook has used Horizon internally:</p>



<ul class="wp-block-list">
<li>to personalize suggestions</li>



<li>deliver more meaningful notifications to users</li>



<li>optimize video streaming quality.&nbsp;</li>
</ul>



<p>Horizon also contains workflows for:</p>



<ul class="wp-block-list">
<li>simulated environments</li>



<li>a distributed platform for data preprocessing</li>



<li>training and exporting models in production.&nbsp;</li>
</ul>



<p>A classic example of reinforcement learning in video display is serving a user a low or high bit rate video based on the state of the video buffers and estimates from other machine learning systems.&nbsp;</p>



<p>Horizon is capable of handling production-like concerns such as:</p>



<ul class="wp-block-list">
<li>deploying at scale</li>



<li>feature normalization</li>



<li>distributed learning</li>



<li>serving and handling datasets with high-dimensional data and thousands of feature types.&nbsp;</li>
</ul>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-reinforcement-learning-in-news-recommendation">Reinforcement Learning in news recommendation</h2>



<p>User preferences can change frequently, therefore<a href="http://www.personal.psu.edu/~gjz5038/paper/www2018_reinforceRec/www2018_reinforceRec.pdf" target="_blank" rel="noreferrer noopener nofollow"><strong> recommending news</strong></a> to users based on reviews and likes could become obsolete quickly. With reinforcement learning, the RL system can track the reader’s return behaviors.&nbsp;</p>



<p>Construction of such a system would involve obtaining news features, reader features, context features, and reader news features. News features include but are not limited to the content, headline, and publisher. Reader features refer to how the reader interacts with the content e.g clicks and shares. Context features include news aspects such as timing and freshness of the news. A reward is then defined based on these user behaviors.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-reinforcement-learning-in-gaming">Reinforcement Learning in gaming&nbsp;</h2>



<p>Let’s look at an application in the <strong>gaming</strong> frontier, specifically <strong>AlphaGo Zero</strong>. Using reinforcement learning, AlphaGo Zero was able to learn the game of Go from scratch. It learned by playing against itself. After 40 days of self-training, Alpha Go Zero was able to outperform the version of Alpha Go known as <em>Master</em> that has defeated<a href="https://deepmind.com/alphago-china" target="_blank" rel="noreferrer noopener nofollow"> world number one Ke Jie</a>. It only used black and white stones from the board as input features and a single neural network. A simple tree search that relies on the single neural network is used to evaluate positions moves and sample moves without using any<a href="https://en.wikipedia.org/wiki/Monte_Carlo_method" target="_blank" rel="noreferrer noopener nofollow"> Monte Carlo</a> rollouts.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-real-time-bidding-reinforcement-learning-applications-in-marketing-and-advertising">Real-time bidding— Reinforcement Learning applications in marketing and advertising</h2>



<p>In this<a href="https://arxiv.org/pdf/1802.09756.pdf" target="_blank" rel="noreferrer noopener nofollow"> paper</a>, the authors propose <strong>real-time bidding</strong> with multi-agent reinforcement learning. The handling of a large number of advertisers is dealt with using a clustering method and assigning each cluster a strategic bidding agent. To balance the trade-off between the competition and cooperation among advertisers, a Distributed Coordinated Multi-Agent Bidding (DCMAB) is proposed.&nbsp;</p>



<p>In marketing, the ability to accurately target an individual is very crucial. This is because the right targets obviously lead to a high return on investment. The study in this paper was based on<a href="http://taobao.com"> Taobao</a> — the largest e-commerce platform in China. The proposed method outperforms the state-of-the-art single-agent reinforcement learning approaches.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-reinforcement-learning-in-robotics-manipulation">Reinforcement Learning in robotics manipulation</h2>



<p>The use of deep learning and reinforcement learning<a href="https://ai.googleblog.com/2018/06/scalable-deep-reinforcement-learning.html" target="_blank" rel="noreferrer noopener nofollow"> can train robots</a> that have the ability to grasp various objects — even those unseen during training. This can, for example, be used in building products in an assembly line.&nbsp;</p>



<p>This is achieved by combining large-scale distributed optimization and a variant of<a href="https://en.wikipedia.org/wiki/Q-learning" target="_blank" rel="noreferrer noopener nofollow"> deep Q-Learning</a> called<a href="https://arxiv.org/abs/1806.10293" rel="nofollow"> QT-Opt</a>. QT-Opt support for continuous action spaces makes it suitable for robotics problems. A model is first trained offline and then deployed and fine-tuned on the real robot.&nbsp;</p>



<p>Google AI applied this approach to<strong> robotics grasping</strong> where 7 real-world robots ran for 800 robot hours in a 4-month period.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter is-resized"><img decoding="async" src="https://lh6.googleusercontent.com/0a1J03d46EEXQYMq9bJQJsm5v9O-0i6qCxucr2UO0yrNybD3ByGqLBUh0KcZvtG62WUQo0FxF2VperyzWgFTQ6WO5FoRBsR5iLrnCe40v0DL-1CvUrWoC_b7AdflG_ttah5VRD3K" alt="" style="width:804px;height:451px"/><figcaption class="wp-element-caption"><a href="https://www.youtube.com/watch?v=W4joe3zzglU" target="_blank" rel="noreferrer noopener"><em>Source</em></a></figcaption></figure>
</div>


<p>In <a href="https://www.youtube.com/watch?v=W4joe3zzglU" target="_blank" rel="noreferrer noopener nofollow">this experiment</a>, the QT-Opt approach succeeds in 96% of the grasp attempts across 700 trials grasps on objects that were previously unseen. Google AI’s previous method had a 78% success rate.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-final-thoughts">Final thoughts</h2>



<p>Whereas reinforcement learning is still a very active research area significant progress has been made to advance the field and apply it in real life.&nbsp;</p>



<p>In this article, we have barely scratched the surface as far as application areas of reinforcement learning are concerned. Hopefully, this has sparked some curiosity that will drive you to dive in a little deeper into this area. If you want to learn more check out this<a href="https://github.com/aikorea/awesome-rl" target="_blank" rel="noreferrer noopener nofollow"> awesome repo</a> — no pun intended, and <a href="https://github.com/dennybritz/reinforcement-learning" target="_blank" rel="noreferrer noopener nofollow">this one</a> as well.</p>



<p></p>
]]></content:encoded>
					
		
		<enclosure url="https://neptune.ai/wp-content/uploads/2022/11/10-Real-Life-Applications-of-Reinforcement-Learning.mp4" length="359487" type="video/mp4" />

		<post-id xmlns="com-wordpress:feed-additions:1">3637</post-id>	</item>
	</channel>
</rss>