Reinforcement Learning: Robot Learns to walk without Human Intervention


A self-governing Robot is always a dream for all human beings. Google researchers have developed an autonomous Robot that teaches itself how to walk without any programs or algorithms. Here reinforcement learning (RL) is used by Robots to walk by themselves. Reinforcement Learning is an area of machine learning. The human brain is working according to reinforcement learning.

There are different types of learnings in machine learning. Mainly Supervised, Unsupervised and Reinforcement Learning. In Unsupervised learning, they use unlabelled data for training the model and we have to predict the output without any experiences. As we have unlabelled data we have to classify them into clusters and this is more related to the classification part.

Supervised Learning means they predict the output from their previous experiences. They have labeled data which makes them solve real-world problems. Reinforcement learning is somewhat similar to supervised learning, as this also learns from previous experience but has some extra features in it which will be mentioned below in a  detailed manner.

What Is Reinforcement Learning?

“ Reinforcement learning (RL) is learning what to do,- how to map situations to actions, -so as to maximize a reward signal.

The learners are not told which actions to take, but instead must discover which action yield the most reward by trying them”

Neural Networks & Statistical Learning

We, humans, do many actions and some action which we make can be a mistake for eg: we sip a hot tea, suddenly our tongue burns. From our mistake, we learn so next time we wait for our tea to get cool and then sips. Here we make mistakes and learn from those mistakes and correct them next time. This is actually reinforcement Learning.

Reinforcement learning comes in as an agent, the environment is producing some states or observation .which is passed to the agent and gives an action. Each time it does a correct action it will be rewarded with a reward function.

We will explain the main 4 terminologies in reinforcement learning.


The agent gives a set of observations to determine which actions to take place. The agent will be rewarded if taken good actions, at the same time it will get negative rewards for taking bad actions.


Place or area where agents can learn. The environment is just a task that our agent should perform. In some other words place where agents perform their task is called the environment.


States is the observation that the agent gets from the environment. Observation or state is the bond or relation between agent and environment. Research tells that some environments won’t give full observation of the environment.


Reward describes how agents have to behave and also the result from the environment after doing a specific action to the environment. A reward can be positive as well as negative depending upon the action.

How does Reinforcement Learning work?

Fig1: The working of Reinforcement Learning

Let’s take a  common scenario of teaching a dog to sit and stand by listening to the master’s command. So here dog is an actor and place where the dog is staying for learning is taken as the environment. So when the master’s give a “stand”  command to the dog and if the dog stands then the dog will be given a bone as a reward. This is actually called a positive reward and if the dog doesn’t stand then it loses a bone which is called a negative reward.

Reinforcement Learning Algorithm

There are three ways to implement reinforcement algorithm:

  1. Model-based method
  2. Value-based method
  3. Policy-based method
Fig 2:Classification of the reinforcement learning algorithm
  • Model-based method: Model-based means learning the model of the environment, by taking the action and observing the outcome, the next state and also the reward. The model predicts the upcoming outcomes of the action to the environment.
  • Value-based method: value function is a state-function or state-action pairs that predicts how good it is for agent to be in a desired state. This mentions how much worthy a state is given towards an expected result in return. Since, value-function is defined with respect to expected return so value-based, this means value function is defined with respect to the specific way the agent acts. Since the way the agent act depends on policies, we can see that value function is defined with respect to policies. There are actually two value functions mainly state-value function and also action-value function.
  • Policy-based method: A policy is a function that maps a given state to probabilities of selecting each possible action from a state. We denote pi to denote a policy. When speaking about a policy we tell that an agent “follows a policy”.

For Example, if an agent follows a policy pi at time t then pi (a/s)  is the probability that At = a. if St = s. This means that, at time t, under policy pi, the probability of taking action a in state s is pi(a/s). For each s element of S, pi is the probability distribution over a element A(s).

Types of Reinforcement Learning

Operate condition is the relationship between behavior and consequences. The behavior has consequences. Consequences have both reinforcement and punishment. Both reinforcement and punishment are divided into two types positive and negative.


This works by providing the individual with a reinforcing stimulus after the desired behavior is displayed, thereby making the behavior more likely to continue in the future. For ex., Boy receives 50 rupees for topics that he scores above 80.


It happens when a certain stimulus is withdrawn after a specific behavior has been displayed. The probability of a similar activity happening again in the future is increased by removing negative consequences. For Example, Bob presses a button that makes a loud alarm

How to Train your robot?

The goal of the control system is to determine the right action into a system that generates the system behavior. The main difficulty in learning a robot is finding or gathering the data and also the safety part. We come out of this by using a multi-task procedure, automatic reset controller and safety constrained Reinforcement Learning Framework.

The high-level goal is to make a two-legged robot walk. The action is to correctly move the body and legs of the robot. Movement of the body or force exerted to move the body is called torque, so it will take action(torque). Torque is applied for the left and right ankles, left and right knee and the left and right hip. So there are 6 different torque which produces action. These actions are applied to the environment to get the following state body position, body velocity, body angle and rate, joint angle and rate, contact force and also commanded torques.

The Reward Function

Forward velocity: If the robot takes a correct step forward, it will be rewarded with some value. It’s not enough that robots move forward and don’t fall, we want it to take simple steps rather than hopping for that, we have to keep the robot at walking height.

We also have to avoid dragging and make both legs do the same amount of work for that we minimize actuator effort. The robot shouldn’t stray away from the path. For each of these points, the robot should be rewarded.

What to do to sense some obstacles on this path?

We have to fix a visible camera. LiDar is a visible camera, which returns thousands of pixels. We keep a LiDAR in our robot and read the obstacles.

Key takeaways

  • Reinforcement Learning can solve complicated problems.
  • Q-learning and SARSA(State – Action – Reward – State – Action ) are the two algorithms used in Reinforcement learning.
  • Reinforcement learning is also used to build Computer games also used in speech recognition, voice recognition and in many AI-related projects.
  • Reinforcement learning is used in business planning, machine learning, data processing
  • This learning helps in aircraft control and  also in the motion of robots
  • Reinforcement learning helps us to find a situation to use action.
  • This learning also helps us to find the best method which helps us to yield a high reward.


  • Parameters can affect the speed of the learning process
  • The realistic environment can have partial clearness and can be  in motion
  • Too much reinforcement can negotiate the results

Latest  Projects Undertaken

Neolix is a self-driving car that also used reinforcement learning. This is a Chinese car where people use them to transport medicine and also goods for the people who are infected with the pandemic disease corona. Neolix also uses this Reinforcement learning method and has a camera sensor for capturing the environment where it driving. These are actually normal robots equipped with cameras.


Deep Reinforcement Learning (deep RL) has turned out as a promising way to develop and create a different kind of control policies autonomously for robots. Robots can perform tasks more accurately and can give high performance than humans which makes the robots dominate human species. The robots that can walk can be developed to self-driving cars and vehicles in the future. In the current situation of the coronavirus outbreak, where people cannot go out for buying groceries and to purchase medicines people are using robots and self-driving cars.

Read Next: Prediction And Spreading of Pandemic Disease (COVID-19)

Leave a Reply

Your "email address" will not be published. Fields which required below are marked as *