Bachelorarbeit, 2021
29 Seiten, Note: 1,3
1 Introduction
1.1 Outline
2 Basics of MDPs and RL
2.1 Markov Decision Processes
2.1.1 Markov Process
2.2 Value Function
2.3 Policy Iteration
2.4 Reinforcement Learning
2.4.1 Monte Carlo Learning
2.4.2 Temporal difference Learning
3 Cleaning Robot Application
3.1 Introduction
3.2 Solving via Value Iteration
3.3 Solving via Monte Carlo Learning
3.4 Solving via Q-Learning
3.5 Comparison of Results
4 Discussion
This thesis investigates the mathematical foundations of Markov Decision Processes (MDPs) and Reinforcement Learning (RL) algorithms. The primary goal is to evaluate different learning strategies—specifically Value Iteration, Monte Carlo Learning, and Q-Learning—by applying them to a practical simulation of a cleaning robot navigating an apartment to locate dust accumulation.
3.5 Comparison of Results
As we saw in the last sections, all three methods for solving the given MDP (reach the room where the dust accumulates) converge to the optimal policy π*. But since they have various requirements and work differently, we will give a short view on some pros and cons of the used methods.
With the Value Iteration method we could accomplish quite good results. Above that it did not have high computational costs to converge to the optimal result. But on the other hand, we had to pass our agent the dynamics, rewards and the terminal states of the environment. Such a method is called model-based, where the agent knows everything about the environment before it starts solving a stated problem in it. As a consequence, value iteration methods become useless, when an environment is given, of which we do not know anything.
Then we introduced Monte Carlo Learning. The term learning means to gather experience by interacting with the surrounding. Like this we could get rid of the need of the knowledge about the environment. Learning methods are model-free, which means, it is not necessary to know anything about the MDP (The agent has no model of the environment).
MC methods work by simulating episodes following a specified policy π. Then the average is calculated at the end (Offline Learning). The main disadvantage of MC is the need to finish an episode to be able to update the state-values, which is in first place ineffective in infinite or in large environments. Moreover MC has a large variance, which implies the sensitivity to the initial values. This could result high computational costs to obtain quality outcomes.
The last proposed learning method was Q-Learning, which is a spacial case of TD methods. TD can learn from incomplete episodes (online). Furthermore TD methods have a lower variance than MC methods. Additionally Q-Learning has the off-policy property. So it is possible to involve more than one policy in order to improve the learning procedure.
Introduction: Provides a high-level motivation for reinforcement learning using a simple maze analogy and outlines the two-part structure of the thesis.
Basics of MDPs and RL: Establishes the theoretical framework including Markov processes, the Bellman equation, and foundational learning algorithms.
Cleaning Robot Application: Demonstrates the practical implementation and comparison of Value Iteration, Monte Carlo Learning, and Q-Learning in a simulated home environment.
Discussion: Reviews the findings, highlights the commonalities between the algorithms, and identifies future directions such as Deep Q-Learning.
Reinforcement Learning, Markov Decision Processes, Value Iteration, Monte Carlo Learning, Q-Learning, Policy Iteration, Bellman Equation, Model-based Learning, Model-free Learning, Agent, Reward Signal, Discount Factor, Off-policy Learning, Temporal Difference, Simulation.
The thesis focuses on explaining Markov Decision Processes (MDPs) and their application through Reinforcement Learning (RL) algorithms, shifting from theoretical foundations to a practical implementation.
The thesis presents three primary methods: Value Iteration, Monte Carlo Learning, and Q-Learning.
The goal is to demonstrate how these algorithms can be used to derive an optimal policy for an agent, specifically a cleaning robot, to navigate an environment and find a target destination (the location of dust).
The author emphasizes a mathematical point of view for understanding these algorithms, rather than focusing purely on computer science implementations.
The main body covers the mathematical definitions of MDPs, the transition matrix, state-value and action-value functions, the Bellman expectation equation, and the subsequent implementation of these concepts in Python for a simulated robot.
Key terms include Reinforcement Learning, MDP, Q-Learning, Monte Carlo, Value Iteration, and optimal policy.
It is model-based because the agent requires prior knowledge of the environment's dynamics, specifically the transition probabilities and reward functions, to compute the optimal results.
Q-Learning is highlighted for being an off-policy algorithm, having lower variance than Monte Carlo methods, and the ability to learn online from incomplete episodes.
The environment is modeled as a grid-based apartment with 7 rooms, where the agent receives a reward of 1 when reaching the child's room (where dust is present) and 0 or -1 elsewhere.
Der GRIN Verlag hat sich seit 1998 auf die Veröffentlichung akademischer eBooks und Bücher spezialisiert. Der GRIN Verlag steht damit als erstes Unternehmen für User Generated Quality Content. Die Verlagsseiten GRIN.com, Hausarbeiten.de und Diplomarbeiten24 bieten für Hochschullehrer, Absolventen und Studenten die ideale Plattform, wissenschaftliche Texte wie Hausarbeiten, Referate, Bachelorarbeiten, Masterarbeiten, Diplomarbeiten, Dissertationen und wissenschaftliche Aufsätze einem breiten Publikum zu präsentieren.
Kostenfreie Veröffentlichung: Hausarbeit, Bachelorarbeit, Diplomarbeit, Dissertation, Masterarbeit, Interpretation oder Referat jetzt veröffentlichen!

