Knowledge (XXG)

Markov decision process

Source đź“ť

4878: 112: 6278:. Similar to reinforcement learning, a learning automata algorithm also has the advantage of solving the problem when probability or rewards are unknown. The difference between learning automata and Q-learning is that the former technique omits the memory of Q-values, but updates the action probability directly to find the learning result. Learning automata is a learning scheme with a rigorous proof of convergence. 1988:, explicitly. In such cases, a simulator can be used to model the MDP implicitly by providing samples from the transition distributions. One common form of implicit MDP model is an episodic environment simulator that can be started from an initial state and yields a subsequent state and reward every time it receives an action input. In this manner, trajectories of states, actions, and rewards, often called 4511: 2201: 4032:. Under this assumption, although the decision maker can make a decision at any time in the current state, there is no benefit in taking multiple actions. It is better to take an action only at the time when system is transitioning from the current state to another state. Under some conditions, if our optimal value function 4873:{\displaystyle {\begin{aligned}{\text{Maximize}}&\sum _{i\in S}\sum _{a\in A(i)}R(i,a)y(i,a)\\{\text{s.t.}}&\sum _{i\in S}\sum _{a\in A(i)}q(j\mid i,a)y(i,a)=0\quad \forall j\in S,\\&\sum _{i\in S}\sum _{a\in A(i)}y(i,a)=1,\\&y(i,a)\geq 0\qquad \forall a\in A(i){\text{ and }}\forall i\in S\end{aligned}}} 4498: 5551: 5888:
Reinforcement learning can solve Markov-Decision processes without explicit specification of the transition probabilities which are instead needed to perform policy iteration. In this setting, transition probabilities and rewards must be learned from experience, i.e. by letting an agent interact with
2175:
These model classes form a hierarchy of information content: an explicit model trivially yields a generative model through sampling from the distributions, and repeated application of a generative model yields an episodic simulator. In the opposite direction, it is only possible to learn approximate
76:. Reinforcement learning utilizes the MDP framework to model the interaction between a learning agent and its environment. In this framework, the interaction is characterized by states, actions, and rewards. The MDP framework is designed to provide a simplified representation of key elements of 2984:
Their order depends on the variant of the algorithm; one can also do them for all states at once or state by state, and more often to some states than others. As long as no state is permanently excluded from either of the steps, the algorithm will eventually arrive at the correct solution.
2979: 2653:
The algorithm has two steps, (1) a value update and (2) a policy update, which are repeated in some order for all the states until no further changes take place. Both recursively update a new estimation of the optimal policy and state value using an older estimation of those values.
5184: 3334: 2813: 4023:
If the state space and action space are finite, we could use linear programming to find the optimal policy, which was one of the earliest approaches applied. Here we only consider the ergodic model, which means our continuous-time MDP becomes an
4212: 4340: 603: 3684:
can find useful solutions in larger problems, and, in theory, it is possible to construct online planning algorithms that can find an arbitrarily near-optimal policy with no computational complexity dependence on the size of the state space.
2528:. The algorithms in this section apply to MDPs with finite state and action spaces and explicitly given transition probabilities and reward functions, but the basic concepts may be extended to handle other problem classes, for example using 6092: 3680:, the size of the problem representation is often exponential in the number of state and action variables, limiting exact solution techniques to problems that have a compact representation. In practice, online planning techniques such as 5284: 6658:
The terminology and notation for MDPs are not entirely settled. There are two main streams — one focuses on maximization problems from contexts like economics, using the terms action, reward, value, and calling the discount factor
3814:, decisions can be made at any time the decision maker chooses. In comparison to discrete-time Markov decision processes, continuous-time Markov decision processes can better model the decision-making process for a system that has 5797: 5889:
the MDP for a given number of steps. Both on a theoretical and on a practical level, effort is put in maximizing the sample efficiency, i.e. minimimizing the number of samples needed to learn a policy whose performance is
2313: 3521:, step one is performed once, and then step two is performed once, then both are repeated until policy converges. Then step one is again performed once and so on. (Policy iteration was invented by Howard to optimize 758: 103:. The process is called a "decision process" because it involves making decisions that influence these state transitions, extending the concept of a Markov chain into the realm of decision-making under uncertainty. 6620: 5000: 3837:
which could maximize the expected cumulated reward. The only difference with the standard case stays in the fact that, due to the continuous nature of the time variable, the sum is replaced by an integral:
2821: 1862: 1383: 5289: 5005: 4516: 4345: 3979: 6546: 1933:. A particular MDP may have multiple distinct optimal policies. Because of the Markov property, it can be shown that the optimal policy is a function of the current state, as assumed above. 793: 3156: 2172:
are the new state and reward. Compared to an episodic simulator, a generative model has the advantage that it can yield data from any state, not only those encountered in a trajectory.
1453: 2085: 6972: 4014: 634: 1774: 1295: 178: 7019: 6926: 6827: 2468: 2419: 1986: 846: 339: 6873: 1519: 6648: 6485: 4493:{\displaystyle {\begin{aligned}{\text{Minimize}}\quad &g\\{\text{s.t}}\quad &g-\sum _{j\in S}q(j\mid i,a)h(j)\geq R(i,a)\,\,\forall i\in S,\,a\in A(i)\end{aligned}}} 4327: 4271: 5910: 4085: 1591: 1412: 5264: 4992: 5669: 5582: 3557: 2370: 1931: 6457: 6217: 6152: 3758: 3447: 3119: 3061: 2513: 2150: 1105: 1032: 891: 424: 5222: 4950: 4915: 1725: 1246: 6127: 5859: 5546:{\displaystyle {\begin{aligned}V(s(0),0)={}&\max _{a(t)=\pi (s(t))}\int _{0}^{T}r(s(t),a(t))\,dt+D\\{\text{s.t.}}\quad &{\frac {dx(t)}{dt}}=f\end{aligned}}} 4057: 3410: 1891: 270: 5829: 5640: 5611: 3651: 3584: 3148: 3090: 3032: 2628: 2599: 2579: 1128: 1003: 935: 3383: 2660: 2003:, a single step simulator that can generate samples of the next state and reward given any state and action. (Note that this is a different meaning from the term 450: 3653:
around those states recently) or based on use (those states are near the starting state, or otherwise of interest to the person or program using the algorithm).
1565: 6237: 6192: 6172: 5938: 5912:
close to the optimal one (due to the stochastic nature of the process, learning the optimal policy with a finite number of samples is, in general, impossible).
4291: 4235: 4077: 3768:
Constrained Markov decision processes (CMDPS) are extensions to Markov decision process (MDPs). There are three fundamental differences between MDPs and CMDPs.
3729: 3631: 3487: 3467: 3357: 2648: 2556: 2488: 2335: 2236: 2170: 2125: 2105: 2033: 1611: 1539: 1473: 1076: 1052: 975: 955: 911: 866: 399: 379: 359: 290: 237: 201: 6667:, while the other focuses on minimization problems from engineering and navigation, using the terms control, cost, cost-to-go, and calling the discount factor 3613:
In this variant, the steps are preferentially applied to states which are in some way important – whether based on the algorithm (there were large changes in
7048: 3706: 3528:
Instead of repeating step two to convergence, it may be formulated and solved as a set of linear equations. These equations are merely obtained by making
1054:. Once a Markov decision process is combined with a policy in this way, this fixes the action for each state and the resulting combination behaves like a 5677: 7885: 7064: 6559:
In this way, Markov decision processes could be generalized from monoids (categories with one object) to arbitrary categories. One can call the result
5275: 2180:. The type of model available for a particular MDP plays a significant role in determining which solution algorithms are appropriate. For example, the 455: 2241: 2535:
The standard family of algorithms to calculate optimal policies for finite state and action MDPs requires storage for two arrays indexed by state:
636:
measurable. In case the state space is discrete, the integral is intended with respect to the counting measure, so that the latter simplifies as
5946: 1130:
that will maximize some cumulative function of the random rewards, typically the expected discounted sum over a potentially infinite horizon:
7850: 7769: 7376: 7228: 7161: 7124: 6413:+ 1) and outputs the corresponding action. The automaton's environment, in turn, reads the action and sends the next input to the automaton. 5179:{\displaystyle {\begin{aligned}\sum _{i\in S}\sum _{a\in A(i)}R(i,a)y^{*}(i,a)\geq \sum _{i\in S}\sum _{a\in A(i)}R(i,a)y(i,a)\end{aligned}}} 3516: 3005: 6562: 7840: 6266:
theory is called learning automata. This is also one type of reinforcement learning if the environment is stochastic. The first detail
3760:
cannot be calculated. When this assumption is not true, the problem is called a partially observable Markov decision process or POMDP.
7479: 5885:
that has, as main objective, finding an approximately optimal policy for MDPs where transition probabilities and rewards are unknown.
3833:
Like the discrete-time Markov decision processes, in continuous-time Markov decision processes the agent aims at finding the optimal
639: 7920: 7863: 7829: 7808: 7728: 7285: 1541:). A lower discount factor motivates the decision maker to favor taking actions early, rather than postpone them indefinitely. 3819: 3605:), step one is performed once, and then step two is repeated several times. Then step one is again performed once and so on. 2974:{\displaystyle \pi (s):=\operatorname {argmax} _{a}\left\{\sum _{s'}P_{a}(s,s')\left(R_{a}(s,s')+\gamma V(s')\right)\right\}} 38: 5274:
In continuous-time MDP, if the state space and action space are continuous, the optimal criterion could be found by solving
115:
Example of a simple MDP with three states (green circles) and two actions (orange circles), with two rewards (orange arrows)
3560: 1779: 1300: 3844: 7944: 7119:. Wiley series in probability and mathematical statistics. Applied probability and statistics section. New York: Wiley. 3559:
in the step two equation. Thus, repeating step two to convergence can be interpreted as solving the linear equations by
6510: 7954: 7431:
van Nunen, J.A. E. E (1976). "A set of successive approximation methods for discounted Markovian decision problems".
7404:
Puterman, M. L.; Shin, M. C. (1978). "Modified Policy Iteration Algorithms for Discounted Markov Decision Problems".
2421:
is the transition of the system, which in this case is going to be deterministic and driven by the laws of mechanics.
7939: 5920:
For the purpose of this section, it is useful to define a further function, which corresponds to taking the action
3329:{\displaystyle V_{i+1}(s):=\max _{a}\left\{\sum _{s'}P_{a}(s,s')\left(R_{a}(s,s')+\gamma V_{i}(s')\right)\right\},} 42: 7949: 763: 7079: 2630:
will contain the discounted sum of the rewards to be earned (on average) by following that solution from state
1417: 7223:. Adaptive computation and machine learning series (2nd ed.). Cambridge, Massachusetts: The MIT Press. 7043: 3681: 3677: 2185: 77: 7819: 7146:
Papers presented at the February 26-28, 1957, western joint computer conference: Techniques for reliability
3790:
The method of Lagrange multipliers applies to CMDPs. Many Lagrangian-based algorithms have been developed.
2205: 7673: 7466: 7033: 5874: 5869: 2529: 73: 3987: 2524:
Solutions for MDPs with finite state and action spaces may be found through a variety of methods such as
6275: 3810:
In discrete-time Markov Decision Processes, decisions are made at discrete time intervals. However, for
46: 3501:
included as a special case the value iteration method for MDPs, but this was recognized only later on.
1730: 1251: 125: 4207:{\displaystyle g\geq R(i,a)+\sum _{j\in S}q(j\mid i,a)h(j)\quad \forall i\in S{\text{ and }}a\in A(i)} 2038: 7317: 6931: 2188:
requires a generative model (or an episodic simulator that can be copied at any state), whereas most
1894: 1478: 7678: 6629: 6466: 4296: 4240: 985:
The goal in a Markov decision process is to find a good "policy" for the decision maker: a function
7712: 7665: 7074: 7053: 6352: 6271: 5892: 3780: 2525: 2181: 2177: 608: 214: 53: 5671:
shows how the state vector changes over time. The Hamilton–Jacobi–Bellman equation is as follows:
7448: 7392: 7364: 7178: 6977: 6885: 6785: 6389:
The states of such an automaton correspond to the states of a "discrete-state discrete-parameter
3827: 3776: 3011: 2426: 2377: 1944: 1570: 1391: 804: 297: 69: 6834: 5227: 4955: 2490:
if the pole is up after the transition, zero otherwise. Therefore, this function only depend on
7474: 7148:. IRE-AIEE-ACM '57 (Western). New York, NY, USA: Association for Computing Machinery: 115–121. 5645: 5558: 3589:
Policy iteration is usually slower than value iteration for a large number of possible states.
848:
is the immediate reward (or expected immediate reward) received after transitioning from state
7916: 7856: 7846: 7825: 7804: 7765: 7724: 7691: 7372: 7345: 7281: 7224: 7198: 7157: 7120: 6257: 2340: 1909: 204: 96: 6424: 3734: 3419: 3095: 3037: 1081: 1008: 7683: 7618: 7536: 7524: 7496: 7488: 7440: 7413: 7335: 7325: 7257: 7190: 7149: 7084: 7058: 6500: 6263: 5878: 5192: 4920: 4885: 3670: 3666: 3586:
does not change in the course of applying step 1 to all states, the algorithm is completed.
3498: 3490: 2808:{\displaystyle V(s):=\sum _{s'}P_{\pi (s)}(s,s')\left(R_{\pi (s)}(s,s')+\gamma V(s')\right)} 2004: 1619: 1136: 796: 7899: 7612: 6100: 5837: 4035: 3566:
This variant has the advantage that there is a definite stopping condition: when the array
3388: 1870: 248: 7525:"A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes" 7470: 7246:"A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes" 7069: 6460: 5882: 5805: 5616: 5587: 3823: 3815: 3799: 3694: 3662: 3636: 3569: 3531: 3124: 3066: 3017: 2604: 2584: 2564: 1113: 988: 920: 100: 7867: 3362: 429: 7321: 6197: 6132: 2493: 2212:
An example of MDP is the Pole-Balancing model, which comes from classic control theory.
2130: 1893:
is the time horizon. Compared to the previous objective, the latter one is more used in
1547: 871: 404: 111: 7340: 7305: 7038: 6390: 6222: 6177: 6157: 5923: 5832: 4276: 4220: 4062: 3714: 3673: 3616: 3472: 3452: 3413: 3342: 2633: 2541: 2473: 2320: 2221: 2155: 2110: 2090: 2018: 1596: 1524: 1458: 1061: 1037: 960: 940: 896: 851: 384: 364: 344: 275: 222: 186: 84:, the management of uncertainty and nondeterminism, and the pursuit of explicit goals. 17: 7194: 1941:
In many cases, it is difficult to represent the transition probability distributions,
7933: 7717: 7594: 7301: 3494: 92: 7580:
Natural policy gradient primal-dual method for constrained Markov decision processes
4952:
is nonnative and satisfied the constraints in the D-LP problem. A feasible solution
7595:"Risk-aware path planning using hierarchical constrained Markov Decision Processes" 7452: 7094: 5940:
and then continuing optimally (or according to whatever policy one currently has):
5792:{\displaystyle 0=\max _{u}(r(t,s,a)+{\frac {\partial V(t,s)}{\partial x}}f(t,s,a))} 1055: 88: 56:
in the 1950s, MDPs have since gained recognition in a variety of fields, including
7637: 3665:
polynomial in the size of the problem representation exist for finite MDPs. Thus,
95:. The "Markov" in "Markov decision process" refers to the underlying structure of 7910: 7798: 7141: 6488: 598:{\displaystyle \Pr(s_{t+1}\in S'\mid s_{t}=s,a_{t}=a)=\int _{S'}P_{a}(s,s')ds',} 65: 7310:
Proceedings of the National Academy of Sciences of the United States of America
2308:{\displaystyle (\theta ,{\dot {\theta }},x,{\dot {x}})\subset \mathbb {R} ^{4}} 1544:
Another possible, but strictly related, objective that is commonly used is the
7687: 7622: 7541: 7262: 7245: 7089: 6240: 4025: 3489:
converges with the left-hand side equal to the right-hand side (which is the "
2012: 2008: 7695: 7202: 7417: 7330: 7153: 6397:= 0,1,2,3,..., the automaton reads an input from its environment, updates P( 6097:
While this function is also unknown, experience during learning is based on
81: 61: 7349: 6087:{\displaystyle \ Q(s,a)=\sum _{s'}P_{a}(s,s')(R_{a}(s,s')+\gamma V(s')).\ } 5278:. In order to discuss the HJB equation, we need to reformulate our problem 3798:
There are a number of applications for CMDPs. It has recently been used in
2035:
is often used to represent a generative model. For example, the expression
7578:
Ding, Dongsheng; Zhang, Kaiqing; Jovanovic, Mihailo; Basar, Tamer (2020).
7492: 6504: 3772:
There are multiple costs incurred after applying an action instead of one.
2315:
given by pole agle, angular velocity, position of the cart and its speed.
3525:
catalogue mailing, which he had been optimizing using value iteration.)
2184:
algorithms described in the next section require an explicit model, and
7764:(Dover paperback ed.). Princeton, NJ: Princeton University Press. 7501: 7444: 1776:, i.e. actions given by the policy). And the expectation is taken over 1297:, i.e. actions given by the policy). And the expectation is taken over 57: 6409:, randomly chooses a successor state according to the probabilities P( 7393:"Comments on the Origin and Application of Markov Decision Processes" 6650:
changes the set of available actions and the set of possible states.
6274:
and Thathachar (1974), which were originally described explicitly as
4029: 6615:{\displaystyle ({\mathcal {C}},F:{\mathcal {C}}\to \mathbf {Dist} )} 2372:, corresponding to applying a force on the left (right) on the cart. 2200: 2087:
might denote the action of sampling from the generative model where
6671:. In addition, the notation for the transition probability varies. 27:
Mathematical model for sequential decision making under uncertainty
7117:
Markov decision processes: discrete stochastic dynamic programming
3522: 2199: 120: 7363:
Kallenberg, Lodewijk (2002). "Finite state and action MDPs". In
452:. In general, this probability transition is defined to satisfy 7668:; Thathachar, M. A. L. (1974). "Learning Automata – A Survey". 7369:
Handbook of Markov decision processes: methods and applications
1613:
steps of the process, with each reward having the same weight.
753:{\displaystyle P_{a}(s,s')=\Pr(s_{t+1}=s'\mid s_{t}=s,a_{t}=a)} 7523:
Kearns, Michael; Mansour, Yishay; Ng, Andrew (November 2002).
2204:
Pole Balancing example (rendering of the environment from the
6635: 6587: 6571: 6516: 6472: 6239:
and uses experience to update it directly. This is known as
5276:
Hamilton–Jacobi–Bellman (HJB) partial differential equation
3870: 1567:
step return. This time, instead of using a discount factor
937:
is a (potentially probabilistic) mapping from state space (
7824:. Stochastic Modelling and Applied Probability. Springer. 213:. The state space may be discrete or continuous, like the 80:
challenges. These elements encompass the understanding of
6882:
In addition, transition probability is sometimes written
7278:
Reinforcement Learning: Theory and Python Implementation
5802:
We could solve the equation to find the optimal control
1900:
A policy that maximizes the function above is called an
292:). As for state, this set may be discrete or continuous. 7638:"Multi-agent reinforcement learning: a critical survey" 4329:, we could use the following linear programming model: 3092:
whenever it is needed. Substituting the calculation of
2581:, which contains actions. At the end of the algorithm, 795:, the integral is usually intended with respect to the 341:
is, on an intuitive level, the probability that action
7179:"Dynamic programming and stochastic control processes" 7244:
Kearns, Michael; Mansour, Yishay; Ng, Andrew (2002).
6980: 6934: 6888: 6837: 6788: 6632: 6565: 6513: 6469: 6427: 6385:: Φ → α which generates the output at each time step. 6225: 6200: 6180: 6160: 6135: 6103: 5949: 5926: 5895: 5840: 5808: 5680: 5648: 5619: 5590: 5561: 5287: 5230: 5224:
to the D-LP. Once we have found the optimal solution
5195: 5003: 4958: 4923: 4888: 4514: 4343: 4299: 4279: 4243: 4223: 4088: 4065: 4038: 3990: 3847: 3737: 3717: 3639: 3619: 3572: 3534: 3475: 3455: 3422: 3391: 3365: 3345: 3159: 3127: 3098: 3069: 3040: 3020: 2824: 2663: 2636: 2607: 2587: 2567: 2544: 2496: 2476: 2429: 2380: 2343: 2323: 2244: 2224: 2158: 2133: 2113: 2093: 2041: 2021: 1947: 1912: 1873: 1782: 1733: 1622: 1599: 1573: 1550: 1527: 1481: 1461: 1420: 1394: 1303: 1254: 1139: 1116: 1084: 1064: 1040: 1011: 991: 963: 943: 923: 899: 874: 854: 807: 766: 642: 611: 458: 432: 407: 387: 367: 347: 300: 278: 251: 225: 189: 128: 7582:. Advances in Neural Information Processing Systems. 1857:{\displaystyle s_{t+1}\sim P_{a_{t}}(s_{t},s_{t+1})} 1378:{\displaystyle s_{t+1}\sim P_{a_{t}}(s_{t},s_{t+1})} 7707: 7705: 7601:. IEEE International Conference. pp. 297, 303. 5266:, we can use it to establish the optimal policies. 3974:{\displaystyle \max \operatorname {E} _{\pi }\left} 3359:is the iteration number. Value iteration starts at 91:, a concept developed by the Russian mathematician 7743: 7716: 7670:IEEE Transactions on Systems, Man, and Cybernetics 7280:. Beijing: China Machine Press. 2019. p. 44. 7013: 6966: 6920: 6867: 6821: 6642: 6614: 6540: 6479: 6451: 6421:Other than the rewards, a Markov decision process 6231: 6211: 6186: 6166: 6146: 6121: 6086: 5932: 5904: 5853: 5823: 5791: 5663: 5634: 5605: 5576: 5545: 5270:Continuous space: Hamilton–Jacobi–Bellman equation 5258: 5216: 5178: 4986: 4944: 4909: 4872: 4492: 4321: 4285: 4265: 4229: 4206: 4071: 4051: 4008: 3973: 3752: 3723: 3645: 3625: 3578: 3551: 3481: 3461: 3441: 3404: 3377: 3351: 3328: 3142: 3113: 3084: 3055: 3026: 2973: 2807: 2642: 2622: 2593: 2573: 2550: 2507: 2482: 2462: 2413: 2364: 2329: 2307: 2230: 2164: 2144: 2119: 2099: 2079: 2027: 2007:in the context of statistical classification.) In 1980: 1925: 1885: 1856: 1768: 1719: 1605: 1585: 1559: 1533: 1513: 1467: 1447: 1406: 1377: 1289: 1240: 1122: 1099: 1070: 1046: 1034:that the decision maker will choose when in state 1026: 997: 969: 949: 929: 905: 885: 860: 840: 787: 752: 628: 597: 444: 418: 393: 373: 353: 333: 284: 264: 231: 195: 172: 6541:{\displaystyle {\mathcal {A}}\to \mathbf {Dist} } 4994:to the D-LP is said to be an optimal solution if 6935: 6889: 6378:), the current input, and the current state, and 5688: 5329: 4293:satisfying the above equation. In order to find 4028:continuous-time Markov chain under a stationary 3848: 3189: 676: 459: 6626:, because moving from one object to another in 3818:, i.e., the system dynamics is defined by 3786:The final policy depends on the starting state. 3731:is known when action is to be taken; otherwise 2192:algorithms require only an episodic simulator. 7887:Introduction to stochastic dynamic programming 4019:Discrete space: Linear programming formulation 41:or stochastic control problem, is a model for 7636:Shoham, Y.; Powers, R.; Grenager, T. (2003). 7475:"The Complexity of Markov Decision Processes" 7219:Sutton, Richard S.; Barto, Andrew G. (2018). 5642:is the system control vector we try to find. 3661:Algorithms for finding optimal policies with 8: 7593:Feyzabadi, S.; Carpin, S. (18–22 Aug 2014). 7049:Partially observable Markov decision process 3822:(ODEs). These kind of applications raise in 3707:Partially observable Markov decision process 3602: 3034:function is not used; instead, the value of 2359: 2344: 1593:, the agent is interested only in the first 7140:Schneider, S.; Wagner, D. H. (1957-02-26). 3794:Natural policy gradient primal-dual method. 3004:harv error: no target: CITEREFBellman1957 ( 2189: 788:{\displaystyle S\subseteq \mathbb {R} ^{d}} 272:is the set of actions available from state 7797:Feinberg, E.A.; Shwartz, A., eds. (2002). 3950: 3711:The solution above assumes that the state 3515:harv error: no target: CITEREFHoward1960 ( 7821:Continuous-Time Markov Decision Processes 7790:Finite state Markovian decision processes 7677: 7614:Continuous-Time Markov Decision Processes 7599:Automation Science and Engineering (CASE) 7540: 7500: 7339: 7329: 7261: 6985: 6979: 6933: 6887: 6842: 6836: 6793: 6787: 6634: 6633: 6631: 6624:context-dependent Markov decision process 6595: 6586: 6585: 6570: 6569: 6564: 6524: 6515: 6514: 6512: 6471: 6470: 6468: 6426: 6224: 6199: 6179: 6159: 6134: 6102: 6026: 5993: 5978: 5948: 5925: 5894: 5845: 5839: 5807: 5727: 5691: 5679: 5647: 5618: 5589: 5560: 5465: 5457: 5422: 5380: 5375: 5332: 5322: 5288: 5286: 5235: 5229: 5194: 5115: 5099: 5071: 5028: 5012: 5004: 5002: 4963: 4957: 4922: 4887: 4849: 4750: 4734: 4638: 4622: 4609: 4548: 4532: 4519: 4515: 4513: 4467: 4451: 4450: 4381: 4363: 4348: 4344: 4342: 4313: 4302: 4301: 4298: 4278: 4257: 4246: 4245: 4242: 4222: 4181: 4120: 4087: 4079:, we will have the following inequality: 4064: 4043: 4037: 3989: 3960: 3943: 3892: 3882: 3877: 3855: 3846: 3812:continuous-time Markov decision processes 3736: 3716: 3638: 3618: 3598: 3571: 3533: 3474: 3454: 3427: 3421: 3416:. It then iterates, repeatedly computing 3396: 3390: 3364: 3344: 3293: 3257: 3222: 3207: 3192: 3164: 3158: 3126: 3097: 3068: 3039: 3019: 2912: 2877: 2862: 2844: 2823: 2742: 2698: 2683: 2662: 2635: 2606: 2586: 2566: 2543: 2495: 2475: 2434: 2428: 2385: 2379: 2342: 2322: 2299: 2295: 2294: 2276: 2275: 2255: 2254: 2243: 2223: 2157: 2132: 2112: 2092: 2040: 2020: 1952: 1946: 1917: 1911: 1872: 1839: 1826: 1811: 1806: 1787: 1781: 1757: 1738: 1732: 1696: 1683: 1668: 1663: 1658: 1646: 1635: 1621: 1598: 1572: 1549: 1526: 1491: 1480: 1460: 1448:{\displaystyle 0\leq \ \gamma \ \leq \ 1} 1419: 1393: 1360: 1347: 1332: 1327: 1308: 1302: 1278: 1259: 1253: 1217: 1204: 1189: 1184: 1174: 1169: 1163: 1152: 1138: 1115: 1083: 1063: 1039: 1010: 990: 962: 942: 922: 898: 873: 853: 812: 806: 779: 775: 774: 765: 735: 716: 686: 647: 641: 610: 555: 540: 518: 499: 469: 457: 431: 406: 386: 366: 346: 305: 299: 277: 256: 250: 224: 188: 161: 148: 127: 6673: 6319:} of possible outputs, or actions, with 5916:Reinforcement Learning for discrete MDPs 110: 7901:Reinforcement Learning: An Introduction 7842:Control Techniques for Complex Networks 7781:Dynamic Programming and Optimal Control 7221:Reinforcement learning: an introduction 7107: 6552:of states and the probability function 3806:Continuous-time Markov decision process 2999: 7142:"Error detection in redundant systems" 6262:Another application of MDP process in 4917:is a feasible solution to the D-LP if 3510: 2127:are the current state and action, and 87:The name comes from its connection to 7818:Guo, X.; Hernández-Lerma, O. (2009). 7800:Handbook of Markov Decision Processes 7715:; Thathachar, Mandayam A. L. (1989). 7645:Technical Report, Stanford University 7565:Constrained Markov decision processes 3764:Constrained Markov decision processes 7: 7898:Sutton, R. S.; Barto, A. G. (2017). 7214: 7212: 6330:an initial state probability vector 1110:The objective is to choose a policy 7912:A First Course in Stochastic Models 7433:Zeitschrift fĂĽr Operations Research 6219:happened"). Thus, one has an array 4009:{\displaystyle 0\leq \gamma <1.} 3669:based on MDPs are in computational 7719:Learning automata: An introduction 7480:Mathematics of Operations Research 5831:, which could give us the optimal 5753: 5730: 4854: 4828: 4708: 4452: 4169: 3883: 3852: 2558:, which contains real values, and 1414:is the discount factor satisfying 1164: 1058:(since the action chosen in state 25: 6417:Category theoretic interpretation 6129:pairs (together with the outcome 5584:is the terminal reward function, 1769:{\displaystyle a_{t}=\pi (s_{t})} 1290:{\displaystyle a_{t}=\pi (s_{t})} 173:{\displaystyle (S,A,P_{a},R_{a})} 119:A Markov decision process is a 4- 7065:Hamilton–Jacobi–Bellman equation 6605: 6602: 6599: 6596: 6534: 6531: 6528: 6525: 5877:is an interdisciplinary area of 2080:{\displaystyle s',r\gets G(s,a)} 7904:. Cambridge, MA: The MIT Press. 7177:Bellman, Richard (1958-09-01). 6967:{\displaystyle \Pr(s'\mid s,a)} 5462: 4827: 4707: 4368: 4353: 4168: 3820:ordinary differential equations 3693:A Markov decision process is a 1997:Another form of simulator is a 1514:{\displaystyle \gamma =1/(1+r)} 239:is a set of actions called the 7845:. Cambridge University Press. 7744:Narendra & Thathachar 1974 7061:for applications to economics. 7005: 6999: 6961: 6938: 6915: 6892: 6862: 6856: 6816: 6799: 6643:{\displaystyle {\mathcal {C}}} 6609: 6592: 6566: 6521: 6480:{\displaystyle {\mathcal {A}}} 6459:can be understood in terms of 6446: 6428: 6306:} of possible internal states, 6116: 6104: 6075: 6072: 6061: 6049: 6032: 6019: 6016: 5999: 5968: 5956: 5818: 5812: 5786: 5783: 5765: 5748: 5736: 5721: 5703: 5697: 5658: 5652: 5629: 5623: 5600: 5594: 5571: 5565: 5536: 5533: 5527: 5518: 5512: 5500: 5480: 5474: 5450: 5447: 5441: 5435: 5419: 5416: 5410: 5401: 5395: 5389: 5366: 5363: 5357: 5351: 5342: 5336: 5316: 5307: 5301: 5295: 5253: 5241: 5211: 5199: 5169: 5157: 5151: 5139: 5131: 5125: 5089: 5077: 5064: 5052: 5044: 5038: 4981: 4969: 4939: 4927: 4904: 4892: 4846: 4840: 4818: 4806: 4786: 4774: 4766: 4760: 4698: 4686: 4680: 4662: 4654: 4648: 4602: 4590: 4584: 4572: 4564: 4558: 4483: 4477: 4447: 4435: 4426: 4420: 4414: 4396: 4322:{\displaystyle {\bar {V}}^{*}} 4307: 4266:{\displaystyle {\bar {V}}^{*}} 4251: 4201: 4195: 4165: 4159: 4153: 4135: 4110: 4098: 3940: 3937: 3934: 3928: 3922: 3913: 3907: 3901: 3747: 3741: 3689:Extensions and generalizations 3597:In modified policy iteration ( 3310: 3299: 3280: 3263: 3245: 3228: 3182: 3176: 3137: 3131: 3108: 3102: 3079: 3073: 3050: 3044: 2958: 2947: 2935: 2918: 2900: 2883: 2834: 2828: 2797: 2786: 2774: 2757: 2752: 2746: 2730: 2713: 2708: 2702: 2673: 2667: 2617: 2611: 2601:will contain the solution and 2457: 2440: 2408: 2391: 2287: 2245: 2074: 2062: 2056: 1975: 1958: 1851: 1819: 1763: 1750: 1708: 1676: 1508: 1496: 1372: 1340: 1284: 1271: 1229: 1197: 1094: 1088: 1021: 1015: 835: 818: 747: 679: 670: 653: 578: 561: 530: 462: 328: 311: 167: 129: 1: 7195:10.1016/S0019-9958(58)80003-0 6281:In learning automata theory, 5905:{\displaystyle \varepsilon -} 2238:is the set of ordered tuples 629:{\displaystyle S'\subseteq S} 7115:Puterman, Martin L. (1994). 5613:is the system state vector, 1455:, which is usually close to 1078:is completely determined by 7861:Appendix contains abridged 7014:{\displaystyle p_{s's}(a).} 6921:{\displaystyle \Pr(s,a,s')} 6822:{\displaystyle P_{a}(s,s')} 6358:which after each time step 6154:; that is, "I was in state 4333:Primal linear program(P-LP) 4217:If there exists a function 2463:{\displaystyle R_{a}(s,s')} 2414:{\displaystyle P_{a}(s,s')} 1981:{\displaystyle P_{a}(s,s')} 1586:{\displaystyle \ \gamma \ } 1407:{\displaystyle \ \gamma \ } 841:{\displaystyle R_{a}(s,s')} 334:{\displaystyle P_{a}(s,s')} 7971: 7783:. Vol. 2. MA: Athena. 6868:{\displaystyle p_{ss'}(a)} 6255: 5867: 5259:{\displaystyle y^{*}(i,a)} 5189:for all feasible solution 4987:{\displaystyle y^{*}(i,a)} 3826:, epidemic processes, and 3704: 1005:that specifies the action 43:sequential decision making 39:stochastic dynamic program 7877:Markov Decision Processes 7875:Puterman., M. L. (1994). 7760:Bellman., R. E. (2003) . 7688:10.1109/TSMC.1974.5408453 7623:10.1007/978-3-642-02547-1 7567:. Vol. 7. CRC Press. 5664:{\displaystyle f(\cdot )} 5577:{\displaystyle D(\cdot )} 4504:Dual linear program(D-LP) 3593:Modified policy iteration 3150:gives the combined step: 2215:In this example, we have 2011:that are expressed using 7367:; Shwartz, Adam (eds.). 7080:Mabinogion sheep problem 4059:is independent of state 3657:Computational complexity 3603:Puterman & Shin 1978 3121:into the calculation of 2365:{\displaystyle \{-1,1\}} 1926:{\displaystyle \pi ^{*}} 7542:10.1023/A:1017932429737 7467:Papadimitriou, Christos 7418:10.1287/mnsc.24.11.1127 7331:10.1073/pnas.39.10.1095 7263:10.1023/A:1017932429737 7183:Information and Control 7154:10.1145/1455567.1455587 7044:Quantum finite automata 6831:transition probability 6782:transition probability 6452:{\displaystyle (S,A,P)} 3802:scenarios in robotics. 3753:{\displaystyle \pi (s)} 3682:Monte Carlo tree search 3678:curse of dimensionality 3442:{\displaystyle V_{i+1}} 3114:{\displaystyle \pi (s)} 3056:{\displaystyle \pi (s)} 3010:, which is also called 2186:Monte Carlo tree search 1906:and is usually denoted 1521:for some discount rate 1100:{\displaystyle \pi (s)} 1027:{\displaystyle \pi (s)} 78:artificial intelligence 31:Markov decision process 18:Markov Decision Process 7803:. Boston, MA: Kluwer. 7779:Bertsekas, D. (1995). 7672:. SMC-4 (4): 323–334. 7563:Altman, Eitan (1999). 7034:Probabilistic automata 7015: 6968: 6922: 6869: 6823: 6644: 6616: 6542: 6481: 6453: 6283:a stochastic automaton 6233: 6213: 6188: 6168: 6148: 6123: 6088: 5934: 5906: 5875:Reinforcement learning 5870:Reinforcement learning 5864:Reinforcement learning 5855: 5825: 5793: 5665: 5636: 5607: 5578: 5547: 5260: 5218: 5217:{\displaystyle y(i,a)} 5180: 4988: 4946: 4945:{\displaystyle y(i,a)} 4911: 4910:{\displaystyle y(i,a)} 4874: 4494: 4323: 4287: 4267: 4231: 4208: 4073: 4053: 4010: 3975: 3775:CMDPs are solved with 3754: 3725: 3697:with only one player. 3676:. However, due to the 3647: 3627: 3580: 3553: 3483: 3463: 3443: 3406: 3379: 3353: 3330: 3144: 3115: 3086: 3057: 3028: 2975: 2809: 2644: 2624: 2595: 2575: 2552: 2530:function approximation 2515:in this specific case. 2509: 2484: 2464: 2415: 2366: 2331: 2309: 2232: 2209: 2190:reinforcement learning 2166: 2146: 2121: 2101: 2081: 2029: 1982: 1927: 1887: 1858: 1770: 1721: 1720:{\displaystyle E\left} 1657: 1607: 1587: 1561: 1535: 1515: 1469: 1449: 1408: 1379: 1291: 1242: 1241:{\displaystyle E\left} 1168: 1124: 1101: 1072: 1048: 1028: 999: 981:Optimization objective 971: 951: 931: 907: 887: 862: 842: 789: 754: 630: 599: 446: 420: 395: 375: 355: 335: 286: 266: 233: 197: 174: 116: 99:that still follow the 74:reinforcement learning 7909:Tijms., H.C. (2003). 7493:10.1287/moor.12.3.441 7016: 6969: 6923: 6870: 6824: 6654:Alternative notations 6645: 6617: 6548:encodes both the set 6543: 6482: 6454: 6393:". At each time step 6276:finite state automata 6270:paper is surveyed by 6234: 6214: 6189: 6169: 6149: 6124: 6122:{\displaystyle (s,a)} 6089: 5935: 5907: 5856: 5854:{\displaystyle V^{*}} 5826: 5794: 5666: 5637: 5608: 5579: 5548: 5261: 5219: 5181: 4989: 4947: 4912: 4875: 4495: 4324: 4288: 4273:will be the smallest 4268: 4232: 4209: 4074: 4054: 4052:{\displaystyle V^{*}} 4011: 3976: 3755: 3726: 3701:Partial observability 3648: 3628: 3581: 3554: 3509:In policy iteration ( 3493:" for this problem). 3484: 3464: 3444: 3407: 3405:{\displaystyle V_{0}} 3380: 3354: 3331: 3145: 3116: 3087: 3063:is calculated within 3058: 3029: 2976: 2810: 2645: 2625: 2596: 2576: 2553: 2510: 2485: 2465: 2416: 2367: 2332: 2310: 2233: 2206:Open AI gym benchmark 2203: 2167: 2147: 2122: 2102: 2082: 2030: 1983: 1928: 1888: 1886:{\displaystyle \ H\ } 1859: 1771: 1722: 1631: 1608: 1588: 1562: 1536: 1516: 1470: 1450: 1409: 1380: 1292: 1243: 1148: 1125: 1102: 1073: 1049: 1029: 1000: 972: 952: 932: 908: 888: 863: 843: 790: 755: 631: 600: 447: 421: 396: 376: 356: 336: 287: 267: 265:{\displaystyle A_{s}} 234: 207:of states called the 198: 175: 114: 7884:Ross, S. M. (1983). 7870:on 18 December 2012. 7864:"Meyn & Tweedie" 7839:Meyn, S. P. (2007). 7713:Narendra, Kumpati S. 7256:(193–208): 193–208. 6978: 6932: 6886: 6835: 6786: 6630: 6563: 6511: 6491:with generating set 6467: 6425: 6223: 6198: 6178: 6158: 6133: 6101: 5947: 5924: 5893: 5838: 5824:{\displaystyle a(t)} 5806: 5678: 5646: 5635:{\displaystyle a(t)} 5617: 5606:{\displaystyle s(t)} 5588: 5559: 5285: 5228: 5193: 5001: 4956: 4921: 4886: 4512: 4341: 4297: 4277: 4241: 4221: 4086: 4063: 4036: 3988: 3845: 3828:population processes 3735: 3715: 3646:{\displaystyle \pi } 3637: 3617: 3609:Prioritized sweeping 3579:{\displaystyle \pi } 3570: 3552:{\displaystyle s=s'} 3532: 3473: 3453: 3420: 3389: 3363: 3343: 3157: 3143:{\displaystyle V(s)} 3125: 3096: 3085:{\displaystyle V(s)} 3067: 3038: 3027:{\displaystyle \pi } 3018: 2998:In value iteration ( 2822: 2661: 2634: 2623:{\displaystyle V(s)} 2605: 2594:{\displaystyle \pi } 2585: 2574:{\displaystyle \pi } 2565: 2542: 2494: 2474: 2427: 2378: 2341: 2321: 2242: 2222: 2156: 2131: 2111: 2091: 2039: 2019: 1945: 1910: 1871: 1780: 1731: 1620: 1597: 1571: 1548: 1525: 1479: 1459: 1418: 1392: 1301: 1252: 1137: 1123:{\displaystyle \pi } 1114: 1082: 1062: 1038: 1009: 998:{\displaystyle \pi } 989: 961: 941: 930:{\displaystyle \pi } 921: 897: 872: 852: 805: 764: 640: 609: 456: 430: 405: 385: 365: 345: 298: 276: 249: 223: 187: 126: 7945:Dynamic programming 7788:Derman, C. (1970). 7762:Dynamic Programming 7365:Feinberg, Eugene A. 7322:1953PNAS...39.1095S 7075:Recursive economics 7054:Dynamic programming 6772:discounting factor 6766:discounting factor 6742:is the negative of 6719:is the negative of 6353:computable function 6293:of possible inputs, 5385: 3887: 3816:continuous dynamics 3781:dynamic programming 3378:{\displaystyle i=0} 2526:dynamic programming 2182:dynamic programming 957:) to action space ( 445:{\displaystyle t+1} 401:will lead to state 215:set of real numbers 54:operations research 7955:Stochastic control 7445:10.1007/bf01920264 7406:Management Science 7306:"Stochastic Games" 7011: 6964: 6918: 6865: 6819: 6640: 6612: 6538: 6477: 6449: 6229: 6212:{\displaystyle s'} 6209: 6184: 6174:and I tried doing 6164: 6147:{\displaystyle s'} 6144: 6119: 6084: 5988: 5930: 5902: 5851: 5821: 5789: 5696: 5661: 5632: 5603: 5574: 5543: 5541: 5371: 5370: 5256: 5214: 5176: 5174: 5135: 5110: 5048: 5023: 4984: 4942: 4907: 4870: 4868: 4770: 4745: 4658: 4633: 4568: 4543: 4490: 4488: 4392: 4319: 4283: 4263: 4227: 4204: 4131: 4069: 4049: 4006: 3971: 3873: 3750: 3721: 3643: 3623: 3576: 3549: 3479: 3459: 3439: 3412:as a guess of the 3402: 3375: 3349: 3326: 3217: 3197: 3140: 3111: 3082: 3053: 3024: 3012:backward induction 2971: 2872: 2805: 2693: 2640: 2620: 2591: 2571: 2548: 2508:{\displaystyle s'} 2505: 2480: 2460: 2411: 2362: 2327: 2305: 2228: 2210: 2162: 2145:{\displaystyle s'} 2142: 2117: 2097: 2077: 2025: 1978: 1923: 1883: 1854: 1766: 1717: 1603: 1583: 1560:{\displaystyle H-} 1557: 1531: 1511: 1465: 1445: 1404: 1375: 1287: 1238: 1120: 1097: 1068: 1044: 1024: 995: 967: 947: 927: 917:A policy function 903: 886:{\displaystyle s'} 883: 858: 838: 785: 750: 626: 595: 442: 419:{\displaystyle s'} 416: 391: 371: 351: 331: 282: 262: 229: 193: 170: 117: 70:telecommunications 7940:Optimal decisions 7893:. Academic press. 7852:978-0-521-88441-9 7792:. Academic Press. 7771:978-0-486-42809-3 7723:. Prentice Hall. 7412:(11): 1127–1137. 7378:978-0-7923-7459-6 7316:(10): 1095–1100. 7230:978-0-262-03924-6 7163:978-1-4503-7861-1 7126:978-0-471-61977-2 6880: 6879: 6507:. Then a functor 6268:learning automata 6258:Learning automata 6252:Learning automata 6232:{\displaystyle Q} 6187:{\displaystyle a} 6167:{\displaystyle s} 6083: 5974: 5952: 5933:{\displaystyle a} 5760: 5687: 5492: 5460: 5328: 5111: 5095: 5024: 5008: 4852: 4746: 4730: 4634: 4618: 4612: 4544: 4528: 4522: 4377: 4366: 4351: 4310: 4286:{\displaystyle g} 4254: 4230:{\displaystyle h} 4184: 4116: 4072:{\displaystyle i} 3724:{\displaystyle s} 3667:decision problems 3626:{\displaystyle V} 3497:'s 1953 paper on 3482:{\displaystyle V} 3462:{\displaystyle s} 3352:{\displaystyle i} 3203: 3188: 2858: 2679: 2643:{\displaystyle s} 2551:{\displaystyle V} 2483:{\displaystyle 1} 2330:{\displaystyle A} 2284: 2263: 2231:{\displaystyle S} 2165:{\displaystyle r} 2120:{\displaystyle a} 2100:{\displaystyle s} 2028:{\displaystyle G} 1994:may be produced. 1882: 1876: 1727:(where we choose 1606:{\displaystyle H} 1582: 1576: 1534:{\displaystyle r} 1468:{\displaystyle 1} 1441: 1435: 1429: 1403: 1397: 1248:(where we choose 1071:{\displaystyle s} 1047:{\displaystyle s} 970:{\displaystyle A} 950:{\displaystyle S} 906:{\displaystyle a} 861:{\displaystyle s} 394:{\displaystyle t} 374:{\displaystyle s} 354:{\displaystyle a} 285:{\displaystyle s} 232:{\displaystyle A} 196:{\displaystyle S} 97:state transitions 52:Originating from 37:), also called a 16:(Redirected from 7962: 7950:Markov processes 7926: 7905: 7894: 7892: 7880: 7871: 7866:. Archived from 7860: 7859:on 19 June 2010. 7855:. Archived from 7835: 7814: 7793: 7784: 7775: 7747: 7741: 7735: 7734: 7722: 7709: 7700: 7699: 7681: 7662: 7656: 7655: 7653: 7652: 7642: 7633: 7627: 7626: 7609: 7603: 7602: 7590: 7584: 7583: 7575: 7569: 7568: 7560: 7554: 7553: 7551: 7549: 7544: 7529:Machine Learning 7520: 7514: 7513: 7511: 7509: 7504: 7471:Tsitsiklis, John 7463: 7457: 7456: 7428: 7422: 7421: 7401: 7395: 7389: 7383: 7382: 7360: 7354: 7353: 7343: 7333: 7298: 7292: 7291: 7274: 7268: 7267: 7265: 7250:Machine Learning 7241: 7235: 7234: 7216: 7207: 7206: 7174: 7168: 7167: 7137: 7131: 7130: 7112: 7085:Stochastic games 7059:Bellman equation 7020: 7018: 7017: 7012: 6998: 6997: 6993: 6973: 6971: 6970: 6965: 6948: 6927: 6925: 6924: 6919: 6914: 6874: 6872: 6871: 6866: 6855: 6854: 6853: 6828: 6826: 6825: 6820: 6815: 6798: 6797: 6775: 6769: 6759: 6753: 6745: 6741: 6736: 6730: 6722: 6718: 6713: 6707: 6697: 6691: 6674: 6670: 6666: 6662: 6649: 6647: 6646: 6641: 6639: 6638: 6621: 6619: 6618: 6613: 6608: 6591: 6590: 6575: 6574: 6547: 6545: 6544: 6539: 6537: 6520: 6519: 6501:Kleisli category 6486: 6484: 6483: 6478: 6476: 6475: 6458: 6456: 6455: 6450: 6264:machine learning 6238: 6236: 6235: 6230: 6218: 6216: 6215: 6210: 6208: 6193: 6191: 6190: 6185: 6173: 6171: 6170: 6165: 6153: 6151: 6150: 6145: 6143: 6128: 6126: 6125: 6120: 6093: 6091: 6090: 6085: 6081: 6071: 6048: 6031: 6030: 6015: 5998: 5997: 5987: 5986: 5950: 5939: 5937: 5936: 5931: 5911: 5909: 5908: 5903: 5879:machine learning 5860: 5858: 5857: 5852: 5850: 5849: 5830: 5828: 5827: 5822: 5798: 5796: 5795: 5790: 5761: 5759: 5751: 5728: 5695: 5670: 5668: 5667: 5662: 5641: 5639: 5638: 5633: 5612: 5610: 5609: 5604: 5583: 5581: 5580: 5575: 5552: 5550: 5549: 5544: 5542: 5493: 5491: 5483: 5466: 5461: 5458: 5384: 5379: 5369: 5323: 5265: 5263: 5262: 5257: 5240: 5239: 5223: 5221: 5220: 5215: 5185: 5183: 5182: 5177: 5175: 5134: 5109: 5076: 5075: 5047: 5022: 4993: 4991: 4990: 4985: 4968: 4967: 4951: 4949: 4948: 4943: 4916: 4914: 4913: 4908: 4879: 4877: 4876: 4871: 4869: 4853: 4850: 4801: 4769: 4744: 4726: 4657: 4632: 4613: 4610: 4567: 4542: 4523: 4520: 4499: 4497: 4496: 4491: 4489: 4391: 4367: 4364: 4352: 4349: 4328: 4326: 4325: 4320: 4318: 4317: 4312: 4311: 4303: 4292: 4290: 4289: 4284: 4272: 4270: 4269: 4264: 4262: 4261: 4256: 4255: 4247: 4236: 4234: 4233: 4228: 4213: 4211: 4210: 4205: 4185: 4182: 4130: 4078: 4076: 4075: 4070: 4058: 4056: 4055: 4050: 4048: 4047: 4015: 4013: 4012: 4007: 3980: 3978: 3977: 3972: 3970: 3966: 3965: 3964: 3955: 3951: 3897: 3896: 3886: 3881: 3860: 3859: 3824:queueing systems 3759: 3757: 3756: 3751: 3730: 3728: 3727: 3722: 3671:complexity class 3652: 3650: 3649: 3644: 3632: 3630: 3629: 3624: 3585: 3583: 3582: 3577: 3558: 3556: 3555: 3550: 3548: 3520: 3505:Policy iteration 3499:stochastic games 3491:Bellman equation 3488: 3486: 3485: 3480: 3468: 3466: 3465: 3460: 3448: 3446: 3445: 3440: 3438: 3437: 3411: 3409: 3408: 3403: 3401: 3400: 3384: 3382: 3381: 3376: 3358: 3356: 3355: 3350: 3335: 3333: 3332: 3327: 3322: 3318: 3317: 3313: 3309: 3298: 3297: 3279: 3262: 3261: 3244: 3227: 3226: 3216: 3215: 3196: 3175: 3174: 3149: 3147: 3146: 3141: 3120: 3118: 3117: 3112: 3091: 3089: 3088: 3083: 3062: 3060: 3059: 3054: 3033: 3031: 3030: 3025: 3009: 2989:Notable variants 2980: 2978: 2977: 2972: 2970: 2966: 2965: 2961: 2957: 2934: 2917: 2916: 2899: 2882: 2881: 2871: 2870: 2849: 2848: 2814: 2812: 2811: 2806: 2804: 2800: 2796: 2773: 2756: 2755: 2729: 2712: 2711: 2692: 2691: 2649: 2647: 2646: 2641: 2629: 2627: 2626: 2621: 2600: 2598: 2597: 2592: 2580: 2578: 2577: 2572: 2557: 2555: 2554: 2549: 2514: 2512: 2511: 2506: 2504: 2489: 2487: 2486: 2481: 2469: 2467: 2466: 2461: 2456: 2439: 2438: 2420: 2418: 2417: 2412: 2407: 2390: 2389: 2371: 2369: 2368: 2363: 2336: 2334: 2333: 2328: 2314: 2312: 2311: 2306: 2304: 2303: 2298: 2286: 2285: 2277: 2265: 2264: 2256: 2237: 2235: 2234: 2229: 2171: 2169: 2168: 2163: 2151: 2149: 2148: 2143: 2141: 2126: 2124: 2123: 2118: 2106: 2104: 2103: 2098: 2086: 2084: 2083: 2078: 2049: 2034: 2032: 2031: 2026: 2005:generative model 2000:generative model 1987: 1985: 1984: 1979: 1974: 1957: 1956: 1937:Simulator models 1932: 1930: 1929: 1924: 1922: 1921: 1892: 1890: 1889: 1884: 1880: 1874: 1863: 1861: 1860: 1855: 1850: 1849: 1831: 1830: 1818: 1817: 1816: 1815: 1798: 1797: 1775: 1773: 1772: 1767: 1762: 1761: 1743: 1742: 1726: 1724: 1723: 1718: 1716: 1712: 1711: 1707: 1706: 1688: 1687: 1675: 1674: 1673: 1672: 1656: 1645: 1612: 1610: 1609: 1604: 1592: 1590: 1589: 1584: 1580: 1574: 1566: 1564: 1563: 1558: 1540: 1538: 1537: 1532: 1520: 1518: 1517: 1512: 1495: 1474: 1472: 1471: 1466: 1454: 1452: 1451: 1446: 1439: 1433: 1427: 1413: 1411: 1410: 1405: 1401: 1395: 1384: 1382: 1381: 1376: 1371: 1370: 1352: 1351: 1339: 1338: 1337: 1336: 1319: 1318: 1296: 1294: 1293: 1288: 1283: 1282: 1264: 1263: 1247: 1245: 1244: 1239: 1237: 1233: 1232: 1228: 1227: 1209: 1208: 1196: 1195: 1194: 1193: 1179: 1178: 1167: 1162: 1129: 1127: 1126: 1121: 1106: 1104: 1103: 1098: 1077: 1075: 1074: 1069: 1053: 1051: 1050: 1045: 1033: 1031: 1030: 1025: 1004: 1002: 1001: 996: 976: 974: 973: 968: 956: 954: 953: 948: 936: 934: 933: 928: 912: 910: 909: 904: 893:, due to action 892: 890: 889: 884: 882: 867: 865: 864: 859: 847: 845: 844: 839: 834: 817: 816: 797:Lebesgue measure 794: 792: 791: 786: 784: 783: 778: 759: 757: 756: 751: 740: 739: 721: 720: 708: 697: 696: 669: 652: 651: 635: 633: 632: 627: 619: 604: 602: 601: 596: 591: 577: 560: 559: 550: 549: 548: 523: 522: 504: 503: 491: 480: 479: 451: 449: 448: 443: 425: 423: 422: 417: 415: 400: 398: 397: 392: 380: 378: 377: 372: 360: 358: 357: 352: 340: 338: 337: 332: 327: 310: 309: 291: 289: 288: 283: 271: 269: 268: 263: 261: 260: 245:(alternatively, 238: 236: 235: 230: 202: 200: 199: 194: 179: 177: 176: 171: 166: 165: 153: 152: 82:cause and effect 21: 7970: 7969: 7965: 7964: 7963: 7961: 7960: 7959: 7930: 7929: 7923: 7908: 7897: 7890: 7883: 7874: 7862: 7853: 7838: 7832: 7817: 7811: 7796: 7787: 7778: 7772: 7759: 7756: 7754:Further reading 7751: 7750: 7742: 7738: 7731: 7711: 7710: 7703: 7679:10.1.1.295.2280 7666:Narendra, K. S. 7664: 7663: 7659: 7650: 7648: 7640: 7635: 7634: 7630: 7611: 7610: 7606: 7592: 7591: 7587: 7577: 7576: 7572: 7562: 7561: 7557: 7547: 7545: 7522: 7521: 7517: 7507: 7505: 7465: 7464: 7460: 7430: 7429: 7425: 7403: 7402: 7398: 7390: 7386: 7379: 7362: 7361: 7357: 7300: 7299: 7295: 7288: 7276: 7275: 7271: 7243: 7242: 7238: 7231: 7218: 7217: 7210: 7176: 7175: 7171: 7164: 7139: 7138: 7134: 7127: 7114: 7113: 7109: 7104: 7099: 7070:Optimal control 7029: 7023: 6986: 6981: 6976: 6975: 6941: 6930: 6929: 6907: 6884: 6883: 6846: 6838: 6833: 6832: 6808: 6789: 6784: 6783: 6773: 6767: 6757: 6751: 6743: 6739: 6734: 6728: 6720: 6716: 6711: 6705: 6695: 6689: 6677:in this article 6668: 6664: 6660: 6656: 6628: 6627: 6561: 6560: 6509: 6508: 6465: 6464: 6461:Category theory 6423: 6422: 6419: 6346: 6340: 6318: 6312: 6305: 6299: 6260: 6254: 6249: 6221: 6220: 6201: 6196: 6195: 6176: 6175: 6156: 6155: 6136: 6131: 6130: 6099: 6098: 6064: 6041: 6022: 6008: 5989: 5979: 5945: 5944: 5922: 5921: 5918: 5891: 5890: 5883:optimal control 5872: 5866: 5841: 5836: 5835: 5804: 5803: 5752: 5729: 5676: 5675: 5644: 5643: 5615: 5614: 5586: 5585: 5557: 5556: 5540: 5539: 5484: 5467: 5463: 5454: 5453: 5324: 5283: 5282: 5272: 5231: 5226: 5225: 5191: 5190: 5173: 5172: 5067: 4999: 4998: 4959: 4954: 4953: 4919: 4918: 4884: 4883: 4867: 4866: 4851: and  4799: 4798: 4724: 4723: 4614: 4606: 4605: 4524: 4510: 4509: 4487: 4486: 4369: 4360: 4359: 4354: 4339: 4338: 4300: 4295: 4294: 4275: 4274: 4244: 4239: 4238: 4219: 4218: 4183: and  4084: 4083: 4061: 4060: 4039: 4034: 4033: 4021: 3986: 3985: 3956: 3888: 3872: 3869: 3868: 3864: 3851: 3843: 3842: 3808: 3800:motion planning 3777:linear programs 3766: 3733: 3732: 3713: 3712: 3709: 3703: 3695:stochastic game 3691: 3663:time complexity 3659: 3635: 3634: 3615: 3614: 3611: 3595: 3568: 3567: 3541: 3530: 3529: 3514: 3507: 3471: 3470: 3451: 3450: 3449:for all states 3423: 3418: 3417: 3392: 3387: 3386: 3361: 3360: 3341: 3340: 3302: 3289: 3272: 3253: 3252: 3248: 3237: 3218: 3208: 3202: 3198: 3160: 3155: 3154: 3123: 3122: 3094: 3093: 3065: 3064: 3036: 3035: 3016: 3015: 3003: 2996: 2994:Value iteration 2991: 2950: 2927: 2908: 2907: 2903: 2892: 2873: 2863: 2857: 2853: 2840: 2820: 2819: 2789: 2766: 2738: 2737: 2733: 2722: 2694: 2684: 2659: 2658: 2632: 2631: 2603: 2602: 2583: 2582: 2563: 2562: 2540: 2539: 2522: 2497: 2492: 2491: 2472: 2471: 2449: 2430: 2425: 2424: 2400: 2381: 2376: 2375: 2339: 2338: 2319: 2318: 2293: 2240: 2239: 2220: 2219: 2198: 2176:models through 2154: 2153: 2134: 2129: 2128: 2109: 2108: 2089: 2088: 2042: 2037: 2036: 2017: 2016: 2001: 1992: 1967: 1948: 1943: 1942: 1939: 1913: 1908: 1907: 1904: 1895:Learning Theory 1869: 1868: 1835: 1822: 1807: 1802: 1783: 1778: 1777: 1753: 1734: 1729: 1728: 1692: 1679: 1664: 1659: 1630: 1626: 1618: 1617: 1595: 1594: 1569: 1568: 1546: 1545: 1523: 1522: 1477: 1476: 1457: 1456: 1416: 1415: 1390: 1389: 1356: 1343: 1328: 1323: 1304: 1299: 1298: 1274: 1255: 1250: 1249: 1213: 1200: 1185: 1180: 1170: 1147: 1143: 1135: 1134: 1112: 1111: 1080: 1079: 1060: 1059: 1036: 1035: 1007: 1006: 987: 986: 983: 959: 958: 939: 938: 919: 918: 895: 894: 875: 870: 869: 850: 849: 827: 808: 803: 802: 773: 762: 761: 731: 712: 701: 682: 662: 643: 638: 637: 612: 607: 606: 584: 570: 551: 541: 536: 514: 495: 484: 465: 454: 453: 428: 427: 408: 403: 402: 383: 382: 363: 362: 343: 342: 320: 301: 296: 295: 274: 273: 252: 247: 246: 243: 221: 220: 211: 185: 184: 157: 144: 124: 123: 109: 101:Markov property 49:are uncertain. 28: 23: 22: 15: 12: 11: 5: 7968: 7966: 7958: 7957: 7952: 7947: 7942: 7932: 7931: 7928: 7927: 7921: 7906: 7895: 7881: 7872: 7851: 7836: 7830: 7815: 7809: 7794: 7785: 7776: 7770: 7755: 7752: 7749: 7748: 7736: 7729: 7701: 7657: 7628: 7604: 7585: 7570: 7555: 7515: 7458: 7439:(5): 203–208. 7423: 7396: 7384: 7377: 7355: 7302:Shapley, Lloyd 7293: 7286: 7269: 7236: 7229: 7208: 7189:(3): 228–239. 7169: 7162: 7132: 7125: 7106: 7105: 7103: 7100: 7098: 7097: 7092: 7087: 7082: 7077: 7072: 7067: 7062: 7056: 7051: 7046: 7041: 7039:Odds algorithm 7036: 7030: 7028: 7025: 7010: 7007: 7004: 7001: 6996: 6992: 6989: 6984: 6963: 6960: 6957: 6954: 6951: 6947: 6944: 6940: 6937: 6917: 6913: 6910: 6906: 6903: 6900: 6897: 6894: 6891: 6878: 6877: 6875: 6864: 6861: 6858: 6852: 6849: 6845: 6841: 6829: 6818: 6814: 6811: 6807: 6804: 6801: 6796: 6792: 6779: 6778: 6776: 6770: 6763: 6762: 6760: 6754: 6747: 6746: 6737: 6731: 6724: 6723: 6714: 6708: 6701: 6700: 6698: 6692: 6685: 6684: 6681: 6678: 6655: 6652: 6637: 6611: 6607: 6604: 6601: 6598: 6594: 6589: 6584: 6581: 6578: 6573: 6568: 6536: 6533: 6530: 6527: 6523: 6518: 6474: 6463:. Namely, let 6448: 6445: 6442: 6439: 6436: 6433: 6430: 6418: 6415: 6391:Markov process 6387: 6386: 6379: 6349: 6344: 6338: 6328: 6314: 6310: 6307: 6301: 6297: 6294: 6256:Main article: 6253: 6250: 6248: 6245: 6228: 6207: 6204: 6183: 6163: 6142: 6139: 6118: 6115: 6112: 6109: 6106: 6095: 6094: 6080: 6077: 6074: 6070: 6067: 6063: 6060: 6057: 6054: 6051: 6047: 6044: 6040: 6037: 6034: 6029: 6025: 6021: 6018: 6014: 6011: 6007: 6004: 6001: 5996: 5992: 5985: 5982: 5977: 5973: 5970: 5967: 5964: 5961: 5958: 5955: 5929: 5917: 5914: 5901: 5898: 5868:Main article: 5865: 5862: 5848: 5844: 5833:value function 5820: 5817: 5814: 5811: 5800: 5799: 5788: 5785: 5782: 5779: 5776: 5773: 5770: 5767: 5764: 5758: 5755: 5750: 5747: 5744: 5741: 5738: 5735: 5732: 5726: 5723: 5720: 5717: 5714: 5711: 5708: 5705: 5702: 5699: 5694: 5690: 5686: 5683: 5660: 5657: 5654: 5651: 5631: 5628: 5625: 5622: 5602: 5599: 5596: 5593: 5573: 5570: 5567: 5564: 5554: 5553: 5538: 5535: 5532: 5529: 5526: 5523: 5520: 5517: 5514: 5511: 5508: 5505: 5502: 5499: 5496: 5490: 5487: 5482: 5479: 5476: 5473: 5470: 5464: 5456: 5455: 5452: 5449: 5446: 5443: 5440: 5437: 5434: 5431: 5428: 5425: 5421: 5418: 5415: 5412: 5409: 5406: 5403: 5400: 5397: 5394: 5391: 5388: 5383: 5378: 5374: 5368: 5365: 5362: 5359: 5356: 5353: 5350: 5347: 5344: 5341: 5338: 5335: 5331: 5327: 5325: 5321: 5318: 5315: 5312: 5309: 5306: 5303: 5300: 5297: 5294: 5291: 5290: 5271: 5268: 5255: 5252: 5249: 5246: 5243: 5238: 5234: 5213: 5210: 5207: 5204: 5201: 5198: 5187: 5186: 5171: 5168: 5165: 5162: 5159: 5156: 5153: 5150: 5147: 5144: 5141: 5138: 5133: 5130: 5127: 5124: 5121: 5118: 5114: 5108: 5105: 5102: 5098: 5094: 5091: 5088: 5085: 5082: 5079: 5074: 5070: 5066: 5063: 5060: 5057: 5054: 5051: 5046: 5043: 5040: 5037: 5034: 5031: 5027: 5021: 5018: 5015: 5011: 5007: 5006: 4983: 4980: 4977: 4974: 4971: 4966: 4962: 4941: 4938: 4935: 4932: 4929: 4926: 4906: 4903: 4900: 4897: 4894: 4891: 4881: 4880: 4865: 4862: 4859: 4856: 4848: 4845: 4842: 4839: 4836: 4833: 4830: 4826: 4823: 4820: 4817: 4814: 4811: 4808: 4805: 4802: 4800: 4797: 4794: 4791: 4788: 4785: 4782: 4779: 4776: 4773: 4768: 4765: 4762: 4759: 4756: 4753: 4749: 4743: 4740: 4737: 4733: 4729: 4727: 4725: 4722: 4719: 4716: 4713: 4710: 4706: 4703: 4700: 4697: 4694: 4691: 4688: 4685: 4682: 4679: 4676: 4673: 4670: 4667: 4664: 4661: 4656: 4653: 4650: 4647: 4644: 4641: 4637: 4631: 4628: 4625: 4621: 4617: 4615: 4608: 4607: 4604: 4601: 4598: 4595: 4592: 4589: 4586: 4583: 4580: 4577: 4574: 4571: 4566: 4563: 4560: 4557: 4554: 4551: 4547: 4541: 4538: 4535: 4531: 4527: 4525: 4518: 4517: 4506: 4505: 4501: 4500: 4485: 4482: 4479: 4476: 4473: 4470: 4466: 4463: 4460: 4457: 4454: 4449: 4446: 4443: 4440: 4437: 4434: 4431: 4428: 4425: 4422: 4419: 4416: 4413: 4410: 4407: 4404: 4401: 4398: 4395: 4390: 4387: 4384: 4380: 4376: 4373: 4370: 4362: 4361: 4358: 4355: 4347: 4346: 4335: 4334: 4316: 4309: 4306: 4282: 4260: 4253: 4250: 4226: 4215: 4214: 4203: 4200: 4197: 4194: 4191: 4188: 4180: 4177: 4174: 4171: 4167: 4164: 4161: 4158: 4155: 4152: 4149: 4146: 4143: 4140: 4137: 4134: 4129: 4126: 4123: 4119: 4115: 4112: 4109: 4106: 4103: 4100: 4097: 4094: 4091: 4068: 4046: 4042: 4020: 4017: 4005: 4002: 3999: 3996: 3993: 3982: 3981: 3969: 3963: 3959: 3954: 3949: 3946: 3942: 3939: 3936: 3933: 3930: 3927: 3924: 3921: 3918: 3915: 3912: 3909: 3906: 3903: 3900: 3895: 3891: 3885: 3880: 3876: 3871: 3867: 3863: 3858: 3854: 3850: 3807: 3804: 3796: 3795: 3788: 3787: 3784: 3783:does not work. 3773: 3765: 3762: 3749: 3746: 3743: 3740: 3720: 3705:Main article: 3702: 3699: 3690: 3687: 3658: 3655: 3642: 3622: 3610: 3607: 3599:van Nunen 1976 3594: 3591: 3575: 3547: 3544: 3540: 3537: 3506: 3503: 3478: 3458: 3436: 3433: 3430: 3426: 3414:value function 3399: 3395: 3374: 3371: 3368: 3348: 3337: 3336: 3325: 3321: 3316: 3312: 3308: 3305: 3301: 3296: 3292: 3288: 3285: 3282: 3278: 3275: 3271: 3268: 3265: 3260: 3256: 3251: 3247: 3243: 3240: 3236: 3233: 3230: 3225: 3221: 3214: 3211: 3206: 3201: 3195: 3191: 3187: 3184: 3181: 3178: 3173: 3170: 3167: 3163: 3139: 3136: 3133: 3130: 3110: 3107: 3104: 3101: 3081: 3078: 3075: 3072: 3052: 3049: 3046: 3043: 3023: 2995: 2992: 2990: 2987: 2982: 2981: 2969: 2964: 2960: 2956: 2953: 2949: 2946: 2943: 2940: 2937: 2933: 2930: 2926: 2923: 2920: 2915: 2911: 2906: 2902: 2898: 2895: 2891: 2888: 2885: 2880: 2876: 2869: 2866: 2861: 2856: 2852: 2847: 2843: 2839: 2836: 2833: 2830: 2827: 2816: 2815: 2803: 2799: 2795: 2792: 2788: 2785: 2782: 2779: 2776: 2772: 2769: 2765: 2762: 2759: 2754: 2751: 2748: 2745: 2741: 2736: 2732: 2728: 2725: 2721: 2718: 2715: 2710: 2707: 2704: 2701: 2697: 2690: 2687: 2682: 2678: 2675: 2672: 2669: 2666: 2639: 2619: 2616: 2613: 2610: 2590: 2570: 2547: 2521: 2518: 2517: 2516: 2503: 2500: 2479: 2459: 2455: 2452: 2448: 2445: 2442: 2437: 2433: 2422: 2410: 2406: 2403: 2399: 2396: 2393: 2388: 2384: 2373: 2361: 2358: 2355: 2352: 2349: 2346: 2326: 2316: 2302: 2297: 2292: 2289: 2283: 2280: 2274: 2271: 2268: 2262: 2259: 2253: 2250: 2247: 2227: 2197: 2194: 2161: 2140: 2137: 2116: 2096: 2076: 2073: 2070: 2067: 2064: 2061: 2058: 2055: 2052: 2048: 2045: 2024: 1999: 1990: 1977: 1973: 1970: 1966: 1963: 1960: 1955: 1951: 1938: 1935: 1920: 1916: 1903:optimal policy 1902: 1879: 1865: 1864: 1853: 1848: 1845: 1842: 1838: 1834: 1829: 1825: 1821: 1814: 1810: 1805: 1801: 1796: 1793: 1790: 1786: 1765: 1760: 1756: 1752: 1749: 1746: 1741: 1737: 1715: 1710: 1705: 1702: 1699: 1695: 1691: 1686: 1682: 1678: 1671: 1667: 1662: 1655: 1652: 1649: 1644: 1641: 1638: 1634: 1629: 1625: 1602: 1579: 1556: 1553: 1530: 1510: 1507: 1504: 1501: 1498: 1494: 1490: 1487: 1484: 1475:(for example, 1464: 1444: 1438: 1432: 1426: 1423: 1400: 1386: 1385: 1374: 1369: 1366: 1363: 1359: 1355: 1350: 1346: 1342: 1335: 1331: 1326: 1322: 1317: 1314: 1311: 1307: 1286: 1281: 1277: 1273: 1270: 1267: 1262: 1258: 1236: 1231: 1226: 1223: 1220: 1216: 1212: 1207: 1203: 1199: 1192: 1188: 1183: 1177: 1173: 1166: 1161: 1158: 1155: 1151: 1146: 1142: 1119: 1096: 1093: 1090: 1087: 1067: 1043: 1023: 1020: 1017: 1014: 994: 982: 979: 966: 946: 926: 915: 914: 902: 881: 878: 857: 837: 833: 830: 826: 823: 820: 815: 811: 800: 782: 777: 772: 769: 749: 746: 743: 738: 734: 730: 727: 724: 719: 715: 711: 707: 704: 700: 695: 692: 689: 685: 681: 678: 675: 672: 668: 665: 661: 658: 655: 650: 646: 625: 622: 618: 615: 594: 590: 587: 583: 580: 576: 573: 569: 566: 563: 558: 554: 547: 544: 539: 535: 532: 529: 526: 521: 517: 513: 510: 507: 502: 498: 494: 490: 487: 483: 478: 475: 472: 468: 464: 461: 441: 438: 435: 414: 411: 390: 370: 350: 330: 326: 323: 319: 316: 313: 308: 304: 293: 281: 259: 255: 241: 228: 218: 209: 192: 169: 164: 160: 156: 151: 147: 143: 140: 137: 134: 131: 108: 105: 26: 24: 14: 13: 10: 9: 6: 4: 3: 2: 7967: 7956: 7953: 7951: 7948: 7946: 7943: 7941: 7938: 7937: 7935: 7924: 7922:9780470864289 7918: 7914: 7913: 7907: 7903: 7902: 7896: 7889: 7888: 7882: 7878: 7873: 7869: 7865: 7858: 7854: 7848: 7844: 7843: 7837: 7833: 7831:9783642025464 7827: 7823: 7822: 7816: 7812: 7810:9781461508052 7806: 7802: 7801: 7795: 7791: 7786: 7782: 7777: 7773: 7767: 7763: 7758: 7757: 7753: 7746:, p.325 left. 7745: 7740: 7737: 7732: 7730:9780134855585 7726: 7721: 7720: 7714: 7708: 7706: 7702: 7697: 7693: 7689: 7685: 7680: 7675: 7671: 7667: 7661: 7658: 7646: 7639: 7632: 7629: 7624: 7620: 7616: 7615: 7608: 7605: 7600: 7596: 7589: 7586: 7581: 7574: 7571: 7566: 7559: 7556: 7543: 7538: 7534: 7530: 7526: 7519: 7516: 7503: 7498: 7494: 7490: 7486: 7482: 7481: 7476: 7472: 7468: 7462: 7459: 7454: 7450: 7446: 7442: 7438: 7434: 7427: 7424: 7419: 7415: 7411: 7407: 7400: 7397: 7394: 7391:Howard 2002, 7388: 7385: 7380: 7374: 7370: 7366: 7359: 7356: 7351: 7347: 7342: 7337: 7332: 7327: 7323: 7319: 7315: 7311: 7307: 7303: 7297: 7294: 7289: 7287:9787111631774 7283: 7279: 7273: 7270: 7264: 7259: 7255: 7251: 7247: 7240: 7237: 7232: 7226: 7222: 7215: 7213: 7209: 7204: 7200: 7196: 7192: 7188: 7184: 7180: 7173: 7170: 7165: 7159: 7155: 7151: 7147: 7143: 7136: 7133: 7128: 7122: 7118: 7111: 7108: 7101: 7096: 7093: 7091: 7088: 7086: 7083: 7081: 7078: 7076: 7073: 7071: 7068: 7066: 7063: 7060: 7057: 7055: 7052: 7050: 7047: 7045: 7042: 7040: 7037: 7035: 7032: 7031: 7026: 7024: 7021: 7008: 7002: 6994: 6990: 6987: 6982: 6958: 6955: 6952: 6949: 6945: 6942: 6911: 6908: 6904: 6901: 6898: 6895: 6876: 6859: 6850: 6847: 6843: 6839: 6830: 6812: 6809: 6805: 6802: 6794: 6790: 6781: 6780: 6777: 6771: 6765: 6764: 6761: 6755: 6749: 6748: 6738: 6732: 6726: 6725: 6715: 6709: 6703: 6702: 6699: 6693: 6687: 6686: 6682: 6679: 6676: 6675: 6672: 6653: 6651: 6625: 6582: 6579: 6576: 6557: 6555: 6551: 6506: 6502: 6498: 6494: 6490: 6462: 6443: 6440: 6437: 6434: 6431: 6416: 6414: 6412: 6408: 6404: 6400: 6396: 6392: 6384: 6380: 6377: 6373: 6369: 6365: 6361: 6357: 6354: 6350: 6347: 6337: 6333: 6329: 6326: 6322: 6317: 6309:a set α = { α 6308: 6304: 6296:a set Φ = { Φ 6295: 6292: 6288: 6287: 6286: 6285:consists of: 6284: 6279: 6277: 6273: 6269: 6265: 6259: 6251: 6246: 6244: 6242: 6226: 6205: 6202: 6181: 6161: 6140: 6137: 6113: 6110: 6107: 6078: 6068: 6065: 6058: 6055: 6052: 6045: 6042: 6038: 6035: 6027: 6023: 6012: 6009: 6005: 6002: 5994: 5990: 5983: 5980: 5975: 5971: 5965: 5962: 5959: 5953: 5943: 5942: 5941: 5927: 5915: 5913: 5899: 5896: 5886: 5884: 5880: 5876: 5871: 5863: 5861: 5846: 5842: 5834: 5815: 5809: 5780: 5777: 5774: 5771: 5768: 5762: 5756: 5745: 5742: 5739: 5733: 5724: 5718: 5715: 5712: 5709: 5706: 5700: 5692: 5684: 5681: 5674: 5673: 5672: 5655: 5649: 5626: 5620: 5597: 5591: 5568: 5562: 5530: 5524: 5521: 5515: 5509: 5506: 5503: 5497: 5494: 5488: 5485: 5477: 5471: 5468: 5444: 5438: 5432: 5429: 5426: 5423: 5413: 5407: 5404: 5398: 5392: 5386: 5381: 5376: 5372: 5360: 5354: 5348: 5345: 5339: 5333: 5326: 5319: 5313: 5310: 5304: 5298: 5292: 5281: 5280: 5279: 5277: 5269: 5267: 5250: 5247: 5244: 5236: 5232: 5208: 5205: 5202: 5196: 5166: 5163: 5160: 5154: 5148: 5145: 5142: 5136: 5128: 5122: 5119: 5116: 5112: 5106: 5103: 5100: 5096: 5092: 5086: 5083: 5080: 5072: 5068: 5061: 5058: 5055: 5049: 5041: 5035: 5032: 5029: 5025: 5019: 5016: 5013: 5009: 4997: 4996: 4995: 4978: 4975: 4972: 4964: 4960: 4936: 4933: 4930: 4924: 4901: 4898: 4895: 4889: 4863: 4860: 4857: 4843: 4837: 4834: 4831: 4824: 4821: 4815: 4812: 4809: 4803: 4795: 4792: 4789: 4783: 4780: 4777: 4771: 4763: 4757: 4754: 4751: 4747: 4741: 4738: 4735: 4731: 4728: 4720: 4717: 4714: 4711: 4704: 4701: 4695: 4692: 4689: 4683: 4677: 4674: 4671: 4668: 4665: 4659: 4651: 4645: 4642: 4639: 4635: 4629: 4626: 4623: 4619: 4616: 4599: 4596: 4593: 4587: 4581: 4578: 4575: 4569: 4561: 4555: 4552: 4549: 4545: 4539: 4536: 4533: 4529: 4526: 4508: 4507: 4503: 4502: 4480: 4474: 4471: 4468: 4464: 4461: 4458: 4455: 4444: 4441: 4438: 4432: 4429: 4423: 4417: 4411: 4408: 4405: 4402: 4399: 4393: 4388: 4385: 4382: 4378: 4374: 4371: 4356: 4337: 4336: 4332: 4331: 4330: 4314: 4304: 4280: 4258: 4248: 4224: 4198: 4192: 4189: 4186: 4178: 4175: 4172: 4162: 4156: 4150: 4147: 4144: 4141: 4138: 4132: 4127: 4124: 4121: 4117: 4113: 4107: 4104: 4101: 4095: 4092: 4089: 4082: 4081: 4080: 4066: 4044: 4040: 4031: 4027: 4018: 4016: 4003: 4000: 3997: 3994: 3991: 3967: 3961: 3957: 3952: 3947: 3944: 3931: 3925: 3919: 3916: 3910: 3904: 3898: 3893: 3889: 3878: 3874: 3865: 3861: 3856: 3841: 3840: 3839: 3836: 3831: 3829: 3825: 3821: 3817: 3813: 3805: 3803: 3801: 3793: 3792: 3791: 3785: 3782: 3778: 3774: 3771: 3770: 3769: 3763: 3761: 3744: 3738: 3718: 3708: 3700: 3698: 3696: 3688: 3686: 3683: 3679: 3675: 3672: 3668: 3664: 3656: 3654: 3640: 3620: 3608: 3606: 3604: 3600: 3592: 3590: 3587: 3573: 3564: 3562: 3545: 3542: 3538: 3535: 3526: 3524: 3518: 3512: 3504: 3502: 3500: 3496: 3495:Lloyd Shapley 3492: 3476: 3456: 3434: 3431: 3428: 3424: 3415: 3397: 3393: 3372: 3369: 3366: 3346: 3323: 3319: 3314: 3306: 3303: 3294: 3290: 3286: 3283: 3276: 3273: 3269: 3266: 3258: 3254: 3249: 3241: 3238: 3234: 3231: 3223: 3219: 3212: 3209: 3204: 3199: 3193: 3185: 3179: 3171: 3168: 3165: 3161: 3153: 3152: 3151: 3134: 3128: 3105: 3099: 3076: 3070: 3047: 3041: 3021: 3013: 3007: 3001: 2993: 2988: 2986: 2967: 2962: 2954: 2951: 2944: 2941: 2938: 2931: 2928: 2924: 2921: 2913: 2909: 2904: 2896: 2893: 2889: 2886: 2878: 2874: 2867: 2864: 2859: 2854: 2850: 2845: 2841: 2837: 2831: 2825: 2818: 2817: 2801: 2793: 2790: 2783: 2780: 2777: 2770: 2767: 2763: 2760: 2749: 2743: 2739: 2734: 2726: 2723: 2719: 2716: 2705: 2699: 2695: 2688: 2685: 2680: 2676: 2670: 2664: 2657: 2656: 2655: 2651: 2637: 2614: 2608: 2588: 2568: 2561: 2545: 2538: 2533: 2531: 2527: 2519: 2501: 2498: 2477: 2453: 2450: 2446: 2443: 2435: 2431: 2423: 2404: 2401: 2397: 2394: 2386: 2382: 2374: 2356: 2353: 2350: 2347: 2324: 2317: 2300: 2290: 2281: 2278: 2272: 2269: 2266: 2260: 2257: 2251: 2248: 2225: 2218: 2217: 2216: 2213: 2207: 2202: 2195: 2193: 2191: 2187: 2183: 2179: 2173: 2159: 2138: 2135: 2114: 2094: 2071: 2068: 2065: 2059: 2053: 2050: 2046: 2043: 2022: 2014: 2010: 2006: 2002: 1995: 1993: 1971: 1968: 1964: 1961: 1953: 1949: 1936: 1934: 1918: 1914: 1905: 1898: 1896: 1877: 1846: 1843: 1840: 1836: 1832: 1827: 1823: 1812: 1808: 1803: 1799: 1794: 1791: 1788: 1784: 1758: 1754: 1747: 1744: 1739: 1735: 1713: 1703: 1700: 1697: 1693: 1689: 1684: 1680: 1669: 1665: 1660: 1653: 1650: 1647: 1642: 1639: 1636: 1632: 1627: 1623: 1616: 1615: 1614: 1600: 1577: 1554: 1551: 1542: 1528: 1505: 1502: 1499: 1492: 1488: 1485: 1482: 1462: 1442: 1436: 1430: 1424: 1421: 1398: 1367: 1364: 1361: 1357: 1353: 1348: 1344: 1333: 1329: 1324: 1320: 1315: 1312: 1309: 1305: 1279: 1275: 1268: 1265: 1260: 1256: 1234: 1224: 1221: 1218: 1214: 1210: 1205: 1201: 1190: 1186: 1181: 1175: 1171: 1159: 1156: 1153: 1149: 1144: 1140: 1133: 1132: 1131: 1117: 1108: 1091: 1085: 1065: 1057: 1041: 1018: 1012: 992: 980: 978: 964: 944: 924: 900: 879: 876: 855: 831: 828: 824: 821: 813: 809: 801: 798: 780: 770: 767: 744: 741: 736: 732: 728: 725: 722: 717: 713: 709: 705: 702: 698: 693: 690: 687: 683: 673: 666: 663: 659: 656: 648: 644: 623: 620: 616: 613: 592: 588: 585: 581: 574: 571: 567: 564: 556: 552: 545: 542: 537: 533: 527: 524: 519: 515: 511: 508: 505: 500: 496: 492: 488: 485: 481: 476: 473: 470: 466: 439: 436: 433: 412: 409: 388: 368: 348: 324: 321: 317: 314: 306: 302: 294: 279: 257: 253: 244: 226: 219: 216: 212: 206: 190: 183: 182: 181: 162: 158: 154: 149: 145: 141: 138: 135: 132: 122: 113: 106: 104: 102: 98: 94: 93:Andrey Markov 90: 89:Markov chains 85: 83: 79: 75: 71: 67: 63: 59: 55: 50: 48: 44: 40: 36: 32: 19: 7911: 7900: 7886: 7876: 7868:the original 7857:the original 7841: 7820: 7799: 7789: 7780: 7761: 7739: 7718: 7669: 7660: 7649:. Retrieved 7644: 7631: 7613: 7607: 7598: 7588: 7579: 7573: 7564: 7558: 7546:. Retrieved 7532: 7528: 7518: 7506:. Retrieved 7484: 7478: 7461: 7436: 7432: 7426: 7409: 7405: 7399: 7387: 7371:. Springer. 7368: 7358: 7313: 7309: 7296: 7277: 7272: 7253: 7249: 7239: 7220: 7186: 7182: 7172: 7145: 7135: 7116: 7110: 7095:Markov chain 7022: 6974:or, rarely, 6881: 6657: 6623: 6558: 6553: 6549: 6496: 6492: 6420: 6410: 6406: 6402: 6398: 6394: 6388: 6382: 6375: 6371: 6367: 6363: 6359: 6355: 6342: 6335: 6331: 6324: 6320: 6315: 6302: 6290: 6282: 6280: 6267: 6261: 6247:Other scopes 6096: 5919: 5887: 5873: 5801: 5555: 5273: 5188: 4882: 4216: 4022: 3983: 3834: 3832: 3811: 3809: 3797: 3789: 3767: 3710: 3692: 3660: 3612: 3596: 3588: 3565: 3527: 3508: 3338: 3000:Bellman 1957 2997: 2983: 2652: 2559: 2536: 2534: 2523: 2214: 2211: 2174: 1998: 1996: 1989: 1940: 1901: 1899: 1866: 1543: 1387: 1109: 1056:Markov chain 984: 916: 242:action space 240: 208: 118: 86: 51: 34: 30: 29: 7548:November 2, 7508:November 2, 7502:1721.1/2893 6733:cost-to-go 6680:alternative 6499:denote the 6489:free monoid 6487:denote the 6381:a function 3511:Howard 1960 210:state space 7934:Categories 7651:2018-12-12 7102:References 7090:Q-learning 6505:Giry monad 6370:+ 1) from 6362:generates 6341:(0), ..., 6241:Q-learning 3779:only, and 3561:relaxation 2520:Algorithms 2178:regression 2013:pseudocode 2009:algorithms 760:; In case 605:for every 107:Definition 66:healthcare 7915:. Wiley. 7696:0018-9472 7674:CiteSeerX 7203:0019-9958 6950:∣ 6593:→ 6522:→ 6056:γ 5976:∑ 5900:− 5897:ε 5847:∗ 5754:∂ 5731:∂ 5656:⋅ 5569:⋅ 5373:∫ 5349:π 5237:∗ 5120:∈ 5113:∑ 5104:∈ 5097:∑ 5093:≥ 5073:∗ 5033:∈ 5026:∑ 5017:∈ 5010:∑ 4965:∗ 4861:∈ 4855:∀ 4835:∈ 4829:∀ 4822:≥ 4755:∈ 4748:∑ 4739:∈ 4732:∑ 4715:∈ 4709:∀ 4669:∣ 4643:∈ 4636:∑ 4627:∈ 4620:∑ 4553:∈ 4546:∑ 4537:∈ 4530:∑ 4472:∈ 4459:∈ 4453:∀ 4430:≥ 4403:∣ 4386:∈ 4379:∑ 4375:− 4315:∗ 4308:¯ 4259:∗ 4252:¯ 4190:∈ 4176:∈ 4170:∀ 4142:∣ 4125:∈ 4118:∑ 4093:≥ 4045:∗ 3998:γ 3995:≤ 3920:π 3890:γ 3884:∞ 3875:∫ 3862:⁡ 3857:π 3739:π 3641:π 3574:π 3287:γ 3205:∑ 3100:π 3042:π 3022:π 2942:γ 2860:∑ 2851:⁡ 2826:π 2781:γ 2744:π 2700:π 2681:∑ 2589:π 2569:π 2348:− 2291:⊂ 2282:˙ 2261:˙ 2258:θ 2249:θ 2057:← 1919:∗ 1915:π 1800:∼ 1748:π 1651:− 1633:∑ 1578:γ 1555:− 1483:γ 1437:≤ 1431:γ 1425:≤ 1399:γ 1321:∼ 1269:π 1172:γ 1165:∞ 1150:∑ 1118:π 1086:π 1013:π 993:π 925:π 868:to state 771:⊆ 710:∣ 621:⊆ 538:∫ 493:∣ 482:∈ 361:in state 180:, where: 62:economics 7879:. Wiley. 7473:(1987). 7350:16589380 7304:(1953). 7027:See also 6991:′ 6946:′ 6912:′ 6851:′ 6813:′ 6694:control 6683:comment 6405:+ 1) by 6334:(0) = ≪ 6313:, ..., α 6300:, ..., Φ 6272:Narendra 6206:′ 6141:′ 6069:′ 6046:′ 6013:′ 5984:′ 4521:Maximize 4350:Minimize 3546:′ 3469:, until 3307:′ 3277:′ 3242:′ 3213:′ 2955:′ 2932:′ 2897:′ 2868:′ 2794:′ 2771:′ 2727:′ 2689:′ 2502:′ 2454:′ 2405:′ 2139:′ 2047:′ 1991:episodes 1972:′ 880:′ 832:′ 706:′ 667:′ 617:′ 589:′ 575:′ 546:′ 489:′ 426:at time 413:′ 381:at time 325:′ 47:outcomes 7453:5167748 7341:1063912 7318:Bibcode 6756:policy 6750:policy 6704:reward 6688:action 6503:of the 6401:) to P( 4237:, then 4026:ergodic 2196:Example 58:ecology 7919:  7849:  7828:  7807:  7768:  7727:  7694:  7676:  7647:: 1–13 7451:  7375:  7348:  7338:  7284:  7227:  7201:  7160:  7123:  6774:α 6768:γ 6758:μ 6752:π 6727:value 6669:α 6665:γ 6661:β 6495:. Let 6348:(0) ≫, 6289:a set 6082:  5951:  4030:policy 3984:where 3835:policy 3339:where 3014:, the 2842:argmax 2560:policy 1881:  1875:  1867:where 1581:  1575:  1440:  1434:  1428:  1402:  1396:  1388:where 7891:(PDF) 7641:(PDF) 7487:(3). 7449:S2CID 6710:cost 3523:Sears 2537:value 203:is a 121:tuple 45:when 7917:ISBN 7847:ISBN 7826:ISBN 7805:ISBN 7766:ISBN 7725:ISBN 7692:ISSN 7550:2023 7510:2023 7373:ISBN 7346:PMID 7282:ISBN 7225:ISBN 7199:ISSN 7158:ISBN 7121:ISBN 6497:Dist 6194:and 5881:and 5459:s.t. 4611:s.t. 4001:< 3517:help 3385:and 3006:help 2152:and 2107:and 72:and 7684:doi 7619:doi 7537:doi 7497:hdl 7489:doi 7441:doi 7414:doi 7336:PMC 7326:doi 7258:doi 7191:doi 7150:doi 6663:or 5689:max 5330:max 4365:s.t 3849:max 3633:or 3190:max 2470:is 2337:is 1107:). 977:). 205:set 35:MDP 7936:: 7704:^ 7690:. 7682:. 7643:. 7617:. 7597:. 7535:. 7533:49 7531:. 7527:. 7495:. 7485:12 7483:. 7477:. 7469:; 7447:. 7437:20 7435:. 7410:24 7408:. 7344:. 7334:. 7324:. 7314:39 7312:. 7308:. 7254:49 7252:. 7248:. 7211:^ 7197:. 7185:. 7181:. 7156:. 7144:. 6936:Pr 6928:, 6890:Pr 6622:a 6556:. 6351:a 6323:≤ 6243:. 4004:1. 3830:. 3601:; 3563:. 3186::= 2838::= 2677::= 2650:. 2532:. 2015:, 1897:. 677:Pr 460:Pr 68:, 64:, 60:, 7925:. 7834:. 7813:. 7774:. 7733:. 7698:. 7686:: 7654:. 7625:. 7621:: 7552:. 7539:: 7512:. 7499:: 7491:: 7455:. 7443:: 7420:. 7416:: 7381:. 7352:. 7328:: 7320:: 7290:. 7266:. 7260:: 7233:. 7205:. 7193:: 7187:1 7166:. 7152:: 7129:. 7009:. 7006:) 7003:a 7000:( 6995:s 6988:s 6983:p 6962:) 6959:a 6956:, 6953:s 6943:s 6939:( 6916:) 6909:s 6905:, 6902:a 6899:, 6896:s 6893:( 6863:) 6860:a 6857:( 6848:s 6844:s 6840:p 6817:) 6810:s 6806:, 6803:s 6800:( 6795:a 6791:P 6744:V 6740:J 6735:J 6729:V 6721:R 6717:g 6712:g 6706:R 6696:u 6690:a 6636:C 6610:) 6606:t 6603:s 6600:i 6597:D 6588:C 6583:: 6580:F 6577:, 6572:C 6567:( 6554:P 6550:S 6535:t 6532:s 6529:i 6526:D 6517:A 6493:A 6473:A 6447:) 6444:P 6441:, 6438:A 6435:, 6432:S 6429:( 6411:t 6407:A 6403:t 6399:t 6395:t 6383:G 6376:t 6374:( 6372:p 6368:t 6366:( 6364:p 6360:t 6356:A 6345:s 6343:p 6339:1 6336:p 6332:p 6327:, 6325:s 6321:r 6316:r 6311:1 6303:s 6298:1 6291:x 6227:Q 6203:s 6182:a 6162:s 6138:s 6117:) 6114:a 6111:, 6108:s 6105:( 6079:. 6076:) 6073:) 6066:s 6062:( 6059:V 6053:+ 6050:) 6043:s 6039:, 6036:s 6033:( 6028:a 6024:R 6020:( 6017:) 6010:s 6006:, 6003:s 6000:( 5995:a 5991:P 5981:s 5972:= 5969:) 5966:a 5963:, 5960:s 5957:( 5954:Q 5928:a 5843:V 5819:) 5816:t 5813:( 5810:a 5787:) 5784:) 5781:a 5778:, 5775:s 5772:, 5769:t 5766:( 5763:f 5757:x 5749:) 5746:s 5743:, 5740:t 5737:( 5734:V 5725:+ 5722:) 5719:a 5716:, 5713:s 5710:, 5707:t 5704:( 5701:r 5698:( 5693:u 5685:= 5682:0 5659:) 5653:( 5650:f 5630:) 5627:t 5624:( 5621:a 5601:) 5598:t 5595:( 5592:s 5572:) 5566:( 5563:D 5537:] 5534:) 5531:t 5528:( 5525:a 5522:, 5519:) 5516:t 5513:( 5510:s 5507:, 5504:t 5501:[ 5498:f 5495:= 5489:t 5486:d 5481:) 5478:t 5475:( 5472:x 5469:d 5451:] 5448:) 5445:T 5442:( 5439:s 5436:[ 5433:D 5430:+ 5427:t 5424:d 5420:) 5417:) 5414:t 5411:( 5408:a 5405:, 5402:) 5399:t 5396:( 5393:s 5390:( 5387:r 5382:T 5377:0 5367:) 5364:) 5361:t 5358:( 5355:s 5352:( 5346:= 5343:) 5340:t 5337:( 5334:a 5320:= 5317:) 5314:0 5311:, 5308:) 5305:0 5302:( 5299:s 5296:( 5293:V 5254:) 5251:a 5248:, 5245:i 5242:( 5233:y 5212:) 5209:a 5206:, 5203:i 5200:( 5197:y 5170:) 5167:a 5164:, 5161:i 5158:( 5155:y 5152:) 5149:a 5146:, 5143:i 5140:( 5137:R 5132:) 5129:i 5126:( 5123:A 5117:a 5107:S 5101:i 5090:) 5087:a 5084:, 5081:i 5078:( 5069:y 5065:) 5062:a 5059:, 5056:i 5053:( 5050:R 5045:) 5042:i 5039:( 5036:A 5030:a 5020:S 5014:i 4982:) 4979:a 4976:, 4973:i 4970:( 4961:y 4940:) 4937:a 4934:, 4931:i 4928:( 4925:y 4905:) 4902:a 4899:, 4896:i 4893:( 4890:y 4864:S 4858:i 4847:) 4844:i 4841:( 4838:A 4832:a 4825:0 4819:) 4816:a 4813:, 4810:i 4807:( 4804:y 4796:, 4793:1 4790:= 4787:) 4784:a 4781:, 4778:i 4775:( 4772:y 4767:) 4764:i 4761:( 4758:A 4752:a 4742:S 4736:i 4721:, 4718:S 4712:j 4705:0 4702:= 4699:) 4696:a 4693:, 4690:i 4687:( 4684:y 4681:) 4678:a 4675:, 4672:i 4666:j 4663:( 4660:q 4655:) 4652:i 4649:( 4646:A 4640:a 4630:S 4624:i 4603:) 4600:a 4597:, 4594:i 4591:( 4588:y 4585:) 4582:a 4579:, 4576:i 4573:( 4570:R 4565:) 4562:i 4559:( 4556:A 4550:a 4540:S 4534:i 4484:) 4481:i 4478:( 4475:A 4469:a 4465:, 4462:S 4456:i 4448:) 4445:a 4442:, 4439:i 4436:( 4433:R 4427:) 4424:j 4421:( 4418:h 4415:) 4412:a 4409:, 4406:i 4400:j 4397:( 4394:q 4389:S 4383:j 4372:g 4357:g 4305:V 4281:g 4249:V 4225:h 4202:) 4199:i 4196:( 4193:A 4187:a 4179:S 4173:i 4166:) 4163:j 4160:( 4157:h 4154:) 4151:a 4148:, 4145:i 4139:j 4136:( 4133:q 4128:S 4122:j 4114:+ 4111:) 4108:a 4105:, 4102:i 4099:( 4096:R 4090:g 4067:i 4041:V 3992:0 3968:] 3962:0 3958:s 3953:| 3948:t 3945:d 3941:) 3938:) 3935:) 3932:t 3929:( 3926:s 3923:( 3917:, 3914:) 3911:t 3908:( 3905:s 3902:( 3899:r 3894:t 3879:0 3866:[ 3853:E 3748:) 3745:s 3742:( 3719:s 3674:P 3621:V 3543:s 3539:= 3536:s 3519:) 3513:) 3477:V 3457:s 3435:1 3432:+ 3429:i 3425:V 3398:0 3394:V 3373:0 3370:= 3367:i 3347:i 3324:, 3320:} 3315:) 3311:) 3304:s 3300:( 3295:i 3291:V 3284:+ 3281:) 3274:s 3270:, 3267:s 3264:( 3259:a 3255:R 3250:( 3246:) 3239:s 3235:, 3232:s 3229:( 3224:a 3220:P 3210:s 3200:{ 3194:a 3183:) 3180:s 3177:( 3172:1 3169:+ 3166:i 3162:V 3138:) 3135:s 3132:( 3129:V 3109:) 3106:s 3103:( 3080:) 3077:s 3074:( 3071:V 3051:) 3048:s 3045:( 3008:) 3002:) 2968:} 2963:) 2959:) 2952:s 2948:( 2945:V 2939:+ 2936:) 2929:s 2925:, 2922:s 2919:( 2914:a 2910:R 2905:( 2901:) 2894:s 2890:, 2887:s 2884:( 2879:a 2875:P 2865:s 2855:{ 2846:a 2835:) 2832:s 2829:( 2802:) 2798:) 2791:s 2787:( 2784:V 2778:+ 2775:) 2768:s 2764:, 2761:s 2758:( 2753:) 2750:s 2747:( 2740:R 2735:( 2731:) 2724:s 2720:, 2717:s 2714:( 2709:) 2706:s 2703:( 2696:P 2686:s 2674:) 2671:s 2668:( 2665:V 2638:s 2618:) 2615:s 2612:( 2609:V 2546:V 2499:s 2478:1 2458:) 2451:s 2447:, 2444:s 2441:( 2436:a 2432:R 2409:) 2402:s 2398:, 2395:s 2392:( 2387:a 2383:P 2360:} 2357:1 2354:, 2351:1 2345:{ 2325:A 2301:4 2296:R 2288:) 2279:x 2273:, 2270:x 2267:, 2252:, 2246:( 2226:S 2208:) 2160:r 2136:s 2115:a 2095:s 2075:) 2072:a 2069:, 2066:s 2063:( 2060:G 2054:r 2051:, 2044:s 2023:G 1976:) 1969:s 1965:, 1962:s 1959:( 1954:a 1950:P 1878:H 1852:) 1847:1 1844:+ 1841:t 1837:s 1833:, 1828:t 1824:s 1820:( 1813:t 1809:a 1804:P 1795:1 1792:+ 1789:t 1785:s 1764:) 1759:t 1755:s 1751:( 1745:= 1740:t 1736:a 1714:] 1709:) 1704:1 1701:+ 1698:t 1694:s 1690:, 1685:t 1681:s 1677:( 1670:t 1666:a 1661:R 1654:1 1648:H 1643:0 1640:= 1637:t 1628:[ 1624:E 1601:H 1552:H 1529:r 1509:) 1506:r 1503:+ 1500:1 1497:( 1493:/ 1489:1 1486:= 1463:1 1443:1 1422:0 1373:) 1368:1 1365:+ 1362:t 1358:s 1354:, 1349:t 1345:s 1341:( 1334:t 1330:a 1325:P 1316:1 1313:+ 1310:t 1306:s 1285:) 1280:t 1276:s 1272:( 1266:= 1261:t 1257:a 1235:] 1230:) 1225:1 1222:+ 1219:t 1215:s 1211:, 1206:t 1202:s 1198:( 1191:t 1187:a 1182:R 1176:t 1160:0 1157:= 1154:t 1145:[ 1141:E 1095:) 1092:s 1089:( 1066:s 1042:s 1022:) 1019:s 1016:( 965:A 945:S 913:. 901:a 877:s 856:s 836:) 829:s 825:, 822:s 819:( 814:a 810:R 799:. 781:d 776:R 768:S 748:) 745:a 742:= 737:t 733:a 729:, 726:s 723:= 718:t 714:s 703:s 699:= 694:1 691:+ 688:t 684:s 680:( 674:= 671:) 664:s 660:, 657:s 654:( 649:a 645:P 624:S 614:S 593:, 586:s 582:d 579:) 572:s 568:, 565:s 562:( 557:a 553:P 543:S 534:= 531:) 528:a 525:= 520:t 516:a 512:, 509:s 506:= 501:t 497:s 486:S 477:1 474:+ 471:t 467:s 463:( 440:1 437:+ 434:t 410:s 389:t 369:s 349:a 329:) 322:s 318:, 315:s 312:( 307:a 303:P 280:s 258:s 254:A 227:A 217:. 191:S 168:) 163:a 159:R 155:, 150:a 146:P 142:, 139:A 136:, 133:S 130:( 33:( 20:)

Index

Markov Decision Process
stochastic dynamic program
sequential decision making
outcomes
operations research
ecology
economics
healthcare
telecommunications
reinforcement learning
artificial intelligence
cause and effect
Markov chains
Andrey Markov
state transitions
Markov property

tuple
set
set of real numbers
Lebesgue measure
Markov chain
Learning Theory
generative model
algorithms
pseudocode
regression
dynamic programming
Monte Carlo tree search
reinforcement learning

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

↑