eduzhai > Applied Sciences > Engineering >

Model-Free Algorithm and Regret Analysis for MDPs with Long-Term Constraints

  • Save

... pages left unread,continue reading

Document pages: 0 pages

Abstract: In the optimization of dynamical systems, the variables typically haveconstraints. Such problems can be modeled as a constrained Markov DecisionProcess (CMDP). This paper considers a model-free approach to the problem,where the transition probabilities are not known. In the presence of long-term(or average) constraints, the agent has to choose a policy that maximizes thelong-term average reward as well as satisfy the average constraints in eachepisode. The key challenge with the long-term constraints is that the optimalpolicy is not deterministic in general, and thus standard Q-learning approachescannot be directly used. This paper uses concepts from constrained optimizationand Q-learning to propose an algorithm for CMDP with long-term constraints. Forany $ gamma in(0, frac{1}{2})$, the proposed algorithm is shown to achieve$O(T^{1 2+ gamma})$ regret bound for the obtained reward and$O(T^{1- gamma 2})$ regret bound for the constraint violation, where $T$ is thetotal number of steps. We note that these are the first results on regretanalysis for MDP with long-term constraints, where the transition probabilitiesare not known apriori.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...