Behavior cloning methods, largely based on the principles of supervised learning, show fairly good results. But their main problem remains the search for ideal role models, which are sometimes very difficult to collect. In turn, reinforcement learning methods are able to work with non-optimal raw data. At the same time, they can find suboptimal policies to achieve the goal. However, when searching for an optimal policy, we often encounter an optimization problem that is more relevant in high-dimensional ...