V. CASE STUDY AND EXPERIMENTSWe require an abstract representation of the human’s commandsas a strategy to use our synthesis approach in a shared controlscenario, We now discuss how such strategies may be obtainedusing inverse reinforcement learning and report on case studyresults.A.
Experimental settingWe consider two scenarios, the first of which is the wheelchairscenario from Fig. 1. We model the wheelchair scenario insidean interactive Python environment.In the second scenario, we use a tool called AMASE1, whichis used to simulate multi-unmanned aircraft vehicles (UAV)missions.
Its graphical user interfaces allow humans to sendcommands to the one or multiple vehicles at run time. Itincludes three main programs: a simulator, a data playbacktool, and a scenario setup tool.We use the model checker PRISM 19 to verify if thecomputed strategies satisfy the specification.
We use the LPsolver Gurobi 14 to check the feasibility of the LP problemsthat is given in Section IV. We also implemented the greedyapproach for strategy repair in 15.B. Data collectionWe asked five participants to accomplish tasks in the wheelchairscenario. The goal is moving the wheelchair to a target cell ingridworld while never occupying the same cell as the movingobstacle.
Similarly, three participants performed the surveillancetask in the AMASE environment.From the data obtained from each participant, we computean individual randomized human strategy ?h via maximum-entropy inverse reinforcement learning (MEIRL) 28. Refer-ence 16 uses inverse reinforcement learning to reason aboutthe human’s commands in a shared control scenario fromhuman’s demonstrations. However, they lack formal guaranteeson the robot’s execution.
In 25, inverse reinforcement learningis used to distinguish human intents with respect to differenttasks. We show the work flow of the case study in Figure 4.In our setting, we denote each sample as one particularcommand of the participant, and we assume that the participantissues the command to satisfy the specification. Under thisassumption, we can bound the probability of a possibledeviation from the actual intent with respect to the number ofsamples using Hoeffding’s inequality for the resulting strategy,see 27 for details.
Using these bounds, we can determinethe required number of command to get an approximationof a typical human behavior. The probability of a possibledeviation from the human behavior is given by O(exp(?n?2)),where n is the number of commands from the human and ?is the upper bound on the deviation between the probabilityof satisfying the specification with the true human strategyand the probability obtained by the strategy that is computedby inverse reinforcement learning. For example, to ensure anupper bound ? = 0.05 on the deviation of the probability1https://github.com/afrl-rq/OpenAMASEFig. 4.
The setting of the case study for the shared control simulation.of satisfying the specification with a probability of 0.99, werequire 1060 demonstrations from the human.We design the blending function by assigning a lower weightin the human strategy at states where the human strategyyields a lower probability of reaching the target set. Usingthis function, we create the autonomy strategy ?a and pass(together with the blending function) back to the environment.Note that the blended strategy ?ha satisfies the specification,by Theorem 1.C.
GridworldThe size of the gridworld in Fig. 1 is variable, and wegenerate a number of randomly moving (e.g., the vacuumcleaner) and stationary obstacles.
An agent (e.g., the wheelchair)moves in the gridworld according to the commands from ahuman. For the gridworld scenario, we construct an MDP andthe states of the MDP represents the positions of the agent andthe obstacles. The actions in the MDP induce changes in theposition of the agent.The safety specification specifies that the agent has to reacha target cell while not crashing into an obstacle with a certainprobability ? ? 0, 1, formally P??(¬crash U target).First, we report results for one particular participant in agridworld scenario with a 8× 8 grid and one moving obstacle.The resulting MDP has 2304 states.
We compute the humanstrategy using MEIRL with the features, e. g., the componentsof the cost function of the human, giving the distance to theobstacle and the goal state.We instantiate the safety specification with ? = 0.7, whichmeans the target should be reached with at least a probabilityof 0.7.
The human strategy ?h induces a probability of 0.546to satisfy the specification. That is, it does not satisfy thespecification.We compute the repaired strategy ?ha using the greedy andthe QCP approach, and both strategies satisfies the specificationby inducing a probability of satisfying the specification largerthan ?. On the one hand, the maximum deviation betweenthe human strategy ?h and ?ha is 0.15 with the LP approach,which implies that the strategy of the human and the autonomyprotocol deviates at most 15% for all states and actions.
Onthe other hand, the maximum deviation between the humanstrategy ?h and the blended strategy ?ha is 0.03 with the QCPapproach. The results shows that the QCP approach computesa blended strategy that induces more similar strategy to thehuman strategy compared to the LP approach.(a) Strategy ?h (b) Strategy ?ah (c) Strategy ?aFig. 5. Graphical representation of the obtained human, blended, and autonomystrategy in the grid.TABLE ISCALABILITY RESULTS FOR THE GRIDWORLD EXAMPLE.
grid states trans. LP synth. ?LP QCP synth.
?QCP8× 8 2, 304 36, 864 14.12 0.15 31.49 0.0310× 10 3, 600 57, 600 23.80 0.24 44.61 0.
0412× 12 14, 400 230, 400 250.78 0.33 452.27 0.05To finally assess the scalability of our approach, considerTable I. We generated MDPs for different gridworlds withdifferent number of states and number of obstacles. We listthe number of states in the MDP (labeled as “states”) andthe number of transitions (labeled as “trans”). We report onthe time that the synthesis process took with the LP approachand QCP approach (labeled as “LP synth.
” and “QCP synth”),which includes the time of solving the LP or QCPs measuredin seconds. It also includes the model checking times usingPRISM for the LP approach. To represent the optimality of thesynthesis, we list the maximal deviation between the repairedstrategy and the human strategy for the LP and QCP approach(labeled as “?LP” and “?QCP”). In all of the examples, weobserve that the strategies with the QCP approach yieldsautonomy strategies with less deviation to the human strategywhile having similar computation time with the LP approach.
D. UAV mission planningSimilar to the gridworld scenario, we generate an MDP, inwhich that of the states MDP denotes the position of the agentsand the obstacles in a AMASE scenario. Consider an AMASEscenario in Fig. 6.
In this scenario, the specification or themission of the agent (blue UAV) is to keep surveilling thegreen regions (labeled as w1, w2, w3) while avoiding restrictedoperating zones (labeled as “ROZ1, ROZ2”) and enemy agents(purple and green UAVs). As we consider reachability problems,we asked the participants to visit the regions in a sequence,i.e., visiting the first region, then second, and then the thirdregion.
After visiting the third region, the task is to visit thefirst region again to perform the surveillance.For example, if the last visited region is w3, then thesafety specification in this scenario is P??((¬crash ?¬ROZ) U target), where ROZ is to visit the ROZ areasand target is visiting w1.We synthesize the autonomy protocol on the AMASEscenario with two enemy agents that induces an MDP with15625 states.
We use the same blending function and same(a) An AMASE simulator. (b) The GUI of AMASE.Fig. 6. An example of an AMASE scenario.threshold ? = 0.
7 as in the gridworld example. The featuresto compute the human strategy with MEIRL are given bythe distance to the closest ROZ, enemy agents and the targetregion.The human strategy ?h induces a probability of 0.163 tosatisfy the specification, and it violates the specification likein the gridworld example. Similar to the gridworld example,we compute the repaired strategy ?ha with the LP and theQCP approach, and both strategies satisfy the specification.
Onthe one hand, the maximum deviation between ?h and ?ha is0.42 with the LP approach, which means the strategies of thehuman and the autonomy protocol are significantly different insome states of the MDP. On the other hand, the QCP approachyields a repaired strategy ?ha that is more similar to the humanstrategy ?h with the maximum deviation being 0.06. The timeof the synthesis procedure with the LP approach is 481.
31seconds and the computation time with the QCP approachis 749.18 seconds, showing the trade-offs between the LPapproach and the QCP approach. We see that, the LP approachcan compute a feasible solution slightly faster, however theresulting blended strategy may be less similar to the humanstrategy compared to the QCP approach.VI.
CONCLUSION AND CRITIQUEWe introduced a formal approach to synthesize an autonomyprotocol in a shared control setting subject to probabilistictemporal logic specifications. The proposed approach utilizesinverse reinforcement learning to compute an abstraction ofa human’s behavior as a randomized strategy in a Markovdecision process. We designed an autonomy protocol suchthat the robot behavior satisfies safety and performancespecifications given in probabilistic temporal logic. We alsoensured that the resulting robot behavior is as similar to thebehavior induced by the human’s commands as possible. Wesynthesized the robot behavior using quasiconvex programming.
We showed the practical usability of our approach throughcase studies involving autonomous wheelchair navigation andunmanned aerial vehicle planning.There is a number of limitations and also possible extensionsof the proposed approach. First of all, we computed anglobally optimal strategy by bisection, which requires checkingfeasibility of a number of linear programming problems.
Aconvex formulation of the shared control synthesis problemwould make computing the globally optimal strategy moreefficient.We assumed that the human’s commands are consistentthrough the whole execution, i. e.
, the human issues eachcommand to satisfy the specification. Also, this assumptionimplies the human does not consider assistance from the robotwhile providing commands – and in particular, the human doesnot adapt the strategy to the assistance. It may be possibleto extend the approach to handle non-consistent commandsby utilizing additional side information, such as the taskspecifications.
Finally, in order to generalize the proposed approach toother task domains, it is worth to explore transfer learning 21techniques. Such techniques will allow us to handle differentscenarios without requiring to relearn the human strategy fromthe human’s commands.