Synthesis joint input to the robot. We model

Synthesis of Provably Correct Autonomy Protocolsfor Shared ControlMurat Cubuktepe, Nils Jansen, Mohammed Alsiekh, Ufuk TopcuAbstract—We develop algorithms for the synthesis of sharedcontrol protocols subject to probabilistic temporal logic specifi-cations. More specifically, we introduce a modeling formalismin which both a human and an autonomy protocol can issuecommands to a robot towards performing a certain task. Thesecommands are blended into a joint input to the robot. We modelthe interaction between the human and the robot as a Markovdecision process that incorporates the potential randomness indecisions caused by factors such as complexity of the task spec-ifications and imperfect interfaces. Using inverse reinforcementlearning, we obtain an abstraction of the human’s behavior as aso-called randomized strategy.

We design the autonomy protocolto ensure that the robot behavior—resulting from the blendingof the commands—satisfies given safety and performance specifi-cations in probabilistic temporal logic. Additionally, the resultingstrategies generate behavior as similar to the behavior induced bythe human’s commands as possible. We formulate the underlyingproblem as a quasiconvex optimization problem, which is solvedby checking feasibility for a number of linear programmingproblems. We assess the applicability and the scalability of theapproach through case studies involving autonomous wheelchairnavigation and unmanned aerial vehicle mission planning.I.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

INTRODUCTIONWe study the problem of shared control, where a robotis to execute a task according to the goals of a humanoperator while adhering to additional safety and performancerequirements. Applications of such human-robot interactioninclude remotely operated semi-autonomous wheelchairs 13,robotic teleoperation 16, and human-in-the-loop unmannedaerial vehicle mission planning 9. Basically, the humanoperator issues a command through an input interface, whichmaps the command directly to an action for the robot. Theproblem is that a sequence of such actions may fail toaccomplish the task at hand, due to limitations of the interfaceor failure of the human in comprehending the complexity ofthe problem.

Therefore, a so-called autonomy protocol providesassistance for the human in order to complete the task accordingto the given requirements.At the heart of the shared control problem is the design ofan autonomy protocol. In the literature, there are two maindirections, based on either switching the control authoritybetween human and autonomy protocol 26, or on blendingtheir commands towards joined inputs for the robot 7, 15.M.

Cubuktepe and U. Topcu are with the Department of Aerospace Engi-neering and Engineering Mechanics, University of Texas at Austin, 201 E 24thSt, Austin, TX 78712, USA. Nils Jansen is with the Department of SoftwareScience, Radboud University Nijmegen, Comeniuslaan 4, 6525 HP Nijmegen,Netherlands. Mohammed Alsiekh is with the Institute for ComputationalEngineering and Sciences, University of Texas at Austin, 201 E 24th St,Austin, TX 78712, USA. email:({mcubuktepe,malsiekh,utopcu}@utexas.edu,[email protected]

ru.nl).One approach to switching the authority first determinesthe desired goal of the human operator with high confidence,and then assists towards exactly this goal 8, 18. In 12,switching the control authority between the human and au-tonomy protocol ensures satisfaction of specifications that areformally expressed in temporal logic. In general, the switchingof authority may cause a decrease in human’s satisfaction, whousually prefers to retain as much control as possible 17.

The second direction is to provide another command inaddition to the one of the human operator. To introduce amore flexible trade-off between the human’s control authorityand the level of autonomous assistance, both commands arethen blended to form a joined input for the robot. A blendingfunction determines the emphasis that is put on the autonomyprotocol in the blending, that is, regulating the amount ofassistance provided to the human. Switching of authority canbe seen as a special case of blending, as the blending functionmay assign full control to the autonomy protocol or to thehuman. In general, putting more emphasis on the autonomyprotocol in blending leads to greater accuracy in accomplishingthe task 6, 7, 20.

However, as before, humans may preferto retain control of the robot 16, 17. None of the existingblending approaches provide formal correctness guarantees thatgo beyond statistical confidence bounds. Correctness here refersto ensuring safety and optimizing performance according tothe given requirements. Our goal is the design of an autonomyprotocol that admits formal correctness while rendering therobot behavior as close to the human’s inputs as possible.We model the behavior of the robot as a Markov decisionprocess (MDP) 23, which captures the robot’s actions inside apotentially stochastic environment.

Problem formulations withMDPs typically focus on maximizing an expected reward (or,minimizing the expected cost). However, such formulations maynot be sufficient to ensure safety or performance guarantees ina task that includes a human operator. Therefore, we designthe autonomy protocol to ensure that the resulting behavior isformally verified with respect to probabilistic temporal logicspecifications. Such verification problems have been extensivelystudied for MDPs 2.Take as an example a scenario involving a semi-autonomouswheelchair 13 whose navigation has to account for a randomlymoving autonomous vacuum cleaner, see Fig. 1.

The wheelchairneeds to navigate to the exit of a room, and the vacuumcleaner moves in the room according to a probabilistic transitionfunction. The goal of the wheelchair is to reach the exit withoutcrashing into the vacuum cleaner. The human may not fullyperceive the motion of the vacuum cleaner.

Note that thehuman’s commands, depicted with the solid red line in Fig 1(a),(a) Autonomy perspective0.40.40.20.2(b) Human perspectiveFig.

1. A wheelchair in a shared control setting.may cause the wheelchair to crash into the vacuum cleaner.The autonomy protocol provides another set of commands,which is indicated by the solid red line in Fig 1(b), to carryout the task safely without crashing. However, the autonomyprotocol’s commands deviate highly from the commands ofthe human. The two sets of commands are then blended intoa new set of commands, depicted using the dashed red linein Fig 1(b). The blended commands perform the task safelywhile generating behavior as similar to the behavior inducedby the human’s commands as possible.In 15, we formulated the problem of designing theautonomy protocol as a nonlinear programming problem.

How-ever, solving nonlinear programs is generally intractable 3.Therefore, we proposed a greedy algorithm that iterativelyrepairs the human strategy such that the specifications aresatisfied without guaranteeing optimality, based on 22. Here,we propose an alternative approach for the blending of the twostrategies. We follow the approach of repairing the strategyof the human to compute an autonomy protocol. We ensurethat the resulting behavior induced by the repaired strategy (1)deviates minimally from the human strategy and (2) satisfiesgiven temporal logic specifications after blending.

We formallydefine the problem as a quasiconvex optimization problem,which can be solved efficiently by checking feasibility of anumber of convex optimization problems 4.A human may be uncertain about which command to issuein order to accomplish a task. Moreover, a typical interfaceused to parse human’s commands, such as a brain-computerinterface, is inherently imperfect. To capture such uncertaintiesand imperfections in the human’s decisions, we introducerandomness to the commands issued by humans. It may notbe possible to blend two different deterministic commands. Ifthe human’s command is “up” and the autonomy protocol’scommand is “right”, we cannot blend these two commandsto obtain another deterministic command. By introducingrandomness to the commands of the human and the autonomyprotocol, we therefore ensure that the blending is always well-defined.

In what follows, we call a formal interpretation of asequence of the human’s commands the human strategy, andthe sequence of commands issued by the autonomy protocolthe autonomy strategy.The question remains how to obtain the human strategyin the first place. It may be unrealistic that a human cancomprehend an MDP that models a realistic scenario. To thisend, we create a virtual simulation environment that capturesthe behavior of the MDP. We ask humans to participate in twocase studies to collect data about typical human behavior.

Weuse inverse reinforcement learning to get a formal interpretationas a strategy based on human’s inputs 1, 28. We modela typical shared control scenario based on an autonomouswheelchair navigation 13 in our first case study. In our secondcase study, we consider an unmanned aerial vehicle missionplanning scenario, where the human operator is to patrol certainregions while keeping away from enemy aerial vehicles.

In summary, the main contribution this paper is to synthesizethe autonomy protocol such that the resulting blended orrepaired strategy meets all given specifications while onlyminimally deviating from the human strategy. We introduce allformal foundations that we need in Section II. We provide anoverview of the general shared control concept in Section III.We present the shared control synthesis problem and providea solution based on linear programming in Section IV. Weindicate the applicability and scalability of our approach onexperiments in Section V and draw a short conclusion andcritique of our approach in Section VI.II. PRELIMINARIESIn this section, we introduce the required formal models andspecifications that we use to synthesize the autonomy protocol,and we give a short example illustrating the main concepts.

1) Markov Decision Processes: A probability distributionover a finite set X is a function µ : X ? 0, 1 ? R with?x?X µ(x) = µ(X) = 1. The set X of all distributions isDistr(X).Definition 1 (MDP). A Markov decision process (MDP)M =(S, sI ,A,P) has a finite set S of states, an initial state sI ? S,a finite set A of actions, and a transition probability functionP : S ×A? Distr(S).

MDPs have nondeterministic choices of actions at thestates; the successors are determined probabilistically via theassociated probability distribution. We assume that all actionsare available at each state and that the MDP contains nodeadlock states. A cost function C : S ×A? R?0 associatescost to transitions. If there one single action available at eachstate, the MDP reduces to a discrete-time Markov chain (MC).To define a probability measure and expected cost on MDPs,so-called strategies resolve the choices of actions.Definition 2 (Strategy). A randomized strategy for an MDPMis a function ? : S ? Distr(A). If ?(s, ?) = 1 for ? ? A and?(s, ?) = 0 for all ? ? A {?}, the strategy is deterministic.

The set of all strategies over M is StrM.Resolving all the nondeterminism for an MDP M with astrategy ? ? StrM yields an induced Markov chain M? .s0 s1 s2s3 s4abcd0.60.40.40.60.

60.40.40.

6111(a) MDP Ms0 s1 s2s3 s40.50.50.

50.5111(b) Induced MC M?1Fig. 2. MDP M with target state s2 and induced MC for strategy ?unifDefinition 3 (Induced MC). For an MDP M = (S, sI ,A,P)and strategy ? ? StrM, the MC induced by M and ? isM? = (S, sI ,A,P?) whereP?(s, s?) =???A(s)?(s, ?) · P(s, ?, s?) for all s, s? ? S .The occupancy measure of a strategy ? gives the expectednumber of taking an action at a state in an MDP.

In ourformulation, we use the occupancy measure of a strategy tocompute an autonomy protocol.Definition 4 (Occupancy Measure). The occupancy measurex? of a strategy ? is defined asx?(s, ?) = E ??t=0P (?t = ?|st = s)(1)where st and ?t denote the state and action in an MDP attime step t. The occupancy measure x?(s, ?) is the expectednumber of taking the action ? at state s with the strategy ?.2) Specifications: A probabilistic reachability specificationP??(?T ) with the threshold ? ? 0, 1 ? Q and the set oftarget states T ? S restricts the probability to reach T fromsI in M to be at most ?. Similarly, expected cost propertiesE??(?G) restrict the expected cost to reach the set G ? Sof goal states by an upper bound ? ? Q. Intuitively, somebad states in T should only be reached with probability ?(safety specification) while the expected cost to reach the goalstates in G should be below ? (performance specification). Wealso use until properties of the form Pr??(¬T U G), whichrequire that the probability of reaching G while not reachingT beforehand is at least ?.

The synthesis problem is to find one particular strategy ?for an MDP M such that given specifications ?1, . . . , ?n aresatisfied in the induced MC M? , written ? |= ?1, .

. . , ?n.Example 1. Fig.

2(a) depicts an MDPM with initial state s0.States s0 and s1 have choices between actions a or b and c ord. Action a, for example, has a probabilistic choice betweens1 and s3 with probabilities 0.

6 and 0.4. For the self loops ats2, s3 and s4 we omit actions.For a safety specification ? = P?0.

21(?s2), the deterministicstrategy ?1 ? StrM with ?1(s0, a) = 1 and ?1(s1, c) =1 induces the probability 0.36 of reaching s2. Therefore,the specification is not satisfied, see the induced MC inHumanStrategyAutonomyStrategyBlendedStrategyRobotexecutioncommandcommandblended commandBlendingfunction bFormalmodel MrSpecifications?1, . . . , ?nHumanstrategyFig.

3. Shared control architecture.Fig. 2(b). Likewise, the randomized strategy ?unif ? StrMwith ?unif(s0, a) = ?unif(s0, b) = 0.5 and ?unif(s1, c) =?unif(s1, d) = 0.5 violates the specification, as the probabilityof reaching s2 is 0.

25. However, the deterministic strategy?safe ? StrM with ?safe(s0, b) = 1 and ?safe(s1, d) = 1 inducesthe probability 0.16, thus ?safe |= ?.III. CONCEPTUAL DESCRIPTION OF SHARED CONTROLWe now detail the general shared control concept adopted inthis paper. Consider the setting in Fig.

3. As inputs, we have aset of task specifications, a model Mr for the robot behavior,and a blending function b. The given robot task is described bycertain performance and safety specifications ?1, . . . , ?n. Forexample, it may not be safe to take the shortest route becausethere may be too many obstacles in that route.

In order tosatisfy performance considerations, the robot should prefer totake the shortest route possible while not violating the safetyspecifications. We model the behavior of the robot inside astochastic environment as an MDP Mr.In general, a human issues a set of commands for the robotto execute. It is unrealistic that a human can comprehendan MDP that models a realistic scenario in the first place.Indeed, a human will likely have difficulties interpreting alarge number of possibilities and the associated probabilityof paths and payoffs 11, and it may be impractical for thehuman to provide the human strategy to the autonomy protocol,due to possibly large state space of the MDP.

Therefore,we compute a human strategy ?h as an abstraction of asequence of human’s commands, which we obtain using inversereinforcement learning 1, 28.We design an autonomy protocol that provides anotherstrategy ?a, which we call the autonomy strategy. Then, weblend the two strategies according to the blending function binto the blended strategy ?ha. The blending function reflectspreference over the human strategy or the autonomy strategy.We ensure that the blended strategy deviates minimally fromthe human strategy.

At runtime, we can then blend commands of the human withcommands of the autonomy strategy. The resulting “blended”commands will induce same behavior as with the blendedstrategy ?ha, and the specifications are satisfied. This procedurerequires to check feasibility of a number of linear programmingproblems, which can be expensive, but it is very efficient toimplement at run time.The shared control synthesis problem is then the synthesisof the repaired strategy such that, for the repaired strategy ?ha,it holds ?ha |= ?1, .

. . , ?n while minimally deviating from ?h.The deviation between the human strategy ?h and the repairedstrategy ?ha is measured by the maximal difference betweenthe two strategies.IV. SYNTHESIS OF THE AUTONOMY PROTOCOLA.

Strategy blendingGiven the human strategy ?h ? StrMr and the autonomystrategy ?a ? StrMr , a blending function computes a weightedcomposition of the two strategies by favoring one or the otherstrategy in each state of the MDP 16, 6, 7.It is argued that blending reflects the confidence in howgood the autonomy protocol is able to assist with respect tothe human’s task 7. Put differently, the actions of the humanshould be assigned low confidence by the blending function, ifthe actions may lead to a violation of the specifications. RecallFig.

1 and the example in the introduction. In the cells of thegridworld where some actions may result in a collusion withthe vacuum cleaner with high probability, it makes sense toassign a high weight in the autonomy strategy.We pick the blending function as a state-dependent functionthat weighs the confidence in both the human strategy and theautonomy strategy at each state of an MDP Mr 16, 6, 7.Definition 5 (Linear blending).

Given an MDP Mr =(S, sI ,A,P), two strategies ?h, ?a ? StrMr , and a blendingfunction b : S ? 0, 1, the blended strategy ?ha ? StrMr forall states s ? S, and actions ? ? A is?ha(s, ?) = b(s) · ?h(s, ?) + (1? b(s)) · ?a(s, ?) .Note that the blended strategy ?ha is a well-defined random-ized strategy. For each s ? S, the value b(s) represents the”weight” of ?h at s, meaning how much emphasis the blendingfunction puts on the human strategy at state s.

For example,referring back to Fig. 1, the critical cells of the gridworldcorrespond to certain states of the MDP Mr. At these states,we assign a very low confidence in the human strategy. Forinstance at such a state s ? S, we might have b(s) = 0.1,meaning the blended strategy in state s puts more emphasison the autonomy strategy ?a.B. Strategy repairIn this section, we describe our approach to the shared controlsynthesis problem. We formulate the specific problem of repair-ing the human strategy based on quasiconvex programming,which can be solved by checking feasibility of a number ofconvex (here even linear) programming problems.

We showthat the repaired strategy satisfies the task specifications whiledeviating minimally from the human strategy. Finally, wecompute the autonomy strategy ?a that may then be subjectto blending.We start by formalizing the concept of strategy perturbation.Then, we introduce the dual linear programming (LP) formula-tion to synthesize a strategy in a MDP. Finally, we present anLP formulation of the strategy repair problem and show thatthe solution indeed satisfies the specifications while deviatingfrom the human’s strategy minimally.1) Perturbation of strategies: As mentioned in the introduc-tion, the blended strategy should deviate minimally from thehuman strategy. To measure the quantity of such a deviation,we introduce the concept of perturbation, which was usedin 5.

To modify a (randomized) strategy, we employ additiveperturbation by increasing or decreasing the probabilities ofaction choices in each state. We also ensure that for each state,the resulting strategy is a well-defined distribution over theactions.Definition 6 (Strategy perturbation). Given an MDP Mr anda strategy ? ? StrMr , an (additive) perturbation ? is a function? : S ×A? ?1, 1 with???A?(s, ?) = 0 ?s ? S.The perturbation value at state s for action ? is ?(s, ?).Overloading the notation, the perturbed strategy ?(?) is givenby?(?)(s, ?) = ?(s, ?) + ?(s, ?) ?s ? S.

? ? A. (2)2) Dual linear programming formulation for MDPs: In thissection, we refer to the LP formulation to compute a policythat maximizes the probability of satisfying a reachabilityspecification P??(?T ) in a MDP 24, 10.The LP formulation builds on the set Var of variablesincluding the following:• x(s, ?) ? 0,?) for each s ? S T and ? ? A definesthe occupancy measure of a state-action pair.maximize?s?ST???A?(s, ?)x(s, ?) (3)subject to?s ? S T.???Ax(s, ?) =?s??ST???AP(s?, ?, s)x(s?, ?) + ?s (4)?s?ST???A?(s, ?)x(s, ?) ? ? (5)where ?(s, ?) =?s??TP(s, ?, s?), and ?s = 1 if s = sI and?s = 0 if s 6= sI .

For any optimal solution x to the LP in (3)–(5), then?(s, ?) =x(s, ?)???Ax(s, ?)(6)is an optimal strategy, and x is the occupancy measure of ?,see 24 and 10 for details.After finding an optimal strategy to the LP in (3)–(5), wecan compute the probability of satisfying a specification by?s?ST???A?(s, ?)x(s, ?). (7)3) Strategy repair using quasiconvex programming: Giventhe human strategy, ?h ? StrMr , the aim of the autonomyprotocol is to compute the blended strategy, or the repairedstrategy ?ha that induces a similar behavior to the humanstrategy while satisfying the specifications.

We compute therepaired strategy by perturbing the human strategy, which isintroduced in Definition 6. As in Definition 6, we perturb thehuman strategy ?h to the repaired strategy ?ha by?s ? S.? ? A.

?ha(s, ?) = ?h(s, ?) + ?(s, ?). (8)We cannot add this constraint to the LP in (3)–(5) becausethe variables in the LP are for the occupancy measures forthe strategy. To perturb the strategy, we use the definition ofoccupancy measure to translate the constraint in (8) into theconstraint?s ? S.

? ? A. xha(s, ?)???Axha(s, ?)= ?h(s, ?) + ?(s, ?) (9)or equivalently to the constraint?s ? S.? ? A.xha(s, ?) =???Axha(s, ?) (?h(s, ?) + ?(s, ?)) . (10)Note that the constraint in (10) is not a linear constraintbecause of multiplication of the perturbation variables ? andoccupancy measure variables x.

Since we are interested inminimizing the maximal deviation, we can assign a commonvariable ?? ? 0, 1 for all state-action pairs to put an upperbound on the deviation by?s ? S.? ? A.|xha(s, ?)????Axha(s, ?)?h(s, ?)| ? ?????Axha(s, ?).(11)The constraint in (11) is also not linear constraint. In fact, itis a quadratic constraint due to multiplication of ?? and xha.

However, the feasible set of policies scales monotonically with??, i. e., more policies are feasible with increasing ??. Note thatfor a fixed ??, the above constraint is linear, and therefore it isa convex constraint.We show that the formulation for the shared control synthesisproblem is given by a quasiconvex optimization problem (QCP).The QCP formulation incorporates the occupancy variablesfor the repaired strategy ?ha, and an upper bound for theperturbation ?? of the human strategy. We show the formulationfor a single reachability specification ? = P??(?T ). Then,we discuss how to add additional safety and expected costspecifications in the QCP formulation.

The QCP formulation builds on the set Var of variablesincluding the following:• xha(s, ?) ? 0,?) for each s ? S T and ? ? Adefines the occupancy measure of a state-action pair forthe repaired strategy.• ?? ? 0, 1 gives the upper bound of the perturbation ofthe strategy ?h.The shared control synthesis problem can be formulated as thefollowing QCP:minimize ?? (12)subject to?s ? S T.???Axha(s, ?) =?s??ST???AP(s?, ?, s)xha(s?, ?) + ?s(13)?s ? S.

? ? A.?s?ST???A?(s, ?)xha(s, ?) ? ? (14)|xha(s, ?)????Axha(s, ?)?h(s, ?)| ? ?????Axha(s, ?).(15)Let us walk through the optimization (12)–(15). We minimizethe infinity norm of the perturbation in (12). The constraintsin (13) ensures that the resulting strategy ?ha from theoccupancy variables xha is well-defined. The constraint in (14)constrains the probability of reaching the set of target statesT is smaller than or equal to ? to satisfy the specificationP??(?T ). We perturb the human strategy ?h to the repairedstrategy ?ha as in Definition 6 in (15) by putting an upperbound on the maximal perturbation.The problem in (12)–(15) not a convex optimization problembecause of the nonconvex constraint in (15).

However, for afixed ??, the problem in (12)–(15) is a LP and the set of feasiblepolicies scales monotonically in ??. Therefore, the problemin (12)–(15) is a quasiconvex optimization problem. We cansolve the QCP in (12)–(15) by employing a bisection methodon ??. A method to solve quasiconvex optimization problems isgiven in 4, Algorithm 4.1.

We adopt the Algorithm 4.1 in 4as follows:The above algorithm requires exactly?log2(1?)?iterations,and we can compute a policy that induces a minimal deviationup to any given accuracy of ?.Theorem 1.

The strategy obtained from the QCP in (12)–(15)satisfies the task specifications and it is optimal with respectto the maximal deviation.Proof. From a satisfying assignment to the constraints in (12)–(15), we can compute a policy that satisfies the specificationAlgorithm 1: Bisection method for quasiconvex optimiza-tion problem.given l = 0, u = 1, tolerance ? > 0.repeat1. Set ?? = (l + u)/2.

2. Solve the convex feasibility problem in (13)–(15).3. if the problem in (13)–(15) is feasible, u := ??; elsel := ??.

until u? l ? ?.using (6). Using Algorithm (1), we can compute the repairedpolicy ?ha that deviates minimally from the human strategy?h up to ? accuracy in?log2(1?)?iterations. Therefore, bysolving the QCP (12)–(15), we can compute an optimal policyfor the shared control synthesis problem.

We note that the solution given by the QCP in (12)–(15)computes the minimally deviating repaired strategy ?ha usingstrategy repair. On the one hand, reference 15 considersrepairing the strategy with a greedy approach. Their approachrequires solving possibly an unbounded number of LPs tocompute a feasible strategy that is not necessarily optimal. Onthe other hand, we only need to check feasibility of a boundednumber of LPs to compute an optimal strategy. Note thatwe do not compute the autonomy strategy ?a with the QCPin (12)–(15) directly. After computing the repaired strategy?ha, we compute the autonomy strategy ?a according to theDefinition 5.4) Additional specifications: The QCP in (12)–(15) com-putes an optimal policy for a single reachability specification.

Suppose that we are given another reachability specification?? = P???(?T ?) with T ? 6= T in addition to the reachabilityspecification ? = P??(?T ). We can handle this specificationby the adding ?s?ST ????A??(s, ?)x(s, ?) ? ? (16)to the QCP in (12)–(15), where ??(s, ?) =?s??T ?P(s, ?, s?).The constraint in (16) ensures that the probability of reachingT ? is less than ??.We handle an expected cost specification E??(?G) for G ?S, by adding ?s?SG???AC(s, ?)x(s, ?) ? ? (17)to the QCP in (12)–(15).Similar to the probability computation for reachabilityspecifications, the expected cost for reaching G is given by?s?SG???AC(s, ?)x(s, ?). (18)V.

CASE STUDY AND EXPERIMENTSDefining a formal synthesis approach to the shared controlscenario requires an abstract representation of the human’sdesires as a strategy. We now discuss how such strategies maybe obtained using inverse reinforcement learning and reporton case study results.A.

Experimental settingWe consider two scenarios, the first of which is the wheelchairscenario from Fig. 1. We model the wheelchair scenario insidean interactive Python environment.In the second scenario, we use a tool called AMASE1, whichis used to simulate multi-unmanned aircraft vehicles (UAV)missions. Its graphical user interfaces allow humans to sendcommands to the one or multiple vehicles at run time. Itincludes three main programs: a simulator, a data playbacktool, and a scenario setup tool.

We use the probabilistic model checker PRISM 19 to verifyif the computed strategies satisfy the specification. We use theLP solver Gurobi 14 to check the feasibility of the LPproblems that is given in Section IV. We also implementedthe greedy approach for strategy repair in 15.B.

Data collectionWe asked five participants to accomplish tasks in the wheelchairscenario. The goal is moving the wheelchair to a target cell ingridworld while never being in same cell as the moving obstacle.Similarly, three participants performed the surveillance task inthe AMASE environment.From the data obtained from each participant, we computean individual randomized human strategy ?h via maximum-entropy inverse reinforcement learning (MEIRL) 28. Refer-ence 16 uses inverse reinforcement learning to reason abouthuman behavior in a shared control scenario (though withoutany formal guarantees) from human’s demonstrations. In 25,inverse reinforcement learning is used to distinguish humanintents with respect to different tasks. We show the work flowof the case study in Figure 4.

In our setting, we denote each sample as one particularcommand of the participant, and we assume that the participantissues the command to satisfy the specification. Under thisassumption, we can bound the probability of a possibledeviation from the actual intent with respect to the number ofsamples using Hoeffding’s inequality for the resulting strategy,see 27 for details. Using these bounds, we can determinethe required number of command to get an approximationof a typical human behavior. The probability of a possibledeviation from the human behavior is given by O(exp(?n?2)),where n is the number of commands from the human and ?is the upper bound on the deviation between the probabilityof satisfying the specification with the true human strategyand the probability obtained by the strategy that is computedby inverse reinforcement learning. For example, to ensure anupper bound ? = 0.05 on the deviation of the probability of1https://github.

com/afrl-rq/OpenAMASEFig. 4. The setting of the case study for the shared control simulation.

satisfying the specification with probability 0.99, the requiredamount of samples is 1060.We design the blending function as follows: At states wherethe human strategy yields a lower probability of satisfyingthe specification, we assign a lower confidence in the humanstrategy. Using this function, we create the autonomy strategy?a and pass (together with the blending function) back to theenvironment. Note that the blended strategy ?ha satisfies thespecification, by Theorem 1.C. GridworldThe size of the gridworld in Fig.

1 is variable, and wegenerate a number of randomly moving (e.g., the vacuumcleaner) and stationary obstacles.

An agent (e.g., the wheelchair)moves in the gridworld according to the commands from ahuman. For the gridworld scenario, we construct an MDP andthe states of the MDP represents the positions of the agent andthe obstacles. The actions in the MDP induce changes in theposition of the agent.

The safety specification specifies that the agent has to reacha target cell while not crashing into an obstacle with a certainprobability ? ? 0, 1, formally P??(¬crash U target).First, we report results for one particular participant in agridworld scenario with a 8× 8 grid and one moving obstacle.The resulting MDP has 2304 states. We compute the humanstrategy using MEIRL with the features, e. g.

, the componentsof the cost function of the human, giving the distance to theobstacle and the goal state.We instantiate the safety specification with ? = 0.7, whichmeans the target should be reached with at least a probabilityof 0.7. The human strategy ?h induces a probability of 0.546to satisfy the specification. That is, it does not satisfy thespecification.We compute the repaired strategy ?ha using the greedy andthe QCP approach, and both strategies satisfies the specificationby inducing a probability of satisfying the specification largerthan ?.

On the one hand, the maximum deviation betweenthe human strategy ?h and ?ha is 0.15 with the LP approach,which implies that the strategy of the human and the autonomyprotocol deviates at most 15% for all states and actions. Onthe other hand, the maximum deviation between the humanstrategy ?h and the blended strategy ?ha is 0.03 with the QCPapproach.

The results shows that the QCP approach computesa blended strategy that induces more similar strategy to thehuman strategy compared to the LP approach.(a) Strategy ?h (b) Strategy ?ah (c) Strategy ?aFig. 5. Graphical representation of the obtained human, blended, and autonomystrategy in the grid.TABLE ISCALABILITY RESULTS FOR THE GRIDWORLD EXAMPLE.

grid states trans. LP synth. ?LP QCP synth.

?QCP8× 8 2, 304 36, 864 14.12 0.15 31.49 0.0310× 10 3, 600 57, 600 23.80 0.

24 44.61 0.0412× 12 14, 400 230, 400 250.78 0.33 452.27 0.

05To finally assess the scalability of our approach, considerTable I. We generated MDPs for different gridworlds withdifferent number of states and number of obstacles. We listthe number of states in the MDP (labeled as “states”) andthe number of transitions (labeled as “trans”). We report onthe time that the synthesis process took with the LP approachand QCP approach (labeled as “LP synth.

” and “QCP synth”),which includes the time of solving the LP or QCPs measuredin seconds. It also includes the model checking times usingPRISM for the LP approach. To represent the optimalityof the synthesis, we list the maximal deviation from thehuman strategy for the LP and QCP approach (labeled as”?LP” and “?QCP”). In all of the examples, we observe that thestrategies with the QCP approach yields autonomy strategieswith less deviation to the human strategy while having similarcomputation time with the LP approach.D. UAV mission planningSimilar to the gridworld scenario, we generate an MDPwith states corresponding to the position of the agents andthe obstacles for the AMASE scenario. Consider an AMASEscenario in Fig.

6. In this scenario, the specification or themission of the agent (blue UAV) is to keep surveilling thegreen regions (labeled as w1, w2, w3) while avoiding restrictedoperating zones (labeled as “ROZ1, ROZ2”) and enemy agents(purple and green UAVs). As we consider reachability problems,we asked the participants to visit the regions in a sequence,i.

e., visiting the first region, then second, and then the thirdregion. After visiting the third region, the task is to visit thefirst region again to perform the surveillance.For example, if the last visited region is w3, then thesafety specification in this scenario is P??((¬crash ?¬ROZ) U target), where ROZ is to visit the ROZ areasand target is visiting w1.

We synthesize the autonomy protocol on the AMASEscenario with two enemy agents that induces an MDP with15625 states. We use the same blending function and same(a) An AMASE simulator. (b) The GUI of AMASE.Fig. 6.

An example of an AMASE scenario.threshold ? = 0.7 as in the gridworld example.

The featuresto compute the human strategy with MEIRL are given bythe distance to the closest ROZ, enemy agents and the targetregion.The human strategy ?h induces a probability of 0.163 tosatisfy the specification, and it violates the specification likein the gridworld example. Similar to the gridworld example,we compute the repaired strategy ?ha with the LP and theQCP approach, and both strategies satisfy the specification. Onthe one hand, the maximum deviation between ?h and ?ha is0.42 with the LP approach, which means the strategies of thehuman and the autonomy protocol are significantly different insome states of the MDP. On the other hand, the QCP approachyields a repaired strategy ?ha that is more similar to the humanstrategy ?h with the maximum deviation being 0.06.

The timeof the synthesis procedure with the LP approach is 481.31seconds and the computation time with the QCP approachis 749.18 seconds, showing the trade-offs between the LPapproach and the QCP approach. We see that, the LP approachcan compute a feasible solution slightly faster, however theresulting blended strategy may be less similar to the humanstrategy compared to the QCP approach.VI.

CONCLUSION AND CRITIQUEWe introduced a formal approach to synthesize an autonomyprotocol in a shared control setting subject to probabilistictemporal logic specifications. The proposed approach utilizesinverse reinforcement learning to compute an abstraction ofa human’s behavior as a randomized strategy in a Markovdecision process. We designed an autonomy protocol toensure that the resulting robot behavior satisfies safety andperformance specifications in probabilistic temporal logic. Wealso ensured that the resulting robot behavior is as similar to thebehavior induced by the human’s commands as possible. Wesynthesized the robot behavior using quasiconvex programming.

We showed the practical usability of our approach throughcase studies involving autonomous wheelchair navigation andunmanned aerial vehicle planning.There is a number of limitations and also possible extensionsof the proposed approach. First of all, we computed anglobally optimal strategy by bisection, which requires checkingfeasibility of a number of linear programming problems. Aconvex formulation of the shared control synthesis problemwould make computing the globally optimal strategy moreefficient.We assumed that the human’s commands are consistentthrough the whole execution, i. e.

, the human issues eachcommand to satisfy the specification. Also, this assumptionimplies the human does not consider assistance from the robotwhile providing commands – and in particular, the human doesnot adapt the strategy to the assistance. It may be possibleto extend the approach to handle non-consistent commandsby utilizing additional side information, such as the taskspecifications.Finally, in order to generalize the proposed approach toother task domains, it is worth to explore transfer learning 21techniques.

Such techniques will allow us to handle differentscenarios without requiring to relearn the human strategy fromthe human’s commands.REFERENCES1 Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inversereinforcement learning. In ICML, page 1. ACM, 2004.

2 Christel Baier and Joost-Pieter Katoen. Principles of Model Checking.The MIT Press, 2008.3 Mihir Bellare and Phillip Rogaway. The complexity of approximatinga nonlinear program. In Complexity in numerical optimization, pages16–32. World Scientific, 1993.

4 Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cam-bridge University Press, New York, NY, USA, 2004.5 Taolue Chen, Yuan Feng, David S. Rosenblum, and Guoxin Su.Perturbation analysis in verification of discrete-time Markov chains.In CONCUR, volume 8704 of LNCS, pages 218–233. Springer, 2014.

6 Anca D. Dragan and Siddhartha S. Srinivasa.

Formalizing assistiveteleoperation. In Robotics: Science and Systems, 2012.7 Anca D. Dragan and Siddhartha S. Srinivasa.

A policy-blendingformalism for shared control. I. J. Robotic Res., 32(7):790–805, 2013.8 Andrew Fagg, Michael Rosenstein, Robert Platt, and Roderic Grupen.

Extracting user intent in mixed initiative teleoperator control. InIntelligent Systems Technical Conference, page 6309, 2004.9 Lu Feng, Clemens Wiltsche, Laura Humphrey, and Ufuk Topcu. Synthesisof human-in-the-loop control protocols for autonomous systems. IEEETransactions on Automation Science and Engineering, 13(2):450–462,2016.10 Vojte?ch Forejt, Marta Kwiatkowska, Gethin Norman, David Parker, andHongyang Qu.

Quantitative multi-objective verification for probabilisticsystems. In TACAS, pages 112–127. Springer, 2011.11 Roland Fryer and Matthew O Jackson. A categorical model of cognitionand biased decision making. The BE Journal of Theoretical Economics,8(1).12 Jie Fu and Ufuk Topcu.

Synthesis of shared autonomy policies withtemporal logic specifications. IEEE Transactions on Automation Scienceand Engineering, 13(1):7–17, 2016.13 F. Gala?n, M. Nuttin, E. Lew, P. W. Ferrez, G.

Vanacker, J. Philips,and J. del R. Milla?n.

A brain-actuated wheelchair: Asynchronous andnon-invasive brain-computer interfaces for continuous control of robots.Clinical Neurophysiology, 119(9):2159–2169, 2016/05/28.14 Gurobi Optimization, Inc. Gurobi optimizer reference manual.

url=http://www.gurobi.com, 2013.15 Nils Jansen, Murat Cubuktepe, and Ufuk Topcu. Synthesis of sharedcontrol protocols with provable safety and performance guarantees. InACC, pages 1866–1873. IEEE, 2017.16 Shervin Javdani, J Andrew Bagnell, and Siddhartha Srinivasa. Sharedautonomy via hindsight optimization. In Robotics: Science and Systems,2015.17 Dae-Jin Kim, Rebekah Hazlett-Knudsen, Heather Culver-Godfrey, GretaRucks, Tara Cunningham, David Portee, John Bricout, Zhao Wang, andAman Behal. How autonomy impacts performance and satisfaction:Results from a study with spinal cord injured subjects using an assistiverobot. IEEE Transactions on Systems, Man, and Cybernetics-Part A:Systems and Humans, 42(1):2–14, 2012.18 Jonathan Kofman, Xianghai Wu, Timothy J Luu, and Siddharth Verma.Teleoperation of a robot manipulator using a vision-based human-robotinterface. IEEE transactions on industrial electronics, 52(5):1206–1219,2005.19 Marta Kwiatkowska, Gethin Norman, and David Parker. PRISM 4.0:Verification of probabilistic real-time systems. In CAV, volume 6806 ofLNCS, pages 585–591. Springer, 2011.20 Adam Leeper, Kaijen Hsiao, Matei Ciocarlie, Leila Takayama, and DavidGossow. Strategies for human-in-the-loop robotic grasping. In HRI, pages1–8. IEEE, 2012.21 Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEETransactions on knowledge and data engineering, 22(10):1345–1359,2010.22 Shashank Pathak, Erika A?braha?m, Nils Jansen, Armando Tacchella,and Joost-Pieter Katoen. A greedy approach for the efficient repair ofstochastic models. In NFM, volume 9058 of LNCS, pages 295–309.Springer, 2015.23 Martin L. Puterman. Markov Decision Processes: Discrete StochasticDynamic Programming. John Wiley and Sons, 1994.24 Martin L Puterman. Markov decision processes: discrete stochasticdynamic programming. John Wiley ; Sons, 2014.25 Constantin A Rothkopf and Dana H Ballard. Modular inverse rein-forcement learning for visuomotor behavior. Biological cybernetics,107(4):477–490, 2013.26 Jian Shen, Javier Ibanez-Guzman, Teck Chew Ng, and Boon SengChew. A collaborative-shared control system with safe obstacle avoidancecapability. In Robotics, Automation and Mechatronics, volume 1, pages119–123. IEEE, 2004.27 Brian D Ziebart. Modeling purposeful adaptive behavior with the principleof maximum causal entropy. 2010.28 Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey.Maximum entropy inverse reinforcement learning. 2008.

x

Hi!
I'm Casey!

Would you like to get a custom essay? How about receiving a customized one?

Check it out