OK, our next

speaker is Dan Russo from Columbia Business School. And he got his PhD at

Stanford, and was a postdoc at Microsoft

Research for a while. And then you were a faculty

at the Kellogg Business School as well before joining Columbia. So we look forward to your talk. After a couple years

of bouncing around. Thanks. All right, well, I’m

really excited to be here. I’m excited that the

last couple of talks have set the stage

well for this one. OK so I’m Daniel Russo. I’ll present some work

that’s joint, I guess– most of it– with collaborators

of mine from graduate school. So my PhD advisor and

labmates of mine– Ian Osband and Zheng Wen– as well as the most recent

work, which is solo authored. I guess, in the last

couple of years, I’ve been bouncing around

working on different things. And I’m returning kind of to

think about some, I think, pretty deep open questions

that were left open with some of my earlier work. Now one thing to sort of

set the stage here, sort of to have the right

orientation, I think in CS theory or

information theory, there’s a way we usually

proceed where we carefully define a problem class. It’s the first thing we do. And then we work very

hard to understand the fundamental

limits of how well you do in that problem class. And then potentially, even

kind of as a side note, you mentioned the algorithm

that got them out. I’m working in the

opposite order here. I’m going to try to think of an

algorithmic design principle, and then work really

hard to understand when this thing works well, and

how far I can push the theory to understand that

design principle. And I think you should

think of the pinnacle. The absolute dream is

that you have something like bootstrapping or boosting

that has strong supporting theory, but is useful

to practitioners far beyond exactly

where the theory fits. So this video is kind of the

only catchy video I’ll have. And it does not compare

to Ben’s at all. But it’s going to capture

sort of the main challenge with kind of popular

RL algorithms that you download

and run on OpenAI Gym that we’re after fixing here. And so this is a twist on

maybe one of the first things you would download and

run on OpenAI Gym, which is balancing its cart-pole. You try to balance the pendulum

by exerting forces on the cart. And usually, you start

with a pole upright, and then a bunch of

simple algorithms make you think you know

how to solve this task. Now this is a twist

where the pole is going to start upside down. And we’re going to see how

well our algorithms work. All right, so in

particular, I’m going to contrast two

kinds of algorithms. One uses Epsilon-Greedy

exploration. So it follows its best

estimate of the value of the action at a

certain time, and then kind of randomly

dithers left or right. There’s some cost for

exerting energy here. So eventually, it learns to

just keep the pole still. So what happens

after 1,000 episodes? Now this is a kind of

similarly simple algorithm, performing what we’ll call deep

exploration instead, right? At first, it’s kind of

just going to flail around. But after 500

episodes, it’s learned to move very fast

to the left, stop, use the momentum to

swing the pole up and try to keep it balanced. And after 1,000, it’s really

kind of solved this task. now– AUDIENCE: So is there

representation in here? I mean, somehow, you have

the four state variables? DANIEL RUSSO: All right,

the feature representation, you have angular,

velocity, momentum. There’s a few. AUDIENCE: Just

differential equations? DANIEL RUSSO: Yeah, there’s

something like that. I have to look at exactly

what was used in this case. But there’s linear

function approximation with kind of the right

physics features. I should check. So we’re after, in

some points, bounds. But I’m really after

division between these two types of behavior. In the red, we have

the reward you’re getting under Epsilon-Greedy. And it just flatlines. You really don’t

solve the task at all. So the task is somehow

impossible with that method. And in green, we have this

deep exploration method. And it’s possible. How do we get this to work? Sort of one thing we’re

going to try to build on is that there’s been

just decades of research into doing value function

approximation, with kind of different textbooks

providing a nice treatment of a whole bunch of tricks– including [INAUDIBLE]—-

for doing this. And so I’m going

to sort of imagine that these things work well. And maybe that’s a

leap, in some cases. And I’m going to

ask, given that we have kind of a methodology

for doing that, how do we layer on top

methods for collecting data efficiently? So here’s the kind of template. And here’s what was

followed in that video. Well, first, we’re going

to randomly perturb the training data. And I’ll go back for

a second in a moment, and explain what I mean there. So at each episode, rather

than fit to the training data we’ve actually

observed, we’re going to fit to some randomly

perturbed diversion of that. Then we’re going to apply

an existing value function approximation method

to that perturbed data. And then we’re going to behave

greedily under that value function approximation. Now I do think that

this is an idea where, probably, if I proposed it a

long time ago in the controls community, before kind of the

intellectual progress that led to this idea, it

probably seems pretty crazy. To me, I think it

seems pretty crazy. And it’s sort of been a

journey to a to land at this– a journey from

involving many people. So in this particular

case, we just added Gaussian noise to the

observed rewards in the past, and injected kind of

some prior randomness to deal with the fact

that kind of at episode 1, there’s nothing to

induce exploration. There’s no observed

data to perturb. The off-the-shelf value

function estimation now going to be applied is

called least-squares value iteration, which kind of does

a recursive least-squares estimation, similar

to Bellman backups. OK so far? All right, the main insight– if you can call it this–

is that this– and it takes some work to kind of

start to formalize this insight. It’s saying that somehow,

this kind of scheme synthesizes a sophisticated

kind of deep exploration with value function-based

generalization schemes. So the outline for

the talk is to try to figure out what I’m talking

about in all these points in a slightly more precise way. So what is the kind of

sophisticated exploration that we’re after? In particular, how

do I kind of have a language for describing what’s

going on in the carpool video? What kind of

exploration was that? And how is it different? I’m going to talk

about sanity checks for this kind of exploration. And in particular, I’ll flash

Ben Recht’s linearization principle here, and

try to say, let’s have a linear benchmark that

is an exploration benchmark– unlike, I think, linear

quadratic control. Now I’m going to discuss, then,

where the algorithmic idea comes from. I think, out of

the blue, I don’t know how you would come up with

this idea, at least, to me. And I’ll tell you

sort of the path to understanding this is a

scalable version of Thompson sampling. At the end, I’ll

flash some progress in providing a rigorous

theory justifying what’s going on here, and leading

into the open questions that I’m kind of puzzling

about these days. So what is deep exploration? Well, here is a simple

kind of a thing you might see of classic RL work. We have a rat in a maze. They know that there

is cheese here. And they know kind

of half of the maze, and what it looks like. Part of the maze,

they’ve never been to. And they believe that

it either contains tasty rat cake, or this trap. Now you should

think of this kind of like the cart-pole example. There are going to be

many, many episodes. So my mouse is going to

repeatedly, each day, get started at

this maze, and then have many time steps at which

they try to navigate the maze. And then they restart. So if you’re going to

do this many times, it’s worthwhile,

probably, to figure out whether you’re going to

get the cake every day from here on out. So do you make a

beeline for the cheese? Or do you make a beeline

for the question mark? Now if you’re optimizing

just your immediate reward in this game, that’s some

kind of myopic optimization. We all know not to do that. Instead, we do

dynamic programming. I would say, dynamic programming

is like going for the cheese here. You look ahead many steps at

what reward you’d like to get. And you make a beeline

for that reward. Now myopic exploration

is the kind of thing that is being built in

with Epsilon-Greedy. At each step, you try to

think about maybe testing out a new action. Deep exploration is kind of

thinking ahead, understanding that you’d like to

check out the cake here. And in an intelligent

way, navigating the part of the

environment you already understand to get to the cake. Now as I’ll mention in the

next slide, in simple settings, there are– there’s kind of been decades

of research understanding how to do this. But it’s still kind

of an open question how to do it in a scalable way. There’s a reason, despite

all of the advances kind of understanding the simple

complexity of RL, that people are not really using this stuff

when they download OpenAI Gym. You don’t think so, Chave? AUDIENCE: I feel

like it’s awesome. DANIEL RUSSO: I want

to mention that– OK– in addition to

this being something you see with a

mouse in the maze, I think it is a ubiquitous

phenomenon in real life. It’s a common thing that if

you’re following one strategy and you deviated from it for

one kind of time instant, that’s not enough to figure out

whether alternative strategies are effective. And to come up with examples

of that in your mind, you just have to think

of examples of strategies that only work if

they’re employed for an extended period of time. People talk about

applying RL in medicine. Well, in medicine, most

alternative treatments would only be useful if you

continued sticking to them. You can’t try

chemotherapy for one day and revert back and

expect that experiment to have somehow taught you that

chemotherapy is a good idea. It won’t. It’ll teach you it’s a bad idea. And I think you find this

all throughout the marketing literature and other

kinds of places. Now there’s a large

theory, kind of. This is, in some

ways, just a branding of the idea of what underlies

the efficient RL literature. There’s been a long literature. I think, kind of

the ideas were first probably pioneered in this

paper by Kearns and Singh, that understands the importance

of looking ahead and planning to learn something, many

times, steps into the future. And usually, this is formalized. Kind of a check for this is

that expected regret is somehow polynomial in the number of

parameters you’d like to learn. Usually here, this is something

like a polynomial in the number of states and actions. Whereas these kind of

myopic exploration schemes in the worst case,

are exponential. But I think it’s fair to

say the focus has really been on tabular models. Now we have a lot

of people trying to extend this stuff to

richer parameterized models. And I think that this

talk is about one idea for a paradigm

that might lead to a principled,

scalable approach. Done with that part. So now I want to discuss

sort of a linear sanity check for deep exploration. This is my favorite

toy class of problems for sanity checking RL methods. But it’s not one I hear

maybe all that much about in this kind of event. So we have here, Ben’s

linearization principle, which I totally agree

with in some sense– that if our algorithms, when we

check them on a linear model, are doing crazy things,

that’s something we should worry about. Now I don’t think solving the

linear task is somehow proof that it’s the right algorithm. But failing miserably

on it should at least raise concerns about

the types of tasks it’s going to fail on

in complex environments. I’d say, linear

quadratic control really crystallizes

issues with kind of a feedback and stability. It doesn’t really crystallize

issues with exploration. But I’ll argue that

the shortest path problem might be the simplest

case where methods start really breaking. It’s also got some

benefits that it’s also a special case of the online

linear optimization literature. So there’s strong benchmarks. So here’s the problem. Each day, I commute

from home to work. I start at V1. I want to get to V12. And along the shortest route

were these unknown theta parameters, the edge weights

encode the average time to travel along a

particular edge. Now one thing you’ll

find is that if you take certain popular

algorithms and specialize them to this case, they do

actually do crazy things. So I don’t think

you’ll hear anything like this in this room

over the next few days. But the kind of

thing that I find confusing when I go

through the literature is there are lots

of people claiming to generate exploration by

using stochastic policies– meaning, that often, I

have something explicitly like an entropy constraint

that says that at every state, I’m going to make sure that

my action is sufficiently randomized. Something I want to highlight

is that if you have constraints like that in your

policy, you provably are going to fall flat on your

face in some of these problems. The reason is kind of captured

well in this simple example. Suppose that I have a

long chain like this. I start at this top node. The blue denotes these

kind of short backroads. And there’s a long, long–

and I’m going to kind of scale the graph– a long sequence of

back roads leading to one of two highways,

the green or the orange. I start out thinking

orange is probably fastest. So the greedy policy just

goes straight for orange. In fact, it goes down and over. Now a human being would look

ahead and navigate forward. Someday, they’d try

out the green highway. A simple kind of observation

is that to reach the green, you have to go right, and then

go right, and then go right, and then continue to go right. If you are randomizing

your actions, the odds that you go right

consecutively that many times without ever going down

are decaying exponentially. So the chance that you

randomize has to be almost– has to be as small as 1

over the number of edges in order to ever reach the end. I’m sure this makes sense, but

it’s one simple sanity check that already is enough to

break a fair amount of what you’re seeing proposed. Questions so far? AUDIENCE: So, just so I

can understand the example and sort of generalize

it in my head, if I expect the trajectory

of the optimal policy to take steps, I

don’t want my epsilon to be much greater

than 1 over epsilon. DANIEL RUSSO: I think that yes– AUDIENCE: Did I get it right? DANIEL RUSSO: I think that’s

about right, is it, somehow– AUDIENCE: [INAUDIBLE] AUDIENCE: So

somehow, if you need to do things that are kind of

depth first, I can promise, it really involves some

depth-first searching. You don’t want to be

having these things that are causing constant branching. DANIEL RUSSO: I think

if you have some depth, then your randomization

has to be pretty small, to sort of have even a hope. Even if you– even if the

part of your algorithm that is planning doing the non-random

part is very intelligent, you need the random part to be

very small for deep problems. OK, so let’s start by thinking

about some algorithm template that we might use

on this problem, and then think

about how it extends to more complex problems. So I’m going to

build an algorithm by thinking about a Bayesian

version of the online shortest path, where I don’t know

the edge weights encoded by the vector theta. But I start with prior

beliefs over them. Now as observations

come in, I sort of observe these noisy travel times

on the edges I have traversed. And I’m going to update

my posterior beliefs. I, at least, am going

to require the ability to generate some kind of

accurate posterior sample. So the way I think about

this is, let’s imagine I know how to do inference

and ask, given our own ability to do inference, how should

I select actions sequentially across these episodes? All right, so just to build

toward this algorithm, let’s first think about Greedy. Greedy, what I do

is at each episode, I compute the posterior mean

travel time along each edge. I compute the shortest path

under those edge weights. And I follow that path and

update my posterior beliefs. The issue with this,

as most people know, is that you can get stuck. You continue playing this

path learning nothing, but you don’t try out

the other possible paths. Now another thing to note is

that there’s exponentially many paths, many

of these problems. So anything that requires

kind of enumerating the set of all paths and

somehow crossing them off is not going to work

well computationally. So this is one

simple idea that’s then generated a

lot of interest, and it’s Thompson sampling. I am going to, in each

episode, build a graph by sampling edge weights from

the posterior distribution, given my observations

up to this time. Now naturally, what

you should think happens is that

edges that you’ve traversed many,

many times, you’re pretty much filling in

the posterior mean there. There’s very low variance

in these samples. But edges that you

haven’t visited much, there’s a lot of variance

in what you sample. Now what I’m going to do

is follow the shortest path under the sampled edge weights. And I update my beliefs. So that’s one episode. The next episode, I start again. I have new posterior beliefs. I draw a new sample

from those beliefs. And this sample’s

different for two reasons. One is that I have

another observation, and the other is that

I’m randomizing again. Now what I do is I compute

and follow the shortest path under that new sample that’s

likely to be something different. Did I do that properly? No? That’s likely to be something

different, taking you through a different

part of the graph. Yeah, question? AUDIENCE: What is

your overall goal? Is it to identify

the shortest path? DANIEL RUSSO: It’s not. So here, I would say that

my goal is something more like cumulative regret. So I’m trying to do well

in an online setting. But often, you can flip

these things around. And if you have an algorithm

with low enough regret, then only some polynomial

number of samples are required to identify

a near optimal path. Now I’ve been talking about

that stuff for a long time. But I realized I

never really went through the next step of

crystallizing what I think is so nice about that strategy. And from an intellectual

perspective, at least– and I think also from

a practical one– it’s this kind of nice

separation that a priori, it wasn’t obvious to me that

this should be possible. So on the one hand,

we’re doing inference. At least, we’re

generating a sample from a model in a manner

that kind of reflects the degree of uncertainty. We’re sampling edge

weights in a way that reflects our uncertainty

about those edge weights. And there’s just an

enormous set of methods from the statistics literature

for doing that kind of thing. Now on the other hand, we’re

finding a shortest path. And an enormous

amount of effort has come in to developing different

methods for doing that. So we have this kind

of optimization box. And you can sort of imagine that

there’s one team at Microsoft with experts in optimization. And there’s one

team with experts in drawing approximate posterior

samples or doing bootstrapping, or something. And they don’t even

know each other. But they each build their own

module into this algorithm. And Thompson

sampling just somehow makes the two work together. And if you think about

other kinds of strategies you might build

for this, usually, to get them to work,

you need to understand both inference and optimization,

and get them to work together in a more intelligent way. So does it work? Well, you can sanity check

this through simulation. So here’s a version

of this shortest path problem on a binomial bridge. So this thing extended to

have 20 stages rather than 6, so 185,000 paths. And you can at least check

that Thompson sampling does much better than kind of any

version of Epsilon-Greedy however you tune the epsilon. Yeah? AUDIENCE: I’m a bit

confused at the [INAUDIBLE].. Because, have you

seen some coupling? Because, essentially–

let’s say, if I think of all the edges

having the same biome, you need to visit the

cup at some point, right? You need to find

the shortest path. But if you want to find the

shortest path from x to d, you need to visit

a lot of edges. DANIEL RUSSO: You

need to visit– to find the true

exact shortest path, you have to visit

every edge, I guess. I’m not clear how many times

you have to visit every edge. AUDIENCE: I’d say once. DANIEL RUSSO: Once? Yes. AUDIENCE: That makes sense. At least once. DANIEL RUSSO: At

least once, yeah. AUDIENCE: So you’re trying

to get an algorithm which is better than this [INAUDIBLE]? Would you have had an

additional assumption? DANIEL RUSSO: No. No, I’m not– I’m not. Instead, I’m thinking about

one broad algorithmic idea that at least would

manage to do as well as visiting every edge once. So I think the issue, I

guess, with myopic exploration algorithms– let’s say

Epsilon-Greedy in this– is that it does not quickly

visit every edge once. It would take

exponentially many steps to even visit each edge once. AUDIENCE: [INAUDIBLE]. So as long as you

didn’t [INAUDIBLE].. AUDIENCE: Because counting

something is basic [INAUDIBLE],, it’s just random

over the [INAUDIBLE].. DANIEL RUSSO: I think

it’s a method that gets– sort of gets the same

benefits as our max. Good. So I’m– yeah, I think

that the view here instead, is that this is one algorithmic

idea that works much more broadly. It has many benefits, in that

the two things you plugged in are an inference engine

and an optimization engine. If you specialize it to the

case where our max works, it seems to do about

the same thing. Yeah? AUDIENCE: So you are saying

you didn’t believe in this? That this separation

is perfectly complete, and it just works like that? Or, are there some cases when

this is not quite working that well? So if there’s a

separation, like we’re talking about [INAUDIBLE]. And optimization is

doing it all the same. And you [INAUDIBLE]. You only have to worry

about each and every box individually. And then Thompson sampling

[INAUDIBLE] randomized. AUDIENCE: The

Thompson sampling is on the optimization problems. AUDIENCE: It’s like, they’re

nicely working together. And [INAUDIBLE]. DANIEL RUSSO: I

actually do think that for a very broad class

of problems, this works. But it’s not all. In some ways, then you

need a lot of thought to understand for what class

of online learning problems does this explore the right way,

and for what class does it not? AUDIENCE: It seems to me

that this idea is very highly coupled with [INAUDIBLE]. DANIEL RUSSO: Great. AUDIENCE: And sometimes,

optimism doesn’t work. DANIEL RUSSO: Great. And in fact, I think that

probably, maybe the theme of my PhD thesis was trying

to say, when optimism works, Thompson sampling works. And there’s a big

class of problems that we should worry about, that

have a feature where optimism doesn’t work. And then you should

do something else. Yeah? AUDIENCE: When you

say total sampling, you mean a Bayesian model

in the realizable case. So you’re using the Bayesian

model as the right model? DANIEL RUSSO: I’ll

end with that being one of these open questions. And– AUDIENCE: [INAUDIBLE] DANIEL RUSSO: Thank you. OK, now, does it work? Well, there’s also

some intuition here. So there’s simulation that

says something worked. There’s intuition

that I guess it– I’ll have to do

something like R max, for people who know

what this means. There’s intuition that it ought

to solve this kind of problem. Now for this, let’s note that

there’s equivalent perspectives on what this algorithm does. One perspective is that it

samples from a posterior over models, like samples

of vector of edge weights, and optimizes. That’s sort of the perspective

that a practitioner likes to think about. I think for proving theorems,

the second perspective is probably the more useful one. What this algorithm’s

done, in a sense, is it’s written down this

exponentially long list, encoding all possible

paths you could follow. It has associated with

each one a number, so sort of defining a probability

vector over this list. And each entry of that vector

is the posterior probability that it is the shortest path. Now if you think about

this example I constructed, I was worried about

a case where there’s many, many paths

through this graph. And I think there’s

a decent chance that the shortest one goes along

the top and then follows green. How do I reason that

Thompson’s sampling has to follow that particular path? Well, if there’s a decent chance

that the shortest path includes the green edge, there’s

got to be a decent chance that the algorithm

samples the green edge. There’s one kind of

way, just in words, to work through in your mind,

that this algorithm cannot get stuck in those deep

exploration traps. All right, so does it work? Some theory. And to my mind, the

most satisfying theory for this algorithm right

now is in the case where posterior inference works well. There’s gaps between upper and

lower bounds in some worst case senses. All right, so what kind

of thing can you do? Well, this is an information

theoretic regret bound. It says that the expected

regret under graphs drawn from the true prior

scales at worst, like this. Now the entropy

of the– this here is the entropy under the prior

of the true shortest path, reflecting that if you start

out with a strong belief about which path is shortest,

you should do better. Now that entropy term is

bounded by the logarithm of the number of paths. And the whole thing

on the right hand side scales with the number of edges. The number of edges

is much, much smaller than the number of

paths in the problem. So the fact that this scales

linearly in the number of edges is, in some ways,

proof that it’s doing this kind of deep exploration. Now personally, I think

the information theory is aesthetically beautiful. There’s been some work

extending this kind of approach to many different

kinds of models, and I think, now covering

most of the most popular online learning settings. And I’ll flag here,

one big thing in red that I hope someone

will help me crack, which is, I think we need

new ideas to understand an information theoretic kind

of bound like this for tabular MDPS. Yeah? AUDIENCE: So the entropy of

prior over the shortest path. So does it equal to smaller

than the entropy over the prior over model classes, over models? This is just a case

of shortest paths, not stochastic shortest paths. DANIEL RUSSO: Let

me think about that. AUDIENCE: So your

stochastic shortest paths would be– would model

the prior over modelling to have higher entropy. DANIEL RUSSO: I should

think about that for a second offline. But in a lot of cases

that I think of, the entropy there being over

policies instead of over– or, say, over actions

instead of over models, does mean twisting

the algorithm. But I have to think about

the shortest path one when I’m not on video. AUDIENCE: Yeah, this is

over the shortest paths. DANIEL RUSSO: Let

me think about it. Let me think about it offline. Yeah? AUDIENCE: Yeah,

I’m trying to make sense of the regret definition,

the Bayesian regret. In a way, you define the

Bayesian [INAUDIBLE] algorithm. And there, you also use it to

define the Bayesian regret. Because of that, you can

plot a lot of things. And look how nice they look. But is there a little

bit of mutual dependence between the input of the

algorithm and the performance metric that you

tried to include? DANIEL RUSSO: So,

yeah, I think so. Now to me, the dependence– if you want to be

totally worst case, then you can only justify

using Thompson sampling with a very flat tire,

which, to me, I don’t think you always should do. So I do think there’s room for

having a theory that confirms sort of that if you

get the prior right, you get benefit

from that as well. AUDIENCE: You can also

just change your address to have a reading

based on the outcome that’s there, without needing

a Bayesian setting that– like a code cusp that,

if your true answer is closer to your [INAUDIBLE]

you’re better off. It’s more just an

analysis aspect. DANIEL RUSSO: So

far, that’s very hard to do in problems

where you have, say, like an exponential number

of essentially actions here. So in principle, you

may be able to do that. But it’s sort of an open

question to get it to work. The way it works in sort of,

like, classification setting. So I think that’s true. I think what you should

think of this as saying is, if you believe the Bayesian

inference gives you some sort of

accurate estimation, gives you an accurate

kind of inference, things work very well here. And I’ll be going back to

sort of the big challenge, and what I’m about

to present is we’re moving away from that world. So I’m going to try

to rush through here. Let’s take that

shortest path thing and move it to episodic

reinforcement learning. In the first steps, nothing’s

really going to change. Now I think of my agent, who

is going to repeatedly interact with some environment for

some number of time steps. They apply a policy. They observe an outcome,

which is this trajectory. And they associate

with that, some reward, which is the sum of per time

step rewards over that horizon. This is my bad 1980s stick

figure drawing of this. I think it highlights,

maybe, that this is very much like some kind

of online learning or banded problem where the action

space is the set of policies. Now if you want to do, say,

model-based tabular learning– where you begin with a prior

over the unknown transition kernel, the unknown

reward model– everything kind of

extends gracefully. So what is Thompson

sampling then? Sample an MDP from the posterior

and apply the optimal policy for that MDP. You do that throughout

the episode, you reset. Now you can kind

of do some math, get polynomial regret bounds. Again, ensure there’s some kind

of deep exploration going on if you do this. Emphasize that one reason

to be excited about this is kind of beyond

worst case analyses. Often, when you

simulate this stuff, it works markedly better. I’m kind of picking

on a straw man here because this is

an algorithm that’s really designed just

as a proof of concept, just to make regret

analyses work, all right? Well, we kind of

copied their proof to get a bound for

Thompson sampling. You can see that

the bound effects the algorithm in this case. And regret is enormous

relative to what you get from an algorithm

where the analysis is really just separate from

the algorithm design. Now this has the same kind

of separation property, where you’re saying, if

I can optimize an MDP and I can do inference

over the model for the MDP, things seem to work. The issue is that for realistic

MDPs, inference is very hard. It’s very hard to write down a

model class for the transition dynamics and do

appropriate inference. And optimization

is also very hard– cursive dimensionality. So how do we develop scalable

approximations to this idea? Well, we can think of this

Thompson sampling, or posterior sampling algorithm, as sampling

from posterior over MDPs, and then optimizing. But one thing we could

also think of it as doing, as well as it optimizes,

it recursively generates value

function estimates. So each sample of MDP has an

associated value function. So you could think

of this as sampling from a distributions

over value functions, and then behaving greedily with

respect to that distribution. If you take that

view of things, it’s natural to try to generalize

across states and actions by using a parameterized

approximation to the value function itself. So that’s what we come up

with as a template here. We take some kind of

deterministic learning algorithm, like DQN or

least-squares value iteration. And usually what this does,

it regularizes towards zero. And it takes in a

sequence of data and spits out an estimate

for the parameter of a parameterized value

function approximation. The main insight

here is that you can generate meaningful

approximations to kind of a posterior sample

over value functions by applying the same

algorithm with noise, noise injected

into, in this case, the rewards you have observed. And much like when you

start a Bayesian agent and they have no data,

they’re randomizing a lot. We should have some

kind of prior randomness in the algorithm. And that’s why we regularize

the value function to some prior point. I’m sort of racing here, to make

sure I get to the punchline. So let me point out that

it’s not such a crazy idea. If you take H equals 1– so you have a one

horizon problem– fitting your linear

model to noise is exactly the same thing as

generating a posterior sample. But this view of things seems

to generalize better to problems where you can’t do

posterior inference. And you sort of are applying

complex nonlinear function approximation. There’s variance of this

that do bootstrapping, rather than adding Gaussian noise. And empirically, those

work actually even better. and I’ll just close

by saying that there’s this whole mystery

around showing that this kind of

approximation technique works. And the real challenge

is sort of the thing I’ve been getting

from the audience, that this is no

longer truly Bayesian. You can’t have a posterior

over value functions. A value function doesn’t

even define a likelihood. So getting the math

to work out and going beyond sort of intuition has

been kind of deep and tricky. The current state of affairs is

that there is a simple sanity check, kind of a

polynomial worst-case bound for tabular MDPs. And we are currently working. And I believe that that extends

to a notion of a linear MDP, much like what Mangde

presented earlier. But there’s kind of a lot of– and I’ll race to the end here– there’s a lot of deep questions

surrounding understanding whether this Thompson

sampling scheme works with approximate inference– kind of, if you

don’t believe you have exact posterior

samples, but you think you’ve added,

instead, a lot of noise– toward understanding how this

works with bootstrapping. And toward overall,

kind of confirming that this grand principle I

advertised at the beginning, that you apply a

function approximator to purposefully

noise perturb data. It’s generating this kind of

sophisticated exploration. And I think there’s a

lot of deep questions that I hope the audience helps

with, to confirm why that is, and when that’s the

case, and when it fails. Thanks. [APPLAUSE] Yeah, Shawn? AUDIENCE: So that plot, where

you have UCRL and the Thompson sampling, that was with

[INAUDIBLE] prior– DANIEL RUSSO: [INAUDIBLE] AUDIENCE: So the disappointing

thing of Thompson sampling is that it’s not apparent to

me it’s any different than just UCB, but random over the

confidence intervals. So for example, do

you know that maybe– DANIEL RUSSO: Just so

I make sure I follow, what do you mean by– AUDIENCE: So look at– so, the UCRL, that was

upper confidence algorithm. And you could just look at

the confidence bones and then sample randomly over the– DANIEL RUSSO: Like,

exponential weights over those? AUDIENCE: Yeah, not

exponential, just uniform over the intervals. And we know, just independently,

that that works super well. And the reason

that could be nice is, like, when you do

[INAUDIBLE],, maybe this is what you were alluding to

[INAUDIBLE] technical results at the end. But maybe [INAUDIBLE] will

give you something poly-time for the tabular setting. But it might, like,

hopefully, skew things. Whereas if you were just random

over the confidence interval? Because that,

almost certainly, is poly-time in the tabular thing. And I suspect,

like those curves– like, if those curves weren’t

right on top of each other, then I think you’ve chosen your

MDP in a very peculiar manner to skew things. DANIEL RUSSO: I

think instead, I’ve chosen a good thing to

pick on, an algorithm that doesn’t work well. AUDIENCE: UCRL, like Alex’s

things were interesting because he was showing that

the interesting algorithm’s close to the one

that should work. And I’m concerned that this

UCRL variant you’re using looks much worse than if you

just sample it uniformly. Like, Kevin and Rob, they have

a bunch of these works showing that if you just sort of random

over the confidence intervals, that has– like, it’s much better

than UCD in practice. DANIEL RUSSO: I think the reason

is, to me, the way I explained that plot is that the confidence

bounds used in that algorithm are extremely optimistic. They’re much more

conservative than real. AUDIENCE: But the

point is, in practice, and talking to

practitioners, just uniform over exactly those confidence

intervals might close that gap. And I think that’s

an important– DANIEL RUSSO: Good. I can go test that. But I’m– if you look at how big

of those confidence intervals are, how optimistic you’re

being in that algorithm, I don’t believe uniform

would work, would close it. So to me, instead,

I think what’s happening is in a

lot of cases, it’s hard to come up with the

best confidence balance. AUDIENCE: We can take that. I’d be curious if you tried

it, because I’m betting they will be very close. DANIEL RUSSO: OK,

let’s make a bet.