2019 ADSI Summer Workshop: Algorithmic Foundations of Learning and Control, Daniel Russo

2019 ADSI Summer Workshop: Algorithmic Foundations of Learning and Control,  Daniel Russo


OK, our next
speaker is Dan Russo from Columbia Business School. And he got his PhD at
Stanford, and was a postdoc at Microsoft
Research for a while. And then you were a faculty
at the Kellogg Business School as well before joining Columbia. So we look forward to your talk. After a couple years
of bouncing around. Thanks. All right, well, I’m
really excited to be here. I’m excited that the
last couple of talks have set the stage
well for this one. OK so I’m Daniel Russo. I’ll present some work
that’s joint, I guess– most of it– with collaborators
of mine from graduate school. So my PhD advisor and
labmates of mine– Ian Osband and Zheng Wen– as well as the most recent
work, which is solo authored. I guess, in the last
couple of years, I’ve been bouncing around
working on different things. And I’m returning kind of to
think about some, I think, pretty deep open questions
that were left open with some of my earlier work. Now one thing to sort of
set the stage here, sort of to have the right
orientation, I think in CS theory or
information theory, there’s a way we usually
proceed where we carefully define a problem class. It’s the first thing we do. And then we work very
hard to understand the fundamental
limits of how well you do in that problem class. And then potentially, even
kind of as a side note, you mentioned the algorithm
that got them out. I’m working in the
opposite order here. I’m going to try to think of an
algorithmic design principle, and then work really
hard to understand when this thing works well, and
how far I can push the theory to understand that
design principle. And I think you should
think of the pinnacle. The absolute dream is
that you have something like bootstrapping or boosting
that has strong supporting theory, but is useful
to practitioners far beyond exactly
where the theory fits. So this video is kind of the
only catchy video I’ll have. And it does not compare
to Ben’s at all. But it’s going to capture
sort of the main challenge with kind of popular
RL algorithms that you download
and run on OpenAI Gym that we’re after fixing here. And so this is a twist on
maybe one of the first things you would download and
run on OpenAI Gym, which is balancing its cart-pole. You try to balance the pendulum
by exerting forces on the cart. And usually, you start
with a pole upright, and then a bunch of
simple algorithms make you think you know
how to solve this task. Now this is a twist
where the pole is going to start upside down. And we’re going to see how
well our algorithms work. All right, so in
particular, I’m going to contrast two
kinds of algorithms. One uses Epsilon-Greedy
exploration. So it follows its best
estimate of the value of the action at a
certain time, and then kind of randomly
dithers left or right. There’s some cost for
exerting energy here. So eventually, it learns to
just keep the pole still. So what happens
after 1,000 episodes? Now this is a kind of
similarly simple algorithm, performing what we’ll call deep
exploration instead, right? At first, it’s kind of
just going to flail around. But after 500
episodes, it’s learned to move very fast
to the left, stop, use the momentum to
swing the pole up and try to keep it balanced. And after 1,000, it’s really
kind of solved this task. now– AUDIENCE: So is there
representation in here? I mean, somehow, you have
the four state variables? DANIEL RUSSO: All right,
the feature representation, you have angular,
velocity, momentum. There’s a few. AUDIENCE: Just
differential equations? DANIEL RUSSO: Yeah, there’s
something like that. I have to look at exactly
what was used in this case. But there’s linear
function approximation with kind of the right
physics features. I should check. So we’re after, in
some points, bounds. But I’m really after
division between these two types of behavior. In the red, we have
the reward you’re getting under Epsilon-Greedy. And it just flatlines. You really don’t
solve the task at all. So the task is somehow
impossible with that method. And in green, we have this
deep exploration method. And it’s possible. How do we get this to work? Sort of one thing we’re
going to try to build on is that there’s been
just decades of research into doing value function
approximation, with kind of different textbooks
providing a nice treatment of a whole bunch of tricks– including [INAUDIBLE]—-
for doing this. And so I’m going
to sort of imagine that these things work well. And maybe that’s a
leap, in some cases. And I’m going to
ask, given that we have kind of a methodology
for doing that, how do we layer on top
methods for collecting data efficiently? So here’s the kind of template. And here’s what was
followed in that video. Well, first, we’re going
to randomly perturb the training data. And I’ll go back for
a second in a moment, and explain what I mean there. So at each episode, rather
than fit to the training data we’ve actually
observed, we’re going to fit to some randomly
perturbed diversion of that. Then we’re going to apply
an existing value function approximation method
to that perturbed data. And then we’re going to behave
greedily under that value function approximation. Now I do think that
this is an idea where, probably, if I proposed it a
long time ago in the controls community, before kind of the
intellectual progress that led to this idea, it
probably seems pretty crazy. To me, I think it
seems pretty crazy. And it’s sort of been a
journey to a to land at this– a journey from
involving many people. So in this particular
case, we just added Gaussian noise to the
observed rewards in the past, and injected kind of
some prior randomness to deal with the fact
that kind of at episode 1, there’s nothing to
induce exploration. There’s no observed
data to perturb. The off-the-shelf value
function estimation now going to be applied is
called least-squares value iteration, which kind of does
a recursive least-squares estimation, similar
to Bellman backups. OK so far? All right, the main insight– if you can call it this–
is that this– and it takes some work to kind of
start to formalize this insight. It’s saying that somehow,
this kind of scheme synthesizes a sophisticated
kind of deep exploration with value function-based
generalization schemes. So the outline for
the talk is to try to figure out what I’m talking
about in all these points in a slightly more precise way. So what is the kind of
sophisticated exploration that we’re after? In particular, how
do I kind of have a language for describing what’s
going on in the carpool video? What kind of
exploration was that? And how is it different? I’m going to talk
about sanity checks for this kind of exploration. And in particular, I’ll flash
Ben Recht’s linearization principle here, and
try to say, let’s have a linear benchmark that
is an exploration benchmark– unlike, I think, linear
quadratic control. Now I’m going to discuss, then,
where the algorithmic idea comes from. I think, out of
the blue, I don’t know how you would come up with
this idea, at least, to me. And I’ll tell you
sort of the path to understanding this is a
scalable version of Thompson sampling. At the end, I’ll
flash some progress in providing a rigorous
theory justifying what’s going on here, and leading
into the open questions that I’m kind of puzzling
about these days. So what is deep exploration? Well, here is a simple
kind of a thing you might see of classic RL work. We have a rat in a maze. They know that there
is cheese here. And they know kind
of half of the maze, and what it looks like. Part of the maze,
they’ve never been to. And they believe that
it either contains tasty rat cake, or this trap. Now you should
think of this kind of like the cart-pole example. There are going to be
many, many episodes. So my mouse is going to
repeatedly, each day, get started at
this maze, and then have many time steps at which
they try to navigate the maze. And then they restart. So if you’re going to
do this many times, it’s worthwhile,
probably, to figure out whether you’re going to
get the cake every day from here on out. So do you make a
beeline for the cheese? Or do you make a beeline
for the question mark? Now if you’re optimizing
just your immediate reward in this game, that’s some
kind of myopic optimization. We all know not to do that. Instead, we do
dynamic programming. I would say, dynamic programming
is like going for the cheese here. You look ahead many steps at
what reward you’d like to get. And you make a beeline
for that reward. Now myopic exploration
is the kind of thing that is being built in
with Epsilon-Greedy. At each step, you try to
think about maybe testing out a new action. Deep exploration is kind of
thinking ahead, understanding that you’d like to
check out the cake here. And in an intelligent
way, navigating the part of the
environment you already understand to get to the cake. Now as I’ll mention in the
next slide, in simple settings, there are– there’s kind of been decades
of research understanding how to do this. But it’s still kind
of an open question how to do it in a scalable way. There’s a reason, despite
all of the advances kind of understanding the simple
complexity of RL, that people are not really using this stuff
when they download OpenAI Gym. You don’t think so, Chave? AUDIENCE: I feel
like it’s awesome. DANIEL RUSSO: I want
to mention that– OK– in addition to
this being something you see with a
mouse in the maze, I think it is a ubiquitous
phenomenon in real life. It’s a common thing that if
you’re following one strategy and you deviated from it for
one kind of time instant, that’s not enough to figure out
whether alternative strategies are effective. And to come up with examples
of that in your mind, you just have to think
of examples of strategies that only work if
they’re employed for an extended period of time. People talk about
applying RL in medicine. Well, in medicine, most
alternative treatments would only be useful if you
continued sticking to them. You can’t try
chemotherapy for one day and revert back and
expect that experiment to have somehow taught you that
chemotherapy is a good idea. It won’t. It’ll teach you it’s a bad idea. And I think you find this
all throughout the marketing literature and other
kinds of places. Now there’s a large
theory, kind of. This is, in some
ways, just a branding of the idea of what underlies
the efficient RL literature. There’s been a long literature. I think, kind of
the ideas were first probably pioneered in this
paper by Kearns and Singh, that understands the importance
of looking ahead and planning to learn something, many
times, steps into the future. And usually, this is formalized. Kind of a check for this is
that expected regret is somehow polynomial in the number of
parameters you’d like to learn. Usually here, this is something
like a polynomial in the number of states and actions. Whereas these kind of
myopic exploration schemes in the worst case,
are exponential. But I think it’s fair to
say the focus has really been on tabular models. Now we have a lot
of people trying to extend this stuff to
richer parameterized models. And I think that this
talk is about one idea for a paradigm
that might lead to a principled,
scalable approach. Done with that part. So now I want to discuss
sort of a linear sanity check for deep exploration. This is my favorite
toy class of problems for sanity checking RL methods. But it’s not one I hear
maybe all that much about in this kind of event. So we have here, Ben’s
linearization principle, which I totally agree
with in some sense– that if our algorithms, when we
check them on a linear model, are doing crazy things,
that’s something we should worry about. Now I don’t think solving the
linear task is somehow proof that it’s the right algorithm. But failing miserably
on it should at least raise concerns about
the types of tasks it’s going to fail on
in complex environments. I’d say, linear
quadratic control really crystallizes
issues with kind of a feedback and stability. It doesn’t really crystallize
issues with exploration. But I’ll argue that
the shortest path problem might be the simplest
case where methods start really breaking. It’s also got some
benefits that it’s also a special case of the online
linear optimization literature. So there’s strong benchmarks. So here’s the problem. Each day, I commute
from home to work. I start at V1. I want to get to V12. And along the shortest route
were these unknown theta parameters, the edge weights
encode the average time to travel along a
particular edge. Now one thing you’ll
find is that if you take certain popular
algorithms and specialize them to this case, they do
actually do crazy things. So I don’t think
you’ll hear anything like this in this room
over the next few days. But the kind of
thing that I find confusing when I go
through the literature is there are lots
of people claiming to generate exploration by
using stochastic policies– meaning, that often, I
have something explicitly like an entropy constraint
that says that at every state, I’m going to make sure that
my action is sufficiently randomized. Something I want to highlight
is that if you have constraints like that in your
policy, you provably are going to fall flat on your
face in some of these problems. The reason is kind of captured
well in this simple example. Suppose that I have a
long chain like this. I start at this top node. The blue denotes these
kind of short backroads. And there’s a long, long–
and I’m going to kind of scale the graph– a long sequence of
back roads leading to one of two highways,
the green or the orange. I start out thinking
orange is probably fastest. So the greedy policy just
goes straight for orange. In fact, it goes down and over. Now a human being would look
ahead and navigate forward. Someday, they’d try
out the green highway. A simple kind of observation
is that to reach the green, you have to go right, and then
go right, and then go right, and then continue to go right. If you are randomizing
your actions, the odds that you go right
consecutively that many times without ever going down
are decaying exponentially. So the chance that you
randomize has to be almost– has to be as small as 1
over the number of edges in order to ever reach the end. I’m sure this makes sense, but
it’s one simple sanity check that already is enough to
break a fair amount of what you’re seeing proposed. Questions so far? AUDIENCE: So, just so I
can understand the example and sort of generalize
it in my head, if I expect the trajectory
of the optimal policy to take steps, I
don’t want my epsilon to be much greater
than 1 over epsilon. DANIEL RUSSO: I think that yes– AUDIENCE: Did I get it right? DANIEL RUSSO: I think that’s
about right, is it, somehow– AUDIENCE: [INAUDIBLE] AUDIENCE: So
somehow, if you need to do things that are kind of
depth first, I can promise, it really involves some
depth-first searching. You don’t want to be
having these things that are causing constant branching. DANIEL RUSSO: I think
if you have some depth, then your randomization
has to be pretty small, to sort of have even a hope. Even if you– even if the
part of your algorithm that is planning doing the non-random
part is very intelligent, you need the random part to be
very small for deep problems. OK, so let’s start by thinking
about some algorithm template that we might use
on this problem, and then think
about how it extends to more complex problems. So I’m going to
build an algorithm by thinking about a Bayesian
version of the online shortest path, where I don’t know
the edge weights encoded by the vector theta. But I start with prior
beliefs over them. Now as observations
come in, I sort of observe these noisy travel times
on the edges I have traversed. And I’m going to update
my posterior beliefs. I, at least, am going
to require the ability to generate some kind of
accurate posterior sample. So the way I think about
this is, let’s imagine I know how to do inference
and ask, given our own ability to do inference, how should
I select actions sequentially across these episodes? All right, so just to build
toward this algorithm, let’s first think about Greedy. Greedy, what I do
is at each episode, I compute the posterior mean
travel time along each edge. I compute the shortest path
under those edge weights. And I follow that path and
update my posterior beliefs. The issue with this,
as most people know, is that you can get stuck. You continue playing this
path learning nothing, but you don’t try out
the other possible paths. Now another thing to note is
that there’s exponentially many paths, many
of these problems. So anything that requires
kind of enumerating the set of all paths and
somehow crossing them off is not going to work
well computationally. So this is one
simple idea that’s then generated a
lot of interest, and it’s Thompson sampling. I am going to, in each
episode, build a graph by sampling edge weights from
the posterior distribution, given my observations
up to this time. Now naturally, what
you should think happens is that
edges that you’ve traversed many,
many times, you’re pretty much filling in
the posterior mean there. There’s very low variance
in these samples. But edges that you
haven’t visited much, there’s a lot of variance
in what you sample. Now what I’m going to do
is follow the shortest path under the sampled edge weights. And I update my beliefs. So that’s one episode. The next episode, I start again. I have new posterior beliefs. I draw a new sample
from those beliefs. And this sample’s
different for two reasons. One is that I have
another observation, and the other is that
I’m randomizing again. Now what I do is I compute
and follow the shortest path under that new sample that’s
likely to be something different. Did I do that properly? No? That’s likely to be something
different, taking you through a different
part of the graph. Yeah, question? AUDIENCE: What is
your overall goal? Is it to identify
the shortest path? DANIEL RUSSO: It’s not. So here, I would say that
my goal is something more like cumulative regret. So I’m trying to do well
in an online setting. But often, you can flip
these things around. And if you have an algorithm
with low enough regret, then only some polynomial
number of samples are required to identify
a near optimal path. Now I’ve been talking about
that stuff for a long time. But I realized I
never really went through the next step of
crystallizing what I think is so nice about that strategy. And from an intellectual
perspective, at least– and I think also from
a practical one– it’s this kind of nice
separation that a priori, it wasn’t obvious to me that
this should be possible. So on the one hand,
we’re doing inference. At least, we’re
generating a sample from a model in a manner
that kind of reflects the degree of uncertainty. We’re sampling edge
weights in a way that reflects our uncertainty
about those edge weights. And there’s just an
enormous set of methods from the statistics literature
for doing that kind of thing. Now on the other hand, we’re
finding a shortest path. And an enormous
amount of effort has come in to developing different
methods for doing that. So we have this kind
of optimization box. And you can sort of imagine that
there’s one team at Microsoft with experts in optimization. And there’s one
team with experts in drawing approximate posterior
samples or doing bootstrapping, or something. And they don’t even
know each other. But they each build their own
module into this algorithm. And Thompson
sampling just somehow makes the two work together. And if you think about
other kinds of strategies you might build
for this, usually, to get them to work,
you need to understand both inference and optimization,
and get them to work together in a more intelligent way. So does it work? Well, you can sanity check
this through simulation. So here’s a version
of this shortest path problem on a binomial bridge. So this thing extended to
have 20 stages rather than 6, so 185,000 paths. And you can at least check
that Thompson sampling does much better than kind of any
version of Epsilon-Greedy however you tune the epsilon. Yeah? AUDIENCE: I’m a bit
confused at the [INAUDIBLE].. Because, have you
seen some coupling? Because, essentially–
let’s say, if I think of all the edges
having the same biome, you need to visit the
cup at some point, right? You need to find
the shortest path. But if you want to find the
shortest path from x to d, you need to visit
a lot of edges. DANIEL RUSSO: You
need to visit– to find the true
exact shortest path, you have to visit
every edge, I guess. I’m not clear how many times
you have to visit every edge. AUDIENCE: I’d say once. DANIEL RUSSO: Once? Yes. AUDIENCE: That makes sense. At least once. DANIEL RUSSO: At
least once, yeah. AUDIENCE: So you’re trying
to get an algorithm which is better than this [INAUDIBLE]? Would you have had an
additional assumption? DANIEL RUSSO: No. No, I’m not– I’m not. Instead, I’m thinking about
one broad algorithmic idea that at least would
manage to do as well as visiting every edge once. So I think the issue, I
guess, with myopic exploration algorithms– let’s say
Epsilon-Greedy in this– is that it does not quickly
visit every edge once. It would take
exponentially many steps to even visit each edge once. AUDIENCE: [INAUDIBLE]. So as long as you
didn’t [INAUDIBLE].. AUDIENCE: Because counting
something is basic [INAUDIBLE],, it’s just random
over the [INAUDIBLE].. DANIEL RUSSO: I think
it’s a method that gets– sort of gets the same
benefits as our max. Good. So I’m– yeah, I think
that the view here instead, is that this is one algorithmic
idea that works much more broadly. It has many benefits, in that
the two things you plugged in are an inference engine
and an optimization engine. If you specialize it to the
case where our max works, it seems to do about
the same thing. Yeah? AUDIENCE: So you are saying
you didn’t believe in this? That this separation
is perfectly complete, and it just works like that? Or, are there some cases when
this is not quite working that well? So if there’s a
separation, like we’re talking about [INAUDIBLE]. And optimization is
doing it all the same. And you [INAUDIBLE]. You only have to worry
about each and every box individually. And then Thompson sampling
[INAUDIBLE] randomized. AUDIENCE: The
Thompson sampling is on the optimization problems. AUDIENCE: It’s like, they’re
nicely working together. And [INAUDIBLE]. DANIEL RUSSO: I
actually do think that for a very broad class
of problems, this works. But it’s not all. In some ways, then you
need a lot of thought to understand for what class
of online learning problems does this explore the right way,
and for what class does it not? AUDIENCE: It seems to me
that this idea is very highly coupled with [INAUDIBLE]. DANIEL RUSSO: Great. AUDIENCE: And sometimes,
optimism doesn’t work. DANIEL RUSSO: Great. And in fact, I think that
probably, maybe the theme of my PhD thesis was trying
to say, when optimism works, Thompson sampling works. And there’s a big
class of problems that we should worry about, that
have a feature where optimism doesn’t work. And then you should
do something else. Yeah? AUDIENCE: When you
say total sampling, you mean a Bayesian model
in the realizable case. So you’re using the Bayesian
model as the right model? DANIEL RUSSO: I’ll
end with that being one of these open questions. And– AUDIENCE: [INAUDIBLE] DANIEL RUSSO: Thank you. OK, now, does it work? Well, there’s also
some intuition here. So there’s simulation that
says something worked. There’s intuition
that I guess it– I’ll have to do
something like R max, for people who know
what this means. There’s intuition that it ought
to solve this kind of problem. Now for this, let’s note that
there’s equivalent perspectives on what this algorithm does. One perspective is that it
samples from a posterior over models, like samples
of vector of edge weights, and optimizes. That’s sort of the perspective
that a practitioner likes to think about. I think for proving theorems,
the second perspective is probably the more useful one. What this algorithm’s
done, in a sense, is it’s written down this
exponentially long list, encoding all possible
paths you could follow. It has associated with
each one a number, so sort of defining a probability
vector over this list. And each entry of that vector
is the posterior probability that it is the shortest path. Now if you think about
this example I constructed, I was worried about
a case where there’s many, many paths
through this graph. And I think there’s
a decent chance that the shortest one goes along
the top and then follows green. How do I reason that
Thompson’s sampling has to follow that particular path? Well, if there’s a decent chance
that the shortest path includes the green edge, there’s
got to be a decent chance that the algorithm
samples the green edge. There’s one kind of
way, just in words, to work through in your mind,
that this algorithm cannot get stuck in those deep
exploration traps. All right, so does it work? Some theory. And to my mind, the
most satisfying theory for this algorithm right
now is in the case where posterior inference works well. There’s gaps between upper and
lower bounds in some worst case senses. All right, so what kind
of thing can you do? Well, this is an information
theoretic regret bound. It says that the expected
regret under graphs drawn from the true prior
scales at worst, like this. Now the entropy
of the– this here is the entropy under the prior
of the true shortest path, reflecting that if you start
out with a strong belief about which path is shortest,
you should do better. Now that entropy term is
bounded by the logarithm of the number of paths. And the whole thing
on the right hand side scales with the number of edges. The number of edges
is much, much smaller than the number of
paths in the problem. So the fact that this scales
linearly in the number of edges is, in some ways,
proof that it’s doing this kind of deep exploration. Now personally, I think
the information theory is aesthetically beautiful. There’s been some work
extending this kind of approach to many different
kinds of models, and I think, now covering
most of the most popular online learning settings. And I’ll flag here,
one big thing in red that I hope someone
will help me crack, which is, I think we need
new ideas to understand an information theoretic kind
of bound like this for tabular MDPS. Yeah? AUDIENCE: So the entropy of
prior over the shortest path. So does it equal to smaller
than the entropy over the prior over model classes, over models? This is just a case
of shortest paths, not stochastic shortest paths. DANIEL RUSSO: Let
me think about that. AUDIENCE: So your
stochastic shortest paths would be– would model
the prior over modelling to have higher entropy. DANIEL RUSSO: I should
think about that for a second offline. But in a lot of cases
that I think of, the entropy there being over
policies instead of over– or, say, over actions
instead of over models, does mean twisting
the algorithm. But I have to think about
the shortest path one when I’m not on video. AUDIENCE: Yeah, this is
over the shortest paths. DANIEL RUSSO: Let
me think about it. Let me think about it offline. Yeah? AUDIENCE: Yeah,
I’m trying to make sense of the regret definition,
the Bayesian regret. In a way, you define the
Bayesian [INAUDIBLE] algorithm. And there, you also use it to
define the Bayesian regret. Because of that, you can
plot a lot of things. And look how nice they look. But is there a little
bit of mutual dependence between the input of the
algorithm and the performance metric that you
tried to include? DANIEL RUSSO: So,
yeah, I think so. Now to me, the dependence– if you want to be
totally worst case, then you can only justify
using Thompson sampling with a very flat tire,
which, to me, I don’t think you always should do. So I do think there’s room for
having a theory that confirms sort of that if you
get the prior right, you get benefit
from that as well. AUDIENCE: You can also
just change your address to have a reading
based on the outcome that’s there, without needing
a Bayesian setting that– like a code cusp that,
if your true answer is closer to your [INAUDIBLE]
you’re better off. It’s more just an
analysis aspect. DANIEL RUSSO: So
far, that’s very hard to do in problems
where you have, say, like an exponential number
of essentially actions here. So in principle, you
may be able to do that. But it’s sort of an open
question to get it to work. The way it works in sort of,
like, classification setting. So I think that’s true. I think what you should
think of this as saying is, if you believe the Bayesian
inference gives you some sort of
accurate estimation, gives you an accurate
kind of inference, things work very well here. And I’ll be going back to
sort of the big challenge, and what I’m about
to present is we’re moving away from that world. So I’m going to try
to rush through here. Let’s take that
shortest path thing and move it to episodic
reinforcement learning. In the first steps, nothing’s
really going to change. Now I think of my agent, who
is going to repeatedly interact with some environment for
some number of time steps. They apply a policy. They observe an outcome,
which is this trajectory. And they associate
with that, some reward, which is the sum of per time
step rewards over that horizon. This is my bad 1980s stick
figure drawing of this. I think it highlights,
maybe, that this is very much like some kind
of online learning or banded problem where the action
space is the set of policies. Now if you want to do, say,
model-based tabular learning– where you begin with a prior
over the unknown transition kernel, the unknown
reward model– everything kind of
extends gracefully. So what is Thompson
sampling then? Sample an MDP from the posterior
and apply the optimal policy for that MDP. You do that throughout
the episode, you reset. Now you can kind
of do some math, get polynomial regret bounds. Again, ensure there’s some kind
of deep exploration going on if you do this. Emphasize that one reason
to be excited about this is kind of beyond
worst case analyses. Often, when you
simulate this stuff, it works markedly better. I’m kind of picking
on a straw man here because this is
an algorithm that’s really designed just
as a proof of concept, just to make regret
analyses work, all right? Well, we kind of
copied their proof to get a bound for
Thompson sampling. You can see that
the bound effects the algorithm in this case. And regret is enormous
relative to what you get from an algorithm
where the analysis is really just separate from
the algorithm design. Now this has the same kind
of separation property, where you’re saying, if
I can optimize an MDP and I can do inference
over the model for the MDP, things seem to work. The issue is that for realistic
MDPs, inference is very hard. It’s very hard to write down a
model class for the transition dynamics and do
appropriate inference. And optimization
is also very hard– cursive dimensionality. So how do we develop scalable
approximations to this idea? Well, we can think of this
Thompson sampling, or posterior sampling algorithm, as sampling
from posterior over MDPs, and then optimizing. But one thing we could
also think of it as doing, as well as it optimizes,
it recursively generates value
function estimates. So each sample of MDP has an
associated value function. So you could think
of this as sampling from a distributions
over value functions, and then behaving greedily with
respect to that distribution. If you take that
view of things, it’s natural to try to generalize
across states and actions by using a parameterized
approximation to the value function itself. So that’s what we come up
with as a template here. We take some kind of
deterministic learning algorithm, like DQN or
least-squares value iteration. And usually what this does,
it regularizes towards zero. And it takes in a
sequence of data and spits out an estimate
for the parameter of a parameterized value
function approximation. The main insight
here is that you can generate meaningful
approximations to kind of a posterior sample
over value functions by applying the same
algorithm with noise, noise injected
into, in this case, the rewards you have observed. And much like when you
start a Bayesian agent and they have no data,
they’re randomizing a lot. We should have some
kind of prior randomness in the algorithm. And that’s why we regularize
the value function to some prior point. I’m sort of racing here, to make
sure I get to the punchline. So let me point out that
it’s not such a crazy idea. If you take H equals 1– so you have a one
horizon problem– fitting your linear
model to noise is exactly the same thing as
generating a posterior sample. But this view of things seems
to generalize better to problems where you can’t do
posterior inference. And you sort of are applying
complex nonlinear function approximation. There’s variance of this
that do bootstrapping, rather than adding Gaussian noise. And empirically, those
work actually even better. and I’ll just close
by saying that there’s this whole mystery
around showing that this kind of
approximation technique works. And the real challenge
is sort of the thing I’ve been getting
from the audience, that this is no
longer truly Bayesian. You can’t have a posterior
over value functions. A value function doesn’t
even define a likelihood. So getting the math
to work out and going beyond sort of intuition has
been kind of deep and tricky. The current state of affairs is
that there is a simple sanity check, kind of a
polynomial worst-case bound for tabular MDPs. And we are currently working. And I believe that that extends
to a notion of a linear MDP, much like what Mangde
presented earlier. But there’s kind of a lot of– and I’ll race to the end here– there’s a lot of deep questions
surrounding understanding whether this Thompson
sampling scheme works with approximate inference– kind of, if you
don’t believe you have exact posterior
samples, but you think you’ve added,
instead, a lot of noise– toward understanding how this
works with bootstrapping. And toward overall,
kind of confirming that this grand principle I
advertised at the beginning, that you apply a
function approximator to purposefully
noise perturb data. It’s generating this kind of
sophisticated exploration. And I think there’s a
lot of deep questions that I hope the audience helps
with, to confirm why that is, and when that’s the
case, and when it fails. Thanks. [APPLAUSE] Yeah, Shawn? AUDIENCE: So that plot, where
you have UCRL and the Thompson sampling, that was with
[INAUDIBLE] prior– DANIEL RUSSO: [INAUDIBLE] AUDIENCE: So the disappointing
thing of Thompson sampling is that it’s not apparent to
me it’s any different than just UCB, but random over the
confidence intervals. So for example, do
you know that maybe– DANIEL RUSSO: Just so
I make sure I follow, what do you mean by– AUDIENCE: So look at– so, the UCRL, that was
upper confidence algorithm. And you could just look at
the confidence bones and then sample randomly over the– DANIEL RUSSO: Like,
exponential weights over those? AUDIENCE: Yeah, not
exponential, just uniform over the intervals. And we know, just independently,
that that works super well. And the reason
that could be nice is, like, when you do
[INAUDIBLE],, maybe this is what you were alluding to
[INAUDIBLE] technical results at the end. But maybe [INAUDIBLE] will
give you something poly-time for the tabular setting. But it might, like,
hopefully, skew things. Whereas if you were just random
over the confidence interval? Because that,
almost certainly, is poly-time in the tabular thing. And I suspect,
like those curves– like, if those curves weren’t
right on top of each other, then I think you’ve chosen your
MDP in a very peculiar manner to skew things. DANIEL RUSSO: I
think instead, I’ve chosen a good thing to
pick on, an algorithm that doesn’t work well. AUDIENCE: UCRL, like Alex’s
things were interesting because he was showing that
the interesting algorithm’s close to the one
that should work. And I’m concerned that this
UCRL variant you’re using looks much worse than if you
just sample it uniformly. Like, Kevin and Rob, they have
a bunch of these works showing that if you just sort of random
over the confidence intervals, that has– like, it’s much better
than UCD in practice. DANIEL RUSSO: I think the reason
is, to me, the way I explained that plot is that the confidence
bounds used in that algorithm are extremely optimistic. They’re much more
conservative than real. AUDIENCE: But the
point is, in practice, and talking to
practitioners, just uniform over exactly those confidence
intervals might close that gap. And I think that’s
an important– DANIEL RUSSO: Good. I can go test that. But I’m– if you look at how big
of those confidence intervals are, how optimistic you’re
being in that algorithm, I don’t believe uniform
would work, would close it. So to me, instead,
I think what’s happening is in a
lot of cases, it’s hard to come up with the
best confidence balance. AUDIENCE: We can take that. I’d be curious if you tried
it, because I’m betting they will be very close. DANIEL RUSSO: OK,
let’s make a bet.

Leave a Reply

Leave a Reply

Your email address will not be published. Required fields are marked *