Birge J.R., Louveaux F. Introduction to Stochastic Programming

Подождите немного. Документ загружается.

384 8 Evaluating and Approximating Expectations

This result also extends directly to nonconvex functions, as we mentioned earlier.

In terms of stochastic programming computations, the most useful result may be

(c), which implies convergence of optima for approximating distributions. Actually

achieving optimality for each approximation may be time-consuming. One might,

therefore, be interested in achieving convergence of subdifferentials. This may allow

suboptimization for each approximating distribution.

In the case of closed convexity, Wets showed in Theorem 3 of Wets [1980a] that

if g,g

: ℜ

→ ℜ ∪{+∞},

= 1,2,..., are closed convex functions and {g

}

epi-converges to g , then the graphs of the subdifferentials of g

converge to the

graph of the subdifferential of g , i.e., for any convergent sequence {(x

) : u

∈

∂

)} with (x,u) as its limit, one has u ∈

∂

g(x) ;forany (x,u) with u ∈

∂

g(x) , there exists at least one such sequence {(x

) : u

∈

∂

)} converging

to it.

However, in general, it is not true that

∂

g(x)= lim

→∞

∂

(x) (6.6)

even if x ∈ int(dom( g )) (See Exercise 2). However, if g is G -differentiable at x ,

(6.6) is true. This is the following result from Birge and Qi [1995].

Theorem 13. Suppose that g,g

: ℜ

→ ℜ ∪{+∞},

= 1,2,... , are closed

convex functions and {g

} epi-converges to g . Suppose further that g is G -

differentiable at x . Then

∇g(x)= lim

→∞

∂

(x) . (6.7)

In fact, for any x ∈ int(dom( g )) , there exists

such that for any

≥

∂

(x)

is nonempty, and {

∂

(x) :

≥

} is bounded. Thus, for any x ∈ int(dom( g )),

the right hand side of (6.7) is nonempty and always contained in the left-hand side

of (6.7). But equality does not necessarily hold by our example. We also state the

following result (Corollary 2.5 of Birge and Qi [1995]).

Corollary 14. Suppose the conditions of Theorem 12 and that g(·,

) is convex for

each

∈

. Then for D = dom( E(g(·) )) , in addition to results (a)–(c) in Theo-

rem 12,

(d) there is a Lebesgue zero-measure set D

⊆ D such that E(g(x)) is G -

differentiable on D \D

, E(g(x)) is not G -differentiable on D

, and for each

x ∈ D \D

lim

→∞

∂

(g(x)) = ∇E(g(x)) ;

(e) for each x ∈ D,

∂

E(g(x)) = {lim

→∞

: u

∈

∂

(g(x

)) , x

→ x} .

8.6 General Convergence Properties 385

Proof: By closed convexity of g(·,

) ,E

(g(x)) are also closed convex for all

. Now (d) follows from Theorem 13 and the differentiability property of convex

functions, and (e) follows from Theorem 3 of Wets [1980a].

Many other results are possible using Theorem 13 and results on epi-convergence.

As an example, we consider convergence of sampled problem minima following

King and Wets [1991]. Let P

be an empirical measure derived from an indepen-

dent series of random observations {

,...,

} each with common distribution

P . Then, for all x ,

(g(x)) =

∑

i=1

g(x,

) .

Let (

,A , P) be a probability space completed with respect to P . A closed-valued

multifunction G mapping

to ℜ

is called measurable if for all closed subsets

C ⊆ℜ

, one has

−1

∈

: G(

) ∩C = /0}∈A .

In the following, “with probability one” refers to the sampling probability measure

on {

,...,

,...} that is consistent with P (see King and Wets [1991] for de-

tails). Applying Theorem 2.3 of King and Wets [1991] and Corollary 14, we have

the following.

Corollary 15. Suppose for each

∈

,g(·,

) is closed convex and the epi-

graphical multifunction

→ epi g(·,

) is measurable. Let E

(g(x)) be cal-

culated by (6.2). If there exists ¯x ∈ dom( E

(g(x)) ) and a measurable selection

¯u(ξ) ∈

∂

g( ¯x, ξ) with



¯u(ξ)P(dξ) ﬁnite, then the conclusions of Corollary 14

hold with probability one.

King and Wets [1991] applied their results to the two-stage stochastic program

with ﬁxed recourse repeated here as

min c

x +



Q(x,ξ)P(dξ)

s. t. Ax = b,

x ≥0 ,

(6.8)

where x ∈ℜ

and

Q(x,ξ)=inf{q(ξ)

y |Wy = h(ξ) −T(ξ)x,y ∈ ℜ

} . (6.9)

It is a ﬁxed recourse problem because W is deterministic. Combining their Theo-

rem 3.1 with our Corollary 14, we have the following.

Corollary 16. Suppose that the stochastic program (6.8) has ﬁxed recourse (6.9)

and that for all i , j , k , the random variables q

and q

have ﬁnite ﬁrst

moments. If there exists a feasible point ¯xof(6.9) with the objective function of

(6.9) ﬁnite, then the conclusions of Corollary 14 hold with probability one for

386 8 Evaluating and Approximating Expectations

g(x,

)=c

x + Q(x,

(x) ,

where

(x)=0 if Ax = b, x≥ 0 ,

(x)=+∞ otherwise.

By Theorem 3.1 of King and Wets [1991], one may solve the approximation

problem

min c

x +

∑

i=1

Q(x,

)

s. t. Ax = b ,

x ≥0 ,

(6.10)

instead of solving (6.8). If the solution of (6.10) converges as

tends to inﬁnity,

then the limiting point is a solution of (6.8). Alternatively, by Corollary 16, one may

directly solve (6.8) with a nonlinear programming method and use

x +

∑

i=1

Q(x,

) and c+

∑

i=1

∂

Q(x,

)

as approximate objective function values and subdifferentials of (6.8) with

(k)

at the k th step. Notice that −u

) ∈

∂

Q(x,

) if and only if u is an opti-

mal dual solution of (6.9) with

. In this way, one may directly solve the

original problem using the subgradients −u

) and the probability that each

is optimal (equivalently that the corresponding basis is primal feasible). The cal-

culation is therefore reduced to obtaining the probability of satisfying a system of

linear inequalities, which can be approximated well (see Pr´ekopa [1988] and Sec-

tion 8.4). This procedure may allow computation without calculating the actual ob-

jective value, which may involve a more difﬁcult multiple integral.

These results give some general idea about the uses of approximations in stochas-

tic programming. We can also introduce approximating functions, g

, such that

converges to g pointwise in D . Similar convergence results are also obtained

there. The general rule is that approximating distribution functions that converge

in distribution (even with probability one) to the true distribution function lead to

convergence of optima and, for differentiable points, convergence of subgradients.

Exercises

1. Prove that if g

epi-converges to g and x

∗

is a limit point of {x

},where

∈ argming

= {x | g

(x) ≤ infg

},then x

∗

∈ argming .

2. Construct an example where g

epi-converges to g but

∂

g(x) = lim

∂

(x) .

3. Consider the basic bounding method in Section 8.2. Suppose that

is com-

pact and that for any

> 0 , there exists some

such that for all

≥

8.6 General Convergence Properties 387

diam S

≤

for all S

∈ S

. Show that this implies that P

converges to P

in distribution.

Chapter 9

Monte Carlo Methods

Each function value in a stochastic program can involve a multidimensional integral

in extremely high dimensions. Because Monte Carlo simulation appears to offer the

best possibilities for higher dimensions (see, e.g., De´ak [1988] and Asmussen and

Glynn [2007]), it seems to be the natural choice for use in stochastic programs. In

this chapter, we describe some of the basic approaches built on sampling methods.

The key feature is the use of statistical estimates to obtain conﬁdence intervals on

results. Some of the material uses probability measure theory which is necessary to

develop the analytical results.

To build on our earlier emphasis on decomposition algorithms, Section 9.1 be-

gins this discussion with a description of the basic sampling approximation, the

sample-average approximation, and then approaches uses of this system with the

L -shaped method. We ﬁrst consider possibilities for estimating the cuts in this

method using a large number of samples for each cut. Section 9.2 then considers

the stochastic decomposition method (Higle and Sen [1991b]) that forms many cuts

with few additional samples on each iteration. Section 9.3 considers methods based

on the stochastic quasi-gradient, which can be viewed as a generalization of the

steepest descent method. These approaches have a wide variety of applications that

extend beyond stochastic programming. In Section 9.4, we consider extensions of

Monte Carlo methods to include analytical evaluations exploiting problem structure

in probabilistic constraint estimation and empirical sample information for methods

that may use updated information in dynamic problems. Section 9.5 describes basic

theoretical results for the statistical analysis of stochastic programs and, in partic-

ular, for the sample-average approximation. We describe asymptotic properties and

large-deviation bounds for optimal values and solutions to those problems.

J.R. Birge and F. Louveaux, Introduction to Stochastic Programming, Springer Series 389

in Operations Research and Financial Engineering, DOI 10.1007/978-1-4614-0237-4

 Springer Science+Business Media, LLC 2011

390 9 Monte Carlo Methods

9.1 Sample Average Approximation and Importance Sampling in

the L -Shaped Method

The most direct sampling approach to the two-stage stochastic program is to replace

the recourse function, Q(x) , by a Monte Carlo estimate,

(x)=

∑

k=1

Q(x,

)

, (1.1)

where

,...,

are random samples of the random vector ξ . This then yields the

sample average approximation (SAA) problem for the general two-stage problem

as:

min

x∈X

(x)+

∑

k=1

Q(x,

)

, (1.2)

where X represents the feasibility set as, for example, in the nonlinear program in

(3.4.1). For a stochastic linear program, we can then write (1.2)as:

min c

x +

∑

k=1

(1.3)

s. t. Ax = b,

x + Wy

= h

x ≥ 0,y

≥ 0.

As we show in Section 9.5, by increasing the sample size

, solutions to (1.3)

converge to an optimal solution of the two-stage stochastic program (3.1.2). A dis-

advantage of solving (1.3) completely for each

using any algorithm is that some

effort might be wasted on optimizing when the approximation is not accurate. An

approach to avoid these problems is to use sampling within another algorithm with-

out complete optimization. In this section, we describe this process for the L -shaped

method, which often works well for discrete distributions. To ensure that the process

makes efﬁcient use of the sample information, we ﬁrst describe a version using im-

portance sampling to reduce variance in deriving each cut based on a large sample

(see Dantzig and Glynn [1990]). In the following section, we consider an approach

that uses a single sample stream to derive many cuts that eventually drop away as

iteration numbers increase (Higle and Sen [1991b]).

The general approach is to sample Q to construct cuts in the L -shaped method

to obtain an approximate solution to (3.1.2). Using a crude Monte Carlo sample of

ξ , however, may result in high variance for the sample values Q(x,ξ

) ,slowing

convergence or leading to biased results. Instead, to reduce the variance of the sam-

ple values, we use the importance sampling (see, e.g., Rubinstein [1981] and De´ak

[1990]) variance-reduction technique to concentrate samples where they provide the

most information.

9.1 Sample Average Approximation and Importance Sampling in the L -Shaped Method 391

If we use a crude Monte Carlo estimate,

,...,

, then, given an iterate x

,the

result is a recourse function estimate, Q

∑

i=1

Q(x

) , and a correspond-

ing estimate of the gradient, ∇Q(x

) ,as

∑

i=1

where

∈

∂

Q(x

) .

Now, for Q convex in x , one obtains

Q(x,

) ≥ Q(x

)+(

)

(x −x

) (1.4)

for all x .Wealsohavethat

(x)=







∑

i=1

Q(x,

)



≥ Q

)+(

)

(x −x

)=LB

(x) , (1.5)

where, by the central limit theorem,

√

times the right-hand side is asymptotically

normally distributed with a mean value,

√

(Q(x

)+∇Q(x

)

(x −x

)) , (1.6)

which is a lower bound on

√

Q(x) , and a variance,

(x) .

Note that the cut placed on Q(x) as the right-hand side of (1.5) is a support of

Q with some error,

Q(x) ≥ Q

)+(

)

(x −x

) −

(x) , (1.7)

where

(x) is an error term with zero mean and variance equal to

(x) .Of

course, the error term is not known. At iteration s ,the L -shaped method involves

the solution of:

min c

x +

s. t. Ax = b ,

x ≥d

, l = 1,...,r ,

x +

≥ e

, l = 1,...,s ,

x ≥0 ,

(1.8)

where D

is a feasibility cut as in (5.1.7)–(5.1.8), E

= −

,and e

= Q

(

)

(−x

) , where we count iterations only when a ﬁnite Q

) is found. Note

that the generation of feasibility cuts occurs whenever

is sampled and Q(x

)

is ∞ .

We suppose that (1.8) is solved to yield x

s+1

and

s+1

,where

s+1

= max

−E

s+1

} , (1.9)

where each e

−E

s+1

can be viewed as a sample from a normally distributed ran-

dom variable with mean at most Q(x

s+1

) and variance at most

(

max

s+1

))

(max

s+1

)) . Note that

s+1

is a maximum of these random variables so,

if the samples are taken independently on each iteration s , the solution of (1.8)

392 9 Monte Carlo Methods

has a bias that may skew results for large s . Conﬁdence intervals can, however, be

developed based on certain assumptions about the functions and the supports. Al-

ternatively, the same sample set,

,...,

can be used on each iteration so that

the L -shaped method iterations solve (1.2) for the given sample with the theory of

sample average approximations providing convergence results (see Section 9.5).

If the variances of the sample estimates are sufﬁciently small, one can stop with

a high conﬁdence solution. Other approaches may also be used. Infanger [1991]

makes several assumptions that can lead to tight conﬁdence intervals on the optimal

value and allow solutions of large problems (see, e.g., Dantzig and Infanger [1991]).

Variances and any form of conﬁdence interval may, however, be quite large when

crude Monte Carlo samples are used as indicated earlier. Importance sampling can,

however, reduce the variance substantially (see Dantzig and Glynn [1990]).

In importance sampling, the goal is to replace a sample using the distribution of

ξ with one that uses an alternative distribution that places more weight in the areas

of importance. To see this, suppose that ξ has a density f (

) over

so that we

are trying to ﬁnd:

Q(x)=



Q(x,

) f (

. (1.10)

The crude Monte Carlo technique generates each sample

according to the distri-

bution given by density f .

In importance sampling, a new probability density g(

) is introduced that is

somewhat similar to Q(x,

) and such that g(

) > 0 whenever Q(x,

) f (

) =

0 . We then generate samples

according to this distribution while writing the

integral as:

Q(x)=



Q(x,

) f (

)

) d

. (1.11)

In this case, we generate random samples of

Q(x,

) f (

)

from the distribution with

density g(

) . Note that if g(

Q(x,

)

f (

)Q(x)

,theneverysample

imp

under impor-

tance sampling yields an importance sampling expectation, Q

imp

(x)=Q(x) .

Of course, if we could generate samples from the density

Q(x,

)

f (

)Q(x)

, we would

already know Q(x) . We can, however, use approximations such as the sublinear

approximations in Section 8.5 that may be close to Q(x) and should result in lower

variances for Q

imp

over Q

. This approximation is the approach suggested in

Infanger [1991].

In the sublinear approximation approach, the approximating density g(

) is cho-

sen as

∑

i=1

I(i)

i·

x,h

) f (

(Tx) , (1.12)

where g may also depend on x . Using this construction, much lower variances

can result in comparison to the crude Monte Carlo approach. One complication

is, however, in generating a random sample from the density in (1.12). The gen-

eral techniques for generating such random vectors is to generate sequentially from

9.1 Sample Average Approximation and Importance Sampling in the L -Shaped Method 393

the marginal distributions conditionally, ﬁrst choosing

with the ﬁrst marginal,

(



,...,

) d

. Then, sequentially,

is chosen with density, g

(

,...,

i−1

) . Remember that in each case, a random sample with density g

(

) on

an interval

of ℜ can be found by choosing from a uniform random sample u

from [0,1] and then taking

such that G(

)=u where G(x)=



−∞

(

Example 1

Consider Example 1 of Section 8.2 with x

= x

= x . We consider both the crude

Monte Carlo approach and the importance sampling using the sublinear approxi-

mation for g(

) . In this case, g(

) is actually chosen to depend on x as g

(

)

deﬁned by:

(

|x −

|+ |x −

[|x −ξ

|+ |x −ξ

· (1.13)

For comparison, we ﬁrst consider the L -shaped method with ξ

chosen by crude

Monte Carlo from the original uniform density on [0, 1] ×[0,1] and by the im-

portance sampling method with distribution g

(ξ) in (1.13). The results appear in

Figure 1 for the solution x

at each iteration s of the crude Monte Carlo and im-

portance sampling L -shaped method with

= 500 on each L -shaped iteration.

The ﬁgures show up to 101 L -shaped iterations, which involve more than 50,000

recourse problem solutions.

In Figure 1, the crude Monte Carlo iteration values x appear as x(crude) while

the importance sampling iterations appear as x(imp) . We also include the optimal

solution x

∗

√

2 −1 on the graph. Note that x(imp) is very close to x

∗

from

just over 40 iterations while x(crude) does not appear to approach this accuracy

within 100 iterations. Note that x(imp) begins to deteriorate after 80 iterations as

the accumulation of cuts increases the probability that some cuts are actually above

Q(x) . If each cut is generated independently, this adds a bias to the results since the

expectation of the outer linearization is the expectation of the maximum of a set of

random approximations, which is greater than the maximum of the expectations of

those cuts in an exact procedure. This problem is reduced but not eliminated with

importance sampling. As a remedy, a ﬁxed set of samples can be used to obtain

convergence for that sample set and then checked for convergence using sequential

sampling procedures as discussed in Section 9.5.

The advantage of importance sampling can also be seen in Figure 2, which com-

pares the optimal value Q(x

∗

) with sample values, Q

) , with crude Monte

Carlo denoted as Q (crude) and Q

imp

) with importance sampling denoted as

Q (imp). Note that the crude Monte Carlo values have a much wider variance, in

fact, double the variance of the importance sampling results. Also note that in both

sampling methods, the estimates have a mean close to the optimal value after 40

iterations.