batch manufacturing

随机过程

Posted by Birdie on January 15, 2024

实验目的

Write programs to solve batch manufacturing problem

  • Formulate and solve both discounted and average cost problems
  • Realize both value iteration and policy iteration algorithms
  • parameters c,K,n,p,α as inputs

Experiments and analysis

  • Run the code with different parameter inputs to show the performance of value iteration and policy iteration
  • For discounted problem, consider exercise 7.8
  • For average cost problem, write code to find the threshold by stationary analysis
  • Compare the optimal policies of the discounted problem wit hdifferent values of α and the average cost problem

前置知识

Batch Manufacturing

A Manufacturer at each time period receives an order for her product with probability p and receives no order with probability 1p.

At any period, she has a choice of processing all the unfilled orders in a batch, or process no order at all. The maximum number of orders that can remain unfilled is n.

The cost per unfilled order at each time period is c>0, the setup cost to process the unfilled orders is K>0.

The manufacturer wants to find a processing policy that minimizes the total expected cost with discount factor α<1.

Discounted Problem

Discounted problems (0<α<1)

limNEwkk=0,1,{N1k=0αkg(xk,μk(xk),wk)}
  • α 表示衰减率
  • g(xk,μk(xk),wk) 表示从状态 xk 执行动作 μk(xk) 到达状态wk的代价

Value Iteration

给定初始状态 J0(1),,J0(n), 使用动态规划方程计算 Jk(i)

Jk+1(i)=minuU(i)[g(i,u)+αnj=1pij(u)Jk(j)],i

最终所有状态会收敛到J(i)

Policy Iteration

根据bellman方程,状态i的最优值函数应满足:

J(i)=minuU(i)[g(i,u)+αnj=1pij(u)J(j)],i

对于一个给定的策略 μ, 一个特定解的代价 Jμ(1),,Jμ(n) 满足

Jμ(i)=g(i,μ(i))+αnj=1pij(μ(i))Jμ(j),i

给定初始状态 J0(1),,J0(n), 使用动态规划方程计算 Jk(i)

Jk+1(i)=g(i,μ(i))+αnj=1pij(μ(i))Jk(j),i

最终所有状态会收敛到 Jμ(i)

当且仅当对于每个状态 iμ(i) 都达到Bellman方程中的最小值时,静态政策 μ 才是最优的。

因此策略迭代方程为

μk+1(i)=argminuU(i)[g(i,u)+αnj=1pij(u)Jμk(j)],i

生成不断改进的策略序列,并以最优策略结束。

Average Cost Problem

Average cost per stage problems

limN1NEwkk=0,1,{N1k=0g(xk,μk(xk),wk)}
  • g(xk,μk(xk),wk)表示从状态 xk 执行动作 μk(xk) 到达状态wk的代价

Value Iteration

Bellman方程如下:

λ+h(i)=minuU(i)[g(i,u)+nj=1pij(u)h(j)],i

未知变量 λ,h(i)λ 表示最优平均cost,h 表示各个状态的最优cost与平均最优cost的差分cost。

给定 hk(i),以及固定状态 s,计算

λk=minuU(s)[g(s,u)+nj=1psj(u)hk(j)]

hk+1(i)=minuU(i)[g(i,u)+nj=1pij(u)hk(j)]λk

Policy Iteration

在第k轮,固定策略 μk,计算价值函数

λk+hk(i)=g(i,μk(i))+nj=1pij(μk(i))hk(j),ihk(n)=0

策略改进:

μk+1(i)=argminuU(i)[g(i,u)+nj=1pij(u)hk(j)],i

终止状态为 λk+1=λk,hk+1(i)=hk(i),i

\section{方法}

马尔科夫决策建模

  • State i{0,1,,n} : number of unfilled orders
  • Action u{0,1} : process (1) or not (0)

    u{0,1}, if i<n;u=1, if i=n
  • State Transition pij(u) :

    pi1(1)=pi(i+1)(0)=p,pi0(1)=pii(0)=1p,i<npn1(1)=p,pn0(1)=1p
  • Per-stage cost

    g(i,1)=K,g(i,0)=ci

在Discounted Problem下

Value Iteration

初始值

J0(i)=0,i

Value iteration

Jk+1(i)=min[K+α(1p)Jk(0)+αpJk(1)ci+α(1p)Jk(i)+αpJk(i+1)],i=0,1,,n1,Jk+1(n)=K+α(1p)Jk(0)+αpJk(1)

Convergence:

J(i)=limkJk(i)

Policy Iteration

初始值

μ0(i)=1,i

策略评估

Jμk+1(i)={K+α(1p)Jμk(0)+αpJμk(1)μk(i)=0ci+α(1p)Jμk(i)+αpJμk(i+1)μk(i)=1ifi=0,1,,n1,Jμk+1(n)=K+α(1p)Jμk(0)+αpJμk(1)

策略改进

costprocess(i)=K+(1p)Jμk(0)+pJμk(1)+λkcostunprocess(i)=ci+(1p)Jμk(i)+pJμk(i+1)+λkμk+1(i)={0costprocess(i)costunprocess(i)1costprocess(i)<costunprocess(i) if i=0,,n1μk+1(n)=1

在Average Cost Problem下

Bellman方程为:

λ+h(i)=min[K+(1p)h(0)+ph(1)ci+(1p)h(i)+ph(i+1)],i=0,1,,n1λ+h(n)=K+(1p)h(0)+ph(1)

Value Iteration

初始值

J0(i)=0,i

固定状态为0,每一轮循环必定会从0开始并返回0。

Value iteration

λk=(1p)Jk(0)+pJk(1)Jk+1(i)=min[K+(1p)Jk(0)+pJk(1)ci+(1p)Jk(i)+pJk(i+1)]λk,i=0,1,,n1,Jk+1(n)=K+(1p)Jk(0)+pJk(1)λk

Convergence:

J(i)=limkJk(i),λ=limkλk

Policy Iteration

初始值

μ0(i)=1,i

策略评估

λk=(1p)Jk(0)+pJk(1)Jμk+1(i)={K+(1p)Jμk(0)+pJμk(1)λkμk(i)=0ci+(1p)Jμk(i)+pJμk(i+1)λkμk(i)=1ifi=0,1,,n1,Jμk+1(n)=K+(1p)Jμk(0)+pJμk(1)λk

策略改进

costprocess(i)=K+(1p)Jμk(0)+pJμk(1)+λkcostunprocess(i)=ci+(1p)Jμk(i)+pJμk(i+1)+λkμk+1(i)={0costprocess(i)costunprocess(i)1costprocess(i)<costunprocess(i) if i=0,,n1μk+1(n)=1

根据value function得到policy

正常来说,我们可以通过value-action function采用贪婪的策略得到policy,但在本次实验中,通过value iteration会得到状态价值函数,状态价值函数只能评估一个状态的好坏程度。已知问题是batch manufacturing problem,决策只有process和unprocess两种,从贪婪的角度来说,如果unporcess能够得到更好的状态,那么我们选择unporcess,否则将会process。因此,在得到value function后,所有拥有最大value的状态,都会得到process的决策,其他状态则为unprocess。

实现

Discounted Problem下的Value Iteration

def _iteration_discounted(self, p, n, c, K, alpha):
    # 当前的Value函数表,初始值为零向量
    value = [0] * (n + 1)
    # 上一轮的Value函数表,初始值为零向量
    last_value = [0] * (n + 1)
    # 记录迭代轮数
    turns = 0
    while True:
        # 轮次 + 1
        turns = turns + 1
        # 根据更新方程,更新value[0...n-1]
        for i in range(n):
            value[i] = min(
                K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1],
                c * i + alpha * (1 - p) * last_value[i] + alpha * p * last_value[i + 1]
            )
        # 根据更新方程,更新value[n]
        value[n] = K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1]
        # 判断是否收敛,即两个向量的差小于EPS
        if is_terminal(value, last_value, Value.EPS):
            break
        last_value = value.copy()
    
    return {
        "Iteration turns" : turns,
        "Value function" : format_float(value),
        "Policy" : self._get_policy(value)
    }

Discounted Problem下的Policy Iteration

# 策略评估:根据当前策略policy计算value表
def _get_value_discounted(self, p, n, c, K, alpha, policy):
    value = [0] * (n + 1)
    last_value = [0] * (n + 1)
    turns = 0
    while True:
        turns = turns + 1
        for i in range(n):
            value[i] = (
                K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1] if policy[i] else
                c * i + alpha * (1 - p) * last_value[i] + alpha * p * last_value[i + 1]
            )
        value[n] = K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1]
        if is_terminal(value, last_value, Policy.EPS):
            break
        last_value = value.copy()
    
    return value, turns

# 策略改进:策略迭代
def _iteration_discounted(self, p, n, c, K, alpha):
    # 当前轮的策略,目前是初始策略
    policy = [0] * n + [1]
    # 上一轮的策略
    last_policy = [0] * n + [1]
    # 记录轮数
    turns = 0
    # 价值计算总轮数
    value_turns = 0
    while True:
        # 轮数 + 1
        turns = turns + 1
        # 根据上一轮的策略计算value表
        value, turns_i = self._get_value_discounted(p, n, c, K, alpha, last_policy)
        value_turns += turns_i

        # 根据value表计算最新的策略
        for i in range(n):
            # 如果process,代价为
            process_cost = K + alpha * (1 - p) * value[0] + alpha * p * value[1]
            # 如果不process,代价为
            unprocess_cost = c * i + alpha * (1 - p) * value[i] + alpha * p * value[i + 1]
            # 选取最小值对应的动作
            policy[i] = 1 if process_cost < unprocess_cost else 0
        # policy[n]的动作空间只有{1}
        policy[n] = 1

        # 判断是否达到终止状态,即value表收敛
        if policy == last_policy:
            break
        last_policy = policy.copy()

    return {
        "Policy improvement iteration turns" : turns,
        "Calucating value function total iteration turns" : value_turns,
        "Value function" : format_float(value),
        "Policy" : policy
    }

Average Cost Problem下的Value Iteration

def _iteration_average(self, p, n, c, K):
    # 当前轮次的差分价值函数
    value_h = [0] * (n + 1)
    # 上一轮次的差分价值函数
    last_value_h = [0] * (n + 1)
    # 上一轮次的平均价值函数
    last_lambda_k = 0
    # 记录轮数
    turns = 0
    while True:
        # 轮数 + 1
        turns = turns + 1
        # 固定状态0,计算平均价值函数
        lambda_k = min(
            K + (1 - p) * last_value_h[0] + p * last_value_h[1],
            (1 - p) * last_value_h[0] + p * last_value_h[1]
        )
        # 迭代计算当前轮次差分状态函数
        for i in range(n):
            value_h[i] = min(
                K + (1 - p) * last_value_h[0] + p * last_value_h[1],
                c * i + (1 - p) * last_value_h[i] + p * last_value_h[i + 1]
            ) - lambda_k
        # 计算状态n的当前轮次差分状态函数
        value_h[n] = K + (1 - p) * last_value_h[0] + p * last_value_h[1] - lambda_k
        # 判断是否达到终止状态
        if is_terminal(value_h, last_value_h, Value.EPS) and abs(last_lambda_k - lambda_k) < Value.EPS:
            break
        # 更新轮次状态
        last_value_h = value_h.copy()
        last_lambda_k = lambda_k
    
    value = [i + lambda_k for i in value_h]
    
    return {
        "Iteration turns" : turns,
        "Value function" : format_float(value),
        "Policy" : self._get_policy(value),
        "Optimal average cost" : format_float(lambda_k)
    }

Average Cost Problem下的Policy Iteration

# 策略评估:根据当前轮次的策略,计算当前轮次的平均价值和差分价值表
def _get_value_average(self, p, n, c, K, policy):
    value_h = [0] * (n + 1)
    last_value_h = [0] * (n + 1)
    last_lambda_k = 0
    turns = 0
    while True:
        turns = turns + 1
        lambda_k = min(
            K + (1 - p) * last_value_h[0] + p * last_value_h[1],
            (1 - p) * last_value_h[0] + p * last_value_h[1]
        )
        for i in range(n):
            value_h[i] = (
                K + (1 - p) * last_value_h[0] + p * last_value_h[1] if policy[i] else
                c * i + (1 - p) * last_value_h[i] + p * last_value_h[i + 1]
            ) - lambda_k
        value_h[n] = K + (1 - p) * last_value_h[0] + p * last_value_h[1] - lambda_k
        if is_terminal(value_h, last_value_h, Policy.EPS) and abs(last_lambda_k - lambda_k) < Policy.EPS:
            break
        last_value_h = value_h.copy()
        last_lambda_k = lambda_k
    
    return lambda_k, value_h, turns

# 策略改进:策略迭代
def _iteration_average(self, p, n, c, K):
    # 初始化当前轮次的策略
    policy = [1] * (n + 1)
    # 初始化上一轮次的策略
    last_policy = [1] * (n + 1)
    # 记录轮次
    turns = 0
    # 价值函数迭代轮数
    value_turns = 0
    while True:
        # 轮次 + 1
        turns = turns + 1
        # 根据当前轮次的策略,计算当前轮次的平均价值和差分价值表
        value, turns_i = self._get_value_average(p, n, c, K, last_policy)
        value_turns += turns_i

        # 策略迭代
        for i in range(n):
            process_cost = K + (1 - p) * value[0] + p * value[1]
            unprocess_cost = c * i + (1 - p) * value[i] + p * value[i + 1]
            policy[i] = 1 if process_cost < unprocess_cost else 0
        policy[n] = 1

        # 判断是否达到终止状态
        if policy == last_policy:
            break
        last_policy = policy.copy()

    return {
        "Policy improvement iteration turns" : turns,
        "Calucating value function total iteration turns" : value_turns,
        "Value function" : format_float(value),
        "Policy" : policy
    }

工具函数

# 用与判断两个list:x和y每个元素的距离都小于EPS
def is_terminal(x, y, EPS):
    if x is None or y is None:
        return False

    for (xi, yi) in zip(x, y):
        if abs(xi - yi) > EPS:
            return False
    return True

# 用语规范list或者float为4位小数
def format_float(v):
    if type(v) == type(list()):
        return [float('{:.4f}'.format(i)) for i in v]
    else:
        return round(v, 4)

# 根据value function得到policy
def _get_policy(self, value):
    max_value = max(value)
    policy = [0 if i < max_value - Value.EPS else 1 for i in value]
    return policy

实验

实验指令见附件。

Discounted Problem

比较不同的 n

Used value iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 8, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 12, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Used policy iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 8, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1]

p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

p = 0.5, n = 12, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

一方面,只要其他参数不改变,当process的阈值小于n,value function和policy并不会随着n的增长而改变;另一方面,在这组参数下,value iteration和policy iteration得到的value function和policy是一致的,尽管policy iteration所需要的策略改进轮数更少,但其进行策略评估的总轮数是要大于value iteration所需要的轮数的。

比较不同的 K

Used value iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 1.0, K = 3.0, alpha = 0.9.
    Iteration turns : 91
    Value function : [10.5741, 12.9241, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 8.0, alpha = 0.9.
    Iteration turns : 98
    Value function : [18.358, 22.4377, 25.2018, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
Used policy iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 1.0, K = 3.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 378
    Value function : [10.5741, 12.9241, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 8.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 407
    Value function : [18.358, 22.4377, 25.2018, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

K 发生变化时,两种方法得到的结果仍然是一致的;尽管policy iteration所需要的策略改进轮数更少,但其进行策略评估的总轮数是要大于value iteration所需要的轮数的。

K 增大时,所有状态的价值函数的值都会增大,而得到的最优策略显示,货物会存放更多的天数,这样也是符合常识的。

比较不同的 c

Used value iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 0.5, K = 5.0, alpha = 0.9.
    Iteration turns : 92
    Value function : [10.319, 12.6123, 14.3042, 15.2609, 15.319, 15.319, 15.319, 15.319, 15.319, 15.319, 15.319]
    Policy : [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 2.0, K = 5.0, alpha = 0.9.
    Iteration turns : 97
    Value function : [19.1242, 23.3742, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Used policy iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 0.5, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 387
    Value function : [10.3191, 12.6124, 14.3042, 15.2609, 15.3191, 15.3191, 15.3191, 15.3191, 15.3191, 15.3191, 15.3191]
    Policy : [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 2.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 398
    Value function : [19.1242, 23.3742, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

c 发生变化时,两种方法得到的结果仍然是一致的;尽管policy iteration所需要的策略改进轮数更少,但其进行策略评估的总轮数是要大于value iteration所需要的轮数的。

c 增大时,所有状态的价值函数的值都会增大,而得到的最优策略显示,货物会存放更少的天数,这样也是符合常识的。

比较不同的α

Used value iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.0.
    Iteration turns : 2
    Value function : [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0]
    Policy : [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.1.
    Iteration turns : 6
    Value function : [0.0617, 1.1728, 2.2839, 3.3935, 4.4769, 5.0617, 5.0617, 5.0617, 5.0617, 5.0617, 5.0617]
    Policy : [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.5.
    Iteration turns : 16
    Value function : [0.9615, 2.8845, 4.6538, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Used policy iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.0.
    Policy improvement iteration turns : 2
    Calucating value function total iteration turns : 4
    Value function : [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0]
    Policy : [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.1.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 18
    Value function : [0.0617, 1.1728, 2.2839, 3.3935, 4.4769, 5.0617, 5.0617, 5.0617, 5.0617, 5.0617, 5.0617]
    Policy : [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.5.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 49
    Value function : [0.9615, 2.8846, 4.6538, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

上述实验可以反应,在discounted problem下,value iteration和policy iteration得到的结果是一致的,只有迭代轮数上的差异。

α 越接近1,代表对未来奖励的重视程度越高;而 α 越接近0,代表对未来奖励的重视程度越低。当α增大时,

  • 上一轮迭代的结果对当前轮数的影响会增加,收敛的速度会减缓,迭代的轮数会增加;
  • 上一轮迭代的结果会以一定程度的衰减增加到当前的value function上,因此最终收敛后的value function会增大;
  • policy会更加注重长期收益,而随着轮数的增加,process的收益是最大的,因此policy中process的决策点会提前;
  • α=1时,奖励会一直叠加导致iteration无法终止。

比较不同的 p

Used value iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.1, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 85
    Value function : [4.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991]
    Policy : [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.9, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 98
    Value function : [21.1872, 23.803, 25.5072, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
Used policy iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.1, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 284
    Value function : [4.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991]
    Policy : [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.9, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 401
    Value function : [21.1872, 23.8031, 25.5072, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

上述实验可以反应,在discounted problem下,value iteration和policy iteration得到的结果是一致的,只有迭代轮数上的差异。

p增大时,获得订单的概率会增大,

  • 因此每一轮unprocess的收益会增大,下一个时刻很可能又获得一个订单使得下一个时刻process的收益增加,因此process的决策点会推移;
  • 由于决策点会往后推移,因此计算的value function会整体增大;
  • 计算的value function会整体增大,收敛的轮数也会增加。

Average cost Problem

比较不同的 n

Used value iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 8, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 12, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
Used policy iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 8, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 5
    Calucating value function total iteration turns : 100
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 5
    Calucating value function total iteration turns : 100
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 12, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 5
    Calucating value function total iteration turns : 100
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75

可以观察,

  • n 本身已经超过unprocess决策窗口时,随着n的增大,value iteration和policy iteration的结果并不会变化,包括迭代的次数也不会变化;
  • 得到的value function可以应证,状态价值函数是一个递增的序列,也是符合预期的,同时value[0]=optimal average cost也是符合预期的;
  • 横向比较看,value iteration和policy iteration得到的状态价值函数和策略都是一样的;
  • 从迭代轮数来看,policy iteration所需要的计算代价会更大。

比较不同的 K

Used value iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 10, c = 1.0, K = 3.0, alpha = 0.9.
    Iteration turns : 5
    Value function : [1.25, 3.75, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.25
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 10, c = 1.0, K = 8.0, alpha = 0.9.
    Iteration turns : 18
    Value function : [2.3333, 7.0001, 9.6667, 10.3333, 10.3333, 10.3333, 10.3333, 10.3333, 10.3333, 10.3333, 10.3333]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.3333
Used policy iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 10, c = 1.0, K = 3.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 42
    Value function : [1.25, 3.75, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.25
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 5
    Calucating value function total iteration turns : 100
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 10, c = 1.0, K = 8.0, alpha = 0.9.
    Policy improvement iteration turns : 6
    Calucating value function total iteration turns : 251
    Value function : [2.3334, 7.0001, 9.6667, 10.3334, 10.3334, 10.3334, 10.3334, 10.3334, 10.3334, 10.3334, 10.3334]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.3334

K 发生变化时,两种方法得到的结果仍然是一致的;尽管policy iteration所需要的策略改进轮数更少,但其进行策略评估的总轮数是要大于value iteration所需要的轮数的。

K 增大时,所有状态的价值函数的值都会增大,而得到的最优策略显示,货物会存放更多的天数,这样也是符合常识的。

可以观察,

  • 两种方法得到的结果都是一致的;
  • K 增大时,所有状态的价值函数的值都会增大,而得到的最优策略显示,货物会存放更多的天数,这样也是符合常识的。
  • K 增大时,所需要的计算代价会更大,收敛的时间会更久。

比较不同的c

Used value iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 10, c = 0.5, K = 5.0, alpha = 0.9.
    Iteration turns : 18
    Value function : [1.3333, 4.0, 5.6667, 6.3333, 6.3333, 6.3333, 6.3333, 6.3333, 6.3333, 6.3333, 6.3333]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.3333
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 10, c = 2.0, K = 5.0, alpha = 0.9.
    Iteration turns : 5
    Value function : [2.25, 6.75, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.25
Used policy iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 10, c = 0.5, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 240
    Value function : [1.375, 4.1251, 5.875, 6.625, 6.375, 6.375, 6.375, 6.375, 6.375, 6.375, 6.375]
    Policy : [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.375
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 96
    Value function : [1.8334, 5.5001, 7.1667, 6.8334, 6.8334, 6.8334, 6.8334, 6.8334, 6.8334, 6.8334, 6.8334]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.8334
    
p = 0.5, n = 10, c = 2.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 28
    Value function : [2.25, 6.75, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.25

c 增大时,所有状态的价值函数的值都会增大,而得到的最优策略显示,货物会存放更少的天数,这样也是符合常识的。

比较不同的 p

Used value iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.1, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 8
    Value function : [0.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5]
    Policy : [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 0.5
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.9, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 62
    Value function : [2.5, 5.2777, 6.9444, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.5
Used policy iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.1, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 231
    Value function : [0.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5]
    Policy : [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 0.5
    
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 5
    Calucating value function total iteration turns : 100
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
    
p = 0.9, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 338
    Value function : [2.5, 5.2778, 6.9444, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.5

上述实验可以反应,在average cost problem下,value iteration和policy iteration得到的结果是一致的,但迭代轮数上的也有差异。

p 增大时,获得订单的概率会增大,

  • 因此每一轮unprocess的收益会增大,下一个时刻很可能又获得一个订单使得下一个时刻process的收益增加,因此process的决策点会推移;
  • 由于决策点会往后推移,因此计算的value function会整体增大;
  • 计算的value function会整体增大,收敛的轮数也会增加。

实验总结

具体的计算结果、比较和分析见上述实验。

  • 在Discounted problem下,Value iteration和Policy iteration得到的结果是一致的,只有计算代价上的区别,后者的计算开销会更大,因为其不仅要策略改进,还需要进行策略的评估,而策略评估的开销是比较大的;
  • 在Average cost problem下,Value iteration和Policy iteration得到的结果也是是一致的,同样的,也是后者的计算开销也会更大;
  • n 增加时,只要threshold在 n 的范围内,对结果是没有影响的;
  • c 增加时,所有状态的价值函数的值都会增大,计算开销会减小,而得到的最优策略显示,货物会存放更少的天数,这样也是符合常识的;
  • K 增加时,所有状态的价值函数的值都会增大,计算开销会增大,而得到的最优策略显示,货物会存放更多的天数,这样也是符合常识的;
  • α 增加时,上一轮迭代的结果对当前轮数的影响会增加,收敛的速度会减缓,迭代的轮数会增加,上一轮迭代的结果会以一定程度的衰减增加到当前的value function上,因此最终收敛后的value function会增大,policy会更加注重长期收益,而随着轮数的增加,process的收益是最大的,因此policy中process的决策点会提前;
  • p 增大时,因此每一轮unprocess的收益会增大,下一个时刻很可能又获得一个订单使得下一个时刻process的收益增加,因此process的决策点会推移,由于决策点会往后推移,因此计算的value function会整体增大,计算的value function会整体增大,收敛的轮数也会增加。

附件

代码

main.py

import argparse

from algorithm import Value, Policy

if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='Program to solve batch manufacturing problem.')

    parser.add_argument('--p', type=float, nargs='+', default=[0.5], help='Probability of receiving an order for the product.')
    parser.add_argument('--n', type=int, nargs='+', default=[5], help='The maximum number of orders that can remain unfilled.')
    parser.add_argument('--c', type=float, nargs='+', default=[1.0], help='The cost per unfilled order at each time period.')
    parser.add_argument('--K', type=float, nargs='+', default=[3.0], help='The setup cost to process the unfilled orders.')
    parser.add_argument('--alpha', type=float, nargs='+', default=[0.8], help='Discount factor.')
    
    parser.add_argument('--problem', type=str, nargs='+', choices=['discounted', 'average_cost'], default=['discounted', 'average_cost'], help='Problem in [discounted, average_cost].')
    parser.add_argument('--algorithm', type=str, choices=['value', 'policy'], nargs='+', default=['value', 'policy'], help='Algorithm in [value, policy] iteration.')
    
    args = parser.parse_args()

    for algorithm in args.algorithm:

        for problem in args.problem:

            method = {
                'value' : Value(problem),
                'policy' : Policy(problem)
            }.get(algorithm)
            print('Used {} iteration to solve batch manufacturing problem, formulate with {} problem:'.format(algorithm, problem))

            for p in args.p:
                for n in args.n:
                    for c in args.c:
                        for K in args.K:
                            for alpha in args.alpha:
                                result = method.iteration(p, n, c, K, alpha)
                                print('p = {}, n = {}, c = {}, K = {}, alpha = {}.'.format(p, n, c, K, alpha))
                                for key, value in result.items():
                                    print(key, ":", value)
                                print("\n")

utils.py

def is_terminal(x, y, EPS):
    
    if x is None or y is None:
        return False

    for (xi, yi) in zip(x, y):
        if abs(xi - yi) > EPS:
            return False
    return True

def format_float(v):
    if type(v) == type(list()):
        return [float('{:.4f}'.format(i)) for i in v]
    else:
        return round(v, 4)

algorithm.py

from utils import is_terminal, format_float

class Value:

    EPS = 1e-4

    def __init__(self, problem):
        self.problem = problem

    def _get_policy(self, value):
        max_value = max(value)
        policy = [0 if i < max_value - Value.EPS else 1 for i in value]
        return policy
    
    def _iteration_discounted(self, p, n, c, K, alpha):
        value = [0] * (n + 1)
        last_value = [0] * (n + 1)
        turns = 0
        while True:
            turns = turns + 1
            for i in range(n):
                value[i] = min(
                    K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1],
                    c * i + alpha * (1 - p) * last_value[i] + alpha * p * last_value[i + 1]
                )
            value[n] = K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1]
            if is_terminal(value, last_value, Value.EPS):
                break
            last_value = value.copy()
        
        return {
            "Iteration turns" : turns,
            "Value function" : format_float(value),
            "Policy" : self._get_policy(value)
        }
    
    def _iteration_average(self, p, n, c, K):
        value_h = [0] * (n + 1)
        last_value_h = [0] * (n + 1)
        lambda_k = last_lambda_k = 0
        turns = 0
        while True:
            turns = turns + 1
            lambda_k = min(
                K + (1 - p) * last_value_h[0] + p * last_value_h[1],
                (1 - p) * last_value_h[0] + p * last_value_h[1]
            )
            for i in range(n):
                value_h[i] = min(
                    K + (1 - p) * last_value_h[0] + p * last_value_h[1],
                    c * i + (1 - p) * last_value_h[i] + p * last_value_h[i + 1]
                ) - lambda_k
            value_h[n] = K + (1 - p) * last_value_h[0] + p * last_value_h[1] - lambda_k
            if is_terminal(value_h, last_value_h, Value.EPS) and abs(last_lambda_k - lambda_k) < Value.EPS:
                break
            last_value_h = value_h.copy()
            last_lambda_k = lambda_k
        
        value = [i + lambda_k for i in value_h]
        
        return {
            "Iteration turns" : turns,
            "Value function" : format_float(value),
            "Policy" : self._get_policy(value),
            "Optimal average cost" : format_float(lambda_k)
        }


    def iteration(self, p, n, c, K, alpha):
        return {
            'discounted' : self._iteration_discounted(p, n, c, K, alpha),
            'average_cost' : self._iteration_average(p, n, c, K)
        }.get(self.problem)


class Policy:

    EPS = 1e-4

    def __init__(self, problem):
        self.problem = problem
    
    def _get_value_discounted(self, p, n, c, K, alpha, policy):
        value = [0] * (n + 1)
        last_value = [0] * (n + 1)
        turns = 0
        while True:
            turns = turns + 1
            for i in range(n):
                value[i] = (
                    K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1] if policy[i] else
                    c * i + alpha * (1 - p) * last_value[i] + alpha * p * last_value[i + 1]
                )
            value[n] = K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1]
            if is_terminal(value, last_value, Policy.EPS):
                break
            last_value = value.copy()
        
        return value, turns
    
    def _get_value_average(self, p, n, c, K, policy):
        value_h = [0] * (n + 1)
        last_value_h = [0] * (n + 1)
        last_lambda_k = 0
        turns = 0
        while True:
            turns = turns + 1
            lambda_k = min(
                K + (1 - p) * last_value_h[0] + p * last_value_h[1],
                (1 - p) * last_value_h[0] + p * last_value_h[1]
            )
            for i in range(n):
                value_h[i] = (
                    K + (1 - p) * last_value_h[0] + p * last_value_h[1] if policy[i] else
                    c * i + (1 - p) * last_value_h[i] + p * last_value_h[i + 1]
                ) - lambda_k
            value_h[n] = K + (1 - p) * last_value_h[0] + p * last_value_h[1] - lambda_k
            if is_terminal(value_h, last_value_h, Policy.EPS) and abs(last_lambda_k - lambda_k) < Policy.EPS:
                break
            last_value_h = value_h.copy()
            last_lambda_k = lambda_k
        
        return lambda_k, value_h, turns
    
    def _iteration_discounted(self, p, n, c, K, alpha):
        policy = [1] * (n + 1)
        last_policy = [1] * (n + 1)
        turns = 0
        value_turns = 0
        while True:
            turns = turns + 1
            value, turns_i = self._get_value_discounted(p, n, c, K, alpha, last_policy)
            value_turns += turns_i

            for i in range(n):
                process_cost = K + alpha * (1 - p) * value[0] + alpha * p * value[1]
                unprocess_cost = c * i + alpha * (1 - p) * value[i] + alpha * p * value[i + 1]
                policy[i] = 1 if process_cost < unprocess_cost else 0
            policy[n] = 1

            if policy == last_policy:
                break
            last_policy = policy.copy()

        return {
            "Policy improvement iteration turns" : turns,
            "Calucating value function total iteration turns" : value_turns,
            "Value function" : format_float(value),
            "Policy" : policy
        }
    
    def _iteration_average(self, p, n, c, K):
        policy = [1] * (n + 1)
        last_policy = [1] * (n + 1)
        turns = 0
        value_turns = 0
        while True:
            turns = turns + 1
            lambda_k, value, turns_i = self._get_value_average(p, n, c, K, last_policy)
            value_turns += turns_i

            for i in range(n):
                process_cost = K + (1 - p) * value[0] + p * value[1]
                unprocess_cost = c * i + (1 - p) * value[i] + p * value[i + 1]
                policy[i] = 1 if process_cost < unprocess_cost else 0
            policy[n] = 1

            if last_policy == policy:
                break
            last_policy = policy.copy()
        
        value_f = [i + lambda_k for i in value]

        return {
            "Policy improvement iteration turns" : turns,
            "Calucating value function total iteration turns" : value_turns,
            "Value function" : format_float(value_f),
            "Policy" : policy,
            "Optimal average cost" : format_float(lambda_k)
        }

    def iteration(self, p, n, c, K, alpha):
        return {
            'discounted' : self._iteration_discounted(p, n, c, K, alpha),
            'average_cost' : self._iteration_average(p, n, c, K)
        }.get(self.problem)

指令

discounted problem实验

指令1.1 比较不同的n

python main.py --n 8 10 12 --K 5 --c 1 --alpha 0.9 --p 0.5 --problem discounted --algorithm value policy

指令1.2 比较不同的K

python main.py --n 10 --K 3 5 8 --c 1 --alpha 0.9 --p 0.5 --problem discounted --algorithm value policy

指令1.3 比较不同的c

python main.py --n 10 --K 5 --c 0.5 1 2 --alpha 0.9 --p 0.5 --problem discounted --algorithm value policy

指令1.4 比较不同的alpha

python main.py --n 10 --K 5 --c 1 --alpha 0 0.1 0.5 0.9 --p 0.5 --problem discounted --algorithm value policy

指令1.5 比较不同的p

python main.py --n 10 --K 5 --c 1 --alpha 0.9 --p 0.1 0.5 0.9 --problem discounted --algorithm value policy

average cost problem实验

指令2.1 比较不同的n

python main.py --n 8 10 12 --K 5 --c 1 --alpha 0.9 --p 0.5 --problem average_cost --algorithm value policy

指令2.2 比较不同的K

python main.py --n 10 --K 3 5 8 --c 1 --alpha 0.9 --p 0.5 --problem average_cost --algorithm value policy

指令2.3 比较不同的c

python main.py --n 10 --K 5 --c 0.5 1 2 --alpha 0.9 --p 0.5 --problem average_cost --algorithm value policy

指令2.4 比较不同的p

python main.py --n 10 --K 5 --c 1 --alpha 0.9 --p 0.1 0.5 0.9 --problem average_cost --algorithm value policy


Related Issues not found

Please contact @Birdie-Go to initialize the comment