batch manufacturing

随机过程

Posted by Birdie on January 15, 2024

实验目的

Write programs to solve batch manufacturing problem

  • Formulate and solve both discounted and average cost problems
  • Realize both value iteration and policy iteration algorithms
  • parameters $c, K, n, p, \alpha$ as inputs

Experiments and analysis

  • Run the code with different parameter inputs to show the performance of value iteration and policy iteration
  • For discounted problem, consider exercise 7.8
  • For average cost problem, write code to find the threshold by stationary analysis
  • Compare the optimal policies of the discounted problem wit hdifferent values of $\alpha$ and the average cost problem

前置知识

Batch Manufacturing

A Manufacturer at each time period receives an order for her product with probability p and receives no order with probability $1 - p$.

At any period, she has a choice of processing all the unfilled orders in a batch, or process no order at all. The maximum number of orders that can remain unfilled is $n$.

The cost per unfilled order at each time period is $c > 0$, the setup cost to process the unfilled orders is $K > 0$.

The manufacturer wants to find a processing policy that minimizes the total expected cost with discount factor $\alpha < 1$.

Discounted Problem

Discounted problems $(0<\alpha<1)$

\[\lim _{N \rightarrow \infty} \underset{\substack{w_k \\ k=0,1, \ldots}}{E}\left\lbrace\sum_{k=0}^{N-1} \alpha^k g\left(x_k, \mu_k\left(x_k\right), w_k\right)\right\rbrace\]
  • $\alpha$ 表示衰减率
  • $g(x_k, \mu_k\left(x_k\right), w_k)$ 表示从状态 $x_k$ 执行动作 $\mu_k(x_k)$ 到达状态$w_k$的代价

Value Iteration

给定初始状态 $J_0(1), \ldots, J_0(n)$, 使用动态规划方程计算 $J_k(i)$ 为

\[J_{k+1}(i)=\min _{u \in U(i)}\left[g(i, u)+\alpha \sum_{j=1}^n p_{i j}(u) J_k(j)\right], \forall i\]

最终所有状态会收敛到$J^\ast(i)$。

Policy Iteration

根据bellman方程,状态$i$的最优值函数应满足:

\[J^\ast(i)=\min _{u \in U(i)}\left[g(i, u)+\alpha \sum_{j=1}^n p_{i j}(u) J^\ast(j)\right], \forall i\]

对于一个给定的策略 $\mu$, 一个特定解的代价 $J_\mu(1), \cdots, J_\mu(n)$ 满足

\[J_\mu(i)=g(i, \mu(i))+\alpha \sum_{j=1}^n p_{i j}(\mu(i)) J_\mu(j), \forall i\]

给定初始状态 $J_0(1), \ldots, J_0(n)$, 使用动态规划方程计算 $J_k(i)$ 为

\[J_{k+1}(i)=g(i, \mu(i))+\alpha \sum_{j=1}^n p_{i j}(\mu(i)) J_k(j), \forall i\]

最终所有状态会收敛到 $J_{\mu}(i)$。

当且仅当对于每个状态 $i$,$\mu(i)$ 都达到Bellman方程中的最小值时,静态政策 $\mu$ 才是最优的。

因此策略迭代方程为

\[\mu^{k+1}(i)=\arg \min _{u \in U(i)}\left[g(i, u)+\alpha \sum_{j=1}^n p_{i j}(u) J_{\mu^k}(j)\right], \forall i\]

生成不断改进的策略序列,并以最优策略结束。

Average Cost Problem

Average cost per stage problems

\[\lim _{N \rightarrow \infty} \frac{1}{N} \underset{\substack{w_k \\ k=0,1, \ldots}}{E}\left\lbrace\sum_{k=0}^{N-1} g\left(x_k, \mu_k\left(x_k\right), w_k\right)\right\rbrace\]
  • $g(x_k, \mu_k\left(x_k\right), w_k)$表示从状态 $x_k$ 执行动作 $\mu_k(x_k)$ 到达状态$w_k$的代价

Value Iteration

Bellman方程如下:

\[\lambda^\ast+h^\ast(i)=\min _{u \in U(i)}\left[g(i, u)+\sum_{j=1}^n p_{i j}(u) h^\ast(j)\right], \forall i\]

未知变量 $\lambda^\ast, h^\ast(i)$,$\lambda^\ast$ 表示最优平均cost,$h^\ast$ 表示各个状态的最优cost与平均最优cost的差分cost。

给定 $h_k(i)$,以及固定状态 $s$,计算

\[\lambda_k=\min _{u \in U(s)}\left[g(s, u)+\sum_{j=1}^n p_{s j}(u) h_k(j)\right]\]

\[h_{k+1}(i)=\min _{u \in U(i)}\left[g(i, u)+\sum_{j=1}^n p_{i j}(u) h_k(j)\right]-\lambda_k\]

Policy Iteration

在第$k$轮,固定策略 $\mu^k$,计算价值函数

\[\begin{gathered} \lambda^k+h^k(i)=g\left(i, \mu^k(i)\right)+\sum_{j=1}^n p_{i j}\left(\mu^k(i)\right) h^k(j), \forall i \\ h^k(n)=0 \end{gathered}\]

策略改进:

\[\mu^{k+1}(i)=\arg \min _{u \in U(i)}\left[g(i, u)+\sum_{j=1}^n p_{i j}(u) h^k(j)\right], \forall i\]

终止状态为 $\lambda^{k+1}=\lambda^k, h^{k+1}(i)=h^k(i), \forall i$。

\section{方法}

马尔科夫决策建模

  • State $i \in\lbrace0,1, \cdots, n\rbrace$ : number of unfilled orders
  • Action $u \in\lbrace0,1\rbrace$ : process (1) or not (0)

    \[u \in\lbrace0,1\rbrace, \text { if } i<n ; \quad u=1 \text {, if } i=n\]
  • State Transition $p_{i j}(u)$ :

    \[\begin{gathered} p_{i 1}(1)=p_{i(i+1)}(0)=p, \quad p_{i 0}(1)=p_{i i}(0)=1-p, \quad i<n \\ p_{n 1}(1)=p, \quad p_{n 0}(1)=1-p \end{gathered}\]
  • Per-stage cost

    \[g(i, 1)=K, \quad g(i, 0)=c i\]

在Discounted Problem下

Value Iteration

初始值

\[J_0(i) = 0,\forall i\]

Value iteration

\[\begin{aligned} J_{k+1}(i)= & \min \left[K+\alpha(1-p) J_k(0)+\alpha p J_k(1)\right. \\ & \left.c i+\alpha(1-p) J_k(i)+\alpha p J_k(i+1)\right], \quad i=0,1, \cdots, n-1, \\ J_{k+1}(n)= & K+\alpha(1-p) J_k(0)+\alpha p J_k(1) \end{aligned}\]

Convergence:

\[J^\ast(i)=\lim _{k \rightarrow \infty} J_k(i)\]

Policy Iteration

初始值

\[\mu^0(i)=1, \forall i\]

策略评估

\[\begin{aligned} J_{\mu^{k+1}}(i)&=\left\lbrace \begin{aligned} K+\alpha(1-p) J_{\mu^k}(0)+\alpha p J_{\mu^k}(1) \quad \mu^k(i)=0\\ c i+\alpha(1-p) J_{\mu^k}(i)+\alpha p J_{\mu^k}(i+1) \quad \mu^k(i)=1\\ \end{aligned} \right . \quad\text{if} \quad i=0,1, \cdots, n-1,\\ J_{\mu^{k+1}}(n)&= K+\alpha(1-p) J_{\mu^k}(0)+\alpha p J_{\mu^k}(1) \end{aligned}\]

策略改进

\[\begin{aligned} \text{cost}_{process}(i) &= K+(1-p) J_{\mu^k}(0)+ p J_{\mu^k}(1)+\lambda_{k}\\ \text{cost}_{unprocess}(i) &= c i+(1-p) J_{\mu^k}(i)+ p J_{\mu^k}(i+1)+\lambda_{k}\\ \mu^{k+1}(i)&=\left\lbrace \begin{aligned} 0 \quad \text{cost}_{process}(i)\geq\text{cost}_{unprocess}(i)\\ 1 \quad \text{cost}_{process}(i)<\text{cost}_{unprocess}(i)\\ \end{aligned} \right. \quad \text{ if } i = 0, \cdots, n-1\\ \mu^{k+1}(n)&=1 \end{aligned}\]

在Average Cost Problem下

Bellman方程为:

\[\begin{aligned} \lambda^\ast+h^\ast(i)= & \min \left[K+(1-p) h^\ast(0)+p h^\ast(1)\right. \\ & \left.c i+(1-p) h^\ast(i)+p h^\ast(i+1)\right], \quad i=0,1, \cdots, n-1 \\ \lambda^\ast+h^\ast(n)= & K+(1-p) h^\ast(0)+p h^\ast(1) \end{aligned}\]

Value Iteration

初始值

\[J_0(i) = 0,\forall i\]

固定状态为$0$,每一轮循环必定会从0开始并返回0。

Value iteration

\[\begin{aligned} \lambda_{k}=&(1 - p) * J_k(0) + p * J_k(1)\\ J_{k+1}(i)= & \min \left[K+(1-p) J_k(0)+ p J_k(1)\right. \\ & \left.c i+(1-p) J_k(i)+ p J_k(i+1)\right]-\lambda_{k}, \quad i=0,1, \cdots, n-1, \\ J_{k+1}(n)= & K+(1-p) J_k(0)+ p J_k(1)-\lambda_{k} \end{aligned}\]

Convergence:

\[J^\ast(i)=\lim _{k \rightarrow \infty} J_k(i),\lambda^\ast=\lim _{k \rightarrow \infty} \lambda_k\]

Policy Iteration

初始值

\[\mu^0(i)=1, \forall i\]

策略评估

\[\begin{aligned} \lambda_{k}&=(1 - p) * J_k(0) + p * J_k(1)\\ J_{\mu^{k+1}}(i)&=\left\lbrace \begin{aligned} K+(1-p) J_{\mu^k}(0)+ p J_{\mu^k}(1)-\lambda_{k} \quad \mu^k(i)=0\\ c i+(1-p) J_{\mu^k}(i)+ p J_{\mu^k}(i+1)-\lambda_{k} \quad \mu^k(i)=1\\ \end{aligned} \right . \quad\text{if} \quad i=0,1, \cdots, n-1,\\ J_{\mu^{k+1}}(n)&= K+(1-p) J_{\mu^k}(0)+ p J_{\mu^k}(1)-\lambda_{k} \end{aligned}\]

策略改进

\[\begin{aligned} \text{cost}_{process}(i) &= K+(1-p) J_{\mu^k}(0)+ p J_{\mu^k}(1)+\lambda_{k}\\ \text{cost}_{unprocess}(i) &= c i+(1-p) J_{\mu^k}(i)+ p J_{\mu^k}(i+1)+\lambda_{k}\\ \mu^{k+1}(i)&=\left\lbrace \begin{aligned} 0 \quad \text{cost}_{process}(i)\geq\text{cost}_{unprocess}(i)\\ 1 \quad \text{cost}_{process}(i)<\text{cost}_{unprocess}(i)\\ \end{aligned} \right. \quad \text{ if } i = 0, \cdots, n-1\\ \mu^{k+1}(n)&=1 \end{aligned}\]

根据value function得到policy

正常来说,我们可以通过value-action function采用贪婪的策略得到policy,但在本次实验中,通过value iteration会得到状态价值函数,状态价值函数只能评估一个状态的好坏程度。已知问题是batch manufacturing problem,决策只有process和unprocess两种,从贪婪的角度来说,如果unporcess能够得到更好的状态,那么我们选择unporcess,否则将会process。因此,在得到value function后,所有拥有最大value的状态,都会得到process的决策,其他状态则为unprocess。

实现

Discounted Problem下的Value Iteration

def _iteration_discounted(self, p, n, c, K, alpha):
    # 当前的Value函数表,初始值为零向量
    value = [0] * (n + 1)
    # 上一轮的Value函数表,初始值为零向量
    last_value = [0] * (n + 1)
    # 记录迭代轮数
    turns = 0
    while True:
        # 轮次 + 1
        turns = turns + 1
        # 根据更新方程,更新value[0...n-1]
        for i in range(n):
            value[i] = min(
                K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1],
                c * i + alpha * (1 - p) * last_value[i] + alpha * p * last_value[i + 1]
            )
        # 根据更新方程,更新value[n]
        value[n] = K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1]
        # 判断是否收敛,即两个向量的差小于EPS
        if is_terminal(value, last_value, Value.EPS):
            break
        last_value = value.copy()
    
    return {
        "Iteration turns" : turns,
        "Value function" : format_float(value),
        "Policy" : self._get_policy(value)
    }

Discounted Problem下的Policy Iteration

# 策略评估:根据当前策略policy计算value表
def _get_value_discounted(self, p, n, c, K, alpha, policy):
    value = [0] * (n + 1)
    last_value = [0] * (n + 1)
    turns = 0
    while True:
        turns = turns + 1
        for i in range(n):
            value[i] = (
                K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1] if policy[i] else
                c * i + alpha * (1 - p) * last_value[i] + alpha * p * last_value[i + 1]
            )
        value[n] = K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1]
        if is_terminal(value, last_value, Policy.EPS):
            break
        last_value = value.copy()
    
    return value, turns

# 策略改进:策略迭代
def _iteration_discounted(self, p, n, c, K, alpha):
    # 当前轮的策略,目前是初始策略
    policy = [0] * n + [1]
    # 上一轮的策略
    last_policy = [0] * n + [1]
    # 记录轮数
    turns = 0
    # 价值计算总轮数
    value_turns = 0
    while True:
        # 轮数 + 1
        turns = turns + 1
        # 根据上一轮的策略计算value表
        value, turns_i = self._get_value_discounted(p, n, c, K, alpha, last_policy)
        value_turns += turns_i

        # 根据value表计算最新的策略
        for i in range(n):
            # 如果process,代价为
            process_cost = K + alpha * (1 - p) * value[0] + alpha * p * value[1]
            # 如果不process,代价为
            unprocess_cost = c * i + alpha * (1 - p) * value[i] + alpha * p * value[i + 1]
            # 选取最小值对应的动作
            policy[i] = 1 if process_cost < unprocess_cost else 0
        # policy[n]的动作空间只有{1}
        policy[n] = 1

        # 判断是否达到终止状态,即value表收敛
        if policy == last_policy:
            break
        last_policy = policy.copy()

    return {
        "Policy improvement iteration turns" : turns,
        "Calucating value function total iteration turns" : value_turns,
        "Value function" : format_float(value),
        "Policy" : policy
    }

Average Cost Problem下的Value Iteration

def _iteration_average(self, p, n, c, K):
    # 当前轮次的差分价值函数
    value_h = [0] * (n + 1)
    # 上一轮次的差分价值函数
    last_value_h = [0] * (n + 1)
    # 上一轮次的平均价值函数
    last_lambda_k = 0
    # 记录轮数
    turns = 0
    while True:
        # 轮数 + 1
        turns = turns + 1
        # 固定状态0,计算平均价值函数
        lambda_k = min(
            K + (1 - p) * last_value_h[0] + p * last_value_h[1],
            (1 - p) * last_value_h[0] + p * last_value_h[1]
        )
        # 迭代计算当前轮次差分状态函数
        for i in range(n):
            value_h[i] = min(
                K + (1 - p) * last_value_h[0] + p * last_value_h[1],
                c * i + (1 - p) * last_value_h[i] + p * last_value_h[i + 1]
            ) - lambda_k
        # 计算状态n的当前轮次差分状态函数
        value_h[n] = K + (1 - p) * last_value_h[0] + p * last_value_h[1] - lambda_k
        # 判断是否达到终止状态
        if is_terminal(value_h, last_value_h, Value.EPS) and abs(last_lambda_k - lambda_k) < Value.EPS:
            break
        # 更新轮次状态
        last_value_h = value_h.copy()
        last_lambda_k = lambda_k
    
    value = [i + lambda_k for i in value_h]
    
    return {
        "Iteration turns" : turns,
        "Value function" : format_float(value),
        "Policy" : self._get_policy(value),
        "Optimal average cost" : format_float(lambda_k)
    }

Average Cost Problem下的Policy Iteration

# 策略评估:根据当前轮次的策略,计算当前轮次的平均价值和差分价值表
def _get_value_average(self, p, n, c, K, policy):
    value_h = [0] * (n + 1)
    last_value_h = [0] * (n + 1)
    last_lambda_k = 0
    turns = 0
    while True:
        turns = turns + 1
        lambda_k = min(
            K + (1 - p) * last_value_h[0] + p * last_value_h[1],
            (1 - p) * last_value_h[0] + p * last_value_h[1]
        )
        for i in range(n):
            value_h[i] = (
                K + (1 - p) * last_value_h[0] + p * last_value_h[1] if policy[i] else
                c * i + (1 - p) * last_value_h[i] + p * last_value_h[i + 1]
            ) - lambda_k
        value_h[n] = K + (1 - p) * last_value_h[0] + p * last_value_h[1] - lambda_k
        if is_terminal(value_h, last_value_h, Policy.EPS) and abs(last_lambda_k - lambda_k) < Policy.EPS:
            break
        last_value_h = value_h.copy()
        last_lambda_k = lambda_k
    
    return lambda_k, value_h, turns

# 策略改进:策略迭代
def _iteration_average(self, p, n, c, K):
    # 初始化当前轮次的策略
    policy = [1] * (n + 1)
    # 初始化上一轮次的策略
    last_policy = [1] * (n + 1)
    # 记录轮次
    turns = 0
    # 价值函数迭代轮数
    value_turns = 0
    while True:
        # 轮次 + 1
        turns = turns + 1
        # 根据当前轮次的策略,计算当前轮次的平均价值和差分价值表
        value, turns_i = self._get_value_average(p, n, c, K, last_policy)
        value_turns += turns_i

        # 策略迭代
        for i in range(n):
            process_cost = K + (1 - p) * value[0] + p * value[1]
            unprocess_cost = c * i + (1 - p) * value[i] + p * value[i + 1]
            policy[i] = 1 if process_cost < unprocess_cost else 0
        policy[n] = 1

        # 判断是否达到终止状态
        if policy == last_policy:
            break
        last_policy = policy.copy()

    return {
        "Policy improvement iteration turns" : turns,
        "Calucating value function total iteration turns" : value_turns,
        "Value function" : format_float(value),
        "Policy" : policy
    }

工具函数

# 用与判断两个list:x和y每个元素的距离都小于EPS
def is_terminal(x, y, EPS):
    if x is None or y is None:
        return False

    for (xi, yi) in zip(x, y):
        if abs(xi - yi) > EPS:
            return False
    return True

# 用语规范list或者float为4位小数
def format_float(v):
    if type(v) == type(list()):
        return [float('{:.4f}'.format(i)) for i in v]
    else:
        return round(v, 4)

# 根据value function得到policy
def _get_policy(self, value):
    max_value = max(value)
    policy = [0 if i < max_value - Value.EPS else 1 for i in value]
    return policy

实验

实验指令见附件。

Discounted Problem

比较不同的 $n$

Used value iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 8, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 12, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Used policy iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 8, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1]

p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

p = 0.5, n = 12, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

一方面,只要其他参数不改变,当process的阈值小于n,value function和policy并不会随着$n$的增长而改变;另一方面,在这组参数下,value iteration和policy iteration得到的value function和policy是一致的,尽管policy iteration所需要的策略改进轮数更少,但其进行策略评估的总轮数是要大于value iteration所需要的轮数的。

比较不同的 $K$

Used value iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 1.0, K = 3.0, alpha = 0.9.
    Iteration turns : 91
    Value function : [10.5741, 12.9241, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 8.0, alpha = 0.9.
    Iteration turns : 98
    Value function : [18.358, 22.4377, 25.2018, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
Used policy iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 1.0, K = 3.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 378
    Value function : [10.5741, 12.9241, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 8.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 407
    Value function : [18.358, 22.4377, 25.2018, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

当 $K$ 发生变化时,两种方法得到的结果仍然是一致的;尽管policy iteration所需要的策略改进轮数更少,但其进行策略评估的总轮数是要大于value iteration所需要的轮数的。

当 $K$ 增大时,所有状态的价值函数的值都会增大,而得到的最优策略显示,货物会存放更多的天数,这样也是符合常识的。

比较不同的 $c$

Used value iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 0.5, K = 5.0, alpha = 0.9.
    Iteration turns : 92
    Value function : [10.319, 12.6123, 14.3042, 15.2609, 15.319, 15.319, 15.319, 15.319, 15.319, 15.319, 15.319]
    Policy : [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 2.0, K = 5.0, alpha = 0.9.
    Iteration turns : 97
    Value function : [19.1242, 23.3742, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Used policy iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 0.5, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 387
    Value function : [10.3191, 12.6124, 14.3042, 15.2609, 15.3191, 15.3191, 15.3191, 15.3191, 15.3191, 15.3191, 15.3191]
    Policy : [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 2.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 398
    Value function : [19.1242, 23.3742, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

当 $c$ 发生变化时,两种方法得到的结果仍然是一致的;尽管policy iteration所需要的策略改进轮数更少,但其进行策略评估的总轮数是要大于value iteration所需要的轮数的。

当 $c$ 增大时,所有状态的价值函数的值都会增大,而得到的最优策略显示,货物会存放更少的天数,这样也是符合常识的。

比较不同的$\alpha$

Used value iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.0.
    Iteration turns : 2
    Value function : [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0]
    Policy : [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.1.
    Iteration turns : 6
    Value function : [0.0617, 1.1728, 2.2839, 3.3935, 4.4769, 5.0617, 5.0617, 5.0617, 5.0617, 5.0617, 5.0617]
    Policy : [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.5.
    Iteration turns : 16
    Value function : [0.9615, 2.8845, 4.6538, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Used policy iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.0.
    Policy improvement iteration turns : 2
    Calucating value function total iteration turns : 4
    Value function : [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0]
    Policy : [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.1.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 18
    Value function : [0.0617, 1.1728, 2.2839, 3.3935, 4.4769, 5.0617, 5.0617, 5.0617, 5.0617, 5.0617, 5.0617]
    Policy : [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.5.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 49
    Value function : [0.9615, 2.8846, 4.6538, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

上述实验可以反应,在discounted problem下,value iteration和policy iteration得到的结果是一致的,只有迭代轮数上的差异。

$\alpha$ 越接近1,代表对未来奖励的重视程度越高;而 $\alpha$ 越接近0,代表对未来奖励的重视程度越低。当$\alpha$增大时,

  • 上一轮迭代的结果对当前轮数的影响会增加,收敛的速度会减缓,迭代的轮数会增加;
  • 上一轮迭代的结果会以一定程度的衰减增加到当前的value function上,因此最终收敛后的value function会增大;
  • policy会更加注重长期收益,而随着轮数的增加,process的收益是最大的,因此policy中process的决策点会提前;
  • 当$\alpha=1$时,奖励会一直叠加导致iteration无法终止。

比较不同的 $p$

Used value iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.1, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 85
    Value function : [4.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991]
    Policy : [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.9, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 98
    Value function : [21.1872, 23.803, 25.5072, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
Used policy iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.1, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 284
    Value function : [4.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991]
    Policy : [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.9, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 401
    Value function : [21.1872, 23.8031, 25.5072, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

上述实验可以反应,在discounted problem下,value iteration和policy iteration得到的结果是一致的,只有迭代轮数上的差异。

当$p$增大时,获得订单的概率会增大,

  • 因此每一轮unprocess的收益会增大,下一个时刻很可能又获得一个订单使得下一个时刻process的收益增加,因此process的决策点会推移;
  • 由于决策点会往后推移,因此计算的value function会整体增大;
  • 计算的value function会整体增大,收敛的轮数也会增加。

Average cost Problem

比较不同的 $n$

Used value iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 8, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 12, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
Used policy iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 8, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 5
    Calucating value function total iteration turns : 100
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 5
    Calucating value function total iteration turns : 100
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 12, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 5
    Calucating value function total iteration turns : 100
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75

可以观察,

  • 当 $n$ 本身已经超过unprocess决策窗口时,随着$n$的增大,value iteration和policy iteration的结果并不会变化,包括迭代的次数也不会变化;
  • 得到的value function可以应证,状态价值函数是一个递增的序列,也是符合预期的,同时value[0]=optimal average cost也是符合预期的;
  • 横向比较看,value iteration和policy iteration得到的状态价值函数和策略都是一样的;
  • 从迭代轮数来看,policy iteration所需要的计算代价会更大。

比较不同的 $K$

Used value iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 10, c = 1.0, K = 3.0, alpha = 0.9.
    Iteration turns : 5
    Value function : [1.25, 3.75, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.25
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 10, c = 1.0, K = 8.0, alpha = 0.9.
    Iteration turns : 18
    Value function : [2.3333, 7.0001, 9.6667, 10.3333, 10.3333, 10.3333, 10.3333, 10.3333, 10.3333, 10.3333, 10.3333]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.3333
Used policy iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 10, c = 1.0, K = 3.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 42
    Value function : [1.25, 3.75, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.25
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 5
    Calucating value function total iteration turns : 100
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 10, c = 1.0, K = 8.0, alpha = 0.9.
    Policy improvement iteration turns : 6
    Calucating value function total iteration turns : 251
    Value function : [2.3334, 7.0001, 9.6667, 10.3334, 10.3334, 10.3334, 10.3334, 10.3334, 10.3334, 10.3334, 10.3334]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.3334

当 $K$ 发生变化时,两种方法得到的结果仍然是一致的;尽管policy iteration所需要的策略改进轮数更少,但其进行策略评估的总轮数是要大于value iteration所需要的轮数的。

当 $K$ 增大时,所有状态的价值函数的值都会增大,而得到的最优策略显示,货物会存放更多的天数,这样也是符合常识的。

可以观察,

  • 两种方法得到的结果都是一致的;
  • 当 $K$ 增大时,所有状态的价值函数的值都会增大,而得到的最优策略显示,货物会存放更多的天数,这样也是符合常识的。
  • 当 $K$ 增大时,所需要的计算代价会更大,收敛的时间会更久。

比较不同的$c$

Used value iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 10, c = 0.5, K = 5.0, alpha = 0.9.
    Iteration turns : 18
    Value function : [1.3333, 4.0, 5.6667, 6.3333, 6.3333, 6.3333, 6.3333, 6.3333, 6.3333, 6.3333, 6.3333]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.3333
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 10, c = 2.0, K = 5.0, alpha = 0.9.
    Iteration turns : 5
    Value function : [2.25, 6.75, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.25
Used policy iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 10, c = 0.5, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 240
    Value function : [1.375, 4.1251, 5.875, 6.625, 6.375, 6.375, 6.375, 6.375, 6.375, 6.375, 6.375]
    Policy : [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.375
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 96
    Value function : [1.8334, 5.5001, 7.1667, 6.8334, 6.8334, 6.8334, 6.8334, 6.8334, 6.8334, 6.8334, 6.8334]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.8334
    
p = 0.5, n = 10, c = 2.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 28
    Value function : [2.25, 6.75, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.25

当 $c$ 增大时,所有状态的价值函数的值都会增大,而得到的最优策略显示,货物会存放更少的天数,这样也是符合常识的。

比较不同的 $p$

Used value iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.1, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 8
    Value function : [0.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5]
    Policy : [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 0.5
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.9, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 62
    Value function : [2.5, 5.2777, 6.9444, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.5
Used policy iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.1, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 231
    Value function : [0.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5]
    Policy : [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 0.5
    
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 5
    Calucating value function total iteration turns : 100
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
    
p = 0.9, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 338
    Value function : [2.5, 5.2778, 6.9444, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.5

上述实验可以反应,在average cost problem下,value iteration和policy iteration得到的结果是一致的,但迭代轮数上的也有差异。

当 $p$ 增大时,获得订单的概率会增大,

  • 因此每一轮unprocess的收益会增大,下一个时刻很可能又获得一个订单使得下一个时刻process的收益增加,因此process的决策点会推移;
  • 由于决策点会往后推移,因此计算的value function会整体增大;
  • 计算的value function会整体增大,收敛的轮数也会增加。

实验总结

具体的计算结果、比较和分析见上述实验。

  • 在Discounted problem下,Value iteration和Policy iteration得到的结果是一致的,只有计算代价上的区别,后者的计算开销会更大,因为其不仅要策略改进,还需要进行策略的评估,而策略评估的开销是比较大的;
  • 在Average cost problem下,Value iteration和Policy iteration得到的结果也是是一致的,同样的,也是后者的计算开销也会更大;
  • 当 $n$ 增加时,只要threshold在 $n$ 的范围内,对结果是没有影响的;
  • 当 $c$ 增加时,所有状态的价值函数的值都会增大,计算开销会减小,而得到的最优策略显示,货物会存放更少的天数,这样也是符合常识的;
  • 当 $K$ 增加时,所有状态的价值函数的值都会增大,计算开销会增大,而得到的最优策略显示,货物会存放更多的天数,这样也是符合常识的;
  • 当 $\alpha$ 增加时,上一轮迭代的结果对当前轮数的影响会增加,收敛的速度会减缓,迭代的轮数会增加,上一轮迭代的结果会以一定程度的衰减增加到当前的value function上,因此最终收敛后的value function会增大,policy会更加注重长期收益,而随着轮数的增加,process的收益是最大的,因此policy中process的决策点会提前;
  • 当 $p$ 增大时,因此每一轮unprocess的收益会增大,下一个时刻很可能又获得一个订单使得下一个时刻process的收益增加,因此process的决策点会推移,由于决策点会往后推移,因此计算的value function会整体增大,计算的value function会整体增大,收敛的轮数也会增加。

附件

代码

main.py

import argparse

from algorithm import Value, Policy

if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='Program to solve batch manufacturing problem.')

    parser.add_argument('--p', type=float, nargs='+', default=[0.5], help='Probability of receiving an order for the product.')
    parser.add_argument('--n', type=int, nargs='+', default=[5], help='The maximum number of orders that can remain unfilled.')
    parser.add_argument('--c', type=float, nargs='+', default=[1.0], help='The cost per unfilled order at each time period.')
    parser.add_argument('--K', type=float, nargs='+', default=[3.0], help='The setup cost to process the unfilled orders.')
    parser.add_argument('--alpha', type=float, nargs='+', default=[0.8], help='Discount factor.')
    
    parser.add_argument('--problem', type=str, nargs='+', choices=['discounted', 'average_cost'], default=['discounted', 'average_cost'], help='Problem in [discounted, average_cost].')
    parser.add_argument('--algorithm', type=str, choices=['value', 'policy'], nargs='+', default=['value', 'policy'], help='Algorithm in [value, policy] iteration.')
    
    args = parser.parse_args()

    for algorithm in args.algorithm:

        for problem in args.problem:

            method = {
                'value' : Value(problem),
                'policy' : Policy(problem)
            }.get(algorithm)
            print('Used {} iteration to solve batch manufacturing problem, formulate with {} problem:'.format(algorithm, problem))

            for p in args.p:
                for n in args.n:
                    for c in args.c:
                        for K in args.K:
                            for alpha in args.alpha:
                                result = method.iteration(p, n, c, K, alpha)
                                print('p = {}, n = {}, c = {}, K = {}, alpha = {}.'.format(p, n, c, K, alpha))
                                for key, value in result.items():
                                    print(key, ":", value)
                                print("\n")

utils.py

def is_terminal(x, y, EPS):
    
    if x is None or y is None:
        return False

    for (xi, yi) in zip(x, y):
        if abs(xi - yi) > EPS:
            return False
    return True

def format_float(v):
    if type(v) == type(list()):
        return [float('{:.4f}'.format(i)) for i in v]
    else:
        return round(v, 4)

algorithm.py

from utils import is_terminal, format_float

class Value:

    EPS = 1e-4

    def __init__(self, problem):
        self.problem = problem

    def _get_policy(self, value):
        max_value = max(value)
        policy = [0 if i < max_value - Value.EPS else 1 for i in value]
        return policy
    
    def _iteration_discounted(self, p, n, c, K, alpha):
        value = [0] * (n + 1)
        last_value = [0] * (n + 1)
        turns = 0
        while True:
            turns = turns + 1
            for i in range(n):
                value[i] = min(
                    K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1],
                    c * i + alpha * (1 - p) * last_value[i] + alpha * p * last_value[i + 1]
                )
            value[n] = K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1]
            if is_terminal(value, last_value, Value.EPS):
                break
            last_value = value.copy()
        
        return {
            "Iteration turns" : turns,
            "Value function" : format_float(value),
            "Policy" : self._get_policy(value)
        }
    
    def _iteration_average(self, p, n, c, K):
        value_h = [0] * (n + 1)
        last_value_h = [0] * (n + 1)
        lambda_k = last_lambda_k = 0
        turns = 0
        while True:
            turns = turns + 1
            lambda_k = min(
                K + (1 - p) * last_value_h[0] + p * last_value_h[1],
                (1 - p) * last_value_h[0] + p * last_value_h[1]
            )
            for i in range(n):
                value_h[i] = min(
                    K + (1 - p) * last_value_h[0] + p * last_value_h[1],
                    c * i + (1 - p) * last_value_h[i] + p * last_value_h[i + 1]
                ) - lambda_k
            value_h[n] = K + (1 - p) * last_value_h[0] + p * last_value_h[1] - lambda_k
            if is_terminal(value_h, last_value_h, Value.EPS) and abs(last_lambda_k - lambda_k) < Value.EPS:
                break
            last_value_h = value_h.copy()
            last_lambda_k = lambda_k
        
        value = [i + lambda_k for i in value_h]
        
        return {
            "Iteration turns" : turns,
            "Value function" : format_float(value),
            "Policy" : self._get_policy(value),
            "Optimal average cost" : format_float(lambda_k)
        }


    def iteration(self, p, n, c, K, alpha):
        return {
            'discounted' : self._iteration_discounted(p, n, c, K, alpha),
            'average_cost' : self._iteration_average(p, n, c, K)
        }.get(self.problem)


class Policy:

    EPS = 1e-4

    def __init__(self, problem):
        self.problem = problem
    
    def _get_value_discounted(self, p, n, c, K, alpha, policy):
        value = [0] * (n + 1)
        last_value = [0] * (n + 1)
        turns = 0
        while True:
            turns = turns + 1
            for i in range(n):
                value[i] = (
                    K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1] if policy[i] else
                    c * i + alpha * (1 - p) * last_value[i] + alpha * p * last_value[i + 1]
                )
            value[n] = K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1]
            if is_terminal(value, last_value, Policy.EPS):
                break
            last_value = value.copy()
        
        return value, turns
    
    def _get_value_average(self, p, n, c, K, policy):
        value_h = [0] * (n + 1)
        last_value_h = [0] * (n + 1)
        last_lambda_k = 0
        turns = 0
        while True:
            turns = turns + 1
            lambda_k = min(
                K + (1 - p) * last_value_h[0] + p * last_value_h[1],
                (1 - p) * last_value_h[0] + p * last_value_h[1]
            )
            for i in range(n):
                value_h[i] = (
                    K + (1 - p) * last_value_h[0] + p * last_value_h[1] if policy[i] else
                    c * i + (1 - p) * last_value_h[i] + p * last_value_h[i + 1]
                ) - lambda_k
            value_h[n] = K + (1 - p) * last_value_h[0] + p * last_value_h[1] - lambda_k
            if is_terminal(value_h, last_value_h, Policy.EPS) and abs(last_lambda_k - lambda_k) < Policy.EPS:
                break
            last_value_h = value_h.copy()
            last_lambda_k = lambda_k
        
        return lambda_k, value_h, turns
    
    def _iteration_discounted(self, p, n, c, K, alpha):
        policy = [1] * (n + 1)
        last_policy = [1] * (n + 1)
        turns = 0
        value_turns = 0
        while True:
            turns = turns + 1
            value, turns_i = self._get_value_discounted(p, n, c, K, alpha, last_policy)
            value_turns += turns_i

            for i in range(n):
                process_cost = K + alpha * (1 - p) * value[0] + alpha * p * value[1]
                unprocess_cost = c * i + alpha * (1 - p) * value[i] + alpha * p * value[i + 1]
                policy[i] = 1 if process_cost < unprocess_cost else 0
            policy[n] = 1

            if policy == last_policy:
                break
            last_policy = policy.copy()

        return {
            "Policy improvement iteration turns" : turns,
            "Calucating value function total iteration turns" : value_turns,
            "Value function" : format_float(value),
            "Policy" : policy
        }
    
    def _iteration_average(self, p, n, c, K):
        policy = [1] * (n + 1)
        last_policy = [1] * (n + 1)
        turns = 0
        value_turns = 0
        while True:
            turns = turns + 1
            lambda_k, value, turns_i = self._get_value_average(p, n, c, K, last_policy)
            value_turns += turns_i

            for i in range(n):
                process_cost = K + (1 - p) * value[0] + p * value[1]
                unprocess_cost = c * i + (1 - p) * value[i] + p * value[i + 1]
                policy[i] = 1 if process_cost < unprocess_cost else 0
            policy[n] = 1

            if last_policy == policy:
                break
            last_policy = policy.copy()
        
        value_f = [i + lambda_k for i in value]

        return {
            "Policy improvement iteration turns" : turns,
            "Calucating value function total iteration turns" : value_turns,
            "Value function" : format_float(value_f),
            "Policy" : policy,
            "Optimal average cost" : format_float(lambda_k)
        }

    def iteration(self, p, n, c, K, alpha):
        return {
            'discounted' : self._iteration_discounted(p, n, c, K, alpha),
            'average_cost' : self._iteration_average(p, n, c, K)
        }.get(self.problem)

指令

discounted problem实验

指令1.1 比较不同的n

python main.py --n 8 10 12 --K 5 --c 1 --alpha 0.9 --p 0.5 --problem discounted --algorithm value policy

指令1.2 比较不同的K

python main.py --n 10 --K 3 5 8 --c 1 --alpha 0.9 --p 0.5 --problem discounted --algorithm value policy

指令1.3 比较不同的c

python main.py --n 10 --K 5 --c 0.5 1 2 --alpha 0.9 --p 0.5 --problem discounted --algorithm value policy

指令1.4 比较不同的alpha

python main.py --n 10 --K 5 --c 1 --alpha 0 0.1 0.5 0.9 --p 0.5 --problem discounted --algorithm value policy

指令1.5 比较不同的p

python main.py --n 10 --K 5 --c 1 --alpha 0.9 --p 0.1 0.5 0.9 --problem discounted --algorithm value policy

average cost problem实验

指令2.1 比较不同的n

python main.py --n 8 10 12 --K 5 --c 1 --alpha 0.9 --p 0.5 --problem average_cost --algorithm value policy

指令2.2 比较不同的K

python main.py --n 10 --K 3 5 8 --c 1 --alpha 0.9 --p 0.5 --problem average_cost --algorithm value policy

指令2.3 比较不同的c

python main.py --n 10 --K 5 --c 0.5 1 2 --alpha 0.9 --p 0.5 --problem average_cost --algorithm value policy

指令2.4 比较不同的p

python main.py --n 10 --K 5 --c 1 --alpha 0.9 --p 0.1 0.5 0.9 --problem average_cost --algorithm value policy