并行矩阵乘法

并行与分布式系统

Posted by Birdie on May 2, 2022

矩阵乘法

​ 分别采用不同的算法(非分布式算法)例如一般算法、分治算法和Strassen算法等计算计算矩阵两个300x300的矩阵乘积,并通过Perf工具分别观察cache missCPImem_load等性能指标,找出特征或者规律。

解决方法

步骤0

安装Perf工具:

sudo apt install linux-oem-5.6-tools-common
sudo apt install linux-tools-5.11.0-37-generic

检测是否安装成功:

perf --version

检测结果:

perf version 5.11.22

步骤1

​ 一般算法即普通的复杂度为$O(n^3)$的矩阵乘法,不过多赘述。

​ 一般的分治算法和Strassen算法有相似的地方,都是将一个N x N的矩阵划分为四个大小为$\frac{n}{2}\times\frac{n}{2}$大小的子矩阵,做矩阵乘法运算。具体划分如下:

\[A = \left[ \begin{matrix} A_{11}\ A_{12} \\ A_{21}\ A_{22} \end{matrix} \right],\ B = \left[ \begin{matrix} B_{11}\ B_{12} \\ B_{21}\ B_{22} \end{matrix} \right],\ C = \left[ \begin{matrix} C_{11}\ C_{12} \\ C_{21}\ C_{22} \end{matrix} \right]\]

然后可以将乘法公式写成4个递推式:

\[C_{11}=A_{11}\times B_{11}+A_{12}\times B_{21} \\ C_{12}=A_{11}\times B_{12}+A_{12}\times B_{22} \\ C_{21}=A_{21}\times B_{11}+A_{22}\times B_{21} \\ C_{22}=A_{21}\times B_{12}+A_{22}\times B_{22}\]

​ 因此可以得到一般分治算法的时间复杂度:

\[T(n) = \begin{cases} O(1) &n = 1\\ 8T(\frac{n}{2})+O(n^2) & n > 1 \end{cases}\]

​ 通过求解,可以看出$T(n)=O(n^3)$,与一般算法比较,并没有任何提高,反而增加了递归带来的开销。

​ 而Strassen算法的分治划分矩阵也是相同的,但所在计算C之前,增加了如下步骤:

\[S_1 = B_{12}-B_{22}\\ S_2 = A_{11}+A_{12}\\ S_3 = A_{21}+A_{22}\\ S_4 = B_{21}-B_{11}\\ S_5 = A_{11}+A_{22}\\ S_6 = B_{11}+B_{22}\\ S_7 = A_{12}-A_{22}\\ S_8 = B_{21}+B_{22}\\ S_9 = A_{11}-A_{21}\\ S_10 = B_{11}+B_{12}\]

​ 然后递归计算7个矩阵:

\[\begin{gathered} P_{1}=A_{11} \cdot S_{1}=A_{11} \cdot B_{12}-A_{11} \cdot B_{22} \\ P_{2}=S_{2} \cdot B_{22}=A_{11} \cdot B_{22}+A_{12} \cdot B_{22} \\ P_{3}=S_{3} \cdot B_{11}=A_{21} \cdot B_{11}+A_{22} \cdot B_{11} \\ P_{4}=A_{22} \cdot S_{4}=A_{22} \cdot B_{21}-A_{22} \cdot B_{11} \\ P_{5}=S_{5} \cdot S_{6}=A_{11} \cdot B_{11}+A_{11} \cdot B_{22}+A_{22} \cdot B_{11}+A_{22} \cdot B_{22} \\ P_{6}=S_{7} \cdot S_{8}=A_{12} \cdot B_{21}+A 12 \cdot B_{22}-A_{22} \cdot B_{21}-A_{22} \cdot B_{22} \\ P_{7}=S_{9} \cdot S_{10}=A_{11} \cdot B_{11}+A_{11} \cdot B_{12}-A_{21} \cdot B_{11}-A_{21} \cdot B_{12} \end{gathered}\]

​ 最后通过7个P矩阵计算C

\[\begin{gathered} C_{11}=P_{5}+P_{4}-P_{2}+P_{6} \\ C_{12}=P_{1}+P_{2} \\ C_{21}=P_{3}+P_{4} \\ C_{22}=P_{5}+P_{1}-P_{3}-P_{7} \end{gathered}\]

​ 最后分析时间复杂度:

\[T(n)= \begin{cases}\Theta(1) & \text { 若 } n=1 \\ 7 T\left(\frac{n}{2}\right)+\Theta\left(n^{2}\right) & \text { 若 } n>1\end{cases}\]

​ 进而求出时间复杂度:$T(n)=O(n^{log^7_2})$。

步骤2

​ 由于300x300的矩阵不能够很好的分治,因此将其扩充成512x512的矩阵。

​ 当然,300x300的代码我也写过,在遇到的困难部分有讲到。

代码:

#include <bits/stdc++.h>
#include <ctime>
#pragma comment(linker, "/STACK:102400000,102400000")
using namespace std;

struct Matrix {
    int **A;
    int siz;
    Matrix() {
        siz = 0;
    }
    Matrix(int n) {
        A = (int**)malloc(sizeof(int*) * n);
        for (int i = 0; i < n; i++)
            A[i] = (int*)malloc(sizeof(int) * n);
        siz = n;
    }
    Matrix(const Matrix &B) {
        int n = B.siz;
        A = (int**)malloc(sizeof(int*) * n);
        for (int i = 0; i < n; i++)
            A[i] = (int*)malloc(sizeof(int) * n);
        for (int i = 0; i < n; i++)
            for (int j = 0; j < n; j++)
                A[i][j] = B.A[i][j];
        siz = n;
    }
    ~Matrix() {
        int n = siz;
        if (n == 0) return;
        for (int i = 0; i < n; i++)
            free(A[i]);
        free(A);
    }
    Matrix operator = (const Matrix &B) {
        int n = siz;
        if (n != 0) {
            for (int i = 0; i < n; i++)
                free(A[i]);
            free(A);
        }
        
        n = B.siz;
        A = (int**)malloc(sizeof(int*) * n);
        for (int i = 0; i < n; i++)
            A[i] = (int*)malloc(sizeof(int) * n);
        for (int i = 0; i < n; i++)
            for (int j = 0; j < n; j++)
                A[i][j] = B.A[i][j];
        siz = n;
        return *this;
    }
    friend Matrix operator + (const Matrix & A, const Matrix & B) {
        int n = B.siz;
        Matrix result(n);
        for (int i = 0; i < n; i++) {
            for (int j = 0; j < n; j++)
                result.A[i][j] = A.A[i][j] + B.A[i][j];
        }
        return result;
    }
    friend Matrix operator - (const Matrix & A, const Matrix & B) {
        int n = B.siz;
        Matrix result(n);
        for (int i = 0; i < n; i++) {
            for (int j = 0; j < n; j++)
                result.A[i][j] = A.A[i][j] - B.A[i][j];
        }
        return result;
    }
    friend Matrix operator * (const Matrix & A, const Matrix & B) {
        int n = B.siz;
        Matrix result(n); 
        for (int i = 0; i < n;i++) {
            for (int j = 0; j < n; j++) {
                result.A[i][j] = 0;
                for (int k = 0; k < n; ++k ) 
                    result.A[i][j] += A.A[i][k] * B.A[k][j];
            }
        }
        return result;
    }
};

Matrix common_algorithm(Matrix &A, Matrix &B, int n) {
    Matrix result(n); 
    result = A * B;
    return result;
}

Matrix divide_algorithm(Matrix &A, Matrix &B, int N) {
    if (N == 1) {
        Matrix result(N);
        result = A * B;
        return result;
    }

    int n = N / 2;
    Matrix A11(n), A12(n), A21(n), A22(n); 
    Matrix B11(n), B12(n), B21(n), B22(n); 

    for(int i = 0; i < n; i++) {
        for(int j = 0; j < n; j++) {
            A11.A[i][j] =  A.A[i][j];
            B11.A[i][j] =  B.A[i][j];
            A12.A[i][j] =  A.A[i][j + n];
            B12.A[i][j] =  B.A[i][j + n];
            A21.A[i][j] =  A.A[i + n][j];
            B21.A[i][j] =  B.A[i + n][j];
            A22.A[i][j] =  A.A[i + n][j + n];
            B22.A[i][j] =  B.A[i + n][j + n];
        }
    }

    Matrix result111 = divide_algorithm(A11, B11, n);
    Matrix result112 = divide_algorithm(A12, B21, n);
    Matrix result11 = result111 + result112;
    
    Matrix result121 = divide_algorithm(A11, B12, n);
    Matrix result122 = divide_algorithm(A12, B22, n);
    Matrix result12 = result121 + result122;

    Matrix result211 = divide_algorithm(A21, B11, n);
    Matrix result212 = divide_algorithm(A22, B21, n);
    Matrix result21 = result211 + result212;

    Matrix result221 = divide_algorithm(A21, B12, n);
    Matrix result222 = divide_algorithm(A22, B22, n);
    Matrix result22 = result221 + result222;

    Matrix result(N); 
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            result.A[i][j] = result11.A[i][j];
            result.A[i][j + n] = result12.A[i][j];
            result.A[i + n][j] = result21.A[i][j];
            result.A[i + n][j + n] = result22.A[i][j];
        }
    }
    return result;    
}

Matrix strassen_algorithm(Matrix &A, Matrix &B, int N) {
    if (N == 1) {
        Matrix result(N);
        result = A * B;
        return result;
    }

    int n = N / 2;
    Matrix A11(n), A12(n), A21(n), A22(n); 
    Matrix B11(n), B12(n), B21(n), B22(n); 

    for (int i = 0; i < n; i++) {
        for(int j = 0; j < n; j++) {
            A11.A[i][j] =  A.A[i][j];
            B11.A[i][j] =  B.A[i][j];
            A12.A[i][j] =  A.A[i][j + n];
            B12.A[i][j] =  B.A[i][j + n];
            A21.A[i][j] =  A.A[i + n][j];
            B21.A[i][j] =  B.A[i + n][j];
            A22.A[i][j] =  A.A[i + n][j + n];
            B22.A[i][j] =  B.A[i + n][j + n];
        }
    }

    Matrix S1(n), S2(n), S3(n), S4(n), S5(n);
    Matrix S6(n), S7(n), S8(n), S9(n), S10(n);
    S1  = B12 - B22;
    S2  = A11 + A12;
    S3  = A21 + A22;
    S4  = B21 - B11;
    S5  = A11 + A22;
    S6  = B11 + B22;
    S7  = A12 - A22;
    S8  = B21 + B22;
    S9  = A11 - A21;
    S10 = B11 + B12;

    Matrix P1(n), P2(n), P3(n), P4(n), P5(n), P6(n), P7(n);
    P1 = strassen_algorithm(A11, S1,  n);
    P2 = strassen_algorithm(S2,  B22, n);
    P3 = strassen_algorithm(S3,  B11, n);
    P4 = strassen_algorithm(A22, S4,  n);
    P5 = strassen_algorithm(S5,  S6,  n);
    P6 = strassen_algorithm(S7,  S8,  n);
    P7 = strassen_algorithm(S9,  S10, n);

    Matrix result11(n);
    Matrix result12(n);
    Matrix result21(n);
    Matrix result22(n);
    
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            result11.A[i][j] = P5.A[i][j] + P4.A[i][j] - P2.A[i][j] + P6.A[i][j];
            result12.A[i][j] = P1.A[i][j] + P2.A[i][j];
            result21.A[i][j] = P3.A[i][j] + P4.A[i][j];
            result22.A[i][j] = P5.A[i][j] + P1.A[i][j] - P3.A[i][j] - P7.A[i][j];
        }
    }

    Matrix result(N);
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            result.A[i][j] = result11.A[i][j];
            result.A[i][j + n] = result12.A[i][j];
            result.A[i + n][j] = result21.A[i][j];
            result.A[i + n][j + n] = result22.A[i][j];
        }
    }
    return result;    
}

bool check(Matrix &A, Matrix &B, int N) {
    for (int i = 0; i < N; i++)
        for (int j = 0; j < N; j++)
            if (A.A[i][j] != B.A[i][j])
                return false;
    return true;
}

Matrix matrix_mul(Matrix &A, Matrix &B, int flag, int N) {
    Matrix answer = common_algorithm(A, B, N);
    if (flag == 1) {
        cerr << "Common Multiple CORRECT!" << endl;
        return common_algorithm(A, B, N);
    }
    else if (flag == 2) {
        Matrix result = divide_algorithm(A, B, N);
        if (!check(answer, result, N)) {
            cerr << "ERROR!" << endl;
            exit(0);
        } else
            cerr << "Divided Multiple CORRECT!" << endl;
        return result;
    } else if (flag == 3) {
        Matrix result = strassen_algorithm(A, B, N);
        if (!check(answer, result, N)) {
            cerr << "ERROR!" << endl;
            exit(0);
        } else
            cerr << "Strassen Multiple CORRECT!" << endl;
        return result;
    } else {
        cerr << "type ERROR!" << endl;
        exit(0);
    }
}

const int N = 512;

int main() {
    srand(time(0));

    Matrix matrix_A(N);
    Matrix matrix_B(N);
    for (int i = 0; i < N; i++)
        for (int j = 0; j < N; j++) {
            matrix_A.A[i][j] = rand() % 10;
            matrix_B.A[i][j] = rand() % 10;
        }

    int flag = 1;
    Matrix result(N);
    result = matrix_mul(matrix_A, matrix_B, flag, N);

    return 0;
}

注:

flag = 1表示普通矩阵乘法

flag = 2表示一般分治乘法

flag = 3表示Strassen分治算法

实验结果

​ 首先,验证计算结果正确。

Common Multiple CORRECT!

Divided Multiple CORRECT!

Strassen Multiple CORRECT!

Perf工具观测结果如下:

一般算法

sudo perf stat -e cache-misses ./mymul

Common Multiple CORRECT!

 Performance counter stats for './mymul':

        12,391,855      cache-misses                                                

       5.572006171 seconds time elapsed

       5.517136000 seconds user
       0.015957000 seconds sys

sudo perf stat -e cpu-cycles ./mymul

Common Multiple CORRECT!

 Performance counter stats for './mymul':

    13,862,902,330      cpu-cycles                                                  

       4.409959414 seconds time elapsed

       4.298224000 seconds user
       0.079596000 seconds sys

sudo perf stat -e instructions ./mymul

Common Multiple CORRECT!

 Performance counter stats for './mymul':

    14,906,982,718      instructions                                                

       5.031148826 seconds time elapsed

       4.954497000 seconds user
       0.027856000 seconds sys

sudo perf stat -e mem-loads ./mymul

Common Multiple CORRECT!

 Performance counter stats for './mymul':

    0      mem-loads                                                

       5.012348826 seconds time elapsed

       4.896397000 seconds user
       0.026856000 seconds sys

分治算法

sudo perf stat -e cache-misses ./mymul

Divided Multiple CORRECT!

 Performance counter stats for './mymul':

        21,808,731      cache-misses                                                

      84.627024336 seconds time elapsed

      84.000938000 seconds user
       0.199366000 seconds sys

sudo perf stat -e cpu-cycles ./mymul

Divided Multiple CORRECT!

 Performance counter stats for './mymul':

   246,794,312,017      cpu-cycles                                                  

      89.159748347 seconds time elapsed

      88.436857000 seconds user
       0.203348000 seconds sys

sudo perf stat -e instructions ./mymul

Divided Multiple CORRECT!

 Performance counter stats for './mymul':

   448,624,527,153      instructions                                                

      76.904732800 seconds time elapsed

      76.578158000 seconds user
       0.067835000 seconds sys

sudo perf stat -e mem-loads ./mymul

Divided Multiple CORRECT!

 Performance counter stats for './mymul':

        0      mem-loads                                               

      78.627024336 seconds time elapsed

      78.000938000 seconds user
       0.199366000 seconds sys

strassen算法

sudo perf stat -e cache-misses ./mymul

Strassen Multiple CORRECT!

 Performance counter stats for './mymul':

        23,796,797      cache-misses                                                

      60.538929623 seconds time elapsed

      59.772946000 seconds user
       0.342492000 seconds sys

sudo perf stat -e cpu-cycles ./mymul

Strassen Multiple CORRECT!

 Performance counter stats for './mymul':

   171,462,129,314      cpu-cycles                                                  

      60.506264278 seconds time elapsed

      60.107100000 seconds user
       0.127649000 seconds sys

sudo perf stat -e instructions ./mymul

Strassen Multiple CORRECT!

 Performance counter stats for './mymul':

   332,097,051,364      instructions                                                

      63.912872140 seconds time elapsed

      63.497150000 seconds user
       0.095790000 seconds sys

sudo perf stat -e mem-loads ./mymul

Strassen Multiple CORRECT!

 Performance counter stats for './mymul':

        0      mem-loads                                                

      58.567929623 seconds time elapsed

      57.752646000 seconds user
       0.335492000 seconds sys

注:

  1. 由于我的笔记本的CPU是AMD的,并且使用的是ubuntu 20.4,在perf list中并没有mem-loads这项指标。而我所持有的3个服务器并没有管理员权限,因此mem-loads是在同学的笔记本中观察的。由于是进行横向比较,所以不会产生影响。
  2. mem-loads这项指标为0,我也不知道为什么,可能是对mem-loads有一点误解吧。

数据分析

  cache miss cpu-cycles instructions CPI mem-loads
一般矩阵乘法 12,391,855 13,862,902,330 14,906,982,718 0.929960314 0
一般分治算法 21,808,731 246,794,312,017 448,718,998,375 0.549997466 0
Strassen算法 23,796,797 171,462,129,314 332,097,051,364 0.51630127 0

cache miss

​ 一般算法最简单,没有递归所以cache miss最少;

​ 一般分治算法的计算量其实和Strassen算法没有太大的区别,但是相比一般乘法,由于有递归运算,因此cache miss会相对多一些;两种分治算法相比较,由于Strassen算法更复杂,中间计算更多,所需要的赋值操作也更多,cache miss会明显大一些。

CPI

​ 由于将一部分的乘法转换成为了加法,一般分治算法的CPI确实比一般矩阵乘法要小,而Strassen由于进一步压缩乘法的计算数量,只有7次矩阵乘法,CPI会更小。

mem-loads

​ 三者都是0。

一些题目

问题描述

  1. Consider a memory system with a level 1 cache of 32 KB and DRAM of 512 MB with the processor operating at 1 GHz. The latency to L1 cache is one cycle and the latency to DRAM is 100 cycles. In each memory cycle, the processor fetches four words (cache line size is four words). What is the peak achievable performance of a dot product of two vectors? Note: Where necessary, assume an optimal cache placement policy.
/* dot product loop */
for (i = 0; i < dim; i++)
	dot_prod += a[i] * b[i];
  1. Now consider the problem of multiplying a dense matrix with a vector using a two-loop dot-product formulation. The matrix is of dimension 4K x 4K. (Each row of the matrix takes 16 KB of storage.) What is the peak achievable performance of this technique using a two-loop dot-product based matrix-vector product?
/* matrix-vector product loop */  
for (i = 0; i < dim; i++)  
	for (j = 0; i < dim; j++)  
		c[i] += a[i][j] * b[j];

解决方法

问题二

​ 关于fetch的顺序,有两种考虑:一种是先fetcha[] 或者 b[],另一种是同时fetcha[] 和 b[]。那么显然,为了追求peak achievable performance,肯定是后者的效率更高。

​ 在cache的一行中,可以加载4个word。那么一个循环周期中,加载a[] 和 b[] 会产生两次cache miss,访问了两次DRAM。那在一个循环周期中,可以计算4个word的点乘运算,即4次乘法和4次加法。

​ 一个周期中:

  • 2次cache miss,会访问两次DRAM,需要$2\times 100 = 200$个cycle
  • 10次cache hit,3次来自a[],3次来自b[],4次来自dot_prod,需要10个cycle
  • 共需要210个cycle
  • 完成了8次浮点数运算。

​ 所以,计算的峰值性能为: \(performace = \frac{8}{210\times \frac{1}{1GHz}}\approx 38.1 MFlops\)

问题三

​ 首先考虑b[]:尽管b[]中的一个元素会被计算4K次,但是观察cache的容量,发现b[]可以完全存放在cache中。因此,b[]中的每个元素会被加载到cache中一次,一共发生cache miss 4K次。这个操作可以提前处理。

​ 再考虑c[],由于c[]的变化是外层循环,在计算四个内层循环后才会发生一次cache miss。因此,一共会发生cache miss 1K次。

​ 那么,剩下的a[]其实和上一题异曲同工,在j每增加4的时候,会发生一次cache miss。又可以发现,b[]和c[]发生cache miss产生的延时可以忽略不计。因此,一个循环周期中:

  • 1次cache miss,会访问一次DRAM,需要100个cycle
  • 11次cache hit,3次来自a[],4次来自b[],4次来自c[],需要11个cycle
  • 共需要111个cycle
  • 完成了8次浮点数运算。

​ 所以,计算的峰值性能为: \(performace = \frac{8}{111\times \frac{1}{1GHz}}\approx 72.07 MFlops\)

实验中遇到的困难和解决方法

困难一

​ 在安装完Perf工具后,在实验时,发现Perf工具并不能观测到mem-loads指标。查找了很多相关资料,也做了一些尝试,发现并不能解决问题。于是,我去咨询同学,发现可能是由于我的笔记本的CPUAMD的,而我咨询的同学都是Intel的。带着怀疑,又查找了一些资料,虽然并没有验证,但或多或少可能有一些关系。

​ 后来尝试在服务器上观测,很可惜我没有服务器的管理员权限。

​ 再后来,咨询了亲爱的张景润助教,他耐心地解答了我的问题。他怀疑是跟ubuntu的版本有关系,那就尘埃落定了,就说明,只能借助外部力量来观测这项指标了。

困难二

​ 在实现代码的时候,我发现,在递归计算矩阵的时候,有这样的过程:

  1. 300x300的矩阵划分成4个150x150的子矩阵

  2. 150x150的矩阵划分成4个75x75的子矩阵

​ 这时候发现,75x75的矩阵再划分,就会变成长方形

​ 通过分析,尽管是长方形,依旧可以进行分治计算,一般的分治算法是可以继续进行的,而strassen算法由于其独特的计算方式,无法继续划分。因此,我产生了两个解决思路:

  1. 划分两次到75x75后,就结束递归,直接进行计算
  2. 300x300的矩阵扩充至512x512,这样就可以分治到1x1的子矩阵

​ 当我思考要不要采用后者的时候,我观察到室友采用了后者的方法,在strassen算法计算300x300的矩阵,计算时间竟长达一分钟,而且cache miss指标也特别高,让我萌生退意。因此我拟定采用第一种方法。后来我转念一想,实验的目的是为了探究不同算法对计算性能的影响,那分治一次和分治两次也没有多大区别吧。

​ 得到一些性能的数据:

  cache miss cpu-cycles instructions CPI mem-loads
一般矩阵乘法 137964 1579487311 2889779715 0.546577064 0
一般分治算法 150833 1406175919 2906113512 0.483868202 0
Strassen算法 111727 1297045630 2737623738 0.473785207 0

​ 可以发现,由于两种分治都没有使用递归,cache miss发生的次数会与一般矩阵乘法不相上下。而CPI由于将一部分的乘法转换成为了加法,一般分治算法的CPI确实比一般矩阵乘法要小,而Strassen由于进一步压缩乘法的计算数量,CPI会更小。但是由于没有递归,因此观察不到很好的实验效果,因此,还是实现了递归的操作。

困难三

​ 其实一开始改成递归模式的时候,终端一直在杀死我的运行程序。最初我怀疑是因为爆栈了,所以我手动调大了栈空间,可是依旧不解决问题。后来我发现,很有可能是在递归的过程中,内存爆炸了。因为我最初版本的代码,是没有析构函数的。后来又出现了一系列错误,然后逐步完善了复制构造函数和等于号的重载,才逐步解决问题。