并行计算中国科学技术大学计算机科学与技术系国家高性能计算中心(合肥)2004年12月第三篇并行数值算法第八章基本通讯操作第九章稠密矩阵运算第十章线性方程组的求解第十一章快速傅里叶变换第九章稠密矩阵运算9.1矩阵的划分9.2矩阵转置9.3矩阵-向量乘法9.4矩阵乘法9.1矩阵的划分9.1.1带状划分9.1.2棋盘划分国家高性能计算中心(合肥)52020/1/10带状划分16×16阶矩阵,p=4列块带状划分行循环带状划分PPPP0481215913261014371115PPPP(b)(a)图9.1012332100123456789101112131415国家高性能计算中心(合肥)62020/1/10带状划分示例:p=3,27×27矩阵的3种带状划分9.1矩阵的划分9.1.1带状划分9.1.2棋盘划分国家高性能计算中心(合肥)82020/1/10棋盘划分8×8阶矩阵,p=16块棋盘划分循环棋盘划分(b)(a)图9.261014711154812591361014(0,0)(1,0)(2,0)(3,0)(5,0)(4,0)(7,0)(6,0)(0,1)(1,1)(2,1)(3,1)(5,1)(4,1)(7,1)(6,1)(0,2)(1,2)(2,2)(3,2)(5,2)(4,2)(7,2)(6,2)(0,3)(1,3)(0,4)(1,4)(2,3)(3,3)(2,4)(3,4)(5,3)(4,3)(5,4)(4,4)(7,3)(6,3)(7,4)(6,4)(0,5)(1,5)(0,6)(1,6)(2,5)(3,5)(2,6)(3,6)(5,5)(4,5)(5,6)(4,6)(7,5)(6,5)(7,6)(6,6)(0,7)(1,7)(2,7)(3,7)(5,7)(4,7)(7,7)(6,7)(0,0)(2,0)(3,0)(5,0)(4,0)(7,0)(6,0)(1,0)(6,1)(1,1)(0,4)(1,4)(2,1)(3,1)(2,4)(3,4)(5,1)(4,1)(5,4)(4,4)(7,1)(7,4)(6,4)(0,1)(1,2)(0,5)(1,5)(2,2)(3,2)(2,5)(3,5)(5,2)(4,2)(5,5)(4,5)(7,2)(6,2)(7,5)(6,5)(0,2)(1,3)(0,6)(1,6)(2,3)(2,6)(3,6)(5,3)(4,3)(5,6)(4,6)(7,3)(6,3)(7,6)(6,6)(0,3)(3,3)(0,7)(1,7)(2,7)(3,7)(5,7)(4,7)(7,7)(6,7)3711150481215913PPPPPPPP23012PPPPPPPPPPPPPPPPPPPPPPPP国家高性能计算中心(合肥)92020/1/10棋盘划分示例:p=4,16×16矩阵的3种棋盘划分第九章稠密矩阵运算9.1矩阵的划分9.2矩阵转置9.3矩阵-向量乘法9.4矩阵乘法9.2矩阵转置9.2.1棋盘划分的矩阵转置9.2.2带状划分的矩阵转置国家高性能计算中心(合肥)122020/1/10棋盘划分的矩阵转置网孔连接情形1:p=n2。通讯步转置后P(3,0)(1,0)(1,2)(1,3)(2,0)(2,1)(2,3)P(3,1)P(3,2)PPPP(b)(a)图9.3(0,0)(0,1)(0,2)(0,3)(1,0)(1,1)(1,2)(1,3)(2,0)(2,1)(2,2)(2,3)(3,0)(3,1)(3,2)(3,3)(0,1)(0,2)(0,3)P(0,0)(1,1)(2,2)(3,3)371115261014159131204837111526141019135PPPPPPPPPPPPPPPPPPPPPPPPP12480国家高性能计算中心(合肥)132020/1/10棋盘划分的矩阵转置情形2:pn2。-划分:-算法:①按mesh连接进行块转置(不同处理器间)②进行块内转置(同一处理器内)通讯步转置后PPPPPPPPPPPP(b)(a)图9.4PPPPPPPPPPPPPPPP0481215913261014371115(0,0)(1,0)(0,2)(1,2)(1,4)(0,4)(1,6)(0,6)(0,1)(1,1)(0,3)(1,3)(1,5)(0,5)(1,7)(0,7)(2,0)(3,0)(2,2)(3,2)(3,4)(2,4)(3,6)(2.6)(2,1)(3,1)(2,3)(3,3)(3,5)(2,5)(3,7)(2,7)(4,0)(5,0)(4,2)(5,2)(5,4)(4,4)(5,6)(4,6)(4,1)(5,1)(4,3)(5,3)(5,5)(4,5)(5,7)(4,7)(6,0)(7,0)(6,2)(7,2)(7,4)(6,4)(7,6)(6,6)(6,1)(7,1)(6,3)(7,3)(7,5)(6,5)(7,7)(6,7)48120PP15913P261014P371115(1,0)(2,0)(3,0)(5,0)(4,0)(7,0)(6,0)(0,0)(0,1)(1,1)(2,1)(3,1)(5,1)(4,1)(7,1)(6,1)(0,2)(1,2)(2,2)(3,2)(5,2)(4,2)(7,2)(6,2)(0,3)(1,3)(2,3)(3,3)(5,3)(4,3)(7,3)(6,3)(0,4)(1,4)(2,4)(3,4)(5,4)(4,4)(7,4)(6,4)(0,5)(1,5)(2,5)(3,5)(5,5)(4,5)(7,5)(6,5)(0,6)(1,6)(2,6)(3,6)(5,6)(4,6)(7,6)(6,6)(0,7)(1,7)(2,7)(3,7)(5,7)(4,7)(7,7)(6,7)子块划分成p个大小为Apnpnnn通讯)/(2//2pnttpws计算pn2//2运行时间pntptpnTwsp/22222国家高性能计算中心(合肥)142020/1/10棋盘划分的矩阵转置超立方连接划分:算法:①②对Aij递归应用①进行转置,直至分块矩阵的元素处于同一处理器;③进行同一处理器的内部转置。运行时间:子块划分成p个大小为ApnpnnnppnttpnppnttpnppnttpnTwswswsplog)2log)22//log)22222222(递归步:,(选路:,内部转置(2212211122211211AAAAAAAAA转置为=将国家高性能计算中心(合肥)152020/1/10棋盘划分的矩阵转置超立方连接:示例(0,5)(1,5)PPP(2,5)(3,5)PPPP(5,5)(4,5)PPPP(7,5)(6,5)P(0,0)(0,1)(1,0)(1,1)(0,2)(0,3)(1,2)(1,3)P(0,4)(0,5)(1,4)(1,5)P(0,6)(0,7)(1,6)(1,7)P(2,0)(2,1)(3,0)(3,1)P(2,2)(2,3)(3,2)(3,3)P(2,4)(2,5)(3,4)(3,5)P(2,6)(2,7)(3,6)(3,7)(5,0)(5,1)(4,0)(4,1)(5,2)(5,3)(4,2)(4,3)P(5,4)(5,5)(4,4)(4,5)P(5,6)(5,7)(4,6)(4,7)P(7,0)(7,1)(6,0)(6,1)P(7,2)(7,3)(6,2)(6,3)P(7,4)(7,5)(6,4)(6,5)P(7,6)(7,7)(6,6)(6,7)(b)(a)图9.6048121592610143711150841219513621014731115(0,0)(1,0)(2,0)(3,0)(5,0)(4,0)(7,0)(6,0)(0,1)(1,1)(2,1)(3,1)(5,1)(4,1)(7,1)(6,1)PPPPPPPP(0,2)(1,2)(2,2)(3,2)(5,2)(4,2)(7,2)(6,2)(0,3)(1,3)(2,3)(3,3)(5,3)(4,3)13(7,3)(6,3)(0,4)(1,4)(2,4)(3,4)(5,4)(4,4)(7,4)(6,4)(0,6)(1,6)(2,6)(3,6)(5,6)(4,6)(7,6)(6,6)(0,7)(1,7)(2,7)(3,7)(5,7)(4,7)(7,7)(6,7)9.2矩阵转置9.2.1棋盘划分的矩阵转置9.2.2带状划分的矩阵转置国家高性能计算中心(合肥)172020/1/10带状划分的矩阵转置划分:An×n分成p个(n/p)×n大小的带算法:①Pi有p-1个(n/p)×(n/p)大小子块发送到另外p-1个处理器中;②每个处理器本地交换相应的元素PPPPn图9.70123第九章稠密矩阵运算9.1矩阵的划分9.2矩阵转置9.3矩阵-向量乘法9.4矩阵乘法9.3矩阵-向量乘法9.3.1带状划分的矩阵-向量乘法9.3.2棋盘划分的矩阵-向量乘法国家高性能计算中心(合肥)202020/1/10带状划分的矩阵-向量乘法划分(行带状划分):Pi存放xi和ai,0,ai,1,…,ai,n-1,并输出yi算法:对p=n情形①每个Pi向其他处理器播送xi(多到多播送);②每个Pi计算;注:对pn情形,算法中Pi要播送X中相应的n/p个分量(1)超立方连接的计算时间(2)网孔连接的计算时间充分大时项是多到多的播送时间后项是乘法时间,前pntptpnptpnptpnTwswsp//log21//)1(log22充分大时项是多到多的播送时间后项是乘法时间,前pntptpnptpntppnTwswsp//)1(221//)1()1(222国家高性能计算中心(合肥)212020/1/10带状划分的矩阵-向量乘法示例矩阵A01p-1n/p向量x处理器01p-1n(b)(a)0000011111p-1矩阵A01p-1向量y(d)(c)图9.8P0P1Pp-1P0P1Pp-1P0P1Pp-1P0P1Pp-1p-1p-1p-1p-19.3矩阵-向量乘法9.3.1带状划分的矩阵-向量乘法9.3.2棋盘划分的矩阵-向量乘法国家高性能计算中心(合肥)232020/1/10棋盘划分的矩阵-向量乘法划分(块棋盘划分):Pij存放ai,j,xi置入Pi,i中算法:对p=n2情形①每个Pi,i向Pj,i播送xi(一到多播送);②按行方向进行乘-加与积累运算,最后一列Pi,n-1收集的结果为yi;注:对pn2情形,p个处理器排成的二维网孔,算法中Pi,i向Pj,i播送X中相应的个分量(1)网孔连接的计算时间Tp(CT):.X中相应分量置入Pi,i的通讯时间:.按列一到多播送时间:.按行单点积累的时间:ptptpnptpnThwsp3loglog2pppn/pttpnthws)1(log)(ptptpnthws)1(log)(ptptpnthws国家高性能计算中心(合肥)242020/1/10棋盘划分的矩阵-向量乘法示例矩阵A向量xn(d)(c)图9.9P0P1Pn/矩阵A向量yP1P0√p-1P(b)(a)√pn/√p-1√pPp-1Pp-1PP√p2√pPP√p2√p国家高性能计算中心(合肥)252020/1/10带状与棋盘划分比较以网孔链接为例网孔上带状划分的运行时间网孔上棋盘划分的运行时间棋盘划分要比带状划分快。)5.9()1(22wspntptpnT)6.9(3loglog2ptptpnptpnThwsp第九章稠密矩阵运算9.1矩阵的划分9.2矩阵转置9.3矩阵-向量乘法9.4矩阵乘法9.4矩阵乘法9.4.1简单并行分块乘法9.4.2Cannon乘法9.4.3Fox乘法9.4.4Systolic乘法9.4.5DNS乘法国家高性能计算中心(合肥)282020/1/10矩阵乘法符号及定义