Solutions6Chapter6SolutionsS-36.1Thereisnosinglerightanswerforthisquestion.Thepurposeistogetstudentstothinkaboutparallelismpresentintheirdailylives.Theanswershouldhaveatleast10activitiesidentified.6.1.1Anyreasonableansweriscorrecthere.6.1.2Anyreasonableansweriscorrecthere.6.1.3Anyreasonableansweriscorrecthere.6.1.4Thestudentisaskedtoquantifythesavingsduetoparallelism.Theanswershouldconsidertheamountofoverlapprovidedthroughparallelismandshouldbelessthanorequalto(ifnoparallelismwaspossible)totheoriginaltimecomputedifeachactivitywascarriedoutserially.6.26.2.1Forthissetofresources,wecanpipelinethepreparation.Weassumethatwedonothavetoreheattheovenforeachcake.PreheatOvenMixingredientsinbowlforCake1FillcakepanwithcontentsofbowlandbakeCake1.MixingredientsforCake2inbowl.FinishbakingCake1.Emptycakepan.FillcakepanwithbowlcontentsforCake2andbakeCake2.MixingredientsinbowlforCake3.FinishbakingCake2.Emptycakepan.FillcakepanwithbowlcontentsforCake3andbakeCake3.FinishbakingCake3.Emptycakepan.6.2.2Nowwehave3bowls,3cakepansand3mixers.WewillnamethemA,B,andC.PreheatOvenMixincredientsinbowlAforCake1FillcakepanAwithcontentsofbowlAandbakeforCake1.MixingredientsforCake2inbowlA.FinishbakingCake1.EmptycakepanA.FillcakepanAwithcontentsofbowlAforCake2.MixingredientsinbowlAforCake3.FinishingbakingCake2.EmptycakepanA.FillcakepanAwithcontentsofbowlAforCake3.S-4Chapter6SolutionsFinishbakingCake3.EmptycakepanA.Thepointhereisthatwecannotcarryoutanyoftheseitemsinparallelbecauseweeitherhaveonepersondoingthework,orwehavelimitedcapacityinouroven.6.2.3Eachstepcanbedoneinparallelforeachcake.Thetimetobake1cake,2cakesor3cakesisexactlythesame.6.2.4Theloopcomputationisequivalenttothestepsinvolvedtomakeonecake.Giventhatwehavemultipleprocessors(orovensandcooks),wecanexecuteinstructions(orcookmultiplecakes)inparallel.Theinstructionsintheloop(orcookingsteps)mayhavesomedependenciesonpriorinstructions(orcookingsteps)intheloopbody(cookingasinglecake).Data-levelparallelismoccurswhenloopiterationsareindependent(i.e.,noloopcarrieddependencies).Task-levelparallelismincludesanyinstructionsthatcanbecomputedonparallelexecutionunits,aresimilartotheindependentoperationsinvolvedinmakingmultiplecakes.6.36.3.1Whilebinarysearchhasverygoodserialperformance,itisdifficulttoparallelizewithoutmodifyingthecode.SopartAaskstocomputethespeedupfactor,butincreasingXbeyond2or3shouldhavenobenefits.Whilewecanperformthecomparisonoflowandhighononecore,thecomputationformidonasecondcore,andthecomparisonforA[mid]onathirdcore,withoutsomerestructuringorspeculativeexecution,wewillnotobtainanyspeedup.Theanswershouldincludeagraph,showingthatnospeedupisobtainedafterthevaluesof1,2,or3(thisvaluedependssomewhatontheassumptionmade)forY.6.3.2Inthisquestion,wesuggestthatwecanincreasethenumberofcores(toeachthenumberofarrayelements).Again,giventhecurrentcode,wereallycannotobtainanybenefitfromtheseextracores.ButifwecreatethreadstocomparetheNelementstothevalueXandperformtheseinparallel,thenwecangetidealspeedup(Ytimesspeedup),andthecomparisoncanbecompletedintheamountoftimetoperformasinglecomparison.6.4.Thisproblemillustratesthatsomecomputationscanbedoneinparallelifserialcodeisrestructured.Butmoreimportantly,wemaywanttoprovideforSIMDoperationsinourISA,andallowfordata-levelparallelismwhenperformingthesameoperationonmultipledataitems.Chapter6SolutionsS-56.4.1Thisisastraightforwardcomputation.Thefirstinstructionisexecutedonce,andtheloopbodyisexecuted998times.Version1—17,965cyclesVersion2—22,955cyclesVersion3—20,959cycles6.4.2ArrayelementsD[j]andD[j1]willhaveloopcarrieddependencies.Thesewill$f4inthecurrentiterationand$f0inthenextiteration.6.4.3Thisisaverychallengingproblemandtherearemanypossibleimplementationsforthesolution.Thepreferredsolutionwilltrytoutilizethetwonodesbyunrollingtheloop4times(thisalreadygivesyouasubstantialspeedupbyeliminatingmanyloopincrement,branchandloadinstructions).Theloopbodyrunningonnode1wouldlooksomethinglikethis(thecodeisnotthemostefficientcodesequence):addiu$s1,$zero,996l.d$f0,–16($s0)l.d$f2,–8($s0)loop:add.d$f4,$f2,$f0add.d$f6,$f4,$f2Send(2,$f4)Send(2,$f6)s.d$f4,0($s0)s.d$f6,8($s0)Receive($f8)add.d$f10,$f8,$f6add.d$f0,$f10,$f8Send(2,$f10)Send(2,$f0)s.d.$f8,16($s0)s.d$f10,24($s0)s.d$f032($s0)Receive($f2)s.d$f240($s0)addiu$s0,$s0,48bne$s0,$s1,loopadd.d$f4,$f2,$f0add.d$f6,$f4,$f2add.d$f10,$f8,$f6s.d$f4,0($s0)s.d$f6,8($s0)s.d$f8,16($s0)S-6Chapter6SolutionsThecodeonnode2wouldlooksomethinglikethis:addiu$s2,$zero,0loop:Receive($f12)Receive($f14)add.d$f16,$f14,$f12Send(1,$f16)Receive($f12)Receive($f14)add.d$f16,$f14,$f12Send(1,$f16)Receive($f12)Receive($f14)add.d$f16,$f14,$f12Send(1,$f16)Receive($f12)Receive($f14)add.d$f16,$f14,$f12Send(1,$f16)addiu$s2,$s2,1bne$s2,83,loopBasicallyNode1wouldcompute4addseachloopiteration,andNode2wouldcompute4adds.Thelooptakes1463cycles,whichismuchbetterthancloseto18K.Buttheunrolledloopwouldrunfastergiventhecurrentsendinstructionlatency.6.4.4Theloopnetworkwouldneedtorespondwithinasinglecycletoobtainaspeedup.Thisillustrateswhyusingdistributedmessagepassingisd