Chapter6VolumeRayCastingonCUDATheperformanceofgraphicsprocessors(GPUs)isimprovingatarapidrate,almostdoublingeveryyear.SuchanevolutionhasbeenmadepossiblebecausetheGPUisspecializedforhighlyparallelcompute-intensiveapplications,primarilygraphicsrendering,andthusdesignedsuchthatmoretransistorsaredevotedtocomputationratherthancachingandbranchpredictionunits.Duetocompute-intensiveapplications'higharithmeticintensity(theratioofarithmeticoperationstomemoryoperations),thememorylatencycanbehiddenwithcomputationsinsteadofusingcachesonGPUs.Inaddition,sincethesameinstructionsareexecutedonmanydataelementsinparallel,sophisticated°owcontrolunitssuchasbranchpredictionunitsinCPUsarenotrequiredonGPUsasmuch.Althoughtheperformanceof3DgraphicsrenderingachievedbydedicatinggraphicshardwaretoitfarexceedstheperformanceachievablefromjustusingCPU,graphicsprogrammershaduptonowtogiveupprogrammabilityinexchangeforspeed.Theywerelimitedtousinga¯xedsetofgraphicsoperations.Ontheotherhand,insteadofusingGPUs,imagesfor¯lmsandvideosarerenderedusingano®-linerenderingsystemthatusesgeneralpurposeCPUstorenderaframeinhoursbecausethegeneralpurposeCPUsgivegraphicsprogrammersalotof°exibilityto95createriche®ects.Thegeneralityand°exibilityofCPUsarewhattheGPUhasbeenmissinguntilveryrecently.Inordertoreducethegap,graphicshardwaredesignershavecontinuouslyintroducedmoreprogrammabilitythroughseveralgenerationsofGPUs.Upuntil2000,noprogrammabilitywassupportedinGPUs.However,in2001,vertex-levelprogrammabilitystartedtoappear,andin2002,pixel-levelprogrammabilityalsostartedbeingprovidedonGPUssuchasNVIDIA'sGeForceFXfamilyandATI'sRadeon9700series.Thislevelofprogrammabilityallowsprogrammerstohaveconsiderablymorecon¯gurabilitybymakingitpossibletospecifyasequenceofinstructionsforprocessingbothvertexandfragmentprocessors.However,accessingthecomputationalpowerofGPUsfornon-graphicsappli-cationsorglobalilluminationrenderingsuchasraytracingoftenrequiresingeniouse®orts.OnereasonisthatGPUscouldonlybeprogrammedusingagraphicsAPIsuchasOpenGL,whichimposesasigni¯cantoverheadtothenon-graphicsappli-cations.ProgrammershadtoexpresstheiralgorithmsintermsoftheinadequateAPIs,whichrequiredsometimesheroice®ortstomakeane±cientuseoftheGPU.AnotherreasonisthelimitedwritingcapabilityoftheGPU.TheGPUprogramcouldgatherdataelementfromanypartofmemory,butcouldnotscatterdatatoarbitrarylocations,whichremoveslotsoftheprogramming°exibilityavailableontheCPU.Inordertoovercometheabovelimitation,NVIDIAhasdevelopedanewhard-wareandsoftwarearchitecture,calledCUDA(ComputeUni¯edDeviceArchitec-ture),forissuingandmanagingcomputationsontheGPUasadata-parallelcom-96putingdevicethatdoesnotrequiremappinginstructionstoagraphicsAPI[NVI07].CUDAprovidesthegeneralmemoryaccessfeature,andthus,theGPUprogramisnowallowedtoreadfromandwritetoanylocationinmemoryonCUDA.InordertoharnessthepoweroftheCUDAarchitecture,weneednewdesignstrategiesandtechniquesthatfullyutilizethenewfeaturesofthearchitecture.CUDAisbasicallytailoredfordata-parallelcomputationsandthusisnotwellsuitedforothertypesofcomputations.Moreover,thecurrentversionofCUDArequiresprogrammerstounderstandthespeci¯carchitecturedetailsinordertoachievethedesiredperformancegains.Programswrittenwithoutthecarefulattentiontothearchitecturedetailsareverylikelytoperformpoorly.Inthischapter,weexploretheapplicationofourstreamingmodel,whichwasintroducedinthepreviouschapterfortheCellprocessor,fortheCUDAarchitecture.Sincethemodelisdesignedforheterogeneouscomputeresourceenvironment,itisalsowellsuitedfortheCPUandCUDAcombinedenvironment.OurbasicstrategyinthestreamingmodelisthesameasinthecaseofCellprocessor.Weassigntheworklistgenerationtothe¯rststage(CPU)andactualrenderingworktothesecondstage(CUDA)withdatamovementstreamlinedthroughthetwostages.Thekeyisthatthewecarefullymatchtheperformancesofthetwostagessothattwoprocessesarecompletelyoverlappedandnostagehastowaitfortheinputfromtheotherstage.Ourschemefeaturesthefollowing.First,weessentiallyremovetheoverheadcausedbytraversingthehierarchicaldatastructurebyoverlappingtheemptyspaceskippingprocesswiththeactualrenderingprocess.Second,ouralgorithmsare97carefullytailoredtotakeintoaccounttheCUDAarchitecture'suniquedetailssuchastheconceptofwarpandlocalsharedmemorytoachievehighperformance.Last,theraycastingperformanceis1.5timesbetterthanthatoftheCellprocessorwithonlyathirdlinesofcodesoftheCellprocessorand15timesbetterthanthatofIntelXeonprocessor.6.1TheCUDAArchitectureOverviewTheCUDA(ComputeUni¯edDeviceArchitecture)hardwaremodelhasasetofSIMDmultiprocessorsasshowninFigure6.1.Eachmultiprocessorhasasmalllocalsharedmemory,constantcache,texturecacheandasetofprocessors.Atanygivenclock,everyprocessorinthemultiprocessorexecutesthesameinstruction.Forexample,NVIDIAGeforce8800GTXarchitectureiscomprisedof16multiprocessors.Eachmultiprocessorhas8streamingprocessorsforatotalof128processors.Figure6.2showstheCUDAprogrammingmodel.CUDAallowsprogrammerstouseC-languagetoprogramitinsteadofgraphicsAPIssuchasOpenGLandDirect3D.InCUDA,theGPUisacomputedevicethatcanexecuteaveryhighnumberofth