Volume Ray Casting on CUDA

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

Chapter6VolumeRayCastingonCUDATheperformanceofgraphicsprocessors(GPUs)isimprovingatarapidrate,almostdoublingeveryyear.SuchanevolutionhasbeenmadepossiblebecausetheGPUisspecializedforhighlyparallelcompute-intensiveapplications,primarilygraphicsrendering,andthusdesignedsuchthatmoretransistorsaredevotedtocomputationratherthancachingandbranchpredictionunits.Duetocompute-intensiveapplications'higharithmeticintensity(theratioofarithmeticoperationstomemoryoperations),thememorylatencycanbehiddenwithcomputationsinsteadofusingcachesonGPUs.Inaddition,sincethesameinstructionsareexecutedonmanydataelementsinparallel,sophisticated°owcontrolunitssuchasbranchpredictionunitsinCPUsarenotrequiredonGPUsasmuch.Althoughtheperformanceof3DgraphicsrenderingachievedbydedicatinggraphicshardwaretoitfarexceedstheperformanceachievablefromjustusingCPU,graphicsprogrammershaduptonowtogiveupprogrammabilityinexchangeforspeed.Theywerelimitedtousinga¯xedsetofgraphicsoperations.Ontheotherhand,insteadofusingGPUs,imagesfor¯lmsandvideosarerenderedusingano®-linerenderingsystemthatusesgeneralpurposeCPUstorenderaframeinhoursbecausethegeneralpurposeCPUsgivegraphicsprogrammersalotof°exibilityto95createriche®ects.Thegeneralityand°exibilityofCPUsarewhattheGPUhasbeenmissinguntilveryrecently.Inordertoreducethegap,graphicshardwaredesignershavecontinuouslyintroducedmoreprogrammabilitythroughseveralgenerationsofGPUs.Upuntil2000,noprogrammabilitywassupportedinGPUs.However,in2001,vertex-levelprogrammabilitystartedtoappear,andin2002,pixel-levelprogrammabilityalsostartedbeingprovidedonGPUssuchasNVIDIA'sGeForceFXfamilyandATI'sRadeon9700series.Thislevelofprogrammabilityallowsprogrammerstohaveconsiderablymorecon¯gurabilitybymakingitpossibletospecifyasequenceofinstructionsforprocessingbothvertexandfragmentprocessors.However,accessingthecomputationalpowerofGPUsfornon-graphicsappli-cationsorglobalilluminationrenderingsuchasraytracingoftenrequiresingeniouse®orts.OnereasonisthatGPUscouldonlybeprogrammedusingagraphicsAPIsuchasOpenGL,whichimposesasigni¯cantoverheadtothenon-graphicsappli-cations.ProgrammershadtoexpresstheiralgorithmsintermsoftheinadequateAPIs,whichrequiredsometimesheroice®ortstomakeane±cientuseoftheGPU.AnotherreasonisthelimitedwritingcapabilityoftheGPU.TheGPUprogramcouldgatherdataelementfromanypartofmemory,butcouldnotscatterdatatoarbitrarylocations,whichremoveslotsoftheprogramming°exibilityavailableontheCPU.Inordertoovercometheabovelimitation,NVIDIAhasdevelopedanewhard-wareandsoftwarearchitecture,calledCUDA(ComputeUni¯edDeviceArchitec-ture),forissuingandmanagingcomputationsontheGPUasadata-parallelcom-96putingdevicethatdoesnotrequiremappinginstructionstoagraphicsAPI[NVI07].CUDAprovidesthegeneralmemoryaccessfeature,andthus,theGPUprogramisnowallowedtoreadfromandwritetoanylocationinmemoryonCUDA.InordertoharnessthepoweroftheCUDAarchitecture,weneednewdesignstrategiesandtechniquesthatfullyutilizethenewfeaturesofthearchitecture.CUDAisbasicallytailoredfordata-parallelcomputationsandthusisnotwellsuitedforothertypesofcomputations.Moreover,thecurrentversionofCUDArequiresprogrammerstounderstandthespeci¯carchitecturedetailsinordertoachievethedesiredperformancegains.Programswrittenwithoutthecarefulattentiontothearchitecturedetailsareverylikelytoperformpoorly.Inthischapter,weexploretheapplicationofourstreamingmodel,whichwasintroducedinthepreviouschapterfortheCellprocessor,fortheCUDAarchitecture.Sincethemodelisdesignedforheterogeneouscomputeresourceenvironment,itisalsowellsuitedfortheCPUandCUDAcombinedenvironment.OurbasicstrategyinthestreamingmodelisthesameasinthecaseofCellprocessor.Weassigntheworklistgenerationtothe¯rststage(CPU)andactualrenderingworktothesecondstage(CUDA)withdatamovementstreamlinedthroughthetwostages.Thekeyisthatthewecarefullymatchtheperformancesofthetwostagessothattwoprocessesarecompletelyoverlappedandnostagehastowaitfortheinputfromtheotherstage.Ourschemefeaturesthefollowing.First,weessentiallyremovetheoverheadcausedbytraversingthehierarchicaldatastructurebyoverlappingtheemptyspaceskippingprocesswiththeactualrenderingprocess.Second,ouralgorithmsare97carefullytailoredtotakeintoaccounttheCUDAarchitecture'suniquedetailssuchastheconceptofwarpandlocalsharedmemorytoachievehighperformance.Last,theraycastingperformanceis1.5timesbetterthanthatoftheCellprocessorwithonlyathirdlinesofcodesoftheCellprocessorand15timesbetterthanthatofIntelXeonprocessor.6.1TheCUDAArchitectureOverviewTheCUDA(ComputeUni¯edDeviceArchitecture)hardwaremodelhasasetofSIMDmultiprocessorsasshowninFigure6.1.Eachmultiprocessorhasasmalllocalsharedmemory,constantcache,texturecacheandasetofprocessors.Atanygivenclock,everyprocessorinthemultiprocessorexecutesthesameinstruction.Forexample,NVIDIAGeforce8800GTXarchitectureiscomprisedof16multiprocessors.Eachmultiprocessorhas8streamingprocessorsforatotalof128processors.Figure6.2showstheCUDAprogrammingmodel.CUDAallowsprogrammerstouseC-languagetoprogramitinsteadofgraphicsAPIssuchasOpenGLandDirect3D.InCUDA,theGPUisacomputedevicethatcanexecuteaveryhighnumberofth

1 / 18
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功