636(3):266-27420149JOURNALOFENGINEERINGSTUDIESSep.,2014收稿日期:20131008;修回日期:20131125基金项目:(11205179)作者简介:1977–E-mail:chyd@ihep.ac.cn1961–DOI:10.3724/SP.J.1224.2014.00266科研大数据平台关键技术与实践程耀东,陈刚(100049)摘要:首先,以高能物理领域数据处理为例,分析了支撑科学研究的大数据平台在存储和处理能力、传输和共享等方面的挑战,说明现有技术已经难以满足日益快速增长的数据处理需求。然后,给出了科研大数据平台的典型架构,并讨论科研大数据平台的关键技术,包括数据采集与清洗、数据存储、数据处理、数据传输、数据共享与安全等技术,同时介绍了各种关键技术的研究现状或者主流系统。最后,介绍了中国科学院高能物理研究所科研大数据开放平台的建设思路和实现框架,该平台试图解决目前大数据发展过程中面临的一些问题,如数据开放和跨领域融合不够、跨地域数据传输性能低等,从而激活数据价值,降低应用门槛。关键词:大数据;数据存储;并行数据处理;开放平台中图分类号O57文献标识码A文章编号1674-4969(2014)03-0266-09引言LHCLargeHadronCollider25PB[1]1mm31PBIDCInternetDataCenter2011200PB[2]2012IDC2010ZB1ZB=1106PB202050[3]1图1全球数据增长IDCInternetDataCenter“大数据处理中的基础理论与关键技术”专刊2671科研大数据及其计算平台现状1.117LHCBESIII2012200PB1000PB1.2LHCLHCEuropeanOrganizationforNuclearResearchCERN[4]LHCALICEALargeIonColliderExperimentATLASAToroidalLHCAppa-ratuSCMSCompactMuonSo-lenoidLHCbLargeHadronColliderbeautyexperiment2LHCLHC2009PB2012200PBLHC图2LHC的四个主要实验LHC268,6(3):266-274(2014)20LHCLHCWLCGWorldwideLHCComputingGrid[1]3LHCCERNTier0Tier1Tier2Tier3LHCLHCLHCWLCG20025CPULHC图3WLCG网格体系结构1.3BESIIIBEPCII2.0~4.6GeVBEPCIIBESIII[5]BESIII-BESIII3.6PB1.8PBBESIIIBESIII10PBBESIII15BESIII400BESIIIBESIIIBESIIIBESIIIGRASSGrid-enabledAdvancedStorageSystem[6]Lustre[7]20123PB25GB/sBESIIIBESIII[8]EMIElectromagneticInterferenceGOSGridOperatingSystemDiracBESIII2691.42科研大数据平台关键技术2.14ITGeant4Gaudi图4科研大数据平台基本架构FLUENTMahout[9]2.2LHCATLASAToroidalLHCApparatuS40MHz1PB/sATLAStriggersystemATLAS320MB/s[10]5LHCLHC25PB图5ATLAS实验数据采集与过滤流程[10]270,6(3):266-274(2014)webdeepweb[11]90%Gartner100025%1%~30%13.6%~81%[12]2.3POSIXLustreGlusterGPFSISILON70%Lustre[13]POSIXAPIgooglefilesystemGFS[14]HDFShadoopdistributedfilesystem[15]HDFSPB–CASTOR[16]dCache[17]SSDSATAflash-cache[18]flashcachegroup[19]2.4[20]highthroughput271computingHTCMPIMessagePassingInterface[21]MapReduce[22]opensourceStorm[23]S4[24]StreamBase[25]MPIIBMPlatformLSF[26]Condor[27]Torque/PBS[28]MapReduce2004HadoopMapReduceHive[29]Pig[30]Sawzall[31]MapReduceMapReduceHadoop2.580FTSFileTransferService[32]Phedex[33]SoftwareDefinedNetworkSDNLHCOpenNetworkEn-vironmentLHCONE[34]2.6k-k-anonymityl-l-Diversityt-ClosenessFF-Anonymity2006DworkDifferentialPrivacy[35]3科研大数据开放平台实践66IaaSDaaSPaaSSaaS272,6(3):266-274(2014)图6中国科学院高能物理研究所科研大数据开放平台体系架构DaaSPaaSSaaSChineseHighEnergy273PhysicsDataTransferNetworkCHEPDTNSDNIPv4IPv64小结[36]参考文献[1]WorldwideLHCComputingGrid.Home[EB/OL].(2013)[2013-08-30].[2].[J].,2012,8(9):8-15.[3]GantzJ,ReinselD.TheDigitalUniversein2020:BigData,BiggerDigitalShadows,andBiggestGrowthintheFarEast:UnitedStates[J/OL].IDCiView:IDCAnalyzetheFuture,2012.[4]CERN.TheLargeHadronCollider[EB/OL].(2013)[2013-08-30].[5]IHEP.BESIIIExperiment[EB/OL].(2013)[2013-08-30].[6],,,.GRASS[J].,2011,31(9):969-972.[7],,.LustreBES[J].,2010,30(12):1574-1578.[8]DengZY,LiWD,LinL,etal.ExperienceofBESIIIDataProductionwithLocalClusterandDistributedComputingModel[J].JournalofPhysics:ConferenceSeries,2012,396(3):032031.[9]Mahout.Scalablemachinelearninganddatamining[EB/OL].[2013-08-30].[10]WenausT.ChallengesoftheLHC:Computing[R/OL].20thAnniversaryoftheWinterAspenPhysicsConferences.(2005-02-19)[2013-08-30].[11]BergmanMK.TheDeepWeb:SurfacingHiddenValue[J].JournalofElectronicPublishing,2001,7(1):7-10.[12],,.[J].,2012,8(9):22-30.[13]SchwanP.Lustre:BuildingaFileSystemfor1000-NodeClusters[C]//Proceedingsofthe2003LinuxSymposium.2003.[14]GhemawatS,GobioffH,LeungST.TheGoogleFileSystem[C]//ACMSIGOPSOperatingSystemsReview.ACM,2003,37(5):29-43.[15]ShvachkoK,KuangH,RadiaS,etal.TheHadoopDis-tributedFileSystem[C]//ProceedingoftheIEEE26thSymposiumonMassStorageSystemsandTechnologies(MSST).IEEE,2010:1-10.[16]CASTOR:CERNAdvancedSTORagemanager.Home[EB/OL].(2013)[2013-08-30].[17]FuhrmannP.dCache,theCommodityCache[C]//IEEE/274,6(3):266-274(2014)NASAGoddardConferenceonMassStorageSystemsandTechnologies(MSST).IEEE,2004:171-175.[18]FacebookInc.Facebook/FlashCache[EB/OL].(2013)[2013-08-30].[19]GitHubInc.Lihuiba/flashcachegroup(fcg)[EB/OL].(2013)[2013-08-30].[20]RamanR,LivnyM,SolomonM.Matchmaking:Distrib-utedResourceManagementforHighThroughputCom-puting[C]//ProceedingsoftheSeventhInternationalSymposiumonHighPerformanceDistributedComputing,1998.IEEE,1998:140-146.[21]GroppW,LuskE,DossN,etal.AHigh-Performance,PortableImplementationoftheMPIMessagePassingIn-terfaceStandard[J].Parallelcomputing,1996,22(6):789-828.[22]DeanJ,GhemawatS.MapReduce:SimplifiedDataProc-essingonLargeClusters[J].CommunicationsoftheACM,2008,51(1):107-113.[23]Storm:DistributedandFault-TolerantRealtimeCompu-tation[EB/Ol].(2013)[2013-08-30].[24]NeumeyerL,RobbinsB,NairA,etal.S4:Distributedstreamcomputingplatform[C]//Inte