1,,。,。、、、,[1]。,,,,,。,”garbagein,garbageout”[2],、、、、[3],,OLAP。,,,,[4]。50,[5],,:(1);(2);(3);(4);(5)。,,,:、、[3]。,,。IBMInfoSphereQualityStage[6-7]。,。,。[8]、(1.210093)(2.210003)。、、,;,。:G302;TP391A1003-6938(2013)05-0022-07ResearchonDataCleaningintheProcessofBuildingDataWarehouseAbstractThepaperdiscussedresearchaboutdatacleaningintheprocessofbuildingdatawarehouse。Inthispaperareintroducedthetypesofdirtydataandreasons,researchstatusofdatacleaningathomeandabroad,anddefinitionandobjectofdatacleaning.Algorithmsofdetectionanddataprocessingaboutabnormaldataonattribute-levelandrecord-levelareemphasized.Theweaknessofdatacleaningisclarified,andthefutureresearchtopicsofdatacleaningarediscussed.Keyworddatacleaningdirtydataoutlierdatadetectduplicaterecorddetect*“”712731262013“———”CXZZ13_0070。2013-09-27··2220135[9]、[10]、[11];[12]、[13]、[14]、[15];[16]、[17]、[18]、[19]、[20-22]、[23]、[24]、[25]、[26]。,,,、,,。2,,,,、、[27],1“”、。“”2012-13-01AGE=32BITHDAY=85-5-15()Customer1(customerid=1213,idcard=12442)customer2(customerid=1214,idcard=12442)Maxinstitutionid=25Institutionid=26,…=,,1977-5-16Institutioninstitation、:::()、458925、、、1:=NanjingUnverisity,=210093;2:=NanjinUnver,=210093;,、,………1:2:1:=2300()2:=2300()、、、,…1::2500()2::15000()1::2500()2::15000()1::2500()2::2500()1“”、231,“”,,“”、、、、,;“”,、,、。33.1,,,。[28]:,。,,。3.2,,,。(1):,、、、、、、,。:①(RadioFrequencyIdentification,RFID):RFID[29-30],RFID、,RFID[31],、、,。,[32-33]。②Web:Web,,,[34]。,,Web,GooglePageRank[35]IBMCleverHITS[36];MSNVIPS[37]。③:、、、、,、、。④,,OCR,,。(2):,。,、、、。4,、、、、。,,。4.1、、,1。1:24201351,,、,,,、[38]、,、、、。、、2。2,6,、。、、(),(、)(、),()、。,、。⑴:;;;(、);、、、、;;。(2):(Binning),。“”,、、、“”;(regression),;,,;;;。(3)。,。,。,。。,[39]。,。3。,,,,,,。[6]。:(1),5:、、、;(2),,,,,。,,,,,、,,,,,“”“”,,,,,(),,。,“”“”,,,,,,225,,;(3),,,。4.2,。[38],[25],Smith-Waterman[25],[40-41]、Cosine[41-42],4[28]。“”,,。:[41][43]、(Sorted-NeighborhoodMethod,SNM)[41][44]、(Multi-PassSorted-Neighbor2hood,MPN)[41][44]。5[28]。,,,,:①,:,,[45],,,,;,,;②FuzzyMatch/merge,,,,,,,,,Smith-Waterman,,,,,。、、,Cosine,,,,,(、);,、、、、,,,,,,34:2620135,,,,,。5,;,,。,,。,:(1),,;(2),、,;(3),,;(4),;(5),,,,;(6)。,:(1);(2);(3);(4);(5);(6)。。:[1]WilliamH.Inmon.,.(4)[M].:,2006:20.[2]LeeM,LuH,LingTW,etal.Cleansingdataforminingandwarehousing[A].Proceedingsofthe10thInternationalConferenceonDatabaseandExpertSystemsApplications[C].1999:751-760.[3]JiaweiHan,MichelineKamber,JianPei.DATAMININGConceptsandTechniques[M].:(),2012:84,92-99,543-572.[4]DasuT,JohnsonT.Exploratorydatamininganddatacleaning[M].Johnwiley,2003.[5]GalhardasH,FlorescuD.AnExtensibleFrameworkforDataCleaning[A].Proceedingsofthe16thIEEEInternationalConferenceonDataEngineering.SanDiego[C].California,2000:312-312.[6],,.[J].,2012,48(12):121-129.[7],,.[J].,2002,29(1):118-121.[8],,.[J].,2001,24(1):69-77.[9],,.[J].,2008,29(4):726-729.[10],.[J].,2011,47(30):127-131.[11],,.[J].,2005,42(12):2206-2212.[12],,.[J].,2003,39(17):184-187.[13],,.XML[J].,2009,26(1):172-174.[14],,.[J].,2005,(3):292-296.[15],,.[J].,2003,(3):95-96,183.,,,SNM,w,;,w×N,wMPN527[16],.SCI[J].,2010,28(5):741-746.[17],,.[J].,2011,37(20):191-193.[18],,.Token[J].,2009,26(11):43-45,53.[19].[J].,2007,30(1):93-96.[20],,.RFID[J].,2012,33(10):2158-2163.[21],,.RFID[J].,2010,21(4):632-643.[22],,.RFID[J].(),2009,30(1):34-37.[23],,.RFID[J].,2011,29(3):435-442.[24],,.RFID[J].,2011,32(9):1794-1799.[25],,.RFID[J].,2012,34(7):24-27,36.[26],,.RFID[J].,2011,38(10A):22-25.[27],.[J].,2002,13(11):2076-2082.[28],,.[J].,2007,(12):50-56.[29]SullivanL.RFIDimplementationchallengespersist,allthistimelater[J].InformationWeek,2005,1059:34-40.[30]JefferySR,GarofalakisMN,FranklinMJ.AdaptivecleaningforRFIDdatastreams[A].ProceedingsofVaryLargeDataBasesSeoul,Korea,2006:163-174.[31]DerakhshanR,OrlowskaME,LiX.RFIDdatamanagement:challengesandopportunities[A].Proceedingsof2007IEEEInternationalConferenceonRFID[C].GaylordTexan,USA,2007:175-182.[32]SongBaoyan,QinPengfei,WangHao,etal.bSpace:adatacleaningapproachforRFIDdatastreamsbasedonvirtualspatialgranularity[A].20099thInternationalConferenceonHybridIntelligentSystem.IEEEComputerSociety[C].2009,252-256.[33]ZiekowH,IvantysynovaL.AprobabilisticapproachforcleaningRFIDdata[A].ICDEWorkshop[C].2008.[34],,.[J].,2006,20(3):70-77.[35]SergeyBrinandLawrencePage,Theanatomyofalarge-scalehypertextualWebsearchengine[J].ComputerNetworksandISDNSystems,1998,30(7)107-117.[36]JonM.Kleinberg,Authoritativesourcesinahyperlinkedenvironment[J].JournaloftheACM,1999,46(5):604-632.[37]DengCai,ShipengYu,JiRongWenandWeiYingMa.VIPS:aVisionbasedPageSegmentationAlgorithm[R].MicrosoftTechnicalReport(MSR2TR22003-79),2003.[38].[D].:,2005.[39],,.[J].,2004,24(5):116-19.[40]MasekW,PatersonMA.FasterAlgorithmComputingStringEditDistance[J].JournalofComputerSystemScience,1980,(20):18-31.[41].[D].:,2004.[42]SalonG,McgillMJ.IntroductiontoModernInformationRetrieval[M].NewYork:McGraw-HillBookCo,1983.[43]MongeA,ElkanC.TheFieldMatchingProblem:AlgorithmsandApplications[A].Proceedingsofthe2ndInternationalConferenceofKnowledgeDiscoveryandDataMining[C].Portland,Oregon,1996.[44]HernandezM,StolfoS.RealWorldDataisDirty:DataCleansingandt