大数据管理与数据质量-美国金融业中的对策汪时奇(博士)•处理速度•容量限制•数据质量Overview•数据=Data=信息(并非数字集合)•数据科学(约)=信息科学•为何研究大数据?–因为相关产品(如硬盘,memory,CPU等)价格指数下降–因为信息爆炸–因为大数据导致许多新问题•大数据研究是多学科的综合(IT,DM,BI,BA,…)•实业界对大数据问题的对策(见下文)1.数据库策略•1.1Database(DB)performance•1.2DBspace1.1DBperformance•Auditing–2tables:asmallactive&ahugepassive•Partition•Index(good/bad;Cluster;Global/Local)•Locktype(whenapplyrowlock)•Transaction:1-phaseor2-phase•Normalization•Internaloptimization(e.g.ExecutionPlan=hintinOracle)•Constraints(e.g.Check)usagetoreplacetrigger•Tricks(e.g.Datefunction;Searchsmalltablefirst;…)1.2DBspace•Spacearrangementforevendistribution(e.g.1hugetableusesafewdatafiles)•Cleaningprocedurewithdefragment•Partitiondesignwithcleaningplan2.Applications(软件)(Javaexample)•Usingadvancedlanguage(e.g.JavaorC#)•2.1Memory(内存)•2.2Disk/networkspace•2.3Performance•2.4Maintainability2.1Memory•Minimizebigobjectscreationandcoexistence•GC(GarbageCollection)ornullbigobjectsonceoutofscope–ChooseappropriateGCtype–gc()•Trytosplitonebigobjecttosmallobjects•Usemutableclassforfrequentlychangedbigobjects(e.g.StringBuilder,insteadofString)2.2Disk/networkspace•Smartcleanandarchiveprocessese.g.archivezippedoldornotusedfilestolowspeednetworkspaceanddeleteveryoldfilesfromthatspace•Smartloggingsettings–e.g.log4jsizerolling–e.g.Avoidduplicatedortriviallogginginfo•Monitorforspaces2.3Performance•Avoidredundanttreatment(inbigloops)Maximizereuse•Multi-threading•DBaccessing•Logging--avoidslowoptions(e.g.line#)2.4Maintainability•SOAprinciplesLosecoupling,reusability,granularity,modularity,composability,componentization,interoperability,…•JEEpatterns(DAO,DTO,BizDelegation,…)•Designpatterns(23)andMVC–Creation–Structure–Behavior(e.g.Visitor)•OOPprinciples–Abstraction,encapsulation,polymorphism,…–Open/Close3.数据质量控制•3.1Business•3.2ProcessA.Failover&DR(DisasterRecovery)B.QA(QualityAssurance)(see软件质量管理点滴fordetails)C.UAT(UserAcceptanceTest)•3.3Technology3.1BusinessA.Reducemanualwork;IncreaseautomationB.CompleteapprovalsystemformanualworkE.g.1level=2levelsor3levelsapprovalC.ExtendviewpointstoconfirmdataqualityD.Reduceredundancysystems(e.g.duetomerge,duetovendors)E.ScheduleCleansing(seedetails)F.EnhanceReconciliation(seedetails)G.BuildTrustlevel(seedetails)H.Trytocoverallrarecases3.1.ECleansing•When–Atsystemmerge–Atmajorchange•How–Developdetectionapplications–DelivermismatchreportstoIT&business–FindsolutionsonbothIT&business3.1.FReconciliation•Where–1+subsystemshavedataforsamecontents.–1+subsystemshaveindependentdatechangefunctionality.•What–Run&improverecon.app.routinely.–Categorizereportsbyurgency.–Analyzereports.–DebugoradjustbizruleorapplyCleansing.3.1.GTrustlevel•When–At1+fixeddatainputs–Inputsareindependent–Mustdecidefinaldetailsfrominputs•How(basedon)–Providerlevel(foradetaileddatagroup)–Datahistory–Samples:Bloomberg,Reuter,Telekurs,DTCC,…;Moody,S&P,Fitch.3.2.AFailover&DR•Failover–DB:2+atdiff.locations;real-timereplication–App•Active-Active:ClusterwithLoadBalancing•Active-Passive–Auto(viaSAN)–Manual+Auto•DR–DB:e.g.dailyorhourlyorreal-timereplication–App:Manualswitch3.3Technology•DBdesign–Constraint‘Check’(forsensitivetablevalues)–Normalization(toreduceduplications)–Validationprocesses(tofindconflictdata)•Applicationdesign–Dataintegrationcheck•E.g.cryptographysignature•E.g.CRCcheck–Datadisplay(e.g.Excelmissingleading0,date=num)