大数据时代的隐私保护:挑战和机遇张振杰博士伊利诺伊大学高等数字科学中心Outline•Whyprivacy?•PrivacyAttackingExamples•Oldmodelsandlimitations–K-anonymity–L-diversity•DifferentialPrivacyPublishingsensitivedataaboutindividuals.•Medicalresearch–Whattreatmentshavethebestoutcomes?–Howcanwerecognizetheonsetofdiseaseearlier?–Arecertaindrugsbetterforcertainphenotypes?•Websearch–Whatarepeoplereallylookingforwhentheysearch?–Howcanwegivethemthemostauthoritativeanswers?•Publichealth–Whereareouroutbreaksofunpleasantdiseases?–Whatbehaviorpatternsorpatientcharacteristicsarecorrelatedwiththesediseases?Today,accesstothesedatasetsisusuallystrictlycontrolled.Onlyavailable:•Insidethecompany/agencythatcollectedthedata•Oraftersigningalegalcontract–Clickstreams,taxidata•Orinverycoarse-grainedsummaries–Publichealth•Orafteraverylongwait–USCensusdatadetails•Orwithdefiniteprivacyissues–USCensusreports,theAOLclickstream,oldNIHdbGaPsummarytables,Enronemail•OrwithIRB(InstitutionalReviewBoard)approval–dbGaPsummarytablesSocietywouldbenefitifwecouldpublishsomeusefulformofthedata,withouthavingtoworryaboutprivacy.Whyisaccesssostrictlycontrolled?Nooneshouldlearnwhohadwhichdisease.“Microdata”Whatifwe“de-identify”therecordsbyremovingnames?publishWecanre-identifypeople,absolutelyorprobabilisticallyThepublishedtableAvoterregistrationlistQuasi-identifier(QI)attributes“Backgroundknowledge”87%ofAmericanscanbeuniquelyidentifiedby{zipcode,gender,dateofbirth}.LatanyaSweeney[InternationalJournalonUncertainty,FuzzinessandKnowledge-basedSystems,2002]usedthisapproachtore-identifythemedicalrecordofanex-governorofMassachusetts.actually63%[Golle06]Outline•Whyprivacy?•PrivacyAttackingExamples•Oldmodelsandlimitations–K-anonymity–L-diversity•DifferentialPrivacyRealquerylogscanbeveryusefultoCSresearchers.Butclickhistorycanuniquelyidentifyaperson.AnonID,Query,QueryTime,ItemRank,domainnameclickedWhattheNewYorkTimesdid:–FindalllogentriesforAOLuser4417749–MultiplequeriesforbusinessesandservicesinLilburn,GA(population11K)–SeveralqueriesforJarrettArnold•Lilburnhas14peoplewiththelastnameArnold–NYTcontactsthem,findsoutAOLUser4417749isThelmaArnoldJustbecausedatalookshardtore-identify,doesn’tmeanitis.[NarayananandShmatikov,Oakland08]In2009,theNetflixmovierentalserviceoffereda$1,000,000prizeforimprovingtheirmovierecommendationservice.Trainingdata:~100Mratingsof18Kmoviesfrom~500Krandomlyselectedcustomers,plusdatesOnly10%oftheirdata;slightlyperturbedHighSchoolMusical1HighSchoolMusical2HighSchoolMusical3TwilightCustomer#1455?Wecanre-identifyaNetflixraterifweknowjustalittlebitabouther•8movieratings(≤2wrong,dates±2weeks)re-identify99%ofraters•2ratings,±3daysre-identify68%ofraters–Relativelyfewcandidatesfortheother32%(especiallywithmoviesoutsidethetop100)•EvenahandfulofIMDBcommentsallowsNetflixre-identification,inmanycases–50IMDBusersre-identify2withveryhighprobability,onefromratings,onefromdatesTheNetflixattackworksbecausethedataaresparseanddissimilar,withalongtail.Consideringjustmoviesrated,for90%ofrecordsthereisn’tasingleotherrecordthatismorethan30%similarslide13Whyshouldwecareaboutthisinnocuousdataset?•Allmovieratingspoliticalandreligiousopinions,sexualorientation,…•Everythingboughtinastoreprivatelifedetails•Everydoctorvisitprivatelifedetails“Onecustomer…suedNetflix,sayingshethoughtherrentalhistorycouldrevealthatshewasalesbianbeforeshewasreadytotelleveryone.”Itisbecomingroutineformedicalstudiestoincludeageneticcomponent.Genome-wideassociationstudies(GWAS)aimtoidentifythecorrelationbetweendiseases,e.g.,diabetes,andthepatient’sDNA,bycomparingpeoplewithandwithoutthedisease.GWASpapersusuallyincludedetailedcorrelationstatistics.Ourattack:uncovertheidentitiesofthepatientsinaGWAS–Forstudiesofuptomoderatesize,asignificantfractionofpeople,determinewhetheraspecificpersonhasparticipatedinaparticularstudywithin10seconds,withhighconfidence!Agenome-wideassociationstudyidentifiesnovelrisklocifortype2diabetes,Nature445,881-885(22February2007)16SNPs2,3arelinked,soareSNPs4,5.SNPs1,3,4areassociatedwithdiabetes.GWASpapersusuallyincludedetailedcorrelationstatistics.SNP1HumanDNASNP2…SNP3SNP4SNP5DiabetesPublish:linkagedisequilibriumbetweentheseSNPpairs.Publish:p-valuesoftheseSNP-diseasepairs.PrivacyattackscanuseSNP-diseaseassociation.Idea[Homeretal.PloSGenet.’08,Jacobsetal.Nature’09]:–ObtainaggregateSNPinfofromthepublishedp-values(1)–ObtainasampleDNAofthetargetindividual(2)–ObtaintheaggregateSNPinfoofaref.population(3)–Compare(1),(2),(3)AggregateDNAofpatientsinastudySNP1DNAofanindividualSNP2SNP3SNP4SNP5…SNP1AggregateDNAofareferencepopulationSNP2SNP3SNP4SNP5…BackgroundknowledgeSNP1SNP2SNP3SNP4SNP5…0.100.70.20.50.60.300.90.810.30.40.50.1PrivacyattackscanusebothSNP-diseaseandSNP-SNPassociations.Idea[Wangetal.,CCS’09]:–Modelpatients’SNPstoamatrixofunknowns–Obtaincolumnsumsfromthepublishedp-values–Obtainpair-wisecolumndot-productsfromthepublishedLDs–SolvethematrixusingintegerprogrammingPatientSNP1SNP2SNP3SNP4SNP51x11x12x13x14x152x21x22x23x24x253x31x32x33x34x35EachSNPcanonlybe0or1(withadominancemodel)x11+x21+x31=2x13x14+x23x24+x33x34