RobustDe-anonymizationofLargeDatasets(HowtoBreakAnonymityoftheNetflixPrizeDataset)ArvindNarayananandVitalyShmatikovTheUniversityofTexasatAustinFebruary5,2008AbstractWepresentanewclassofstatisticalde-anonymizationattacksagainsthigh-dimensionalmicro-data,suchasindividualpreferences,recommendations,transactionrecordsandsoon.Ourtechniquesarerobusttoperturbationinthedataandtoleratesomemistakesintheadversary’sbackgroundknowledge.Weapplyourde-anonymizationmethodologytotheNetflixPrizedataset,whichcontainsanonymousmovieratingsof500,000subscribersofNetflix,theworld’slargestonlinemovierentalservice.Wedemonstratethatanadversarywhoknowsonlyalittlebitaboutanindividualsubscribercaneasilyidentifythissubscriber’srecordinthedataset.UsingtheInternetMovieDatabaseasthesourceofbackgroundknowledge,wesuccessfullyidentifiedtheNetflixrecordsofknownusers,uncoveringtheirapparentpoliticalpreferencesandotherpotentiallysensitiveinformation.1IntroductionDatasetscontaining“micro-data,”thatis,informationaboutspecificindividuals,areincreasinglybecomingpublic—bothinresponseto“opengovernment”laws,andtosupportdataminingresearch.Somedatasetsincludelegallyprotectedinformationsuchashealthhistories;otherscontainindividualpreferences,pur-chases,andtransactions,whichmanypeoplemayviewasprivateorsensitive.Privacyrisksofpublishingmicro-dataarewell-known.Evenifidentifyinginformationsuchasnames,addresses,andSocialSecuritynumbershasbeenremoved,theadversarycanusecontextualandback-groundknowledge,aswellascross-correlationwithpubliclyavailabledatabases,tore-identifyindividualdatarecords.Famousre-identificationattacksincludede-anonymizationofaMassachusettshospitaldis-chargedatabasebyjoiningitwithwithapublicvoterdatabase[22],de-anonymizationofindividualDNAsequences[19],andprivacybreachescausedby(ostensiblyanonymized)AOLsearchdata[12].Micro-dataarecharacterizedbyhighdimensionalityandsparsity.Informally,micro-datarecordscontainmanyattributes,eachofwhichcanbeviewedasadimension(anattributecanbethoughtofasacolumninadatabaseschema).Sparsitymeansthatapairofrandomrecordsarelocatedfarapartinthemulti-dimensionalspacedefinedbytheattributes.Thissparsityisempiricallywell-established[6,4,16]andrelatedtothe“fattail”phenomenon:individualtransactionandpreferencerecordstendtoincludestatisticallyrareattributes.Ourcontributions.Wepresentaverygeneralclassofstatisticalde-anonymizationalgorithmswhichdemonstratethefundamentallimitsofprivacyinpublicmicro-data.Wethenshowhowthesemethodscanbeusedinpracticetode-anonymizetheNetflixPrizedataset,a500,000-recordpublicdataset.1arXiv:cs/0610105v2[cs.CR]22Nov2007Ourfirstcontributionisarigorousformalmodelforprivacybreachesinanonymizedmicro-data(sec-tion3).Wepresenttwodefinitions,onebasedontheprobabilityofsuccessfulde-anonymization,theotherontheamountofinformationrecoveredaboutthetarget.Unlikepreviouswork[22],wedonotassumeapriorithattheadversary’sknowledgeislimitedtoafixedsetof“quasi-identifier”attributes.Ourmodelthusencompassesamuchbroaderclassofde-anonymizationattacksthansimplecross-databasecorrelation.Oursecondcontributionisageneralde-anonymizationalgorithm(section4).Underverymildas-sumptionsaboutthedistributionfromwhichtherecordsaredrawn,theadversarywithasmallamountofbackgroundknowledgeaboutanindividualcanuseittoidentify,withhighprobability,thisindividual’srecordintheanonymizeddatasetandtolearnallanonymouslyreleasedinformationabouthimorher,in-cludingsensitiveattributes.Forsparsedatasets,suchasmostreal-worlddatasetsofindividualtransactions,preferences,andrecommendations,verylittlebackgroundknowledgeisneeded(asfewas5-10attributesinourcasestudy).Ourde-anonymizationalgorithmisrobusttoimprecisionoftheadversary’sbackgroundknowledgeandtosanitizationorperturbationthatmayhavebeenappliedtothedatapriortorelease.Itworksevenifonlyasubsetoftheoriginaldatasethasbeenpublished.OurthirdcontributionisapracticalanalysisoftheNetflixPrizedataset,containinganonymizedmovieratingsof500,000Netflixsubscribers(section5).Netflix—theworld’slargestonlinemovierentalservice—publishedthisdatasettosupporttheNetflixPrizedataminingcontest.Wedemonstratethatanadversarywhoknowsonlyalittlebitaboutanindividualsubscribercaneasilyidentifyhisorherrecordifitispresentinthedataset,or,attheveryleast,identifyasmallsetofrecordswhichincludethesubscriber’srecord.Theadversary’sbackgroundknowledgeneednotbeprecise,e.g.,thedatesmayonlybeknowntotheadversarywitha14-dayerror,theratingsmaybeknownonlyapproximately,andsomeoftheratingsanddatesmayevenbecompletelywrong.Becauseouralgorithmisrobust,ifituniquelyidentifiesarecordinthepublisheddataset,withhighprobabilitythisidentificationisnotafalsepositive,eventhoughthedatasetcontainstherecordsofonly18ofallNetflixsubscribers(asoftheendof2005,whichisthe“cutoff”dateofthedataset),2RelatedworkUnlikestatisticaldatabases[26,1,3,7,5],micro-datadatasetscontainactualrecordsofindividualsevenafteranonymization.Apopularapproachtomicro-dataprivacyisk-anonymity[24,23,8].Thedatapub-lishermustdetermineinadvancewhichoftheattributesareavailabletotheadversary(thesearecalled“quasi-identifiers”),andwhich