How To Break Anonymity of the Netflix Prize Datase

mebious
1 ℃
2020-07-02

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

RobustDe-anonymizationofLargeDatasets(HowtoBreakAnonymityoftheNetﬂixPrizeDataset)ArvindNarayananandVitalyShmatikovTheUniversityofTexasatAustinFebruary5,2008AbstractWepresentanewclassofstatisticalde-anonymizationattacksagainsthigh-dimensionalmicro-data,suchasindividualpreferences,recommendations,transactionrecordsandsoon.Ourtechniquesarerobusttoperturbationinthedataandtoleratesomemistakesintheadversary’sbackgroundknowledge.Weapplyourde-anonymizationmethodologytotheNetﬂixPrizedataset,whichcontainsanonymousmovieratingsof500,000subscribersofNetﬂix,theworld’slargestonlinemovierentalservice.Wedemonstratethatanadversarywhoknowsonlyalittlebitaboutanindividualsubscribercaneasilyidentifythissubscriber’srecordinthedataset.UsingtheInternetMovieDatabaseasthesourceofbackgroundknowledge,wesuccessfullyidentiﬁedtheNetﬂixrecordsofknownusers,uncoveringtheirapparentpoliticalpreferencesandotherpotentiallysensitiveinformation.1IntroductionDatasetscontaining“micro-data,”thatis,informationaboutspeciﬁcindividuals,areincreasinglybecomingpublic—bothinresponseto“opengovernment”laws,andtosupportdataminingresearch.Somedatasetsincludelegallyprotectedinformationsuchashealthhistories;otherscontainindividualpreferences,pur-chases,andtransactions,whichmanypeoplemayviewasprivateorsensitive.Privacyrisksofpublishingmicro-dataarewell-known.Evenifidentifyinginformationsuchasnames,addresses,andSocialSecuritynumbershasbeenremoved,theadversarycanusecontextualandback-groundknowledge,aswellascross-correlationwithpubliclyavailabledatabases,tore-identifyindividualdatarecords.Famousre-identiﬁcationattacksincludede-anonymizationofaMassachusettshospitaldis-chargedatabasebyjoiningitwithwithapublicvoterdatabase[22],de-anonymizationofindividualDNAsequences[19],andprivacybreachescausedby(ostensiblyanonymized)AOLsearchdata[12].Micro-dataarecharacterizedbyhighdimensionalityandsparsity.Informally,micro-datarecordscontainmanyattributes,eachofwhichcanbeviewedasadimension(anattributecanbethoughtofasacolumninadatabaseschema).Sparsitymeansthatapairofrandomrecordsarelocatedfarapartinthemulti-dimensionalspacedeﬁnedbytheattributes.Thissparsityisempiricallywell-established[6,4,16]andrelatedtothe“fattail”phenomenon:individualtransactionandpreferencerecordstendtoincludestatisticallyrareattributes.Ourcontributions.Wepresentaverygeneralclassofstatisticalde-anonymizationalgorithmswhichdemonstratethefundamentallimitsofprivacyinpublicmicro-data.Wethenshowhowthesemethodscanbeusedinpracticetode-anonymizetheNetﬂixPrizedataset,a500,000-recordpublicdataset.1arXiv:cs/0610105v2[cs.CR]22Nov2007Ourﬁrstcontributionisarigorousformalmodelforprivacybreachesinanonymizedmicro-data(sec-tion3).Wepresenttwodeﬁnitions,onebasedontheprobabilityofsuccessfulde-anonymization,theotherontheamountofinformationrecoveredaboutthetarget.Unlikepreviouswork[22],wedonotassumeapriorithattheadversary’sknowledgeislimitedtoaﬁxedsetof“quasi-identiﬁer”attributes.Ourmodelthusencompassesamuchbroaderclassofde-anonymizationattacksthansimplecross-databasecorrelation.Oursecondcontributionisageneralde-anonymizationalgorithm(section4).Underverymildas-sumptionsaboutthedistributionfromwhichtherecordsaredrawn,theadversarywithasmallamountofbackgroundknowledgeaboutanindividualcanuseittoidentify,withhighprobability,thisindividual’srecordintheanonymizeddatasetandtolearnallanonymouslyreleasedinformationabouthimorher,in-cludingsensitiveattributes.Forsparsedatasets,suchasmostreal-worlddatasetsofindividualtransactions,preferences,andrecommendations,verylittlebackgroundknowledgeisneeded(asfewas5-10attributesinourcasestudy).Ourde-anonymizationalgorithmisrobusttoimprecisionoftheadversary’sbackgroundknowledgeandtosanitizationorperturbationthatmayhavebeenappliedtothedatapriortorelease.Itworksevenifonlyasubsetoftheoriginaldatasethasbeenpublished.OurthirdcontributionisapracticalanalysisoftheNetﬂixPrizedataset,containinganonymizedmovieratingsof500,000Netﬂixsubscribers(section5).Netﬂix—theworld’slargestonlinemovierentalservice—publishedthisdatasettosupporttheNetﬂixPrizedataminingcontest.Wedemonstratethatanadversarywhoknowsonlyalittlebitaboutanindividualsubscribercaneasilyidentifyhisorherrecordifitispresentinthedataset,or,attheveryleast,identifyasmallsetofrecordswhichincludethesubscriber’srecord.Theadversary’sbackgroundknowledgeneednotbeprecise,e.g.,thedatesmayonlybeknowntotheadversarywitha14-dayerror,theratingsmaybeknownonlyapproximately,andsomeoftheratingsanddatesmayevenbecompletelywrong.Becauseouralgorithmisrobust,ifituniquelyidentiﬁesarecordinthepublisheddataset,withhighprobabilitythisidentiﬁcationisnotafalsepositive,eventhoughthedatasetcontainstherecordsofonly18ofallNetﬂixsubscribers(asoftheendof2005,whichisthe“cutoff”dateofthedataset),2RelatedworkUnlikestatisticaldatabases[26,1,3,7,5],micro-datadatasetscontainactualrecordsofindividualsevenafteranonymization.Apopularapproachtomicro-dataprivacyisk-anonymity[24,23,8].Thedatapub-lishermustdetermineinadvancewhichoftheattributesareavailabletotheadversary(thesearecalled“quasi-identiﬁers”),andwhich