UsingProtein-ProteinInteractionsforRefiningGeneNetworksEstimatedfromMicroarrayDatabyBayesianNetworksN.Nariai,S.Kim,S.Imoto,andS.MiyanoPacificSymposiumonBiocomputing9:336-347(2004)USINGPROTEIN-PROTEININTERACTIONSFORREFININGGENENETWORKSESTIMATEDFROMMICROARRAYDATABYBAYESIANNETWORKSN.NARIAI,S.KIM,S.IMOTO,S.MIYANOHumanGenomeCenter,InstituteofMedicalScience,UniversityofTokyo,4-6-1Shirokanedai,Minato-ku,Tokyo,108-8639,JapanWeproposeastatisticalmethodtoestimategenenetworksfromDNAmicroar-raydataandprotein-proteininteractions.Becausephysicalinteractionsbetweenproteinsormultiproteincomplexesarelikelytoregulatebiologicalprocesses,us-ingonlymRNAexpressiondataisnotsufficientforestimatingagenenetworkaccurately.Ourmethodaddsknowledgeaboutprotein-proteininteractionstotheestimationmethodofgenenetworksunderaBayesianstatisticalframework.Intheestimatedgenenetwork,aproteincomplexismodeledasavirtualnodebasedonprincipalcomponentanalysis.WeshowtheeffectivenessoftheproposedmethodthroughtheanalysisofSaccharomycescerevisiaecellcycledata.Theproposedmethodimprovestheaccuracyoftheestimatedgenenetworks,andsuccessfullyidentifiessomebiologicalfacts.1IntroductionThecompleteDNAsequencesofmanyorganisms,suchasyeast,mouse,andhuman,haverecentlybecomeavailable.Genomesequencesspecifythegeneexpressionsthatproduceproteinsoflivingcells,buthowthebiologicalsystemasawholereallyworksisstillunknown.Currently,alargenumberofgeneexpressiondataandprotein-protein(p-p)interactiondatahavebeencollectedfromhigh-throughputanalyses,andestimatinggenenetworksfromthesedatahasbecomeanimportanttopicinsystemsbiology.SeveralmethodshavebeenproposedforestimatinggenenetworksfrommicroarraydatabyusingBooleannetworks1;30,differentialequationmodels3;7,andBayesiannetworks8;9;12;13;14;15;16;22.However,usingonlymicroarraydataisnotsufficientforestimatinggenenetworksaccurately,becausetheinforma-tioncontainedinmicroarraydataislimitedbythenumberofarrays,theirquality,noiseandexperimentalerrors.Therefore,theuseofotherbiologicalknowledgetogetherwithmicroarraydataisakeyforextractingmorereliableinformation.Harteminketal.13noticedthisideapreviouslyandproposedamethodtouselocalizationdatacombinedwithmicroarraydataforestimat-ingagenenetwork.Thereareotherworkscombiningmicroarraydatawithbiologicalknowledge,suchasDNAsequencesofpromoterelements23;32andtranscriptionalbindingsofregulators26;27;29.Inthispaper,weproposeastatisticalmethodforestimatinggenenet-worksfrommicroarraydataandp-pinteractionsbyusingaBayesiannetworkmodel.Weextract9,030physicalinteractionsfromtheMIPSdatabase21toaddknowledgeaboutp-pinteractionstotheestimationmethodofgenenet-works.Ifmultiplegeneswillformaproteincomplex,thenitisnaturaltotreatthemasonevariableintheestimatedgenenetwork.Inaddition,intheesti-matedgenenetwork,aproteincomplexismodeledasavirtualnodebasedonprincipalcomponentanalysis.Thatis,theproteincomplexesaredynamicallyfoundandmodeledbasedontheproposedmethodwhileweestimateagenenetwork.Previously,Segaletal.28proposedamethodforidentifyingpathwaysfrommicroarraydataandp-pinteractiondata.AdifferentpointofourmethodisthatwemodelproteincomplexesdirectlyintheBayesiannetworkmodelaimedatrefiningtheestimatedgenenetwork.Also,ourmethodcandecidewhetherwemakeaproteincomplexbasedonourcriterion.WeevaluateourmethodthroughtheanalysisofSaccharomycescerevisiaecellcyclegeneexpressiondata31.First,weestimatedthreegenenetworks,bymicroarraydataalone,byp-pinteractionsalone,andbyourmethod.Then,wecomparedthemwiththegenenetworkcompiledbyKEGGforevaluation.Wesuccessfullyshowthattheaccuracyoftheestimatedgenenetworkisimprovedbyourapproach.Second,among350cellcyclerelatedgenes,wefound34genepairsasproteincomplexes.Inreality,mostofthemarelikelytoformproteincomplexesconsideringbiologicaldatabasesandexistingliterature.Third,weshowanexampletouseanadditionalinformation“phase”togetherwiththemicroarraydataandp-pinteractionsforestimatingamoremeaningfulgenenetwork.2BayesianNetworkModelwithProteinComplexBayesiannetworks(BNs)areatypeofgraphicalmodelthatrepresentsrela-tionshipsbetweenvariables.Thatis,foreachvariablethereisaprobabilitydistributionfunctionwhosedefinitiondependsontheedgesleadingintothevariable.ABNisadirectedacyclicgraph(DAG)encodingtheMarkovas-sumptionthateachvariableisindependentofitsnon-descendants,givenjustitsparents.InthecontextofBNs,ageneisregardedasarandomvariableandshownasanodeinthegraph,andarelationshipbetweenthegeneanditspar-entsisrepresentedbytheconditionalprobability.Thus,thejointprobabilityofallgenescanbedecomposedastheproductoftheconditionalprobabilities.Supposethatwehavensetofmicroarraydatafx1;:::;xngofpgenes.ABNmodelisthenwrittenasf(xi1;:::;xipjµG)=Qpj=1fj(xijjpij;µj),wherepijistheparentobservationvectorofjthgene(genej)measuredbyitharray.Forexample,ifgene2andgene3areparentsofgene1,wesetpi1=(xi2;xi3)T.Ifweignoretheinformationofp-pinteractions,therelationshipbetweenxijandpijcanbemodeledbyusinganonparametricadditiveregressionmodel14;16xij=Xkmjk(p(j)ik)+ij;i=1;:::;n;j=1;:::;p;(1)wherep(j)ikisthekthelementofpij,mjisaregressionfunctionandijisarandomvariablewith