BayesianRegressionAnalysisinthe“Largep,Smalln”ParadigmwithApplicationinDNAMicroarrayStudiesMikeWesty,JosephRNevins,JeffreyRMarks,RainerSpang&HarryZuzanDukeUniversityCurrentdraft:July31st2000(May2000original)Summary.Statisticalmodellingandinferenceproblemsinwhichsamplesizesaresubstan-tiallysmallerthanthenumberofavailableandpotentiallyinterestingpredictors(explanatoryvariables)aboundinappliedscienceandmedicine.These“Largep,Smalln”problemsposechallengestostandardstatisticalmethodsanddemandnewconceptsandmodelsforre-gressionandclassification.Ourmotivatingappliedcontextisinfunctionalgenomics;morespecifically,instudiesofphenotypingclinicalorphysiologicaloutcomesinwhichthepredictorsaremeasuredexpressionlevelsoflargenumbersofgenesbasedonhigh-densityDNAmi-croarrays.Inacanonicalframeworkofbinaryregression,wediscuss(a)issuesofregressionmodellingutilisingsingular-valuedecompositionsofdesignmatricesthataremassivelyrankdeficient,(b)theimperativesforcareful,informativepriorspecificationsonhigh-dimensionre-gressionparameters,(c)thedevelopmentofnewclassesofstructuredpriordistributionsforthisproblem,and(d)thedevelopmentofappropriatecomputationalmethodsandmodesofposteriorinferenceforregressionestimationandpredictiveinferenceforout-of-sampleclassi-fication.Thelatterenterpriseisfundamentaltogenomicphenotypingapplications.WestudyandexemplifythenewstatisticalmethodologyinaproblemofbreastcancerphenotypingusingDNAmicroarrayexpressionprofilesaspredictors,andindiscriminationofleukemiatypes.Keywords:Bayesianregressionanalysis,binaryregression,dimensionreduction,geneex-pressionprofiles,DNAmicroarrays,high-dimensionalcovariates,regressionprediction,singu-larvaluedecompositionsyInstituteofStatistisandDeisionSienes,DukeUniversity,DurhamNC27708-0251,USA.:Regressionmodelswithlargesetsofhigher-orderinterationsbetweenpreditorvariablesisanobviousontext,thoughherewefousonthesimplerparadigminwhihnisreallyverysmallomparedtop;sothattheopportunitiesforidentifyinginterationsislimited.Funtionalgenomisprovidesamotivatingappliationofsimplyritialimportane{large-salegeneexpressionpro lingusingDNAmiroarraydata(Golubetal,1999).Theproblemisexempli edandhighlightedinphenotypingstudies,wheretheentralfousisonrelatingmeasuredgeneexpressionpro lestolinialandphysiologialoutomes.Challengingquestionsofmodellingandanalysisariseduetothehigh-dimensionalityofthegeneexpressionpro le.OurmainexamplehereomesfromaurrentDukeprojetinbreastanerphenotyping:linkingthemeasuredexpressionoflargenumbersofgenestolinialoutomesinbreastaner.Firstexamplesinvolveonlytwode nedpossibleoutomessoleadingtoabinaryregressionformat.Typially,wewillhaveavailablerathersmallnumbersofindividualtumourtissuesamplesfromwhihtoproduetheRNArequiredtohybridisetotheDNAmiroarraysthatdeliverthegenetiexpressionmeasures;henethe\smalln:Coupledwiththis,thenumberofgenesso ngerprintedis,withurrentarraytehnologies,intheseveralortensofthousands,henethe\largep.Inthenumerialexamplehere,n=27andp=7129:FurtherdetailsandexampleswillbereportedinWestetal(2000).WefurtherexploreandillustrateourapproahinanalysesofleukemiadatafromareentstudyofGolubetal(1999),wherethemodel-basedapproahisextremelye etiveinout-of-samplepreditivedisrimination.Toaddressthemodellingandanalysishallenges,wedevelopanovelapproahtoBayesianregressionanalysis,fousingonthebinaryregressionontext.Inthisframework,we utilisesingular-valuedeompositionsofmatriesofmeasuredvaluesoflargenumbersofpreditorsarosssamples,generatingfatorrepresentationsandpossiblymassivedimensionredutiontosummary\super-preditorsofuseinexploratoryanalyses; introduelassesofnovelpriordistributionsforlargeregressionparameterstore etthedependeneandsingularitystrutureevidentinlikelihoodfuntionsbasedonlargenumbersofpreditors,andthatutilisethesingular-valuestrutureofthedesignmatriestoindueapotentiallymassiveredutionintheparameterspaerelevanttoposterioromputation;and,hene, developeasilyimplementedandstandardMCMCmethodsforbinaryregressionmod-elstoprodueposteriorinferenesonthehigh-dimensionalregressionparameter,andonsequentevaluationofout-of-samplepreditiveutilityinprobabilistilassi ationofnewases.Theresultinganalysisandmethodologyisillustratedviasomeanalysissummariesinthebreastanerphenotypingontext,andintheleukemiadisriminationproblem.BayesianRegressionAnalysiswithpn32.BinaryregressionandlatentnormallinearmodelsConsiderthestandardbinaryregressionontextinwhihbinaryresponsesz1;:::;znareas-sumedtobedesribedbyaprobitregressiononasetofppreditors.Thatis,independentlyarossasesi=1;:::;n;eahziisabinaryoutomewithPr(zi=1j )= (x0i )(1)where xiisthevetorofppreditorvaluesforasei; isthep vetorregressionparametertobeinferred,and ( )isthestandardnormalum