MathematicalStatisticsandDataAnalysisJohnA.RiceUniversityofCalifornia,BerkeleyArrangementoftheCourseChapter1SummarizingDataChapter2ComparingTwoSamplesChapter3TheAnalysisofVarianceChapter4LinearLeastSquaresWhatweshouldlearn?Mathematics(Statistics)EnglishComputer(SAS)(StatisticalAnalysisSystem)Chapter1SummarizingDataMethodsBasedontheCumulativeDistributionFunctionHistograms,DensityCurvesandStem-and-LeafPlotsMeasuresofLocationMeasuresofDispersionTheEmpiricalCumulativeDistributionFunction(ecdf)Supposethatisabatchofnumbers.TheecdfisdefinedasDenotetheorderedbatchofnumbersby,thentheecdfcanbeexpressedasnxxx,,,21)(#1)(xxnxFin)()2()1(nxxx)()1()()1(10)(nkknxxxxxnkxxxF)()1()()1()1()1(0)(nkknxxnnxxxnkxxxFPropertiesoftheEmpiricalCumulativeDistributionFunctionTheorem1Theorem2Thatis,tendstosimultaneouslywithprobabilityone.)())((xFxFEn))(1)((1))((xFxFnxFVarn1)0)()(maxlim(xFxFpnxn)(xFn)(xFExamplesPlottheecdfofthisbatchofnumbers:1,14,10,9,11,9SASdataset:beeswax.sasSolutionsAnalysisInteractiveDataAnalysis(Findbeeswax.sasfromWork)AnalyzeDistributionOutputCumulativeDistributionfunctionEmpiricalTheSurvivalFunctionIfdenotestimeuntilfailureordeathwithcdf,thesurvivalfunctionisdefinedaswhichissimplytheprobabilitythatthelifetimewillbelongerthan.Theempiricalsurvivalfunctionisgivenbywhereistheecdfofrandomvariable.TF)(1)()(tFtTptSt)(1)(tFtSnn)(tFnTTheHazardFunctionThehazardfunctionisdefinedaswhichistheinstantaneousrateofmortalityofanindividualaliveat.Thelogoftheempiricalsurvivalfunctionisdefinedas)(log)(1)()(1)()(tsdtdtFtFtFtftht)()1()()1()11log()11log(0)(lognkknttnntttnktttSExampleCalculatethehazardfunctionfortheexponentialdistribution:Letdenotethedensityfunctionandthehazardfunctionofanonnegativerandomvariable.Showthat0001)(ttetFthtdsshethtf0)()()(fQuantile-QuantilePlotsThethquantileofthedistributionisthevalueofsuchthatorpxFp)(ppx)(1pFxp1ppxx)(xFPthquantileTheempiricalquantileofdataForthegivensample,theecdfforor;Let,thusthethquantileofdataisassignedto;nxxx,,,21)1()(nkxFn)1()(kkxxx)1()()(nkxFkn)1()()(nkxFkn)1(nk)(kxAssessingGoodnessofFitbyUsingQ-QPlotIsthesamplefromthedistribution?Theempiricalthquantileis;Thetheoreticalthquantileofis,whichsatisfies;Thedotsontheplanewouldbeapproximatelyastraightlineifthesamplecomesfrom.nxxx,,,21F)1(nk)(kx)1(nkF)1(nkx)1()()1(nkxFnk),()1()(nkkxxFComparingTwoSamplesbyusingQ-QplotAresampleandfromthesamedistribution?Theempiricalthquantileofis;Theempiricalthquantileofis;Thedotsontheplanewouldbeapproximatelyastraightlineifthesamplecomesfromthesamedistribution.nxxx,,,21nyyy,,,21)1(nksx')(kx)1(nksy')(ky),()()(kkyxExampleSASdataset:beeswax.sasSolutionsAnalysisInteractiveDataAnalysis(Findbeeswax.sasfromWork)AnalyzeDistributionOutputNormalQ-QplotHistogramsExample:beeswax.sas5.642.6452.649.6369.636.63156.633.63243.63638637.621xxxxxxfrequency5.642.642825.02.649.63339.09.636.638475.06.633.633559.13.6363452.0637.620565.0)3.0)595(()3.0)596(()3.0)5915(()3.0)5924(()3.0)598()3.0)591((xxxxxxdensityDensityCurves—KernelProbabilityDensityEstimationLetbethestandardnormaldensity,thentherescaledversionof,isdefinedaswhichisthenormaldensitywithstandarddeviation;)(xw)(xw)(xwh2222)(2121211)(1)(hxhxhehehhxwhxwhLetbeasamplefromaprobability,thenisthenormaldensitywithmeanandstandarddeviation;Thekernelprobabilitydensityestimateofisthengivenbywhereisachosenbandwidth.nxxx,,,21f)(ihxxwixhfniihhxxwnxf1)(1)(hExampleBeeswaxSolutionsAnalysisInteractiveDataAnalysis(Findbeeswax.sasfromWork)AnalyzeDistributionOutputDensityEstimateNormal(kerneldensity)Stem-and-LeafPlotsExamplebeeswax.sas02:64422:64302147:642352:64116:6400622223:639511334668:63861788:6372190013689:6367260000113668:6351001335:634523001446669:63391877:63229033:63137358:63034:629015:62811leafstemMeasuresofLocationTheArithmeticMeanForabatchofnumbers,themostcommonlyusedmeasureoflocationisDisadvantage:Themeasureissensitivetooutliersinthedataset.niixnx11nxxx,,,21TheMedianFortheorderobservationsthemedianisdefinedasAdvantage:Medianisrobustandinsensitivetooutliers.)()2()1(nxxxevenoddisisnnifxxifxmediannnn2)()12()2()2)1((TheTrimmedMeantrimmedmeanistheaverageoftheremainingdataleftbydiscardingthelowestandthehighestoftheorderdata.ItcanbeexpressedasIngeneral,willbechosenfrom0.1to0.2.%100%100%100][2])[()2]([)1]([nnxxxxnnnnM-estimatesThesamplemeanminimizesThemedianistheminimumofM-estimatesisdefinedtominimizewhere21)(niixniix1)(1niix5.15.1)(2xifxxifxxComparisonofLocationEstimationThereisnosingleestimatethatisbestforallsymmetricdistributionalthoughtheyallestimatethecenterofsymmetry;Ingeneral,10%trimmedmeanor20%trimmedmeanisoverallquiteaneffectiveestimatesinceitsvarianceisquitesmall.MeasuresofDispersionThesamplestandarddeviation,,whichisthesquarerootofthesamplevariance,Disadvantage:Thesamplestandarddeviationissensitivetooutlyin