©SpinnakerLabs,Inc.GoogleClusterComputingFacultyTrainingWorkshopModuleI:IntroductiontoMapReduceThispresentationincludescoursecontent©UniversityofWashingtonRedistributedundertheCreativeCommonsAttribution3.0license.Allothercontents:©SpinnakerLabs,Inc.WorkshopSyllabus•Sevenlecturemodules–Informationaboutteachingthecourse–TechnicalinfoaboutGoogletools&Hadoop–Examplecourselectures•Fourlabexercises–AssignedtostudentsinUWcourse©SpinnakerLabs,Inc.Overview•UniversityofWashingtonCurriculum–TeachingMethods–Reflections–StudentBackground–CourseStaffRequirements•IntroductoryLectureMaterial©SpinnakerLabs,Inc.UW:CourseSummary•Coursetitle:“ProblemSolvingonLargeScaleClusters”•Primarypurpose:developinglarge-scaleproblemsolvingskills•Format:6weeksoflectures+labs,4weekproject©SpinnakerLabs,Inc.UW:CourseGoals•Thinkcreativelyaboutlarge-scaleproblemsinaparallelfashion;designparallelsolutions•Managelargedatasetsundermemory,bandwidthlimitations•Developafoundationinparallelalgorithmsforlarge-scaledata•Identifyandunderstandengineeringtrade-offsinrealsystems©SpinnakerLabs,Inc.Lectures•2hours,onceperweek•Halfformallecture,halfdiscussion•Mostlycoveredsystems&background•Includedgroupactivitiesforreinforcement©SpinnakerLabs,Inc.ClassroomActivities•Worksheetsincludedpseudo-codeprogramming,workingthroughexamples–Performedingroupsof2—3•Small-groupdiscussionsaboutengineeringandsystemsdesign–Groupsof~10–Coursestafffacilitated,butmostlyopen-ended©SpinnakerLabs,Inc.Readings•Notextbook•Oneacademicpaperperweek–E.g.,“SimplifiedDataProcessingonLargeClusters”–Shorthomeworkcoveredcomprehension•Formedbasisfordiscussion©SpinnakerLabs,Inc.LectureSchedule•IntroductiontoDistributedComputing•MapReduce:TheoryandImplementation•NetworksandDistributedReliability•Real-WorldDistributedSystems•DistributedFileSystems•OtherDistributedSystems©SpinnakerLabs,Inc.IntrotoDistributedComputing•Whatisdistributedcomputing?•Flynn’sTaxonomy•Briefhistoryofdistributedcomputing•Somebackgroundonsynchronizationandmemorysharing©SpinnakerLabs,Inc.MapReduce•Briefrefresheronfunctionalprogramming•MapReduceslides–MoredetailedversionofmoduleI•DiscussiononMapReduce©SpinnakerLabs,Inc.NetworkingandReliability•Crashcourseinnetworking•Distributedsystemsreliability–Whatisreliability?–Howdodistributedsystemsfail?–ACID,othermetrics•Discussion:DoesMapReduceprovidereliability?©SpinnakerLabs,Inc.RealSystems•DesignandimplementationofNutch•TechtalkfromGoogleronGoogleMaps©SpinnakerLabs,Inc.DistributedFileSystems•IntroducedGFS•DiscussedimplementationofNFSandAndrewFS(AFS)forcomparison©SpinnakerLabs,Inc.OtherDistributedSystems•BOINC:Anotherplatform•Broaderdefinitionofdistributedsystems–DNS–OneLaptopperChildproject©SpinnakerLabs,Inc.Labs•Also2hours,onceperweek•Focusedonapplicationsofdistributedsystems•Fourlabprojectsoversixweeks©SpinnakerLabs,Inc.LabSchedule•IntroductiontoHadoop,EclipseSetup,WordCount•InvertedIndex•PageRankonWikipedia•ClusteringonNetflixPrizeData©SpinnakerLabs,Inc.DesignProjects•Finalfourweeksofquarter•Teamsof1—3students•Studentsproposedtopic,gathereddata,developedsoftware,andpresentedsolution©SpinnakerLabs,Inc.Example:GeozetteImage©JuliaSchwartz©SpinnakerLabs,Inc.Example:GalaxySimulationImage©SlavaChernyak,MikeHoak©SpinnakerLabs,Inc.OtherProjects•BayesianWikipediaspamfilter•Unsupervisedsynonymextraction•Videocollagerendering©SpinnakerLabs,Inc.CommonFeatures•Hadoop!•Usedpublicly-availablewebAPIsfordata•ManyinvolvedreadingpapersforalgorithmsandtranslatingintoMapReduceframework©SpinnakerLabs,Inc.CourseStaff•Instructor(me!)•Twoundergradteachingassistants–Helpedfacilitatediscussions,directedlabs•Onestudentsysadmin–Workedonlyaboutthreehours/week©SpinnakerLabs,Inc.Preparation•Teachingassistantshadtakenpreviousiterationofcourseinwinter•Lecturesretooledbasedonfeedbackfromthatquarter–Addedreasonablylargeamountofbackgroundmaterial•Ran&solvedalllabsinadvance©SpinnakerLabs,Inc.TheCourse:WhatWorked•Discussions–Oftencoveredbroadrangeofsubjects•Hands-onlabprojects•“Activelearning”inclassroom•Independentdesignprojects©SpinnakerLabs,Inc.ThingstoImprove:Coverage•Algorithmswerenotreinforcedduringlecture–Studentsrequestedmuchmoretimebespenton“howtoparallelizeaniterativealgorithm”•Backgroundmaterialwasveryfast-paced©SpinnakerLabs,Inc.ThingstoImprove:Projects•Labscouldhaveusedamoderated/scripteddiscussioncomponent–Just“jumpingin”tothecodeproveddifficult–NotimewasdevotedtoHadoopitselfinlecture–Clusteringlabshouldbesplitintwo•Designprojectscouldhaveusedmoretime©SpinnakerLabs,Inc.Conclusions•Solidbasisforfuturecoursework–Needsadditionalbackground(e.g.,algorithms)–Fullsemesterrequiresadditionalmaterial(e.g.,distributedsystems,websystemscourse)•Hadoop-basedsystemsexcitingtostudents&canteachimportantCS©SpinnakerLabs,Inc.IntroductoryDistributedSystemsMaterial©SpinnakerLabs,Inc.Overview•Introduction•Modelsofcomputation•Abriefhistorylesson•Connectingdistributedmodules•Failure&reliability©SpinnakerLabs,Inc.ComputerSpeedupMoore’sLaw:“Thedensityoftransistorsonachipdoublesevery18months,forthesamecost”(1965)Image:Tom’sHardwar