BigData:UsingArcGISwithApacheHadoopErikHoelandMikeParkOutline•OverviewofHadoop•AddingGIScapabilitiestoHadoop•IntegratingHadoopwithArcGISApacheHadoop•Hadoopisascalableopensourceframeworkforthedistributedprocessingofextremelylargedatasetsonclustersofcommodityhardware-MaintainedbytheApacheSoftwareFoundation-Assumesthathardwarefailuresarecommon•Hadoopisprimarilyusedfor:-Distributedstorage-DistributedcomputationWhatisHadoop?•Historically,developmentofHadoopbeganin2005asanopensourceimplementationofaMapReduceframework-InspiredbyGoogle’sMapReduceframework,aspublishedina2004paperbyJeffreyDeanandSanjayGhemawat(GoogleLab)-DougCutting(Yahoo!)didtheinitialimplementation•Hadoopconsistsofadistributedfilesystem(HDFS),aschedulerandresourcemanager,andaMapReduceengine-MapReduceisaprogrammingmodelforprocessinglargedatasetsinparallelonadistributedcluster-Map()–aprocedurethatperformsfilteringandsorting-Reduce()–aprocedurethatperformsasummaryoperationWhatisHadoop?•AnumberofframeworkshavebeenbuiltextendingHadoopwhicharealsopartofApache-Cassandra-ascalablemulti-masterdatabasewithnosinglepointsoffailure-HBase-ascalable,distributeddatabasethatsupportsstructureddatastorageforlargetables-Hive-adatawarehouseinfrastructurethatprovidesdatasummarizationandadhocquerying-Pig-ahigh-leveldata-flowlanguageandexecutionframeworkforparallelcomputation-ZooKeeper-ahigh-performancecoordinationservicefordistributedapplicationsWhatisHadoop?()map()map()map()reduce()MapReduceHighleveloverviewdatahdfs://path/outputhdfs://path/inputpart1part2Splitreduce()ApacheHadoopMapReduce–TheWordCountExampleredredblueredgreengreenblueredgreenred1red1blue1red1green1green1blue1red1green1green1green1green1red1red1red1red1green3red4MapMapgreenbluebluebluegreen1blue1blue1blue1Mapblue1blue1blue1blue1blue1blue5ReducePartitionShuffleSortReduceMap1.Eachlineissplitintowords2.Eachwordiswrittentothemapwiththewordasthekeyandavalueof‘1’Partition/Sort/Shuffle1.Theoutputofthemapperissortedandgroupedbasedonthekey2.EachkeyanditsassociatedvaluesaregiventoareducerReduce1.Foreachkey(word)given,sumupthevalues(counts)2.EmitthewordanditscountApacheHadoopHadoopClustersTraditionalHadoopClustersTheDreddClusterAddingGIScapabilitiestoHadoopHadoopCluster.jar•NeedtoreducelargevolumesofdataintomanageabledatasetsthatcanbeprocessedintheArcGISPlatform-Clipping-Filtering-GroupingAddingGISCapabilitiestoHadoopGeneralapproachAddingGISCapabilitiestoHadoop•SpatialdatainHadoopcanshowupinanumberofdifferentformatsSpatialdatainHadoopONTARIO,34.0544,-117.6058RANCHOCUCAMONGA,34.1238,-117.5702REDLANDS,34.0579,-117.1709RIALTO,34.1136,-117.387RUNNINGSPRINGS,34.2097,-117.1135ONTARIOPOINT(34.0544,-117.6058)RANCHOCUCAMONGAPOINT(34.1238,-117.5702)REDLANDSPOINT(34.0579,-117.1709)RIALTOPOINT(34.1136,-117.387)RUNNINGSPRINGSPOINT(34.2097,-117.1135)CommaDelimited…TabDelimited…{{‘attr’:{‘name’=‘ONTARIO’},’geometry’:{‘x’:34.05,’y’:-117.60}}{{‘attr’:{‘name’=‘RANCHO…’},’geometry’:{‘x’:34.12,’y’:-117.57}}{{‘attr’:{‘name’=‘REDLANDS’},’geometry’:{‘x’:34.05,’y’:-117.17}}{{‘attr’:{‘name’=‘RIALTO’},’geometry’:{‘x’:34.11,’y’:-117.38}}{{‘attr’:{‘name’=‘RUNNING…’},’geometry’:{‘x’:34.20,’y’:-117.11}}JSON……withthelocationdefinedinwell-knowntext(WKT)…withEsri’sJSONdefiningthelocation…withthelocationdefinedinmultiplefieldsGISToolsforHadoopEsrionGitHubjsonHadoopTools.pytGISToolsforHadoopSpatialFrameworkforHadoopGeoprocessingToolsforHadoopGeometryAPIJavahivespatial-sdk-hive.jarspatial-sdk-json.jaresri-geometry-api.jarsamplestoolsToolsandsamplesusingtheopensourceresourcesthatsolvespecificproblems•Hiveuser-definedfunctionsforspatialprocessing•JSONhelperutilitiesGeoprocessingtoolsthat…•Copyto/fromHadoop•Convertto/fromJSON•InvokeHadoopJobsJavageometrylibraryforspatialdataprocessingGISToolsforHadoop•Topologicaloperations-Buffer-Union-ConvexHull-Contains-...•In-memoryindexing•Acceleratedgeometriesforrelationshiptests-Intersects,Contains,…•StillbeingmaintainedonGithubJavageometryAPI=OperatorContains.local();for(Geometrygeometry:someGeometryList){opContains.accelerateGeometry(geometry,sref,GeometryAccelerationDegree.enumMedium);for(Pointpoint:somePointList){booleancontains=opContains.execute(geometry,point,sref,null);}OperatorContains.deaccelerateGeometry(geometry);}GISToolsforHadoop•ApacheHivesupportsanalysisoflargedatasetsinHDFSusingaSQL-likelanguage(HiveQL)whilealsomaintainingfullsupportforMapReduce-MaintainsadditionalmetadatafordatastoredinHadoop-Specifically,schemadefinitionthatmapstheoriginaldatatorowsandcolumns-AllowsSQL-likeinteractionwithdatausingtheHiveQueryLanguage(HQL)-SampleofHivetablecreatestatementforsimpleCSV?•HiveUser-DefinedFunctions(UDF)thatwrapgeometryAPIoperators•ModeledontheST_GeometryOGCcompliantgeometrytypeHivespatialfunctions•Definingata