基于机器学习的银行卡消费数据预测与推荐

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

1HeadlineGoesHereMachinelearninginfinanceusingSparkMLpipelineWhoamI?OutlineSparkandML/MLlibbackgroundSparkMLpipelineHyperparametertuningSparkML/MLlibfeaturetransformers&algorithmsFinancialusercasesCreditscoringcaseSparkbackgroundDistributedcomputingengineApacheopensourceBuiltforspeed,easeofuse,andsophisticatedanalyticsResilientDistributedDataset(RDD)ExpressiveAPIsinPython,Java,ScalaandRMachinelearninginSparkSparkisfirstgeneralpurposebigdataprocessingenginebuildforMLfromdayoneTheinitialdesigninSparkwasdrivenbyMLoptimizationCaching-ForrunningondatamultipletimesAccumulator-TokeepstateacrossmultipleiterationsinmemoryGoodsupportforCPUintensivetaskswithlazinessAggregate&TreeAggregateOneoftheexamplesinSparkfirstversionwasofMLInputiteration1iteration2iteration3...iteration1iteration2...InputKey:KeepWorkingSetinRAMone-timeprocessingDistributedmemorySparkforDataScienceDataFramesIntuitivemanipulationofdistributedstructureddataFamiliarAPIbasedonR&PythonPandasDistributed,optimizedimplementationMachineLearningPipelinesIntegrationwithDataFramesFamiliarAPIbasedonscikit-learnSimpaleparametertuningMLWorkflowsarecomplexImageclassificationpipelineSpecifypipelineInspect&debugRe-runonnewdataTuneparametersMLWorkflowarecomplexDataSource1DataSource3DataSource2ExtraceFeaturesExtraceFeaturesFeatureTransform1FeatureTransform2FeatureTransform3ModelTrainer1ModelTrainer2ModelTrainer3BestModelEvaluateEnsembleKeyabstractionofSparkMLpipelineTransformerFeaturetransformers(e.g.,OneHotEncoder)andtrainedMLmodels(e.g.,LogisticRegressionModel).EstimatorMLalgorithmsfortrainingmodels(e.g.,LogisticRegression)EvaluatorTheseevaluatepredictionsandcomputemetrics,usefulfortuningalgorithmparameters(e.g.,BinaryClassificationEvaluator).ExampleDatasourcesforDataFramesLibSVMRelationvaldf=sqlContext.read.format(“libsvm”).load(path)LoaddataLoaddataTokenizerhashingTFLogisticRegressionevaluatepredictlabelInttextStringLoaddataLoaddataTokenizerhashingTFLogisticRegressionevaluatepredictlabelIntwordsSeq[String]FeaturetransformLoaddataTokenizerhashingTFLogisticRegressionevaluatepredictlabelIntwordsVectorFeaturetransformLoaddataTokenizerhashingTFLogisticRegressionevaluatepredictlabelIntfeaturesVectorpredictionIntTrainandevaluatemodelLoaddataTokenizerhashingTFLogisticRegressionevaluatepredictTrainandevaluatemodelTraindataTokenizerhashingTFLogisticRegressionevaluatepredictTestdataTokenizerhashingTFLogisticRegressionevaluatepredictRe-runexactlythesamewayConcisecodevaltokenizer=newTokenizer().setInputCol(text).setOutputCol(words)valhashingTF=newHashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol(features)vallr=newLogisticRegression().setMaxIter(10).setRegParam(0.01)valpipeline=newPipeline().setStages(Array(tokenizer,hashingTF,lr))valmodel=pipeline.fit(trainingDataset)model.transform(testDataset)DatasetHyperparametertuningExtract/Transform#features=100TrainingregParam=0.01EvaluationExtract/Transform#features=200Extract/Transform#features=400TrainingregParam=0.1TrainingregParam=1.0CrossvalidationGiven:EstimatorParametergridEvaluatorFindbestparametersormodels//Buildaparametergrid.valparamGrid=newParamGridBuilder().addGrid(hashingTF.numFeatures,Array(10,20,40)).addGrid(lr.regParam,Array(0.01,0.1,1.0)).build()//Setupcross-validation.valcv=newCrossValidator().setNumFolds(3).setEstimator(pipeline).setEstimatorParamMaps(paramGrid).setEvaluator(newBinaryClassificationEvaluator)//Fitamodelwithcross-validation.valcvModel=cv.fit(trainingDataset)TransformerDescriptionscikit-learnBinarizerThresholdnumericalfeaturetobinaryBinarizerBucketizerBucketnumericalfeaturesintorangesElementwiseProductScaleeachfeature/columnseparatelyHashingTFHashtext/datatovector.ScalebytermfrequencyFeatureHasherIDFScalefeaturesbyinversedocumentfrequencyTfidfTransformerNormalizerScaleeachrowtounitnormNormalizerOneHotEncoderEncodek-categoryfeatureasbinaryfeaturesOneHotEncoderFeatureTransformersTransformerDescriptionscikit-learnPolynomialExpansionCreatehigher-orderfeaturesPolynomialFeaturesRegexTokenizerTokenizetextusingregularexpressions(partoftextmethods)StandardScalerScalefeaturesto0meanand/orunitvarianceStandardScalerStringIndexerConvertStringfeatureto0-basedindicesLabelEncoderTokenizerTokenizetextonwhitespace(partoftextmethods)VectorAssemblerConcatenatefeaturevectorsFeatureUnionVectorIndexerIdentifycategoricalfeatures,andindexWord2VecLearnvectorrepresentationofwordstok=Tokenizer(inputCol=text,outputCol=words)htf=HashingTF(inputCol=words,outputCol=tf,numFeatures=200)w2v=Word2Vec(inputCol=text,outputCol=w2v)ohe=OneHotEncoder(inputCol=userGroup,outputCol=ug)va=VectorAssembler(inputCols=[tf,w2v,ug],outputCol=features)pipeline=Pipeline(stages=[tok,htf,w2v,ohe,va])DiscreteContinousSupervisedClassificationLogisticRegression(withElastic-Net)SVMDecisionTreeRandomForestGBTNaiveBayesMultilayerPerceptronOneVsRestRegressionLinearRegression(withElastic-Net)DecisionTreeRandomForestGBTAFTSurvivalRegressionIsotonicRegressionUnsupervisedClusteringKMeansGaussianMixtureLDAPowerIterationClusteringDimen

1 / 41
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功