Weka开发[21]——IBk(KNN)源代码分析如果你没有看上一篇IB1,请先看一下,因为重复的内容我在这里不会介绍了。直接看buildClassifier,这里只列出在IB1中也没有出现的代码:try{m_NumClasses=instances.numClasses();m_ClassType=instances.classAttribute().type();}catch(Exceptionex){thrownewError(Thisshouldneverbereached);}//Throwawayinitialinstancesuntilwithinthespecifiedwindowsizeif((m_WindowSize>0)&&(instances.numInstances()>m_WindowSize)){m_Train=newInstances(m_Train,m_Train.numInstances()-m_WindowSize,m_WindowSize);}//Computethenumberofattributesthatcontribute//toeachpredictionm_NumAttributesUsed=0.0;for(inti=0;i<m_Train.numAttributes();i){if((i!=m_Train.classIndex())&&(m_Train.attribute(i).isNominal()||m_Train.attribute(i).isNumeric())){m_NumAttributesUsed=1.0;}}//Invalidateanycurrentlycross-validationselectedkm_kNNValid=false;IB1中不关心m_NumClasses是因为它就找一个邻居,当然就一个值了。m_WindowSize是指用多少样本用于分类,这里不是随机选择而是直接选前m_WindowSize个。这里下面是看有多少属性参与预测。KNN也是一个可以增量学习的分器量,下面看一下它的updateClassifier代码:publicvoidupdateClassifier(Instanceinstance)throwsException{if(m_Train.equalHeaders(instance.dataset())==false){thrownewException(Incompatibleinstancetypes);}if(instance.classIsMissing()){return;}if(!m_DontNormalize){updateMinMax(instance);}m_Train.add(instance);m_kNNValid=false;if((m_WindowSize>0)&&(m_Train.numInstances()>m_WindowSize)){while(m_Train.numInstances()>m_WindowSize){m_Train.delete(0);}}}同样很简单,updateMinMax,如果超出窗口大小,循环删除超过窗口大小的第一个样本。这里注意IBk没有实现classifyInstance,它只实现了distributionForInstances:publicdouble[]distributionForInstance(Instanceinstance)throwsException{if(m_Train.numInstances()==0){thrownewException(Notraininginstances!);}if((m_WindowSize>0)&&(m_Train.numInstances()>m_WindowSize)){m_kNNValid=false;booleandeletedInstance=false;while(m_Train.numInstances()>m_WindowSize){m_Train.delete(0);}//rebuilddatastructureKDTreecurrentlycan'tdeleteif(deletedInstance==true)m_NNSearch.setInstances(m_Train);}//Selectkbycrossvalidationif(!m_kNNValid&&(m_CrossValidate)&&(m_kNNUpper>=1)){crossValidate();}m_NNSearch.addInstanceInfo(instance);Instancesneighbours=m_NNSearch.kNearestNeighbours(instance,m_kNN);double[]distances=m_NNSearch.getDistances();double[]distribution=makeDistribution(neighbours,distances);returndistribution;}前面两个判断不讲了,crossValidate()马上讲,寻找K个邻居在我第[18]篇里已经讲过了,现在我们看一下makeDistribution函数。protecteddouble[]makeDistribution(Instancesneighbours,double[]distances)throwsException{doubletotal=0,weight;double[]distribution=newdouble[m_NumClasses];//Setupacorrectiontotheestimatorif(m_ClassType==Attribute.NOMINAL){for(inti=0;i<m_NumClasses;i){distribution[i]=1.0/Math.max(1,m_Train.numInstances());}total=(double)m_NumClasses/Math.max(1,m_Train.numInstances());}for(inti=0;i<neighbours.numInstances();i){//CollectclasscountsInstancecurrent=neighbours.instance(i);distances[i]=distances[i]*distances[i];distances[i]=Math.sqrt(distances[i]/m_NumAttributesUsed);switch(m_DistanceWeighting){caseWEIGHT_INVERSE:weight=1.0/(distances[i]0.001);//toavoiddivbyzerobreak;caseWEIGHT_SIMILARITY:weight=1.0-distances[i];break;default://WEIGHT_NONE:weight=1.0;break;}weight*=current.weight();try{switch(m_ClassType){caseAttribute.NOMINAL:distribution[(int)current.classValue()]=weight;break;caseAttribute.NUMERIC:distribution[0]=current.classValue()*weight;break;}}catch(Exceptionex){thrownewError(Datahasnoclassattribute!);}total=weight;}//Normalisedistributionif(total>0){Utils.normalize(distribution,total);}returndistribution;}第一行注释Setupacorrection,我感觉没什么必要,又不是Bayes还有除0错误,没什么可修正的。这里可以看见它实现了三种距离权重计算方法,倒数,与1的差,另外就是固定权重1。然后如果类别是离散值把对应的类值加上权重,如果是连续值,就加上当前类别值剩权重。crossValidate简单地说就是用蛮力找在到底用多少个邻居好,它对m_Train中的样本进行循环,对每个样本找邻居,然后统计看寻找多少个邻居时最好。protectedvoidcrossValidate(){double[]performanceStats=newdouble[m_kNNUpper];double[]performanceStatsSq=newdouble[m_kNNUpper];for(inti=0;i<m_kNNUpper;i){performanceStats[i]=0;performanceStatsSq[i]=0;}m_kNN=m_kNNUpper;Instanceinstance;Instancesneighbours;double[]origDistances,convertedDistances;for(inti=0;i<m_Train.numInstances();i){instance=m_Train.instance(i);neighbours=m_NNSearch.kNearestNeighbours(instance,m_kNN);origDistances=m_NNSearch.getDistances();for(intj=m_kNNUpper-1;j>=0;j--){//UpdatetheperformancestatsconvertedDistances=newdouble[origDistances.length];System.arraycopy(origDistances,0,convertedDistances,0,origDistances.length);double[]distribution=makeDistribution(neighbours,convertedDistances);doublethisPrediction=Utils.maxIndex(distribution);if(m_Train.classAttribute().isNumeric()){thisPrediction=distribution[0];doubleerr=thisPrediction-instance.classValue();performanceStatsSq[j]=err*err;//SquarederrorperformanceStats[j]=Math.abs(err);//Absoluteerror}else{if(thisPrediction!=instance.classValue()){performanceStats[j];//Classificationerror}}if(j>=1){neighbours=pruneToK(neighbours,convertedDistances,j);}}}//Checkthroughtheperformancestatsandselectthebest//kvalue(orthelowestkifmorethanonebest)double[]searchStats=performanceStats;if(m_Train.classAttribute().isNumeric()&&m_MeanSquared){searchSt