EM算法EM(expectationmaximization,期望最大)算法是K均值聚类算法的一种扩展。与K均值算法将每个对象指派到一个簇不同,在EM算法中,每个对象按照代表对象录属概率的权重指派到每个对象,新的均值基于加权的度量来计算。在EM算法中,簇间没有严格的边界。隶属概率及新均值的计算:对象io隶属于簇jc的概率为|()(|)|jijijjikikkppPpppcococcococ式中,(|)(,())ijjjipocmo服从均值为jm,方差为j的正态分布。簇jc的均值jm为11()()nijnjiiijippocmooc式中,n是对象数目。EM算法的核心思想是:首先对混合模型的参数进行初始估计,然后重复执行E步和M步,直至收敛。E步成为期望步,根据当前参数计算每个对象关于各个簇的隶属概率;M步称为最大化步,使用E步计算的概率来更新参数估计。算法:EM算法(D,K)输入:数据对象集合D,簇数目K输出:K个簇的参数算法步骤如下:1.对混合模型的参数做初步估计:从D中随机选取K个不同的数据对象作为K个簇12,,...,kccc的中心12,,...,kmmm,估计K个簇的方差12,,...,k2.repeata)E步:计算()ijpoc,根据()ijpoc将对象io指派到簇jcb)M步:利用E步计算的概率重新估计jm3.Unite参数不在发生变化EM算法实现matlab代码,建立EM.mfunction[D,param_struct]=EM(train_features,train_targets,Ngaussians,region)%Classifyusingtheexpectation-maximizationalgorithm%Inputs:%features-Trainfeatures%targets-Traintargets%Ngaussians-NumberforGaussiansforeachclass(vector)%region-Decisionregionvector:[-xx-yynumber_of_points]%%Outputs%D-Decisionsufrace%param_struct-AparameterstructurecontainingtheparametersoftheGaussiansfound[Nclasses,classes]=find_classes(train_targets);%NumberofclassesintargetsNalpha=Ngaussians;%NumberofGaussiansineachclass%Theinitialguessisbasedonk-meanspreprocessing.Ifitdoesnotconvergeafter%max_iteriterations,arandomguessisused.disp('Usingk-meansforinitialguess')max_iter=100;Pw=zeros(2,max(Ngaussians));sigma=zeros(2,max(Ngaussians),size(train_features,1),size(train_features,1));mu=zeros(2,max(Ngaussians),size(train_features,1));fori=1:Nclasses,in=find(train_targets==classes(i));[initial_mu,targets,labels]=k_means(train_features(:,in),train_targets(:,in),Ngaussians(i),region);forj=1:Ngaussians(i),gauss_labels=find(labels==j);Pw(i,j)=length(gauss_labels)/length(labels);sigma(i,j,:,:)=diag(std(train_features(:,in(gauss_labels))'));endmu(i,1:Ngaussians(i),:)=initial_mu';end%DotheEM:Estimatemeanandcovarianceforeachclassforc=1:Nclasses,train=find(train_targets==classes(c));if(Ngaussians(c)==1),%IfthereisonlyoneGaussian,thereisnoneedtodoawholeEMproceduresigma(c,1,:,:)=sqrtm(cov(train_features(:,train)',1));mu(c,1,:)=mean(train_features(:,train)');elsesigma_i=squeeze(sigma(c,:,:,:));old_sigma=zeros(size(sigma_i));%Usedforthestoppingcriterioniter=0;%Iterationcountern=length(train);%Numberoftrainingpointsqi=zeros(Nalpha(c),n);%Thiswillholdqi'sP=zeros(1,Nalpha(c));whilesum(sum(sum(abs(sigma_i-old_sigma))))1e-10,old_sigma=sigma_i;%Estep:ComputeQ(theta;theta_i)fort=1:n,data=train_features(:,train(t));fork=1:Nalpha(c),P(k)=Pw(c,k)*p_single(data,squeeze(mu(c,k,:)),squeeze(sigma_i(k,:,:)));endfori=1:Nalpha(c),qi(i,t)=P(i)/sum(P);endend%Mstep:theta_i+1-argmax(Q(theta;theta_i))%Intheimplementationgivenhere,thegoalistofindthedistributionoftheGaussiansusing%maximumlikelihodestimation,asshowninsection10.4.2ofDHS%Calculatingmu'sfori=1:Nalpha(c),mu(c,i,:)=sum((train_features(:,train).*(ones(2,1)*qi(i,:)))')/sum(qi(i,:)');end%Calculatingsigma's%Abitdifferentfromthehandouts,butmuchmoreefficientfori=1:Nalpha(c),data_vec=train_features(:,train);data_vec=data_vec-squeeze(mu(c,i,:))*ones(1,n);data_vec=data_vec.*(ones(2,1)*sqrt(qi(i,:)));sigma_i(i,:,:)=sqrt(abs(cov(data_vec',1)*n/sum(qi(i,:)')));end%Calculatingalpha'sPw(c,1:Ngaussians(c))=1/n*sum(qi');iter=iter+1;disp(['Iteration:'num2str(iter)])if(itermax_iter),theta=randn(size(sigma_i));iter=0;disp('Redrawingweights.')endendsigma(c,:,:,:)=sigma_i;endend%Finddecisionregionp0=length(find(train_targets==0))/length(train_targets);%Ifthereisonlyonegaussianinaclass,squeezewillwreckit'sformat,sofixingisneededm0=squeeze(mu(1,1:Ngaussians(1),:));m1=squeeze(mu(2,1:Ngaussians(2),:));if(size(m0,2)==1),m0=m0';endif(size(m1,2)==1),m1=m1';endparam_struct.m0=m0;param_struct.m1=m1;param_struct.s0=squeeze(sigma(1,1:Ngaussians(1),:,:));param_struct.s1=squeeze(sigma(2,1:Ngaussians(2),:,:));param_struct.w0=Pw(1,1:Ngaussians(1),:);param_struct.w1=Pw(2,1:Ngaussians(2),:);param_struct.p0=p0;D=decision_region(param_struct,region);%ENDEMfunctionp=p_single(x,mu,sigma)%ReturntheprobabilityonaGaussianprobabilityfunction.UsedbyEMp=1/(2*pi*sqrt(abs(det(sigma))))*exp(-0.5*(x-mu)'*inv(sigma)*(x-mu));EM算法的计算复杂度为P(ndt),n是数据库中的对象数目,d是对象的属性数目,t是迭代次数。EM算法简单,容易实现,收敛快,但可能达不到全局最优。