Nutch1.0源代码分析[1]Injector21MAR201012:55:42+0800----------------------------------------------------------------------------在Crawl中的main函数中有一句是://initializecrawlDbinjector.inject(crawlDb,rootUrlDir);引用[李阳]:inject操作调用的是nutch的核心包之一crawl包中的类Injector。inject操作主要作用:1.将URL集合进行格式化和过滤,消除其中的非法URL,并设定URL状态(UNFETCHED),按照一定方法进行初始化分值;2.将URL进行合并,消除重复的URL入口;3.将URL及其状态、分值存入crawldb数据库,与原数据库中重复的则删除旧的,更换新的。inject操作结果:crawldb数据库内容得到更新,包括URL及其状态。看一下inject调用的函数:publicvoidinject(PathcrawlDb,PathurlDir)throwsIOException{//产生一个文件名是随机的临时文件夹PathtempDir=newPath(getConf().get(mapred.temp.dir,.)+/inject-temp-+Integer.toString(newRandom().nextInt(Integer.MAX_VALUE)));//maptextinputfiletoaurl,CrawlDatumfile//产生url,CrawlDatumkey-value对的文件JobConfsortJob=newNutchJob(getConf());sortJob.setJobName(inject+urlDir);FileInputFormat.addInputPath(sortJob,urlDir);sortJob.setMapperClass(InjectMapper.class);FileOutputFormat.setOutputPath(sortJob,tempDir);sortJob.setOutputFormat(SequenceFileOutputFormat.class);sortJob.setOutputKeyClass(Text.class);sortJob.setOutputValueClass(CrawlDatum.class);sortJob.setLong(injector.current.time,System.currentTimeMillis());JobClient.runJob(sortJob);这里用的是hadoop的东西,输入文件目录为:用户指定的url目录。输出目录为:产生的那个临时文件夹。这里的SequenceFileOutputFormat在Hadoop,Thedefinitivebook中的解释为:Imaginealogfile,whereeachlogrecordisanewlineoftext.Ifyouwanttologbinarytypes,plaintextisn’tasuitableformat.Hadoop’sSequenceFileclassfitsthebillinthissituation,providingapersistentdatastructureforbinarykey-valuepairs.,这里是用map函数产生url,CrawlDatum对的文件。publicvoidmap(WritableComparablekey,Textvalue,OutputCollectorText,CrawlDatumoutput,Reporterreporter)throwsIOException{Stringurl=value.toString();//valueislineoftexttry{url=urlNormalizers.normalize(url,URLNormalizers.SCOPE_INJECT);url=filters.filter(url);//filtertheurl}catch(Exceptione){}if(url!=null){//ifitpassesvalue.set(url);//collectitCrawlDatumdatum=newCrawlDatum(CrawlDatum.STATUS_INJECTED,interval);datum.setFetchTime(curTime);datum.setScore(scoreInjected);try{scfilters.injectedScore(value,datum);}catch(ScoringFilterExceptione){}output.collect(value,datum);}}urlNormalizers是用于规范化url,而filters用于过滤不合法的url。Map输出的key是url而value是CrawlDatum,这里设置它的几个成员变量的值:privatebytestatus;privatelongfetchTime=System.currentTimeMillis();privateintfetchInterval;privatefloatscore=1.0f;ScoringFilters是一个计算分数的类。Inject函数的后一部分://mergewithexistingcrawldbJobConfmergeJob=CrawlDb.createJob(getConf(),crawlDb);FileInputFormat.addInputPath(mergeJob,tempDir);mergeJob.setReducerClass(InjectReducer.class);JobClient.runJob(mergeJob);CrawlDb.install(mergeJob,crawlDb);//cleanupFileSystemfs=FileSystem.get(getConf());fs.delete(tempDir,true);if(LOG.isInfoEnabled()){LOG.info(Injector:done);}}mergeJob把刚才的临时目录当作输入目录,输出在install函数里处理,最终删除那个临时目录。下面看一下InjectReducer类:/**Combinemultiplenewentriesforaurl.*/publicstaticclassInjectReducerimplementsReducerText,CrawlDatum,Text,CrawlDatum{privateCrawlDatumold=newCrawlDatum();privateCrawlDatuminjected=newCrawlDatum();publicvoidreduce(Textkey,IteratorCrawlDatumvalues,OutputCollectorText,CrawlDatumoutput,Reporterreporter)throwsIOException{booleanoldSet=false;while(values.hasNext()){CrawlDatumval=values.next();if(val.getStatus()==CrawlDatum.STATUS_INJECTED){injected.set(val);injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);}else{old.set(val);oldSet=true;}}CrawlDatumres=null;if(oldSet)res=old;//don'toverwriteexistingvalueelseres=injected;output.collect(key,res);}}这里reduce,url,CrawlDatum对,因为没有必要一个url关联着多个Datum,这里判断CrawlDatum的状态,如果它是STATUS_INJECTED,也就是新被注入的,设置injected值,如果是STATUS_DB_UNFETEDED,未被抓取的,就设置old的值。这里要注意的一点是如果的确是注入的过,就将res设为old,否则才设为injected。在CrawlDb类的install中:publicstaticvoidinstall(JobConfjob,PathcrawlDb)throwsIOException{PathnewCrawlDb=FileOutputFormat.getOutputPath(job);FileSystemfs=newJobClient(job).getFs();Pathold=newPath(crawlDb,old);Pathcurrent=newPath(crawlDb,CURRENT_NAME);if(fs.exists(current)){if(fs.exists(old))fs.delete(old,true);fs.rename(current,old);}fs.mkdirs(crawlDb);fs.rename(newCrawlDb,current);if(fs.exists(old))fs.delete(old,true);Pathlock=newPath(crawlDb,LOCK_NAME);LockUtil.removeLockFile(fs,lock);}newCrawlDb大概是crawl/crawldb/216164146,而old是crawl/crawldb/old,current是crawl/crawldb/current,如果有current,就将它重命为old,再创建crawlDb,再将它重命为current,如果old是存在的,删除,如果是有锁的,把锁删除。Nutch1.0源代码分析[2]Plugin(1)21MAR201012:58:47+0800----------------------------------------------------------------------------借着URLNormalizers看一下Nutch的插件机制,在Injector类中的configure类中有一句是:urlNormalizers=newURLNormalizers(job,URLNormalizers.SCOPE_INJECT);它调用的是:publicURLNormalizers(Configurationconf,Stringscope){this.conf=conf;this.extensionPoint=PluginRepository.get(conf).getExtensionPoint(URLNormalizer.X_POINT_ID);ObjectCacheobjectCache=ObjectCache.get(conf);normalizers=(URLNormalizer[])objectCache.getObject(URLNormalizer.X_POINT_ID+_+scope);if(normalizers==null){normalizers=getURLNormalizers(scope);}if(normalizers==EMPTY_NORMALIZERS){normalizers=(URLNormalizer[])objectCache.getObject(URLNormalizer.X_POINT_ID+_+SCOPE_DEFAULT);if(normalizers==null){nor