网络爬虫论文

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

I摘要网络爬虫(WebCrawler),通常被称为爬虫,是搜索引擎的重要组成部分。随着信息技术的飞速进步,作为搜索引擎的一个组成部分——网络爬虫,一直是研究的热点,它的好坏会直接决定搜索引擎的未来。目前,网络爬虫的研究包括Web搜索策略研究的研究和网络分析的算法,两个方向,其中在Web爬虫网络搜索主题是一个研究方向,根据一些网站的分析算法,过滤不相关的链接,连接到合格的网页,并放置在一个队列被抓取。把互联网比喻成一个蜘蛛网,那么Spider就是在网上爬来爬去的蜘蛛。网络蜘蛛是通过网页的链接地址来寻找网页,从网站某一个页面(通常是首页)开始,读取网页的内容,找到在网页中的其它链接地址,然后通过这些链接地址寻找下一个网页,这样一直循环下去,直到把这个网站所有的网页都抓取完为止。如果把整个互联网当成一个网站,那么网络爬虫就可以用这个原理把互联网上所有的网页都抓取下来。关键词:网络爬虫;LinuxSocket;C/C++;多线程;互斥锁IIAbstractWebCrawler,usuallycalledCrawlerforshort,isanimportantpartofsearchengine.Withthehigh-speeddevelopmentofinformation,WebCrawler--thesearchenginecannotlackof--whichisahotresearchtopicthoseyears.ThequalityofasearchengineismostlydependedonthequalityofaWebCrawler.Nowadays,thedirectionofresearchingWebCrawlermainlydividesintotwoparts:oneisthesearchingstrategytowebpages;theotheristhealgorithmofanalysisURLs.Amongthem,theresearchofTopic-FocusedWebCrawleristhetrend.Itusessomewebpageanalysisstrategytofiltertopic-lessURLsandaddfitURLsintoURL-WAITqueue.Themetaphorofaspiderwebinternet,thenSpiderspideriscrawlingaroundontheInternet.Webspiderthroughweblinkaddresstofindpages,startingfromaonepagewebsite(usuallyhome),readthecontentsofthepage,findtheaddressoftheotherlinksonthepage,andthenlookforthenextWebpageaddressesthroughtheselinks,sohasbeenthecyclecontinues,untilallthepagesofthissitearecrawledexhausted.IftheentireInternetasasite,thenyoucanusethisWebcrawlerprincipleallthepagesontheInternetarecrawlingdown..Keywords:Webcrawler;LinuxSocket;C/C++;Multithreading;MutexIII目录摘要............................................................................I第一章概述...................................................................11.1课题背景.................................................................................................................................................11.2网络爬虫的历史和分类.........................................................................................................................11.2.1网络爬虫的历史..........................................................................................................................11.2.2网络爬虫的分类..........................................................................................................................21.3网络爬虫的发展趋势.............................................................................................................................31.4系统开发的必要性.................................................................................................................................31.5本文的组织结构.....................................................................................................................................3第二章相关技术和工具综述.........................................................52.1网络爬虫的定义.....................................................................................................................................52.2网页搜索策略介绍.................................................................................................................................52.2.1广度优先搜索策略......................................................................................................................52.3相关工具介绍.........................................................................................................................................62.3.1操作系统......................................................................................................................................62.3.2软件配置......................................................................................................................................6第三章网络爬虫模型的分析和概要设计................................................83.1网络爬虫的模型分析.............................................................................................................................83.2网络爬虫的搜索策略.............................................................................................................................83.3网络爬虫的概要设计...........................................................................................................................10第四章网络爬虫模型的设计与实现...................................................124.1网络爬虫的总体设计...........................................................................................................................124.2网络爬虫的具体设计...........................................................................................................................124.2.1URL类设计及标准化URL.......................................................................................................124.2.2爬取网页....................................................................................................................................134.2.3网页分析....................................................................................................................................144.2.4网页存储....................................................................................................................................144.2.5Linuxsocket通信.............................................................

1 / 36
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功