Python大数据编程4数据分析3数据描述1数据收集2数据整理大数据处理过程2便捷数据获取Python大数据编程用Python获取数据本地数据如何获取?文件的打开,读写和关闭•文件打开•读文件写文件•文件关闭4用Python获取数据网络数据如何获取?抓取网页,解析网页内容•urllib•urllib2•httplib•httplib25Python3中被urllib.request代替Python3中被http.client代替yahoo财经数据=%5EDJI+Component利用urllib库获取yahoo财经数据File#Filename:dji.pyimporturllibimportredBytes=urllib.request.urlopen('=%5EDJI+Components').read()dStr=dBytes.decode('GBK')#在python3中urllib.read()返回bytes对象而非str,语句功能是将dBytes转换成Strm=re.findall('trtdclass=yfnc_tabledata1bahref=.*?(.*?)/a/b/tdtdclass=yfnc_tabledata1(.*?)/td.*?b(.*?)/b.*?/tr',dStr)ifm:printmprint'\n'printlen(m)else:print'notmatch'数据形式•包含多个字符串(dji)–'AXP','AmericanExpressCompany','86.40'–'BA','TheBoeingCompany','122.24'–'CAT','CaterpillarInc.','99.44'–'CSCO','CiscoSystems,Inc.','23.78'–'CVX','ChevronCorporation','115.91'–…便捷网络数据是否能够简单方便并且快速的方式获得雅虎财经上各上市公司股票的历史数据?File#Filename:quotes.pyfrommatplotlib.financeimportquotes_historical_yahoofromdatetimeimportdateimportpandasaspdtoday=date.today()start=(today.year-1,today.month,today.day)quotes=quotes_historical_yahoo_ochl('AXP',start,today)df=pd.DataFrame(quotes)printdf函数目前更新为quotes_historical_yahoo_ochl便捷网络数据quotes的内容日期收盘价开盘价最高价最低价成交量便捷网络数据自然语言工具包NLTK古腾堡语料库•布朗语料库•路透社语料库•网络和聊天文本•…fromnltk.corpusimportgutenbergimportnltkprintgutenberg.fileids()[u'austen-emma.txt',u'austen-persuasion.txt',u'austen-sense.txt',u'bible-kjv.txt',u'blake-poems.txt',u'bryant-stories.txt',u'burgess-busterbrown.txt',u'carroll-alice.txt',u'chesterton-ball.txt',u'chesterton-brown.txt',u'chesterton-thursday.txt',u'edgeworth-parents.txt',u'melville-moby_dick.txt',u'milton-paradise.txt',u'shakespeare-caesar.txt',u'shakespeare-hamlet.txt',u'shakespeare-macbeth.txt',u'whitman-leaves.txt']texts=gutenberg.words('shakespeare-hamlet.txt')[u'[',u'The',u'Tragedie',u'of',u'Hamlet',u'by',...]Sourcebrown需要先执行nltk.download()下载某一个或多个包,若下载失败,可以在官网()•单独下载后放到本地python目录的nltk_data\corpora下数据准备Python大数据编程数据形式30支成分股(dji)股票数据的逻辑结构公司代码公司名最近一次成交价美国运通公司(quotes)股票详细数据的逻辑结构日期开盘价收盘价最高价最低价成交量数据整理quotes数据加属性名File#Filename:quotesproc.pyfrommatplotlib.financeimportquotes_historical_yahoo_ochlfromdatetimeimportdateimportpandasaspdtoday=date.today()start=(today.year-1,today.month,today.day)quotes=quotes_historical_yahoo_ochl('AXP',start,today)fields=['date','open','close','high','low','volume']quotesdf=pd.DataFrame(quotes,columns=fields)printquotesdf数据整理dji数据:加属性名codeAXPnamelasttradeBACAT…XOMquotes数据:加属性名dateopenclosehighlowvolume735190.0735191.0735192.0…735551.0数据整理用1,2,…作为索引quotesdf=pd.DataFrame(quotes,columns=fields)quotesdf=pd.DataFrame(quotes,index=range(1,len(quotes)+1),columns=fields)数据整理如果可以直接用date作为索引,quotes的时间能否转换成常规形式(如下图中的效果)?Sourcefromdatetimeimportdatefirstday=date.fromordinal(735190)lastday=date.fromordinal(735551)firstdaydatetime.date(2013,11,18)lastdaydatetime.date(2014,11,14)时间序列#Filename:quotesproc.pyfrommatplotlib.financeimportquotes_historical_yahoo_ochlfromdatetimeimportdatefromdatetimeimportdatetimeimportpandasaspdtoday=date.today()start=(today.year-1,today.month,today.day)quotes=quotes_historical_yahoo_ochl('AXP',start,today)fields=['date','open','close','high','low','volume']list1=[]foriinrange(0,len(quotes)):x=date.fromordinal(int(quotes[i][0]))y=datetime.strftime(x,'%Y-%m-%d')list1.append(y)quotesdf=pd.DataFrame(quotes,index=list1,columns=fields)quotesdf=quotesdf.drop(['date'],axis=1)printquotesdfFile转换成常规时间转换成固定格式删除原date列创建时间序列importpandasaspddates=pd.date_range('20141001',periods=7)datesclass'pandas.tseries.index.DatetimeIndex'[2014-10-01,...,2014-10-07]Length:7,Freq:D,Timezone:Noneimportnumpyasnpdates=pd.DataFrame(np.random.randn(7,3),index=dates,columns=list('ABC'))datesABC2014-10-011.302600-1.2147081.4116282014-10-02-0.5123432.2774740.4038112014-10-03-0.788498-0.2171610.1732842014-10-041.042167-0.453329-2.1071632014-10-05-1.6280751.6633770.9435822014-10-06-0.0910340.3358842.4554312014-10-07-0.679055-0.8659730.246970[7rowsx3columns]Source数据显示Python大数据编程数据显示djidfquotesdf数据显示Sourcedjidf.indexInt64Index([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29],dtype='int64')djidf.columnsIndex([u'code',u'name',u'lasttrade'],dtype='object')dijdf.valuesarray([['AXP','AmericanExpressCompany','90.67'],['BA','TheBoeingCompany','128.86'],…['XOM','ExxonMobilCorporation','95.09']],dtype=object)djidf.describeboundmethodDataFrame.describeofcodenamelasttrade0AXPAmericanExpressCompany90.671…29BAXOMTheBoeingCompanyExxonMobilCorporation128.8695.09显示方式:•显示索引•显示列名•显示数据的值•显示数据描述数据显示Sourcequotesdf.indexIndex([u'2013-11-18',u'2013-11-19',u'2013-11-20',u'2013-11-21',u'2013-11-22',u'2013-11-25',u'2013-11-26',u'2013-11-27',…-04-08',u'2014-04-09',u'2014-04-10',u'2014-04-11',...],dtype='object')索引的格式数据显示djidf.head(5)codename0AXPAmericanExpressCompany1BA2CAT3CSCO4CVXTheBoeingCompanyCaterpillarInc.CiscoSystems,Inc.ChevronCorporationlasttrade90.67128.86101.