Python爬虫程序设计KC23

PPT

阅读 103 次
下载 0 次
页数 22 页
大小 76.896 KB
2022-11-12 上传

下载文档8.00 元 加入VIP免费下载

此文档由【小橙橙】提供上传，收益归文档提供者，本网站只提供存储服务。若此文档侵犯了您的版权，欢迎进行违规举报或版权认领

可在后台配置第一页与第二页中间广告代码

可在后台配置第二页与第三页中间广告代码

可在后台配置第三页与第四页中间广告代码

还剩10页未读，继续阅读

【这是免费文档，您可以免费阅读】

/ 22

下载文档8.00 元 加入VIP免费下载

TA最新上传

文本内容

【文档说明】Python爬虫程序设计KC23.pptx，共(22)页，76.896 KB，由小橙橙上传

转载请保留链接：https://www.ichengzhen.cn/view-2412.html

以下为本文档部分文字说明：

2.3.1BeautifulSoup查找HTML元素2.3.1BeautifulSoup查找HTML元素查找文档的元素是我们爬取网页信息的重要手段，BeautifulSoup提供了一系列的查找元素的方法，其中功能强大的find_a

ll函数就是其中常用的一个方法。find_all函数的原型如下：find_all(self,name=None,attrs={},recursive=True,text=None,limit=None,**kwargs)self表明它是一个类成员函数；name是要查找的tag

元素名称，默认是None，如果不提供，就是查找所有的元素；attrs是元素的属性，它是一个字典，默认是空，如果提供就是查找有这个指定属性的元素；recursive指定查找是否在元素节点的子树下面全范围进行，默认是True；后面

的text、limit、kwargs参数比较复杂，将在后面用到时介绍；find_all函数返回查找到的所有指定的元素的列表，每个元素是一个bs4.element.Tag对象。find_all函数是查找所有满足要求的元素节点，如果我

们只查找一个元素节点，那么可以使用find函数，它的原型如下：find(self,name=None,attrs={},recursive=True,text=None,limit=None,**kwargs)使用方法

与find_all类似，不同的是它只返回第一个满足要求的节点，不是一个列表。例2-3-1：查找文档中的<title>元素frombs4importBeautifulSoupdoc='''<html><head><title>TheDormouse'sstory</title></head

><body><pclass="title">TheDormouse'sstory<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere<ahref="http

://example.com/elsie"class="sister"id="link1">Elsie</a>,<ahref="http://example.com/lacie"class="sist

er"id="link2">Lacie</a>and<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>;andtheylivedatthebottomofawell.<pclass="s

tory">...</body></html>'''soup=BeautifulSoup(doc,"lxml")tag=soup.find("title")print(type(tag),tag)程序结果：<class'bs4.element.Tag'><title>

TheDormouse'sstory</title>由此可见查找到<title>元素，元素类型是一个bs4.element.Tag对象。例2-3-2：查找文档中的所有<a>元素frombs4importB

eautifulSoupdoc='''<html><head><title>TheDormouse'sstory</title></head><body><pclass="title">TheDormouse'sstory<pclass="story">Onceuponat

imetherewerethreelittlesisters;andtheirnameswere<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>,<ahref="http://exa

mple.com/lacie"class="sister"id="link2">Lacie</a>and<ahref="http://example.com/tillie"class="sister"id="link3">Tillie

</a>;andtheylivedatthebottomofawell.<pclass="story">...</body></html>'''soup=BeautifulSoup(doc,"lxml")tag

s=soup.find_all("a")fortagintags:print(tag)程序结果找到3个<a>元素：<aclass="sister"href="http://example.com/els

ie"id="link1">Elsie</a><aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a><aclass="sister

"href="http://example.com/tillie"id="link3">Tillie</a>例2-3-3：查找文档中的第一个<a>元素frombs4importBeautifulSoupdoc='

''<html><head><title>TheDormouse'sstory</title></head><body><pclass="title">TheDormouse'sstory<

pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere<ahref="http://example.com/elsie"class="sister"id="link

1">Elsie</a>,<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>and<ahref="http://example.com/tillie"class="sister"id="link3"

>Tillie</a>;andtheylivedatthebottomofawell.<pclass="story">...</body></html>'''soup=BeautifulSoup(doc,"lxml")tag=soup.find("a")print(tag)程序结

果找到第一个<a>元素：<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>例2-3-4：查找文档中class="title"的<p

>元素frombs4importBeautifulSoupdoc='''<html><head><title>TheDormouse'sstory</title></head><body><pclass="title">TheDormou

se'sstory<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere<ahref="http://example.com/elsie"class="sister"id="link1">

Elsie</a>,<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>and<ahref="http://example.com/tillie"class="sister"id="link3"

>Tillie</a>;andtheylivedatthebottomofawell.<pclass="story">...</body></html>'''soup=BeautifulSoup(doc,"lxml")tag=soup.find("p",attrs={"cl

ass":"title"})print(tag)程序结果找到class="title"的元素<pclass="title">TheDormouse'sstory很显然如果使用：tag

=soup.find("p")也能找到这个元素，因为它是文档的第一个元素。例2-3-5：查找文档中class="sister"的元素frombs4importBeautifulSoupdoc='

''<html><head><title>TheDormouse'sstory</title></head><body><pclass="title">TheDormouse'sstory<pclass="story">Onceuponatimetherewereth

reelittlesisters;andtheirnameswere<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>,<ahref="http:

//example.com/lacie"class="sister"id="link2">Lacie</a>and<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>;a

ndtheylivedatthebottomofawell.<pclass="story">...</body></html>'''soup=BeautifulSoup(doc,"lxml")tags=soup.find_all(nam

e=None,attrs={"class":"sister"})fortagintags:print(tag)其中name=None表示无论是什么名字的元素，程序结果找到3个：aclass="sister

"href="http://example.com/elsie"id="link1">Elsie</a><aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a><aclass="sis

ter"href="http://example.com/tillie"id="link3">Tillie</a>对于这个文档，很显然语句：tags=soup.find_all("a")或者：tags=soup.find_all("a",attrs={"cl

ass":"sister"})效果一样。2.3.2BeautifulSoup获取元素的属性值2.3.2BeautifulSoup获取元素的属性值如果一个元素已经找到，例如找到<a>元素，那么怎么样获取它的属性值呢？BeautifulSoup使用:tag[attrNam

e]来获取tag元素的名称为attrName的属性值，其中tag是一个bs4.element.Tag对象。例2-3-6：查找文档中所有超级链接地址frombs4importBeautifulSoupdoc='''<html

><head><title>TheDormouse'sstory</title></head><body><pclass="title">TheDormouse'sstory<pclass="story">Onceuponatim

etherewerethreelittlesisters;andtheirnameswere<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>,<a

href="http://example.com/lacie"class="sister"id="link2">Lacie</a>and<ahref="http://example.com/tillie"class="sister"id="li

nk3">Tillie</a>;andtheylivedatthebottomofawell.<pclass="story">...</body></html>'''soup=Beau

tifulSoup(doc,"lxml")tags=soup.find_all("a")fortagintags:print(tag["href"])程序结果：http://example.com/elsiehttp:/

/example.com/laciehttp://example.com/tillie2.3.3BeautifulSoup获取元素包含的文本值2.3.3BeautifulSoup获取元素包含的文本值如果一个元素已经找到，例如找到<a

>元素，那么怎么样获取它包含的文本值呢？BeautifulSoup使用:tag.text来获取tag元素包含的文本值，其中tag是一个bs4.element.Tag对象。例2-3-7：查找文档中所有<a>超级链接包含的文本值frombs4importBeautifulSoupdoc='''

<html><head><title>TheDormouse'sstory</title></head><body><pclass="title">TheDormouse'sstory<pcl

ass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere<ahref="http://example.com/elsie"c

lass="sister"id="link1">Elsie</a>,<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>and<ahref="http://example.com/till

ie"class="sister"id="link3">Tillie</a>;andtheylivedatthebottomofawell.<pclass="story">...</body></html>'''soup=BeautifulSoup(doc,"lxml")

tags=soup.find_all("a")fortagintags:print(tag.text)程序结果：ElsieLacieTillie例2-3-8：查找文档中所有超级链接包含的文本值frombs4importBeautifulSoupdoc='''<html><head

><title>TheDormouse'sstory</title></head><body><pclass="title">TheDormouse'sstory<pclass="story">Onceuponatimeth

erewerethreelittlesisters;andtheirnameswere<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>,<ahref="http://example.com/lac

ie"class="sister"id="link2">Lacie</a>and<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>;andtheylivedatthebottomofawell.

<pclass="story">...</body></html>'''soup=BeautifulSoup(doc,"lxml")tags=soup.find("p")fortagintags:print(tag.

text)程序结果：TheDormouse'sstoryOnceuponatimetherewerethreelittlesisters;andtheirnameswereElsie,LacieandTillie;andtheylivedattheb

ottomofawell....其中第二个包含的值就是节点子树下面所有文本节点的组合值。2.3.4BeautifulSoup高级查找一般find或者find_all都能满足我们的需要，如果还不能那么

可以设计一个查找函数来进行查找。例2-3-9：我们查找文档中的href="http://example.com/lacie"的节点元素<a>frombs4importBeautifulSoupdoc='''<html><head><tit

le>TheDormouse'sstory</title></head><body><ahref="http://example.com/elsie">Elsie</a><ahref="http://example.com/lacie">Lacie</a

><ahref="http://example.com/tillie">Tillie</a></body></html>'''defmyFilter(tag):print(tag.name)return(tag.name=="a"an

dtag.has_attr("href")andtag["href"]=="http://example.com/lacie")soup=BeautifulSoup(doc,"lxml")tag=soup.find_all(myFilter)pri

nt(tag)程序结果：htmlheadtitlebodyaaa[<ahref="http://example.com/lacie">Lacie</a>]说明：在程序中我们定义了一个筛选函数myFilter(tag)，它的参数是tag对象，在调用sou

p.find_all(myFilter)时程序会把每个tag元素传递给myFilter函数，由该函数决定这个tag的取舍，如果myFilter返回True就保留这个tag到结果集中，不然就丢掉这个tag。因此程序执行时可以看到html,body,head,title,body

,a,a,a等一个个tag经过myFilter的筛选，只有节点<ahref="http://example.com/lacie">Lacie</a>满足要求，因此结果为：[<ahref="http://example.com/lacie">Lacie</a>]其

中：tag.name是tag的名称；tag.has_attr(attName)判断tag是否有attName属性；tag[attName]是tag的attName属性值；例2-3-10：通过函数查找可以查找到一些复杂的节点元素，查找文本值以"cie"结

尾所有<a>节点frombs4importBeautifulSoupdoc='''<html><head><title>TheDormouse'sstory</title></head><body><ahref="htt

p://example.com/elsie">Elsie</a><ahref="http://example.com/lacie">Lacie</a><ahref="http://example.com/tillie">Tillie<

/a><ahref="http://example.com/tilcie">Tilcie</a></body></html>'''defendsWith(s,t):iflen(s)>=len(t):return

s[len(s)-len(t):]==treturnFalsedefmyFilter(tag):return(tag.name=="a"andendsWith(tag.text,"cie"))soup=BeautifulSoup(doc,"lxml")tags=soup.find_all(my

Filter)fortagintags:print(tag)程序结果：<ahref="http://example.com/lacie">Lacie</a><ahref="http://example.com/tilcie">Tilcie</a>程序中定义了一个endsWIth(s,t)函数

判断s字符串是否以字符串t结尾，是就返回True，不然返回False，在myFilter中调用这个函数判断tag.text是否以"cie"结尾，最后找出所有文本值以"cie"结尾的<a>节点。

小橙橙

文档分享，欢迎浏览！

文档 25747
被下载 7
被收藏 0

TA的店铺

Python爬虫程序设计KC23

部编版历史中考一轮复习课件全套

专题复习俄国近现代的发展变化课件-人教版

专题04-世界资本主义制度的确立和发展课件

中考系统复习3-中国现代史-1.中华人民共和国的成立和巩固_人教版历史九年级名师课件

中考历史总复习第一编教材知识梳理第11讲中华民族的抗日战争课件

中考历史主题五-中华民族的抗日战争课件

中考历史总复习第二编热点专题速查专题7大国崛起与大国关系三年两次课件

中考历史总复习第二编热点专题速查专题1我国统一多民族国家的形成发展与巩固课件

中考历史专题复习对外关系-教学课件-人教版

中考历史复习专题四大国发展史及重要大国关系课件