
KONICA MINOLTA DIGITAL CAMERA
1、BeautifulSoup
是爬取网页信息使用频率最高的库,下面简单介绍一下我编写脚本过程中利用bs4获取信息的方法。
2、本文用到到网的某一购物场所作为例子:
url='http://www.tripadvisor.cn/Attraction_Review-g294217-d3821611-Reviews-Empire_International_Tailors-Hong_Kong.html'
3、写脚本的时候经常参考网上别人写的,号称“菜鸟”“简洁易懂”,今天就让我来告诉你什么叫真正的菜!!!
1、find()
要点:找到唯一标签
例如:我们需要找出上面所给网页的购物场所的英文名,我们发现在这一段:
<span class="altHead">Empire International Tailors</span>
class 标签是唯一的,很简单,代码如下:
1 2 |
<span class="n">english_name</span><span class="o">=</span><span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'span'</span><span class="p">,{</span><span class="s1">'class'</span><span class="p">:</span><span class="s2">"altHead"</span><span class="p">})</span> <span class="k">print</span> <span class="n">english_name</span><span class="o">.</span><span class="n">string</span> |
2、extract()
要点:移除指定标签,并返回结果
例如:很多时候我们发现情况没有第一种那么简单,比如我们需要爬取所给网页中的中文名,我们发现在这一段:
1 2 3 4 5 |
<span class="p"><</span><span class="nt">h1</span> <span class="na">id</span><span class="o">=</span><span class="s">"HEADING"</span> <span class="na">property</span><span class="o">=</span><span class="s">"name"</span> <span class="na">class</span><span class="o">=</span><span class="s">"heading_name with_alt_title "</span><span class="p">></span> <span class="p"><</span><span class="nt">div</span> <span class="na">class</span><span class="o">=</span><span class="s">"heading_height"</span><span class="p">></</span><span class="nt">div</span><span class="p">></span> Empire服装定制 <span class="p"><</span><span class="nt">span</span> <span class="na">class</span><span class="o">=</span><span class="s">"altHead"</span><span class="p">></span>Empire International Tailors<span class="p"></</span><span class="nt">span</span><span class="p">></span> <span class="p"></</span><span class="nt">h1</span><span class="p">></span> |
如果通过:soup.find_all('h1',{'id':"HEADING"})
,获取的是一整段,中文和英文名是一起的,所以一种方法是使用extract()移除<span>
标签。代码如下:
1 2 3 4 |
<span class="n">biaoqian</span><span class="o">=</span><span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'h1'</span><span class="p">,{</span><span class="s1">'id'</span><span class="p">:</span><span class="s2">"HEADING"</span><span class="p">})</span> <span class="n">english_name</span><span class="o">=</span><span class="n">biaoqian</span><span class="o">.</span><span class="n">span</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span> <span class="k">print</span> <span class="n">english_name</span><span class="c1">#此处是英文名称</span> <span class="k">print</span> <span class="n">biaoqian</span><span class="c1">#此处span标签已经被移除,获得了中文名称,一举两得</span> |
3、select()
要点:通过属性值查找,避免相似
例如:很多属性值很相似,比如class="detail",class="detail_section info",class="detail wrap"
,所以如果通过find来查找,找出来的往往是一大批,这个时候就需要精确定位,例如我们需要爬取所给网页中的店铺分类,我们发现在这一段:
1 2 |
<span class="p"><</span><span class="nt">div</span> <span class="na">class</span><span class="o">=</span><span class="s">"detail"</span><span class="p">></span> <span class="p"><</span><span class="nt">a</span> <span class="na">href</span><span class="o">=</span><span class="s">"/Attractions-g294217-Activities-c26-t144-Hong_Kong.html"</span><span class="p">></span>礼品与特产商店<span class="p"></</span><span class="nt">a</span><span class="p">></span>, <span class="p"><</span><span class="nt">a</span> <span class="na">href</span><span class="o">=</span><span class="s">"/Attractions-g294217-Activities-c26-Hong_Kong.html"</span><span class="p">></span>购物<span class="p"></</span><span class="nt">a</span><span class="p">></span> <span class="p"></</span><span class="nt">div</span><span class="p">></span> |
通观整个网页源代码我们发现,很多跟class="detail"
类似的标签,所以我们使用select()精确指定我们需要的属性值,代码如下:
1 2 3 |
<span class="n">fenlei</span><span class="o">=</span><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'div[class="detail"]'</span><span class="p">)</span> <span class="k">for</span> <span class="n">fen</span> <span class="ow">in</span> <span class="n">fenlei</span><span class="p">:</span> <span class="k">print</span> <span class="n">fen</span><span class="o">.</span><span class="n">text</span> |
再例如:我们需要获取导航栏目,我们发现在这一段:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
<span class="p"><</span><span class="nt">ul</span><span class="p">></span> <span class="p"><</span><span class="nt">li</span><span class="p">></span> <span class="p"><</span><span class="nt">span</span><span class="p">></span><span class="ni">>></span><span class="p"></</span><span class="nt">span</span><span class="p">></span> <span class="p"><</span><span class="nt">a</span> <span class="na">href</span><span class="o">=</span><span class="s">"/Tourism-g2-Asia-Vacations.html"</span> <span class="na">title</span><span class="o">=</span><span class="s">"亚洲旅游"</span> <span class="na">onclick</span><span class="o">=</span><span class="s">"ta.setEvtCookie('Breadcrumbs', 'click', 'Continent', 2, this.href); "</span><span class="p">></span>亚洲<span class="p"></</span><span class="nt">a</span><span class="p">></span> <span class="p"></</span><span class="nt">li</span><span class="p">></span> <span class="p"><</span><span class="nt">li</span><span class="p">></span> <span class="p"><</span><span class="nt">span</span><span class="p">></span><span class="ni">>></span><span class="p"></</span><span class="nt">span</span><span class="p">></span> <span class="p"><</span><span class="nt">a</span> <span class="na">href</span><span class="o">=</span><span class="s">"/Tourism-g294211-China-Vacations.html"</span> <span class="na">title</span><span class="o">=</span><span class="s">"中国旅游"</span> <span class="na">onclick</span><span class="o">=</span><span class="s">"ta.setEvtCookie('Breadcrumbs', 'click', 'Country', 3, this.href); "</span><span class="p">></span>中国<span class="p"></</span><span class="nt">a</span><span class="p">></span> <span class="p"></</span><span class="nt">li</span><span class="p">></span> <span class="p"><</span><span class="nt">li</span><span class="p">></span> <span class="p"><</span><span class="nt">span</span><span class="p">></span><span class="ni">>></span><span class="p"></</span><span class="nt">span</span><span class="p">></span> <span class="p"><</span><span class="nt">a</span> <span class="na">href</span><span class="o">=</span><span class="s">"/Tourism-g294217-Hong_Kong-Vacations.html"</span> <span class="na">title</span><span class="o">=</span><span class="s">"香港旅游"</span> <span class="na">onclick</span><span class="o">=</span><span class="s">"ta.setEvtCookie('Breadcrumbs', 'click', 'City', 4, this.href); "</span><span class="p">></span>香港<span class="p"></</span><span class="nt">a</span><span class="p">></span> <span class="p"></</span><span class="nt">li</span><span class="p">></span> <span class="p"><</span><span class="nt">li</span><span class="p">></span> <span class="p"><</span><span class="nt">span</span><span class="p">></span><span class="ni">>></span><span class="p"></</span><span class="nt">span</span><span class="p">></span> <span class="p"><</span><span class="nt">a</span> <span class="na">href</span><span class="o">=</span><span class="s">"/Attractions-g294217-Activities-Hong_Kong.html"</span> <span class="na">onclick</span><span class="o">=</span><span class="s">"ta.setEvtCookie('Breadcrumbs', 'click', 'Attractions', 5, this.href);"</span><span class="p">></span>香港景点<span class="p"></</span><span class="nt">a</span><span class="p">></span> <span class="p"></</span><span class="nt">li</span><span class="p">></span> <span class="p"><</span><span class="nt">li</span><span class="p">></span> <span class="p"><</span><span class="nt">span</span><span class="p">></span><span class="ni">>></span><span class="p"></</span><span class="nt">span</span><span class="p">></span> Empire服装定制 <span class="p"></</span><span class="nt">li</span><span class="p">></span> <span class="p"></</span><span class="nt">ul</span><span class="p">></span> |
分析:看上去让人很火大,一个个去获取的话,一方面比较繁琐,另一方面因为各个购物场所的情况不一样,当然也可以用selenium,不过那种也比较繁琐,看上去不简洁,通过观察我们发现onclick属性值有共同之处,于是代码如下:
1 2 3 |
<span class="n">daohang</span><span class="o">=</span><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'a[onclick^="ta.setEvtCookie(</span><span class="se">\'</span><span class="s1">Breadcrumbs"]'</span><span class="p">)</span> <span class="k">for</span> <span class="n">dao</span> <span class="ow">in</span> <span class="n">daohang</span><span class="p">:</span> <span class="k">print</span> <span class="n">dao</span><span class="o">.</span><span class="n">string</span> |
说明:^的意思是符合onclick属性值的前半部分,当然还可以符合后半部分加$。此种方法在获取href属性值的时候也会经常用到。
4、get_text()
要点:获取指定标签下所有文字内容,方便简洁
例如:我们需要爬取所给网页中的地址信息,我们发现在这一段:
1 2 3 |
<span class="p"><</span><span class="nt">span</span> <span class="na">class</span><span class="o">=</span><span class="s">"format_address"</span><span class="p">></span>地址: <span class="p"><</span><span class="nt">span</span> <span class="na">class</span><span class="o">=</span><span class="s">"country-name"</span> <span class="na">property</span><span class="o">=</span><span class="s">"addressCountry"</span><span class="p">></span>中国<span class="p"></</span><span class="nt">span</span><span class="p">><</span><span class="nt">span</span> <span class="na">class</span><span class="o">=</span><span class="s">"locality"</span><span class="p">><</span><span class="nt">span</span> <span class="na">property</span><span class="o">=</span><span class="s">"addressLocality"</span><span class="p">></span>香港<span class="p"></</span><span class="nt">span</span><span class="p">></</span><span class="nt">span</span><span class="p">><</span><span class="nt">span</span> <span class="na">class</span><span class="o">=</span><span class="s">"street-address"</span> <span class="na">property</span><span class="o">=</span><span class="s">"streetAddress"</span><span class="p">></span>麼地道63号好时中心6号铺<span class="p"></</span><span class="nt">span</span><span class="p">><</span><span class="nt">span</span> <span class="na">class</span><span class="o">=</span><span class="s">"extended-address"</span><span class="p">></span>Houston Centre<span class="p"></</span><span class="nt">span</span><span class="p">><</span><span class="nt">span</span> <span class="na">property</span><span class="o">=</span><span class="s">"addressRegion"</span> <span class="na">content</span><span class="o">=</span><span class="s">""</span><span class="p">><</span><span class="nt">span</span> <span class="na">class</span><span class="o">=</span><span class="s">"postal-code"</span> <span class="na">property</span><span class="o">=</span><span class="s">"postalCode"</span> <span class="na">content</span><span class="o">=</span><span class="s">""</span><span class="p">></span> <span class="p"></</span><span class="nt">span</span><span class="p">></span> <span class="p"></</span><span class="nt">span</span><span class="p">></span> <span class="p"></</span><span class="nt">span</span><span class="p">></span> |
分析:短短一个地址分得四分五裂,这就是到到的可恨之处,get_text()
可以一锅端了,如果使用string必须定义到最近一层标签,比较麻烦,而且此处也没有必要。代码如下:
1 2 |
<span class="n">dizhi</span><span class="o">=</span><span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'span'</span><span class="p">,{</span><span class="s1">'class'</span><span class="p">:</span><span class="s2">"format_address"</span><span class="p">})</span> <span class="k">print</span> <span class="n">dizhi</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span> |
5、get_text("",strip=True)
要点:去除所获得文本的空白、换行等,干净整洁
例如:我们需要获得所给网页中的营业时间,我们发现在这一段:
1 2 3 4 5 6 7 8 9 10 |
<span class="p"><</span><span class="nt">div</span> <span class="na">class</span><span class="o">=</span><span class="s">"hoursOverlay "</span><span class="p">></span> <span class="p"><</span><span class="nt">div</span><span class="p">></span> <span class="p"><</span><span class="nt">div</span> <span class="na">class</span><span class="o">=</span><span class="s">"days"</span><span class="p">><</span><span class="nt">b</span><span class="p">></span>经营时间:<span class="p"></</span><span class="nt">b</span><span class="p">></</span><span class="nt">div</span><span class="p">></span> <span class="p"></</span><span class="nt">div</span><span class="p">></span> <span class="p"><</span><span class="nt">div</span><span class="p">></span> <span class="p"><</span><span class="nt">span</span> <span class="na">class</span><span class="o">=</span><span class="s">"days"</span><span class="p">></span> 周一 - 周六 <span class="p"></</span><span class="nt">span</span><span class="p">></span> <span class="p"><</span><span class="nt">span</span> <span class="na">class</span><span class="o">=</span><span class="s">"hours"</span><span class="p">></span>上午10点00分 - 下午9点00分<span class="p"></</span><span class="nt">span</span><span class="p">></span> <span class="p"></</span><span class="nt">div</span><span class="p">></span> <span class="p"></</span><span class="nt">div</span><span class="p">></span> |
如果直接获取文本内容的话,代码如下:
1 2 |
<span class="n">time</span><span class="o">=</span><span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'div'</span><span class="p">,{</span><span class="s1">'class'</span><span class="p">:</span><span class="s2">"hoursOverlay"</span><span class="p">})</span> <span class="k">print</span> <span class="n">time</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span> |
但是会出现很多换行,很不好看,需要去除空白处,代码如下:
1 2 |
<span class="n">time</span><span class="o">=</span><span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'div'</span><span class="p">,{</span><span class="s1">'class'</span><span class="p">:</span><span class="s2">"hoursOverlay"</span><span class="p">})</span> <span class="k">print</span> <span class="n">time</span><span class="o">.</span><span class="n">get_text</span><span class="p">(</span><span class="s2">""</span><span class="p">,</span><span class="n">strip</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> |
评价:
一般来说,获取网页信息除了bs4,还有selenium、re两种。ba4不能获取的时候,一些边边角角毫无章程的信息使用re比较方便直接(我经常这么干),比如到到网的poi的经纬度信息。掌握了bs4的使用方法,结合re,爬取网页信息基本无障碍了。
后序:
bs4博大精深,我只是介绍了一下冰山一角,一般实际操作的过程中经常会发生各种意想不到的情况,此乃技艺生疏的表现,所以一般我会打开这个网址:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree,非常全面,很久之后。我发现网页的头部写得很清楚,有中文版本:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/。
