Web Scraping with Beautiful Soup

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

CData

We can get the CData from a document with Beautiful Soup.

For example, wen can write:

from bs4 import BeautifulSoup, CData
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
cdata = CData("A CDATA block")
comment.replace_with(cdata)

print(soup.b.prettify())

We replaced the comment inside the b tag with the CData block, so the print function will print:

<b>
 <![CDATA[A CDATA block]]>
</b>

Going Down

We can get tags with other tags.

For example, we can write:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.head)
print(soup.title)

The first print call gets the head element’s content.

And the 2nd print call gets the title element’s content.

So we get:

<head><title>The Dormouse's story</title></head>

and:

<title>The Dormouse's story</title>

respectively.

We can also get the b element by writing:

print(soup.body.b)

to get the first b element in body .

So we get:

<b>The Dormouse's story</b>

printed.

And:

print(soup.a)

to get the first a element.

So we tet:

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

printed.

We can use the find_all method to find all elements with the given selector.

For example, we can write:

print(soup.find_all('a'))

And we get:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

printed.

`.contents` and `.children`

We can get the contents of tags with the contents property.

For exam[ple, we can write:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.head
print(head_tag.contents)

And we see:

[<title>The Dormouse's story</title>]

printed.

We can get the content of the title tag by writing:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.head
title_tag = head_tag.contents[0]
print(title_tag.contents)

We get the head element with soup.head .

And we get the content of it with head_tag.contents[0] .

And we get the title tag’s content with title_tag.contents .

So we see:

[u"The Dormouse's story"]

printed.

We can also loop through the title_tag ‘s content with a for loop:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.head
title_tag = head_tag.contents[0]
for child in title_tag.children:
    print(child)

Then we see ‘The Dormouse’s story’ logged.

`.descendants`

We can get the descendants of an elemnt with the descendants property.

For example, we can write:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.head
for child in head_tag.descendants:
    print(child)

Then we see:

<title>The Dormouse's story</title>
The Dormouse's story

logged.

We get the title element and the content of it, so it goes through the tree.

Conclusion

Beautiful Soup can work with CData and child nodes.

Post Views: 12

Web Scraping with Beautiful Soup — Child Nodes

CData

Going Down

.contents and .children

.descendants

Conclusion

`.contents` and `.children`

`.descendants`