Menu Close

Web Scraping with Beautiful Soup — Siblings, CSS Selectors, and Node Manipulation

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

find_all_previous() and find_previous()

We can get all the nodes that comes before a given node with the find_all_previous method.

For example, if we have:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
first_link = soup.a
print(first_link.find_all_previous('p'))

Then we see:

[<p class="story">Once upon a time there were three little sisters; and their names weren<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> andn<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;nand they lived at the bottom of a well.</p>, <p class="title"><b>The Dormouse's story</b></p>]

printed.

We get all the p elements that comes before the first a element.

The find_previous method returns the first node only.

CSS Selectors

We can find elements by tags:

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.select("title"))
print(soup.select("p:nth-of-type(3)"))
print(soup.select("body a"))
print(soup.select("html head title"))
print(soup.select("head > title"))
print(soup.select("p > a"))
print(soup.select(".sister"))
print(soup.select("#link1"))

Then we get the elements with the given CSS selectors with the soup.select method.

Modifying the Tree

We can change the text content of an element by writing:

from bs4 import BeautifulSoup
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')

tag = soup.a
tag.string = "New link text."
print(tag)

We get the a element with soup.a .

Then we set the string property to set the text content.

And then we see print the tag and see:

<a href="http://example.com/">New link text.</a>

append()

We can add to a tag’s content with the append method.

For example, we can write:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<a>Foo</a>", 'html.parser')
soup.a.append("Bar")

print(soup.a.contents)

Then we add 'Bar' to the a element as the child of a .

So soup.a.contents is:

[u'Foo', u'Bar']

extend()

The extend method adds every elemnt of a list to a tag.

For instance., we can write:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<a>Foo</a>", 'html.parser')
soup.a.extend([' ', 'bar', ' ', 'baz'])
print(soup.a)

And we get:

<a>Foo bar baz</a>

as the result.

NavigableString() and .new_tag()

We can add navigable strings into an element.

For example, we can write:

from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b></b>", 'html.parser')
tag = soup.b
tag.append("Hello")
new_string = NavigableString(" there")
tag.append(new_string)
print(tag)
print(tag.contents)

And we get:

<b>Hello there</b>

for tag and:

[u'Hello', u' there']

for tag.contents .

Also, we can add a comment node with the Comment class:

from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup("<b></b>", 'html.parser')
tag = soup.b
new_comment = Comment("Nice to see you.")
tag.append(new_comment)
print(tag)
print(tag.contents)

Then tag is:

<b><!--Nice to see you.--></b>

and tag.contents is:

[u'Nice to see you.']

Conclusion

We can get elements and add nodes to other nodes with Beautiful Soup.

Posted in Beautiful Soup, Python