Menu Close

Web Scraping with Beautiful Soup — Attributes and Strings

ps%3A%2F%2Funsplash.com%3Futm_source%3Dmedium%26utm_medium%3Dreferral)

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

Manipulating Attributes

We can manipulate attributes with Beautiful Soup.

For example, we can write:

from bs4 import BeautifulSoup

tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
tag['id'] = 'verybold'
tag['another-attribute'] = 1
print(tag)
del tag['id']
del tag['another-attribute']
print(tag)

We just add and remove items from the tag dictionary to manipulate attributes.

Then the first print statement prints:

<b another-attribute="1" id="verybold">bold</b>

and the 2nd one prints:

<b>bold</b>

Multi-Valued Attributes

Beautiful Soup works with attributes with multiple values.

For example, we can parse:

from bs4 import BeautifulSoup

css_soup = BeautifulSoup('<p class="body bold"></p>', 'html.parser')
print(css_soup.p['class'])

Then we get [u’body’, u’bold’] printed.

All the values will be added after we turn the dictionary back to a string:

from bs4 import BeautifulSoup

rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>', 'html.parser')
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)

The print statement will print:

<p>Back to the <a rel="index contents">homepage</a></p>

If we parse a document withn XML with LXML, we get the same result:

from bs4 import BeautifulSoup

xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'lxml')
print(xml_soup.p['class'])

We still get:

['body', 'strikeout']

printed.

NavigableString

We can get text within a tag. For example, we can write:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag.string
print(type(tag.string))

Then we get:

<class 'bs4.element.NavigableString'>

printed.

The tag.string property has a navigable string in the b tag.

We can convert it into a Python string by writing:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag.string
unicode_string = str(tag.string)
print(unicode_string)

Then ‘Extremely bold’ is printed.

We can replace a navigable string with a different string.

To do that, we write:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
print(tag.string)
tag.string.replace_with("No longer bold")
print(tag.string)

Then we see:

Extremely bold
No longer bold

printed.

BeautifulSoup Object

The BeautifulSoup object represents the whole parsed document.

For example, if we have:

from bs4 import BeautifulSoup

doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml")
doc.find(text="INSERT FOOTER HERE").replace_with(footer)
print(doc)
print(doc.name)

Then we see:

<?xml version="1.0" encoding="utf-8"?>
<document><content/><footer>Here's the footer</footer></document>

printed from the first print call.

And:

[document]

printed from the 2nd print call.

Comments and Other Special Strings

Beautiful Soup can parse comments and other special strings.

For example, we can write:

from bs4 import BeautifulSoup

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
print(type(comment))
print(soup.b.prettify())

Then we can get the comment string from the b element with the soup.b.string property.

So the first print call prints:

<class 'bs4.element.Comment'>

And the 2nd print call prints:

<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>

Conclusion

We can manipulate attributes and work with strings with Beautiful Soup.

Posted in Beautiful Soup, Python