Processing webpages with python

Python may be an useful tool to parse HTML files.

First thing we need to do is to access the file. For this, we can use python urllib library:

from urllib import urlopen
url = 'http://some.url'
content = urlopen(url).read()
print content

The code above should print the source of the url.

The second part consists in selecting the desired part from the text. Suppose we want to extract the content of a table in the middle of the page. We can use python regular expressions.

import re
pattern = '<tr>.*?</tr>'
m = re.findall(pattern, content)

The code above will return a list in m, of all ocurrences of ‘pattern’. In this pattern, ‘.’ represents any character and ‘*’ means we are interested in 0 or more repetitions. The ‘?’ character means will do a minimal match.

For example, if content was:

<tr>hello</tr> something <tr>world</tr>

The list would be

['<tr>hello</tr>', '<tr>world</tr>']

But if we didn’t include the ‘?’ character, the list would be

['<tr>hello</tr> something <tr>world</tr>']

I made a similar code for a very specific task and I probably won’t use this code again. I was advised to not parse HTML files using regular expressions. An alternative for python is using a XML parsing library, for example, Beautiful Soup.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s