Python may be an useful tool to parse HTML files.
First thing we need to do is to access the file. For this, we can use python urllib library:
from urllib import urlopen url = 'http://some.url' content = urlopen(url).read() print content
The code above should print the source of the url.
The second part consists in selecting the desired part from the text. Suppose we want to extract the content of a table in the middle of the page. We can use python regular expressions.
import re pattern = '<tr>.*?</tr>' m = re.findall(pattern, content)
The code above will return a list in m, of all ocurrences of ‘pattern’. In this pattern, ‘.’ represents any character and ‘*’ means we are interested in 0 or more repetitions. The ‘?’ character means will do a minimal match.
For example, if content was:
<tr>hello</tr> something <tr>world</tr>
The list would be
But if we didn’t include the ‘?’ character, the list would be
['<tr>hello</tr> something <tr>world</tr>']
I made a similar code for a very specific task and I probably won’t use this code again. I was advised to not parse HTML files using regular expressions. An alternative for python is using a XML parsing library, for example, Beautiful Soup.