At Balihoo, we recently encountered a problem. We needed to extract all the text and formatting from past blog posts we had written, but our service provider didn’t have a tool to do that. Since there were over 600 posts, this isn’t the kind of project you want to do by hand. With a bit of coding I was able to scrape data off of webpages, getting the content from every single blog post- In less than 5 minutes.
If you are unfamiliar with VBA and Internet Explorer, be sure to check out my introduction post about Excel, VBA, and IE.
The task was two-fold. First, we needed to get the URL for every single blog post we’ve created. Secondly there needed to be a way to visit each one of those pages and extract the actual copy from the post. I made a simple Excel sheet to store this data; a column for URL, h1 tag, h2, blog post (with HTML code), and blog post (just text copy, for readability).
URL grabbing scraper
Our blog features 10 posts per page, and the URL is always formatted “http://marketing.balihoo.com/blog/balihoo-blog/page/X” where X is the a number 1-67. The scraped needs to visit each page (so 67 total) and extract all 10 URLs that link to the actual blog posts from the URL. The final goal should be a list of 670 URLs of the actual blog posts.
Spending a few minutes browsing through the page source for a random page, its easy to notice the post title and URL are incased in div class “post-title”.
So, for the URL scraper, my coding outline is:
Loop Page 1 to 67 Open page X of posts Loop 1 to 10 Find URL in "post-title" div, write to Excel sheet Find text Post title in "post-title" div, write to Excel sheet End Loop End Loop
Translated into VBA specific code, it looks like this:
Sub URL_Lookup() Dim objElement As Object Dim objCollection As Object Dim navtar As String Dim URL As String Dim i As Long 'i is our div counter Dim j As Long 'j is our post-title div counter Dim k As Integer 'k is our page counter Set IE = CreateObject("InternetExplorer.Application") IE.Visible = False 'Saves you from having to see this k = Range("B2").Value 'b2 is the start page number, and b3 is the stop page While k < Range("B3").Value + 1 navtar = Range("B1").Value & k 'b1 is the cell with the URL IE.Navigate navtar waitTime = 5 Start = Timer While IE.Busy 'Basically just a loading check DoEvents If Timer > Start + waitTime Then 'if it doesn't load, quit IE IE.Quit Set IE = Nothing ' Clean up Set objElement = Nothing Set objCollection = Nothing Application.StatusBar = "" Exit Sub End If Wend Set objCollection = IE.document.getElementsByTagName("div") i = 0 j = 0 While i < objCollection.Length And j < 10 If objCollection(i).className = "post-title" Then URL = objCollection(i).innerHTML 'Trimming URL = Right(URL, Len(URL) - InStr(URL, Chr(34))) 'Trimming URL = Left(URL, InStr(URL, Chr(34)) - 1) 'Trimming ActiveCell.Value = URL ActiveCell.Offset(1, 0).Activate j = 1 + 1 'next post-title div End If i = i + 1 'next div Wend k = k + 1 'next page Wend IE.Quit Set IE = Nothing ' Clean up Set objElement = Nothing Set objCollection = Nothing Application.StatusBar = "" End Sub
The real bulk of the lifting is found in these two lines:
Set objCollection = IE.document.getElementsByTagName("div")
If objCollection(i).className = "post-title" Then
The first line of code finds all the div tags in the entire source code of the page that is rendered, and the second line of code then goes and look at each div tag and looks for the div class=”post-title”. Inside the post-title div is both the text version of the post title, and more importantly the unique URL for each post. Really, the rest of the script is supporting these two elements, and taking the results and doing something useful (writing to a line of an Excel spreadheet) with them.
When I executed this script it went through every one of 67 pages, and extracted all 10 “post-title” URLs from each page. It then wrote each URL to the first column of each row of an excel spreadsheet. Check back next week for the coding for the script that will actually extract all the post data from the URL list generated above!