At Balihoo, we recently encountered a problem. We needed to extract all the text and formatting from past blog posts we had written, but our service provider didn’t have a tool to do that. Since there were over 600 posts, this isn’t the kind of project you want to do by hand. With a bit of coding I was able to scrape data off of webpages, getting the content from every single blog post- In less than 5 minutes.

If you are unfamiliar with VBA and Internet Explorer, be sure to check out my introduction post about Excel, VBA, and IE.

The task was two-fold. First, we needed to get the URL for every single blog post we’ve created. Secondly there needed to be a way to visit each one of those pages and extract the actual copy from the post. I made a simple Excel sheet to store this data; a column for URL, h1 tag, h2, blog post (with HTML code), and blog post (just text copy, for readability).

1- excel sheet image

URL grabbing scraper
Our blog features 10 posts per page, and the URL is always formatted “http://marketing.balihoo.com/blog/balihoo-blog/page/X” where X is the a number 1-67. The scraped needs to visit each page (so 67 total) and extract all 10 URLs that link to the actual blog posts from the URL. The final goal should be a list of 670 URLs of the actual blog posts.

Spending a few minutes browsing through the page source for a random page, its easy to notice the post title and URL are incased in div class “post-title”.

2- URL scraper page source

So, for the URL scraper, my coding outline is:

Loop Page 1 to 67
  Open page X of posts
  Loop 1 to 10
    Find URL in "post-title" div, write to Excel sheet
    Find text Post title in "post-title" div, write to Excel sheet
  End Loop
End Loop

Translated into VBA specific code, it looks like this:

Sub URL_Lookup()
    Dim objElement As Object
    Dim objCollection As Object
    Dim navtar As String
    Dim URL As String
    Dim i As Long 'i is our div counter
    Dim j As Long 'j is our post-title div counter
    Dim k As Integer 'k is our page counter

    Set IE = CreateObject("InternetExplorer.Application")
    IE.Visible = False 'Saves you from having to see this
    k = Range("B2").Value 'b2 is the start page number, and b3 is the stop page
    While k < Range("B3").Value + 1
         navtar = Range("B1").Value & k 'b1 is the cell with the URL
         IE.Navigate navtar
         waitTime = 5
         Start = Timer
         While IE.Busy 'Basically just a loading check
             DoEvents
             If Timer > Start + waitTime Then 'if it doesn't load, quit IE
                IE.Quit 
                Set IE = Nothing ' Clean up
                Set objElement = Nothing
                Set objCollection = Nothing
                Application.StatusBar = ""
                Exit Sub
            End If
        Wend
        Set objCollection = IE.document.getElementsByTagName("div")
        i = 0
        j = 0

        While i < objCollection.Length And j < 10
            If objCollection(i).className = "post-title" Then
                URL = objCollection(i).innerHTML 'Trimming
                URL = Right(URL, Len(URL) - InStr(URL, Chr(34))) 'Trimming
                URL = Left(URL, InStr(URL, Chr(34)) - 1) 'Trimming
                ActiveCell.Value = URL
                ActiveCell.Offset(1, 0).Activate
                j = 1 + 1 'next post-title div
            End If
            i = i + 1 'next div
        Wend
        k = k + 1 'next page
    Wend

    IE.Quit
    Set IE = Nothing ' Clean up
    Set objElement = Nothing
    Set objCollection = Nothing
    Application.StatusBar = ""

End Sub

The real bulk of the lifting is found in these two lines:

Set objCollection = IE.document.getElementsByTagName("div")

If objCollection(i).className = "post-title" Then

The first line of code finds all the div tags in the entire source code of the page that is rendered, and the second line of code then goes and look at each div tag and looks for the div class=”post-title”. Inside the post-title div is both the text version of the post title, and more importantly the unique URL for each post. Really, the rest of the script is supporting these two elements, and taking the results and doing something useful (writing to a line of an Excel spreadheet) with them.

When I executed this script it went through every one of 67 pages, and extracted all 10 “post-title” URLs from each page. It then wrote each URL to the first column of each row of an excel spreadsheet. Check back next week for the coding for the script that will actually extract all the post data from the URL list generated above!

Related posts

2 thoughts on “Visual Basic & div tags Part 1- How to scrape data off of webpages”

  1. […] This post is includes the second half of the scripting required for the web-scraping project described in last weeks post. […]

  2. […] the past, I’ve highlighted one method of coding VBA to scrape webpage data by using Internet Explorer and making VBA calls. While IE is good for […]

Leave a Comment