This post is includes the second half of the scripting required for the web-scraping project described in last weeks post.

With a completed list of the blog post and their corresponding URLS, the next step is to code a scraper that will visit each page, search out the blog content and any relevant tags then copy that data to the Excel sheet.

For each URL in List
  Open URL in Explorer
  Look for <h1> tag, write to Excel file
  Look for <h2> tag, write to Excel file
  Look for post body div write HTML
  Trim HTML into plain text write to Excel
Next

The actual code:

Sub Content_Extractor()
    Dim objElement As Object
    Dim objCollection As Object
    Dim navtar As String
    Dim copy As String
    Dim i As Long
    Dim j As Long
    Dim k As Integer

    Set IE = CreateObject("InternetExplorer.Application")
    IE.Visible = False
        
    While ActiveCell.Offset(0, -5).Value <> ""
        navtar = ActiveCell.Offset(0, -5).Value
        IE.Navigate navtar
        waitTime = 5
        Start = Timer
        While IE.Busy
            DoEvents
            If Timer > Start + waitTime Then
                IE.Quit
                Set IE = Nothing ' Clean up
                Set objElement = Nothing
                Set objCollection = Nothing
                Application.StatusBar = ""
                ActiveCell.Offset(0, 0).Value = "Error (page not loading)"
                Exit Sub
            End If
        Wend
        Set objCollection = IE.document.getElementsByTagName("div")
        i = 0
        While i < objCollection.Length
            If objCollection(i).className = "post-body" Then
                copy = objCollection(i).innerText
                ActiveCell.Value = copy
                ActiveCell.Offset(0, -1).Value = objCollection(i).innerHTML
                'ActiveCell.Offset(1, 0).Activate
            End If
            If objCollection(i).className = "post-date" Then
                Debug.Print "Found post date: " & objCollection(i).innerText
                ActiveCell.Offset(0, -2).Value = objCollection(i).innerText
                'ActiveCell.Offset(1, 0).Activate
            End If
            If objCollection(i).className = "post-author" Then
                Debug.Print "Found author: " & objCollection(i).innerText
                ActiveCell.Offset(0, -3).Value = objCollection(i).innerText
                'ActiveCell.Offset(1, 0).Activate
            End If
            i = i + 1
        Wend 
        Set objCollection = IE.document.getElementsByTagName("h2")
        i = 0
        While i < objCollection.Length
            ActiveCell.Offset(0, -4).Value = objCollection(i).innerText
            i = i + 1
        Wend
        ActiveCell.Offset(1, 0).Activate
    Wend
        
    IE.Quit
    Set IE = Nothing ' Clean up
    Set objElement = Nothing
    Set objCollection = Nothing
    Application.StatusBar = ""
End Sub

As with last time, the bulk of the heavy lifting is done by:

        Set objCollection = IE.document.getElementsByTagName("div")
        i = 0
        While i < objCollection.Length
            If objCollection(i).className = "post-body" Then

The idea is to set objCollection to be the complete set of div tags, then start looking for the one with .classname = “post-body”. That div contains all of the copy of the blog post we are looking to scrape, which we can extract with objCollection(i).innerHTML if we want the HTML or objCollection(i).innerText if we just want the plain text. The script also looks for other divs (post-date, post-author) in the same object colletion, then switches collections after that looking for the h2 tag.

The rest of the script is either setup for that, script cleanup, or code needed to support the writing of the data to Excel. Questions on the script? Email me or comment below.

Related posts

1 thought on “Visual Basic & div tags Part 2- How to scrape data off of webpages”

  1. Rupert Dick

    Really useful – I have been using SEOtools but it is way too slow
    I found that it needed this

    While ie.Busy Or ie.ReadyState 4

    otherwise it can read the previous page

    My whole loop is now like this

    While ActiveCell.Offset(0, -3).Value “”
    ie.Stop
    navtar = ActiveCell.Offset(0, -3).Value
    Application.StatusBar = “Loading ” & navtar
    ‘ActiveCell.Offset(0, 1).Value = navtar
    On Error GoTo navError

    ie.Navigate navtar
    waitTime = 10
    Start = Timer
    While ie.Busy Or ie.ReadyState 4
    DoEvents
    If Timer > Start + waitTime Then
    GoTo navError
    End If
    Wend

    ActiveCell.Offset(0, -2).Value = ie.document.Title
    Set objCollection = ie.document.getElementsByTagName(“H1”)

    For i = 0 To objCollection.Length – 1
    ActiveCell.Offset(0, -1).Value = objCollection(i).innerText

    Next
    ActiveCell.Offset(0, 0).Value = “Done ” & Now()
    GoTo nextLoop

    navError:
    ActiveCell.Offset(0, 0).Value = “URL Error”

    nextLoop:
    ActiveCell.Offset(1, 0).Activate
    Wend

Leave a Comment