This post is includes the second half of the scripting required for the web-scraping project described in last weeks post.
With a completed list of the blog post and their corresponding URLS, the next step is to code a scraper that will visit each page, search out the blog content and any relevant tags then copy that data to the Excel sheet.
For each URL in List Open URL in Explorer Look for <h1> tag, write to Excel file Look for <h2> tag, write to Excel file Look for post body div write HTML Trim HTML into plain text write to Excel Next
The actual code:
Sub Content_Extractor() Dim objElement As Object Dim objCollection As Object Dim navtar As String Dim copy As String Dim i As Long Dim j As Long Dim k As Integer Set IE = CreateObject("InternetExplorer.Application") IE.Visible = False While ActiveCell.Offset(0, -5).Value <> "" navtar = ActiveCell.Offset(0, -5).Value IE.Navigate navtar waitTime = 5 Start = Timer While IE.Busy DoEvents If Timer > Start + waitTime Then IE.Quit Set IE = Nothing ' Clean up Set objElement = Nothing Set objCollection = Nothing Application.StatusBar = "" ActiveCell.Offset(0, 0).Value = "Error (page not loading)" Exit Sub End If Wend Set objCollection = IE.document.getElementsByTagName("div") i = 0 While i < objCollection.Length If objCollection(i).className = "post-body" Then copy = objCollection(i).innerText ActiveCell.Value = copy ActiveCell.Offset(0, -1).Value = objCollection(i).innerHTML 'ActiveCell.Offset(1, 0).Activate End If If objCollection(i).className = "post-date" Then Debug.Print "Found post date: " & objCollection(i).innerText ActiveCell.Offset(0, -2).Value = objCollection(i).innerText 'ActiveCell.Offset(1, 0).Activate End If If objCollection(i).className = "post-author" Then Debug.Print "Found author: " & objCollection(i).innerText ActiveCell.Offset(0, -3).Value = objCollection(i).innerText 'ActiveCell.Offset(1, 0).Activate End If i = i + 1 Wend Set objCollection = IE.document.getElementsByTagName("h2") i = 0 While i < objCollection.Length ActiveCell.Offset(0, -4).Value = objCollection(i).innerText i = i + 1 Wend ActiveCell.Offset(1, 0).Activate Wend IE.Quit Set IE = Nothing ' Clean up Set objElement = Nothing Set objCollection = Nothing Application.StatusBar = "" End Sub
As with last time, the bulk of the heavy lifting is done by:
Set objCollection = IE.document.getElementsByTagName("div") i = 0 While i < objCollection.Length If objCollection(i).className = "post-body" Then
The idea is to set objCollection to be the complete set of div tags, then start looking for the one with .classname = “post-body”. That div contains all of the copy of the blog post we are looking to scrape, which we can extract with objCollection(i).innerHTML if we want the HTML or objCollection(i).innerText if we just want the plain text. The script also looks for other divs (post-date, post-author) in the same object colletion, then switches collections after that looking for the h2 tag.
The rest of the script is either setup for that, script cleanup, or code needed to support the writing of the data to Excel. Questions on the script? Email me or comment below.
Really useful – I have been using SEOtools but it is way too slow
I found that it needed this
While ie.Busy Or ie.ReadyState 4
otherwise it can read the previous page
My whole loop is now like this
While ActiveCell.Offset(0, -3).Value “”
ie.Stop
navtar = ActiveCell.Offset(0, -3).Value
Application.StatusBar = “Loading ” & navtar
‘ActiveCell.Offset(0, 1).Value = navtar
On Error GoTo navError
ie.Navigate navtar
waitTime = 10
Start = Timer
While ie.Busy Or ie.ReadyState 4
DoEvents
If Timer > Start + waitTime Then
GoTo navError
End If
Wend
ActiveCell.Offset(0, -2).Value = ie.document.Title
Set objCollection = ie.document.getElementsByTagName(“H1”)
For i = 0 To objCollection.Length – 1
ActiveCell.Offset(0, -1).Value = objCollection(i).innerText
Next
ActiveCell.Offset(0, 0).Value = “Done ” & Now()
GoTo nextLoop
navError:
ActiveCell.Offset(0, 0).Value = “URL Error”
nextLoop:
ActiveCell.Offset(1, 0).Activate
Wend