A Tour of Go Exercises: Web Crawler

Categories: Code, Software Development

To wrap up what I started in my last post, here is my solution to the A Tour of Go: Web Crawler exercise. Be sure to try it yourself before looking at my solution if you want to avoid spoilers…

package main

import (
	"fmt"
	"sync"
	"time"
	"math/rand"
)

var empty = struct{}{}

type UrlSet struct {
	mu sync.Mutex
	m map[string]struct{}
}

func NewUrlSet() (us *UrlSet) {
	us = &UrlSet{}

	us.m = make(map[string]struct{})

	return us
}

func (us *UrlSet) ShouldProcessUrl(url string) bool {
	us.mu.Lock()
	defer us.mu.Unlock()
	if _, ok := us.m[url]; ok {
		return false
	}

	us.m[url] = empty

	return true
}

type Fetcher interface {
	// Fetch returns the body of URL and
	// a slice of URLs found on that page.
	Fetch(url string) (body string, urls []string, err error)
}

// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, maxdepth int, fetcher Fetcher) {
	us := NewUrlSet()

	var worker func(url string, depth int, wg *sync.WaitGroup)

	worker = func(url string, depth int, wg *sync.WaitGroup) {
		if wg != nil {
			defer wg.Done()
		}
		
		// TODO: Fetch URLs in parallel. - Done
		// TODO: Don't fetch the same URL twice. - Done

		if !us.ShouldProcessUrl(url) {
			return
		}

		if depth <= 0 {
			return
		}
		
		body, urls, err := fetcher.Fetch(url)
		if err != nil {
			fmt.Println(err)
			return
		}
		
		fmt.Printf("found: %s %q\n", url, body)
		
		var wgChild sync.WaitGroup
		
		for _, u := range urls {
			wgChild.Add(1)
  			go worker(u, depth-1, &wgChild)
		}
		
		wgChild.Wait()
	}

	worker(url, maxdepth, nil)

	return
}

func main() {
	Crawl("https://golang.org/", 4, fetcher)
}

// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult

type fakeResult struct {
	body string
	urls []string
}

func (f fakeFetcher) Fetch(url string) (string, []string, error) {
	
	if res, ok := f[url]; ok {
		wait := time.Duration(rand.Float64() * 3.0 + 1.0) * time.Second		
		time.Sleep(wait)
		return res.body, res.urls, nil
	}
	return "", nil, fmt.Errorf("not found: %s", url)
}

// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
	"https://golang.org/": &fakeResult{
		"The Go Programming Language",
		[]string{
			"https://golang.org/pkg/",
			"https://golang.org/cmd/",
		},
	},
	"https://golang.org/pkg/": &fakeResult{
		"Packages",
		[]string{
			"https://golang.org/",
			"https://golang.org/cmd/",
			"https://golang.org/pkg/fmt/",
			"https://golang.org/pkg/os/",
		},
	},
	"https://golang.org/pkg/fmt/": &fakeResult{
		"Package fmt",
		[]string{
			"https://golang.org/",
			"https://golang.org/pkg/",
		},
	},
	"https://golang.org/pkg/os/": &fakeResult{
		"Package os",
		[]string{
			"https://golang.org/",
			"https://golang.org/pkg/",
		},
	},
}

There were two primary tasks in this exercise. The first was to avoid downloading the same URL twice. My solution to that was to create a UrlSet structure to track which URLs have already been visited. The structure contains a mutex and a map. Since multiple goroutines may be trying to crawl at the same time, the mutex is needed to protect writes made to the map.

The map is used as a set where the key is the URL to be processed and the value is always the empty structure declared on line 10. The ShouldProcessUrl function checks to see if the given URL has already been processed, and if it hasn’t it adds it to the map. It then returns true if the caller should proceed to crawl.

The second task is slightly trickier:

    // TODO: Fetch URLs in parallel.

Goroutines are the obvious choice, but synchronizing them to ensure that all URLs have been processed before returning to the caller requires some thought. Just as in my Equivalent Binary Trees solution, I chose to use a nested worker function closure to recursively process the URLs for each “page”. The worker function accepts a URL, a depth counter, and a WaitGroup primitive.

A WaitGroup allows one goroutine to block until the WaitGroup.Done() method has been called a predefined number of times. The heart of my solution starts on line 74 where a new WaitGroup is declared. Then in the for loop, the WaitGroup counter is incremented and a new goroutine is spawned for each candidate URL. After the loop termination, the caller blocks until all of the spawned goroutines exit (due to the defer wg.Done() call on line 52).

In practice, this means that all of the goroutines start fetching as soon as possible, then wait until all of their children exit before returning. I added a random wait timer to the dummy Fetch() routine to better illustrate how the fetch process would work in an actual application.

And that’s it! Learning to visualize how code flows through multiple threads (or goroutines) is critical for writing reliable, high-performance software. If the flow isn’t clear to you, try working through a couple of simple examples using pen and paper. Now that I’ve finished the tour, I’m working on a new project to provide synchronization primitives via a REST API, so stay tuned!

«

    Leave a Reply