A Tour of Go Exercises: Web Crawler

To wrap up what I started in my last post, here is my solution to the A Tour of Go: Web Crawler exercise. Be sure to try it yourself before looking at my solution if you want to avoid spoilers…
package main import ( "fmt" "sync" "time" "math/rand" ) var empty = struct{}{} type UrlSet struct { mu sync.Mutex m map[string]struct{} } func NewUrlSet() (us *UrlSet) { us = &UrlSet{} us.m = make(map[string]struct{}) return us } func (us *UrlSet) ShouldProcessUrl(url string) bool { us.mu.Lock() defer us.mu.Unlock() if _, ok := us.m[url]; ok { return false } us.m[url] = empty return true } type Fetcher interface { // Fetch returns the body of URL and // a slice of URLs found on that page. Fetch(url string) (body string, urls []string, err error) } // Crawl uses fetcher to recursively crawl // pages starting with url, to a maximum of depth. func Crawl(url string, maxdepth int, fetcher Fetcher) { us := NewUrlSet() var worker func(url string, depth int, wg *sync.WaitGroup) worker = func(url string, depth int, wg *sync.WaitGroup) { if wg != nil { defer wg.Done() } // TODO: Fetch URLs in parallel. - Done // TODO: Don't fetch the same URL twice. - Done if !us.ShouldProcessUrl(url) { return } if depth <= 0 { return } body, urls, err := fetcher.Fetch(url) if err != nil { fmt.Println(err) return } fmt.Printf("found: %s %q\n", url, body) var wgChild sync.WaitGroup for _, u := range urls { wgChild.Add(1) go worker(u, depth-1, &wgChild) } wgChild.Wait() } worker(url, maxdepth, nil) return } func main() { Crawl("https://golang.org/", 4, fetcher) } // fakeFetcher is Fetcher that returns canned results. type fakeFetcher map[string]*fakeResult type fakeResult struct { body string urls []string } func (f fakeFetcher) Fetch(url string) (string, []string, error) { if res, ok := f[url]; ok { wait := time.Duration(rand.Float64() * 3.0 + 1.0) * time.Second time.Sleep(wait) return res.body, res.urls, nil } return "", nil, fmt.Errorf("not found: %s", url) } // fetcher is a populated fakeFetcher. var fetcher = fakeFetcher{ "https://golang.org/": &fakeResult{ "The Go Programming Language", []string{ "https://golang.org/pkg/", "https://golang.org/cmd/", }, }, "https://golang.org/pkg/": &fakeResult{ "Packages", []string{ "https://golang.org/", "https://golang.org/cmd/", "https://golang.org/pkg/fmt/", "https://golang.org/pkg/os/", }, }, "https://golang.org/pkg/fmt/": &fakeResult{ "Package fmt", []string{ "https://golang.org/", "https://golang.org/pkg/", }, }, "https://golang.org/pkg/os/": &fakeResult{ "Package os", []string{ "https://golang.org/", "https://golang.org/pkg/", }, }, }
There were two primary tasks in this exercise. The first was to avoid downloading the same URL twice. My solution to that was to create a UrlSet
structure to track which URLs have already been visited. The structure contains a mutex and a
. Since multiple goroutines may be trying to crawl at the same time, the mutex is needed to protect writes made to the map.map
The map is used as a set where the key is the URL to be processed and the value is always the
structure declared on line 10. The empty
function checks to see if the given URL has already been processed, and if it hasn’t it adds it to the map. It then returns true if the caller should proceed to crawl.ShouldProcessUrl
The second task is slightly trickier:
// TODO: Fetch URLs in parallel.
Goroutines are the obvious choice, but synchronizing them to ensure that all URLs have been processed before returning to the caller requires some thought. Just as in my Equivalent Binary Trees solution, I chose to use a nested worker function closure to recursively process the URLs for each “page”. The worker function accepts a URL, a depth counter, and a WaitGroup primitive.
A
allows one goroutine to block until the WaitGroup
method has been called a predefined number of times. The heart of my solution starts on line 74 where a new WaitGroup.Done()
is declared. Then in the WaitGroup
loop, the for
counter is incremented and a new goroutine is spawned for each candidate URL. After the loop termination, the caller blocks until all of the spawned goroutines exit (due to the WaitGroup
call on line 52).defer wg.Done()
In practice, this means that all of the goroutines start fetching as soon as possible, then wait until all of their children exit before returning. I added a random wait timer to the dummy
routine to better illustrate how the fetch process would work in an actual application.Fetch()
And that’s it! Learning to visualize how code flows through multiple threads (or goroutines) is critical for writing reliable, high-performance software. If the flow isn’t clear to you, try working through a couple of simple examples using pen and paper. Now that I’ve finished the tour, I’m working on a new project to provide synchronization primitives via a REST API, so stay tuned!