Wednesday, July 16, 2008

Web Crawling Revisited

I was looking trough my past posts, and came to Web Crawling; it's time to update it.

In that post, Googlebot had started crawling my web site based on a blog comment with a link. The linked page was being crawled, but not much else, and while my site is not huge, there is a twelve-page story with hundreds of pictures and a page that gets world-wide attention based on links in a yahoo group.

After several visits, the bot checked for an index.html, studied that for a few visits, then started visiting pages that are linked in the index. Several more weeks and the pages had all been crawled.

The bot noticed there were images and sent Googlebot-image for a look. After a few probes, all of the images were indexed, usually 30-50 at a time. Part of the process must be to index the words on the pages and where they appear with an image associate the words, because I started getting hits from Google searches. I'll note here that I have a robots.txt and Googlebot always checks it, but I do not have an index restriction on most of the pages and images.

I monitor my web site logs and see from where page hits come with many from Google or Google-Images and some of the search terms did not match anything written on the page. Around that time I noticed that Google has an image labler. It must show some of my images to the labler players: that's the only explanation for the hits I see.

So, once Googlebot gets ahold of a web site, it works until the site is completely indexed, then returns frequently, checking for changes and updates. Although my site has been Slurped (Yahoo bot) and MSRBT (Microsoft) has dropped by a few times, none have done the deep digging that Google has done.


