Introduction

Sitemap XML documents give website owners a way to tune the way search engines index a website. The original standard was developed by Google, but it is now supported by Yahoo, and MSN Search has also said that they will adopt the standard.

The Google Sitemap Generator Pebble Addon generates and streams back a sitemap.xml document based on the content of your blog. As the blog author, you don't need to do any extra work because the sitemap.xml is generated based on your published blog entries.

This document is based on my original blog post about the Google Sitemap addon, with additions for new features.

Overview

  • Obtaining the Google Sitemap Pebble Addon
  • Installing the Servlet
  • Testing the Servlet
  • Configuration
  • Publishing to Search Engines

Obtaining the Google Sitemap Pebble Addon

The PebbleSitemapServlet is available as open source under the terms of the GNU GPL. It can be downloaded in source and binary form from SourceForge:

Either grab the source distro and run the Krypton grab the binary distro.

Installing the Servlet

First, copy the JAR file into your Pebble deployment's WEB-INF/lib directory.

Next, open up WEB-INF/web.xml, and look for the first <servlet> tag. Right before that tag, add the following servlet declaration:

  <!-- Sitemap Generator for Pebble -->
  <servlet>
    <servlet-name>
      PebbleAddonsSitemapServlet
    </servlet-name>
    <servlet-class>
      com.brendonmatheson.pebbleaddons
        .sitemap.PebbleSitemapServlet
    </servlet-class>
  </servlet>
		

And finally, still in web.xml, look for the first <servlet-mapping> tag, and add the following servlet-mapping right before it:

  <!-- Sitemap Generator for Pebble -->
  <servlet-mapping>
    <servlet-name>
      PebbleAddonsSitemapServlet
    </servlet-name>
    <url-pattern>/sitemap.xml</url-pattern>
  </servlet-mapping>
		

Testing the Servlet

Depending on your container, you may have to restart the webapp or the entire container to get the servlet going. Point your browser at the sitemap.xml servlet in your blog. For example:

http://localhost:8080/pebble/sitemap.xml
		

You should see a bunch of XML code that looks like the samples in the Google documentation. Google Sitemaps requires that the character encoding of your sitemap is UTF-8. The servlet sets the encoding and you can check it by going to the View / Character Encoding menu in Mozilla Firefox or the View / Encoding menu in Internet Explorer to make sure it's set to Unicode.

Configuration

The sitemap servlet has a number of init-parameter that you can optionally set to tune it's output. The following excerpt from web.xml shows a fully re-configured version of the servlet:

  <!-- Sitemap XML Generator for Pebble -->
  <servlet>

    <servlet-name>
      PebbleAddonsSitemapServlet
    </servlet-name>
    <servlet-class>
      com.brendonmatheson.pebbleaddons
        .sitemap.PebbleSitemapServlet
    </servlet-class>

    <init-param>
      <param-name>schemaUrl</param-name>
      <param-value>
        http://www.sitemaps.org/
          schemas/sitemap/0.9
      </param-value>
    </init-param>

    <!-- Blog Homepage Settings -->
    <init-param>
      <param-name>blogChangeFreq</param-name>
      <param-value>daily</param-value>
    </init-param>

    <init-param>
      <param-name>blogPriority</param-name>
      <param-value>0.1</param-value>
    <init-param>

    <!-- Blog Entry Settings -->
    <init-param>
      <param-name>
        blogEntryChangeFreq
      </param-name>
      <param-value>monthly</param-value>
    </init-param>

    <init-param>
      <param-name>
        blogEntryPriority
      </param-name>
      <param-value>0.9</param-value>
    </init-param>

    <!-- Static Page Settings -->
    <init-param>
      <param-name>
        staticPageChangeFreq
      </param-name>
      <param-value>monthly</param-value>
    </init-param>

    <init-param>
      <param-name>
        staticPagePriority
      </param-name>
      <param-value>0.7</param-value>
    </init-param>

  </servlet>

		

The meaning of these parameters is as follows:

  • schemaUrl - The URL for the sitemap XML namespace. By default it refers to 0.84, the last Google version which still works with Google and appears to be accepted by Yahoo. The code fragment above configures the servlet to use the latet public namespace
  • blogChangeFreq - The change frequency that the blog's home URL will be marked with. Default: "weekly". If you post often you might want to set this to "daily".
  • blogPriority - The priority that the blog's home URL will be marked with. Default: 0.3
  • blogEntryFreq - The change frequency that the blog entry URLs will be marked with. Default: "monthly".
  • blogEntryPriority - The priority that the blog entry URLs will be marked with. Default: 0.8
  • staticPageChangeFreq - The change frequency that static page URLs will be marked with. Default: "monthly".
  • staticPagePriority - The priority that static page URLs will be marked with. Default: 0.8

Note: By default the blog's home URL is ranked with a lower priority at 0.3 than blog entry URL's which are ranked at 0.8. This is to make it more likely that entry permalinks will appear in search engine results than the blog's home page.

See http://www.sitemaps.org/protocol.html for more information on the meaning of these parameters.

To cause the servlet to log it's configuration parameter loading to the logj appenders, make sure it is running with debug enabled by adding the following line to log4j.properties:

log4j.com.brendonmatheson
  .pebbleaddons.sitemap=debug
		

Publishing to Search Engines

Google

The final step is to tell the GoogleBot to use your sitemap.xml descriptor instead of doing it's standard indexing.

If you haven't logged into Google's Webmaster tools before, you'll need to link in and verify your website. To access Google Webmaster Tools, all you need is a GMail account.

After that, you can go to the Sitemaps tab, click "Add a new Sitemap", and point it at the dynamically generated sitemap.xml you now have in your blog. GoogleBot is a busy piece of software, so after you submit your sitemap you'll probably have to wait a little while, possibly a few hours, before it's accessed.

Yahoo!

Yahoo has Site Explorer, a management UI quite similar to Google's Webmaster Tools. which allows you to submit your sitemap.xml's URL. To access Yahoo's Site Explorer app, you need a Yahoo account.

If you're watching your log to see when the bot accesses your sitemap.xml, Yahoo seems to use the UserAgent header:

Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)
			

Cheeky. Or maybe there really is a guy in some Yahoo basement on an old Win 98 box who has the job of manually entering all sitemap.xml information into the Yahoo index. Poor feller.

MSN Search

I've looked all over and haven't been able to find any place where you can submit a sitemap.xml to MSN, so I guess they haven't implemented it yet. If anyone has any info on this please let me know so I can update this post.

Document authored by: