XML sitemaps are a great way to expose your sites content to search engines especially when you do not have an internal or external linking structure built out yet. XML sitemaps in its simplest form is a directory of every unique url your website contains. This gives Google and other search engines a one stop shop for all pages they should index. XML sitemaps are restricted to 10MB or 50k links per sitemap but this limitation can be circumvented with sitemap indexes which link to multiple sitemaps. Sitemaps can also include additional metadata such as how frequently pages get updated or when was the last time a page was updated. After you design a site with HTML / CSS templates make sure you include sitemaps to index the pages quicker.

XML Sitemap with Java

The SitemapGen4J library gives a nice object model for generating all urls required to build out a sitemap. Most likely you will need to write code that can generate all possible urls for your website. Another alternative is to build a generic crawler that can build a sitemap for any website. It's not too difficult to build all of the custom urls so we create a method for each page type. We section them all out because we plan on making a sitemap index later.

class StubbornJavaSitemapGenerator {
    private static final String HOST = "https://www.stubbornjava.com";

    private static final InMemorySitemap sitemap = InMemorySitemap.fromSupplier(StubbornJavaSitemapGenerator::generateSitemap);
    public static InMemorySitemap getSitemap() {
        return sitemap;
    }

    private static Map<String, List<String>> generateSitemap() {
        Map<String, List<String>> index = Maps.newHashMap();
        try {
            index.put("posts", genPosts());
            index.put("guides", genGuides());
            index.put("recommendations", genRecommendations());
            index.put("tags", genTags());
            index.put("libraries", genLibraries());
            return index;
        } catch (MalformedURLException ex) {
            throw new RuntimeException(ex);
        }
    }

    private static List<String> genPosts() throws MalformedURLException {
        WebSitemapGenerator wsg = new WebSitemapGenerator(HOST);
        List<String> slugs = Posts.getAllSlugs();
        for (String slug: slugs) {
            String url = HttpUrl.parse(HOST)
                                .newBuilder()
                                .addPathSegment("posts")
                                .addPathSegment(slug)
                                .build()
                                .toString();
            wsg.addUrl(url);
        }
        return wsg.writeAsStrings();
    }

    private static List<String> genGuides() throws MalformedURLException {
        WebSitemapGenerator wsg = new WebSitemapGenerator(HOST);
        List<GuideTitle> guides = Guides.findTitles();
        for (GuideTitle guide : guides) {
            String url = HttpUrl.parse(HOST)
                                .newBuilder()
                                .addPathSegment("guides")
                                .addPathSegment(guide.getSlug())
                                .build()
                                .toString();
            wsg.addUrl(url);
        }
        return wsg.writeAsStrings();
    }

    private static List<String> genRecommendations() throws MalformedURLException {
        WebSitemapGenerator wsg = new WebSitemapGenerator(HOST);
        List<String> recommendations = Lists.newArrayList(
            "java-libraries"
            , "best-selling-html-css-themes-and-website-templates"
        );
        for (String recommendation : recommendations) {
            String url = HttpUrl.parse(HOST)
                                .newBuilder()
                                .addPathSegment(recommendation)
                                .build()
                                .toString();
            wsg.addUrl(url);
        }
        return wsg.writeAsStrings();
    }

    private static List<String> genTags() throws MalformedURLException {
        WebSitemapGenerator wsg = new WebSitemapGenerator(HOST);
        List<Tag> tags = Tags.getTags();
        for (Tag tag : tags) {
            String url = HttpUrl.parse(HOST)
                                .newBuilder()
                                .addPathSegment("tags")
                                .addPathSegment(tag.getName())
                                .addEncodedPathSegment("posts")
                                .build()
                                .toString();
            wsg.addUrl(url);
        }
        return wsg.writeAsStrings();
    }

    private static List<String> genLibraries() throws MalformedURLException {
        WebSitemapGenerator wsg = new WebSitemapGenerator(HOST);
        List<JavaLib> libraries = Seq.of(JavaLib.values()).toList();
        for (JavaLib lib : libraries) {
            String url = HttpUrl.parse(HOST)
                                .newBuilder()
                                .addPathSegment("java-libraries")
                                .addPathSegment(lib.getName())
                                .build()
                                .toString();
            wsg.addUrl(url);
        }
        return wsg.writeAsStrings();
    }

    public static void main(String[] args) {
        generateSitemap();
    }
}

XML Sitemap Index

SitemapGen4J was built to write the sitemaps to files on disk, however just want to keep ours in memory since it is fairly small. Unfortunately it looks like exposing the internal object model or additional rendering features was an after thought. There is an overriding for the individual sitemaps but not for the index. We should probably contribute an implementation or create a fully custom sitemap generator. Instead we need to build our own internal mapping. Sitemaps have a limit of 10MB or 50k urls per sitemap. This is why an index is needed.

public class InMemorySitemap {
    private final Supplier<Map<String, String>> indexSupplier;
    private InMemorySitemap(Supplier<Map<String, String>> indexSupplier) {
        this.indexSupplier = indexSupplier;
    }

    public String getIndex(String sitemapName) {
        return indexSupplier.get().get(sitemapName);
    }

    public List<String> getIndexNames() {
        return Seq.seq(indexSupplier.get().keySet())
                  .sorted()
                  .toList();
    }

    // Cache the sitemap for the lifetime of the JVM
    public static InMemorySitemap fromSupplier(Supplier<Map<String, List<String>>> supplier) {
        Supplier<Map<String, String>> sup = mapSupplier(supplier);
        Supplier<Map<String, String>> memoized = Suppliers.memoize(sup::get);
        return new InMemorySitemap(memoized);
    }

    // Cache the sitemap but refresh after the given duration.
    public static InMemorySitemap fromSupplierWithExpiration(
            Supplier<Map<String, List<String>>> supplier,
            long duration,
            TimeUnit unit) {
        Supplier<Map<String, String>> sup = mapSupplier(supplier);
        Supplier<Map<String, String>> memoized = Suppliers.memoizeWithExpiration(sup::get, duration, unit);
        return new InMemorySitemap(memoized);
    }

    private static Supplier<Map<String, String>> mapSupplier(Supplier<Map<String, List<String>>> supplier) {
        return () -> {
            Map<String, List<String>> originalMap = supplier.get();
            Map<String, String> newIndex = Maps.newHashMap();
            for (Entry<String, List<String>> entry : originalMap.entrySet()) {
                for (int i = 0; i < entry.getValue().size(); i++) {
                    newIndex.put(entry.getKey() + "-" + i + ".xml", entry.getValue().get(i));
                }
            }
            return newIndex;
        };
    }
}

XML Sitemap Routes

With an internal representation of the sitemap we now need to expose it in our Undertow web server. A cool feature of the RoutingHandler is that it allows you to combine two RoutingHandlers with the addAll method.

public class SitemapRoutes {
    private final InMemorySitemap sitemap;
    private SitemapRoutes(InMemorySitemap sitemap) {
        this.sitemap = sitemap;
    }

    public void getSitemap(HttpServerExchange exchange) {
        String sitemapName = Exchange.pathParams().pathParam(exchange, "sitemap").orElse(null);
        String content = sitemap.getIndex(sitemapName);
        if (null == content) {
            exchange.setStatusCode(404);
            Exchange.body().sendText(exchange, String.format("Sitemap %s doesn't exist", sitemapName));
            return;
        }
        Exchange.body().sendXml(exchange, content);
    }

    /*
     * Routing Handlers can be reused and combined with each other
     * using the RoutingHandler.addAll() method.
     */
    public static RoutingHandler router(InMemorySitemap sitemap) {
        SitemapRoutes routes = new SitemapRoutes(sitemap);
        RoutingHandler router = new RoutingHandler()
            .get("/sitemaps/{sitemap}", timed("getSitemap", routes::getSitemap))
        ;
        return router;
    }
}

Exposing the Sitemap

Ideally you can just expose a single sitemap index file that references all of the others. Since we had to hack around this a bit anohter option is to include all of the sitemap files in our robots.txt.

public static void robots(HttpServerExchange exchange) {
    String host = Exchange.urls().host(exchange).toString();
    List<String> sitemaps = StubbornJavaSitemapGenerator.getSitemap().getIndexNames();
    Response response = Response.fromExchange(exchange)
                                .with("sitemaps", sitemaps)
                                .with("host", host);
    Exchange.body().sendText(exchange, Templating.instance().renderTemplate("templates/src/pages/robots.txt", response));
}