Going from HTML to XML and back again

I've used server-side packages for websites and I've done hand-coded HTML. Hand-coding is a simple way to start a site that won't have too many pages, but what do you do as the site grows? Suppose you've got a menu along one side that lists all your pages. Every time you add a page, all the pages have to have that menu updated. The usual solution to this is with a handy bit of Javascript for nice dynamic menus (that may or may not pop up an error message with every mouseover) or some PHP to fill in the details as needed. These are both valid. I want a template for the common parts of the page and a source document for the content that populates each unique output page (that is, the static web page that the end user can read). Since I've been monkeying with XSLT lately, I decided to try reworking my pages into an XML based format. My needs are pretty straightforward, so I thought this would be a good learning experience.

I was right.

Since my task was to take some existing HTML documents and rework those as XML, I'll give an example that follows that same approach. Take a site with the following simple documents:
apples.html

<html>
<title>Fruit Review</title>
<body>
<h3>All fruit, All the time</h3>
<div class="review">
Apples come in numerous colours and flavours.
</div>
<div class="menu">
<ul>
<li><a href="apples.html">Apples</a></li>
<li><a href="oranges.html">Oranges</a></li>
<li><a href="sitemap.html">Site Map</a></li>
</ul>
</div>
</body>
</html>

oranges.html

<html>
<title>Fruit Review</title>
<body>
<h3>All fruit, All the time</h3>
<div class="review">
Most oranges come from Florida.
</div>
<div class="menu">
<ul>
<li><a href="apples.html">Apples</a></li>
<li><a href="oranges.html">Oranges</a></li>
<li><a href="sitemap.html">Site Map</a></li>
</ul>
</div>
</body>
</html>

sitemap.html

<html>
<title>Fruit Review</title>
<body>
<h3>All fruit, All the time</h3>
<div class="menu">
<ul>
<li><a href="apples.html">Apples</a></li>
<li><a href="oranges.html">Oranges</a></li>
<li><a href="sitemap.html">Site Map</a></li>
</ul>
</div>
</body>
</html>

These documents are all identical except for the content of the "review" div. That nominates everything outside the review div for inclusion in the XSLT file. I guess you could say that the document surrounding the review is just there to frame the review, which makes it style information. Okay, maybe that's stretching the concept of style, but I think there are much better ways of coming to my conclusion that use a lot of much longer words. Anyhow, you can distill the essentials of these fruit reviews into one XML source document each and handle the decorations in the XSL tranforms. The "distilling" step can be tedious, but it's worthwhile since sitewide changes to those non-content items (the menu, ads, headers and footers, etc) will get easier. It's also less work for you than it was for me since I'm telling you how to do it. The XSLT that would do the trick looks like this:
fruitreview.xsl

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:xhtml="http://www.w3.org/1999/xhtml">

<xsl:output method="html"
version="1.0"
doctype-system = "-//W3C//DTD XHTML 1.0 Transitional//EN"
doctype-public = "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
/>

<xsl:template match="docroot">

<html>
<title>Fruit Review</title>
<body>
<h3>All fruit, All the time</h3>

<xsl:apply-templates select="review"/>

<div class="menu">
<ul>
<li><a href="apples.html">Apples</a></li>
<li><a href="oranges.html">Oranges</a></li>
<li><a href="sitemap.html">Site Map</a></li>
</ul>
</div>

</body>
</html>
</xsl:template>

<!-- the review - the main content of the document -->
<xsl:template match="review">
<!-- insert the value of the node, don't turn < into &lt; -->
<!-- even with this, html must be enclosed in <![CDATA[...]]> -->


<div id="content">
<xsl:value-of select="." disable-output-escaping="yes" />
</div>
</xsl:template>

</xsl:stylesheet>

Next trim out the repeated parts of the HTML documents and tag the remaining HTML with the XML tags that the XSL transforms will recognize.
apples.xml

<?xml version="1.0"?>
<docroot>
<review>
Apples come in numerous colours and flavours.
</review>
</docroot>

oranges.xml

<?xml version="1.0"?>
<docroot>
<review>
Most oranges come from Florida.
</review>
</docroot>

sitemap.xml
<?xml version="1.0"?>
<docroot>
<review>
Most oranges come from Florida.
</review>
</docroot>

Notice that sitemap.xml really has nothing in it. There's no review, so that part of the transform does nothing with sitemap.xml. All the documents become significantly shorter which helps us focus on the real content that we're producing, rather than wading through the tags. In my case, I can now add documents by writing them using HTML and the XML tags that my XSLT will recognize. After finishing a new document, I add it to the list in the XSLT and run the transform on all my documents (using a script or batch file). Of course this doesn't automate everything or do all that I can with XML, but it does make a great transitional step froma hand-coded HTML site towards a purely XML driven site. Taking small steps like this allows me to continue producing content while scaling the size of my site. The next steps from here would be to make XML tags for the document title and maybe some shortcut tags for style blocks.

Oh, in case this is new to you, you'll need something to perform the XSL transform. I use Instant Saxon on Windows, so the command line for me looks like:
saxon -o apples.html apples.xml fruitreview.xsl
to produce apples.html.

3.6875
Your rating: None Average: 3.7 (16 votes)

Cool. I understand this was a learning experience for you and that's fine, but other than a learning experience, this isn't very useful right? I have some questions so I'll try to enumerate them to make sure I understand things first (remember I know less XSLT and XML than you):

1) "In my case, I can now add documents by writing them using HTML and the XML tags that my XSLT will recognize."

Why do this? Doesn't it force you to rigidly follow the html template that will later get stripped out and ALSO update the XSLT file? Why not just write the new page (say "apricot") in XML with the tags directly? (I assume you would work on getting the menu, header/footer into some style sheet or use PHP/JavaScript)

2) "After finishing a new document, I add it to the list in the XSLT and run the transform on all my documents (using a script or batch file)."

Again, why do this? This would force you to update EVERY HTML document you have to add the "apricot" entry to the menu. Why not just run the transform for the one new file?

3) Maybe I misunderstood what the "docroot" template does, but how forgiving is XSLT in terms of extra carriage returns, upper-case HTML elements vs. lower-case (i.e. "BODY" vs "body"), etc?

4) I notice that the transformation is done by select=".". Does this mean it looks for the first period to get the full value of the "review" object? What if you have multiple sentences (I realize this was done for example purposes, but what would someone really do - search for the "menu" div somehow?)

Before I start, one thing I might not have been clear on is that I still want to have static HTML documents to actually publish, the end user doesn't get any XML or XSLT - just the output.
1) A new document would look just like the other XML files, but with the review part changed. When I say "HTML tags" I mean the review could contain HTML (more on this later), by "XML tags," I mean the tags I introduced - just docroot and review in this example. I glossed over one important detail about including HTML in the review section, I'll try to clarify it later today but it's in the note on 'disable-output-escaping' in my XSLT example.
2) Again, I want all my HTML files to change because I want the menu included in each one. I'd like to automate this part a little more so that I could generate the menu as part of my "build."
3) Afaik, element names must be lowercase but otherwise it's pretty forgiving (for line breaks and such).
4) The period is a pattern, it means the current node. See value-of and XPath Abbreviated Syntax for more detail. I've seen "." used on select in the XSLT spec, but I only see it explained (briefly) in the XPath spec.

[...] One important detail I didn&#8217;t mention in yesterday&#8217;s post on converting between XML and HTML was the &#8220;disable-output-escaping&#8221; attribute. Using disable-output-escaping=&#8221;yes&#8221; causes the result of the transform to keep &#8220;&lt; " and "&#38;" in the resulting document. Otherwise, they get turned in to &#38;lt; and &#38;amp;. You may have found that HTML tags, and &lt;b&gt; tags, not the XHTML &lt;strong&gt; and &lt;em&gt; tags to emphasize where this should be applied. The HTML inside the CDATA section is treated as just a big text string, XML tags inside the block won&#8217;t be processed. That&#8217;s why it&#8217;s great as an intermediate step for moving your site toward a purely XML based format. When you&#8217;re ready to start handling the formatting of all your content with HTML, then you can start doing that with new documents and minimize redoing old content when there&#8217;s not so much benefit. To recap, since the CDATA block causes &lt; and &#38; to be allowed into the transform, then disable-output-escaping="yes" can see those characters and output them in the final document. The result is XML transforms where you need them, dumb old text copying where you don't. Other recent posts in these categories: [...]