XHTML 1.0 crib sheet - Between past and future

5. Between past and future

XHTML 1.0 is a transition language, with the purpose of porting HTML 4 documents to XML. So, let's take a look at the few steps to convert HTML 4 documents to XHTML 1.0. Though it may make this migration a little more boring, we will always keep in mind the direction taken by XHTML 1.1: this will allow us to save a lot of time during later migrations.

5.1 Migrating from HTML 4

Strictly conforming documents

The overall structure of an XHTML 1.0 and an HTML 4 document are almost identical. We have already pointed out the main differences:

the starting XML declaration (sometimes not required, but always recommended);
the DOCTYPE to use (though forward compatibility is assured only by the Strict DTD);
the root element of the document (<html>) must contain the declaration for the XHTML namespace.

Therefore the structure of the document becomes:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <!-- Headers -->
  </head>
  <body>
    <!-- Document body -->
  </body>
</html>

Due to the starting XML declaration, some old browsers may handle XHTML documents as generic XML and display them unpredictably. To keep backward compatibility with those user agents, you can remove the XML declaration, but you will be bound to use the default UTF-8 or UTF-16 character encodings (by the way, Internet Explorer handles the very same document differently if you write the XML declaration).
XML and XHTML specify the character encoding differently: the former, as we have seen, specifies it inside the XML declaration, the latter inside the Content-Type HTTP header or a meta element. Therefore, if you want documents with non-default character encodings to be portable and you can't make sure the web server provides the right HTTP headers, you need to include both the XML declaration and the corresponding meta http-equiv element; e.g.:

<meta http-equiv="Content-type" content="text/html; charset=EUC-JP" />

Correct nesting

Elements must nest properly; overlapping is illegal in SGML too (and consequently in HTML), but it is widely tolerated in existing browsers.

Element and attribute names must be in lower case

Unlike HTML, XML is case-sensitive and all element and attribute names defined in XHTML are lower case.

End tags

In HTML 4 certain elements were permitted to omit the end tag (e.g. the <p> element). In XHTML 1.0, instead, all non-EMPTY elements must have an end tag. EMPTY elements must either have an end tag or a slash before the closing angle bracket. E.g.:

<hr></hr>
<hr />

The first syntax may give uncertain results in many existing user agents. Therefore, the second is to be preferred; the (optional) space before the trailing slash is recommended for compatibility with HTML browsers
The second syntax cannot apply to non-EMPTY elements,even if their content is empty. Writing an empty paragraph like this would be uncorrect:

<p />

Quoting

Attribute values must always be quoted, even those which appear to be numeric.

Attribute Minimization

XML doesn't support attributes with no value (like, for instance, compact or checked in HTML). The correct syntax is:

<input type="checkbox" checked="checked">

White Space handling in attribute values

When user agents process attributes, they strip leading and trailing white space and they map sequences of one or more white space characters (including line breaks) to a single inter-word space. That's the theory. In practice, you'd better avoid line breaks and multiple white space characters within attribute values, because user agents handle them inconsistently.

script and style elements

In XHTML, the script and style elements are declared as having #PCDATA content. As a result, < and & will be treated as the start of markup, and entities, such as < and &, will be expanded to the corresponding characters. You can avoid this by wrapping the content of the script and style elements within a #CDATA marked section:

<script type="text/javascript">
<![CDATA[
[...]
]]>
</script>

or by using external script and style documents. Also keep in mind that XML parsers may remove comments. Therefore, the good old habit to "hide" scripts and styles inside comments may not work in future XML-based user agents.

Elements with id and name attributes

In HTML 4, some elements (a, applet, form, frame, iframe, img and map) can use two different attributes as fragment identifiers: name and id. In XHTML 1.0, instead, only the id attribute is a legal fragment identifier, while the name attribute is formally deprecated and will be removed in XHTML 1.1.
In handling URI-references that end with fragment identifiers (e.g. "#foo"), XML and HTML differ: the former requires elements with an id attribute, the latter elements with a name attribute. Therefore, to ensure maximum forward and backward compatibility, identical values may be supplied for both of these attributes:

<a id="foo" name="foo">

When defining fragment identifiers to be backward-compatible, only strings matching the pattern [A-Za-z][A-Za-z0-9:_.-]* should be used.

Attributes with pre-defined value sets

Some attributes, called enumerated attributes in XML and SGML, have pre-defined and limited sets of values (e.g. the type attribute of an input element). In XHTML 1.0, the interpretation of these values is case-sensitive and all values are lower-case.

Entity references as hex values

SGML and XML both permit references to characters by using hexadecimal values. In SGML, these references could be made by using either the &#Xnn; or the &#xnn; notation. In XML documents, instead, the only option is the lower-case version (i.e. &#xnn;).

isindex

You can't include more than one isindex element in the document head. In any case, the isindex element is deprecated in favor of the input element.

The lang and xml:lang attributes

To ensure maximum compatibility, it is recommended to use both these attributes to specify the language of an element. Anyway, the value of the xml:lang attribute takes precedence.

Ampersand

In both SGML and XML, the ampersand character ("&") always declares the beginning of an entity reference. Therefore, all ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (i.e. "&"), including ampersands inside the href attribute of an a element; e.g.:

<a href="http://www.kernel-panic.it/myscript.php?id=123456&amp;name=user">

CSS2 and XHTML

CSS2 apply to both HTML and XML documents, though differences in parsing may produce different visual or aural results. A few hints will help you to decrease this risk:

CSS style sheets for XHTML should use lower case element and attribute names;
in tables, the tbody element will be inferred by the parser of an HTML user agent, but not by the parser of an XML user agent. Therefore, you should always explicitly add a tbody element if it is referred to in a CSS selector.
CSS defines different conformance rules for HTML and XML documents; be aware that the HTML rules apply to XHTML documents delivered as HTML and the XML rules apply to XHTML documents delivered as XML.

style elements

In HTML 4 and XHTML, the style element can be used to define document-internal style rules; XML, instead, uses an XML stylesheet declaration. In order to be compatible with this convention, style elements should have their fragment identifier set using the id attribute, and an XML stylesheet declaration should reference this fragment. E.g.:

<?xml-stylesheet href="/css/mystyle.css" type="text/css"?>
<?xml-stylesheet href="#pageStyle" type="text/css"?>
<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>An internal stylesheet example</title>
<style type="text/css" id="pageStyle">
  p {
    color: blue;
    font-weight: bold;
  }
</style>
</head>
<body>
<p>
  Paragrafo blu e in grassetto.
</p>
</body>
</html>

Illegal Characters

Some characters that are legal in HTML, are illegal in XML, e.g. the "formfeed" character (U+000C). On the other hand, the ' entity (the apostrophe, U+0027), was introduced in XML 1.0 but does not appear in HTML (to ensure compatibility, you should use ').

5.2 What does future hold: XHTML 1.1

XHTML 1.1 has removed all the features that were deprecated in HTML 4 and XHTML 1.0: therefore, the Transitional and Frameset document types are no longer available in XHTML 1.1. In general, the strategy of XHTML 1.1 is to define a markup language that is rich in structural functionality, but relies upon style sheets for presentation. The main differences can be summarized as follows:

of course the DOCTYPE has changed:

<!DOCTYPE
 html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
 "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

the lang attribute has been completely removed in favour of the xml:lang attribute;

on the a and map elements, the name attribute has been removed in favour of the id attribute;

the "ruby" collection of elements has been added.