5. Between past and future

XHTML 1.0 is a transition language, with the purpose of porting HTML 4 documents to XML. So, let's take a look at the few steps to convert HTML 4 documents to XHTML 1.0. Though it may make this migration a little more boring, we will always keep in mind the direction taken by XHTML 1.1: this will allow us to save a lot of time during later migrations.

5.1 Migrating from HTML 4

Strictly conforming documents
The overall structure of an XHTML 1.0 and an HTML 4 document are almost identical. We have already pointed out the main differences: Therefore the structure of the document becomes:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <!-- Headers -->
  </head>
  <body>
    <!-- Document body -->
  </body>
</html>
Due to the starting XML declaration, some old browsers may handle XHTML documents as generic XML and display them unpredictably. To keep backward compatibility with those user agents, you can remove the XML declaration, but you will be bound to use the default UTF-8 or UTF-16 character encodings (by the way, Internet Explorer handles the very same document differently if you write the XML declaration).
XML and XHTML specify the character encoding differently: the former, as we have seen, specifies it inside the XML declaration, the latter inside the Content-Type HTTP header or a meta element. Therefore, if you want documents with non-default character encodings to be portable and you can't make sure the web server provides the right HTTP headers, you need to include both the XML declaration and the corresponding meta http-equiv element; e.g.:
<meta http-equiv="Content-type" content="text/html; charset=EUC-JP" />
Correct nesting
Elements must nest properly; overlapping is illegal in SGML too (and consequently in HTML), but it is widely tolerated in existing browsers.
Element and attribute names must be in lower case
Unlike HTML, XML is case-sensitive and all element and attribute names defined in XHTML are lower case.
End tags
In HTML 4 certain elements were permitted to omit the end tag (e.g. the <p> element). In XHTML 1.0, instead, all non-EMPTY elements must have an end tag. EMPTY elements must either have an end tag or a slash before the closing angle bracket. E.g.:
<hr></hr>
<hr />
The first syntax may give uncertain results in many existing user agents. Therefore, the second is to be preferred; the (optional) space before the trailing slash is recommended for compatibility with HTML browsers
The second syntax cannot apply to non-EMPTY elements,even if their content is empty. Writing an empty paragraph like this would be uncorrect:
<p />
Quoting
Attribute values must always be quoted, even those which appear to be numeric.
Attribute Minimization
XML doesn't support attributes with no value (like, for instance, compact or checked in HTML). The correct syntax is:
<input type="checkbox" checked="checked">
White Space handling in attribute values
When user agents process attributes, they strip leading and trailing white space and they map sequences of one or more white space characters (including line breaks) to a single inter-word space. That's the theory. In practice, you'd better avoid line breaks and multiple white space characters within attribute values, because user agents handle them inconsistently.
script and style elements
In XHTML, the script and style elements are declared as having #PCDATA content. As a result, < and & will be treated as the start of markup, and entities, such as &lt; and &amp;, will be expanded to the corresponding characters. You can avoid this by wrapping the content of the script and style elements within a #CDATA marked section:
<script type="text/javascript">
<![CDATA[
[...]
]]>
</script>
or by using external script and style documents. Also keep in mind that XML parsers may remove comments. Therefore, the good old habit to "hide" scripts and styles inside comments may not work in future XML-based user agents.
Elements with id and name attributes
In HTML 4, some elements (a, applet, form, frame, iframe, img and map) can use two different attributes as fragment identifiers: name and id. In XHTML 1.0, instead, only the id attribute is a legal fragment identifier, while the name attribute is formally deprecated and will be removed in XHTML 1.1.
In handling URI-references that end with fragment identifiers (e.g. "#foo"), XML and HTML differ: the former requires elements with an id attribute, the latter elements with a name attribute. Therefore, to ensure maximum forward and backward compatibility, identical values may be supplied for both of these attributes:
<a id="foo" name="foo">
When defining fragment identifiers to be backward-compatible, only strings matching the pattern [A-Za-z][A-Za-z0-9:_.-]* should be used.
Attributes with pre-defined value sets
Some attributes, called enumerated attributes in XML and SGML, have pre-defined and limited sets of values (e.g. the type attribute of an input element). In XHTML 1.0, the interpretation of these values is case-sensitive and all values are lower-case.
Entity references as hex values
SGML and XML both permit references to characters by using hexadecimal values. In SGML, these references could be made by using either the &#Xnn; or the &#xnn; notation. In XML documents, instead, the only option is the lower-case version (i.e. &#xnn;).
isindex
You can't include more than one isindex element in the document head. In any case, the isindex element is deprecated in favor of the input element.
The lang and xml:lang attributes
To ensure maximum compatibility, it is recommended to use both these attributes to specify the language of an element. Anyway, the value of the xml:lang attribute takes precedence.
Ampersand
In both SGML and XML, the ampersand character ("&") always declares the beginning of an entity reference. Therefore, all ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (i.e. "&amp;"), including ampersands inside the href attribute of an a element; e.g.:
<a href="http://www.kernel-panic.it/myscript.php?id=123456&amp;name=user">
CSS2 and XHTML
CSS2 apply to both HTML and XML documents, though differences in parsing may produce different visual or aural results. A few hints will help you to decrease this risk:
style elements
In HTML 4 and XHTML, the style element can be used to define document-internal style rules; XML, instead, uses an XML stylesheet declaration. In order to be compatible with this convention, style elements should have their fragment identifier set using the id attribute, and an XML stylesheet declaration should reference this fragment. E.g.:
<?xml-stylesheet href="/css/mystyle.css" type="text/css"?>
<?xml-stylesheet href="#pageStyle" type="text/css"?>
<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>An internal stylesheet example</title>
<style type="text/css" id="pageStyle">
  p {
    color: blue;
    font-weight: bold;
  }
</style>
</head>
<body>
<p>
  Paragrafo blu e in grassetto.
</p>
</body>
</html>
Illegal Characters
Some characters that are legal in HTML, are illegal in XML, e.g. the "formfeed" character (U+000C). On the other hand, the &apos; entity (the apostrophe, U+0027), was introduced in XML 1.0 but does not appear in HTML (to ensure compatibility, you should use &#39;).

5.2 What does future hold: XHTML 1.1

XHTML 1.1 has removed all the features that were deprecated in HTML 4 and XHTML 1.0: therefore, the Transitional and Frameset document types are no longer available in XHTML 1.1. In general, the strategy of XHTML 1.1 is to define a markup language that is rich in structural functionality, but relies upon style sheets for presentation. The main differences can be summarized as follows: