Characters and their Corresponding Glyphs
In XML [XML10], textual content is defined in terms of a sequence of XML characters, where each character is defined by a particular Unicode code point. Fonts, on the other hand, consist of a collection of glyphs and other associated information, such as font tables. A glyph is a presentable form of one or more characters (or a part of a character in some cases). Each glyph consists of some sort of identifier (in some cases a string, in other cases a number) along with drawing instructions for rendering that particular glyph.
In many cases, there is a one-to-one mapping of Unicode characters (i.e., Unicode code points) to glyphs in a font. For example, it is common for a font designed for Latin languages (where the term Latin is used for European languages such as English with alphabets similar to and/or derivative to the Latin language) to contain a single glyph for each of the standard ASCII characters (i.e., A-to-Z, a-to-z, 0-to-9, plus the various punctuation characters found in ASCII). Thus, in most situations, the string “XML”, which consists of three Unicode characters, would be rendered by the three glyphs corresponding to “X”, “M” and “L”, respectively.
In various other cases, however, there is not a strict one-to-one mapping of Unicode characters to glyphs.
Some of the circumstances when the mapping is not one-to-one:
- Ligatures – For best looking typesetting, it is often desirable that particular sequences of characters are rendered as a single glyph. An example is the word “office”. Many fonts will define an “ffi” ligature. When the word “office” is rendered, sometimes the user agent will render the glyph for the “ffi” ligature instead of rendering distinct glyphs (i.e., “f”, “f” and “i”) for each of the three characters. Thus, for ligatures, multiple Unicode characters map to a single glyph. (Note that for proper rendering of some languages, ligatures are required for certain character combinations.)
- Composite characters – In various situations, commonly used adornments such as diacritical marks will be stored once in a font as a particular glyph and then composed with one or more other glyphs to result in the desired character. For example, it is possible that a font engine might render the é character by first rendering the glyph for e and then rendering the glyph for ´ (the accent mark) such that the accent mark will appear over the e. In this situation, a single Unicode character maps to multiple glyphs.
- Glyph substitution – Some typography systems examine the nature of the textual content and utilize different glyphs in different circumstances. For example, in Arabic, the same Unicode character might render as any of four different glyphs, depending on such factors as whether the character appears at the start, the end or the middle of a sequence of cursively joined characters. Different glyphs might be used for a punctuation character depending on inline-progression-direction (e.g., horizontal vs. vertical). In these situations, a single Unicode character might map to one of several alternative glyphs.
- In some languages, particular sequences of characters will be converted into multiple glyphs such that parts of a particular character are in one glyph and the remainder of that character is in another glyph.
- Alternative glyph specification – SVG contains a facility for the author to explicitly specify that a particular sequence of Unicode characters is to be rendered using a particular glyph. (See Alternate glyphs.) When this facility is used, multiple Unicode characters map to a single glyph.