Characters (Gauche Users’ Reference)

6.9 Characters

Builtin Class: <char>

Reader Syntax: #\charname

[R7RS+] Denotes a literal character.

When the reader reads #\, it fetches a subsequent character. If it is one of ()[]{}" \|;#, this is a character literal of itself. Otherwise, the reader reads subsequent characters until it sees a non word-constituent character. If only one character is read, it is the character. Otherwise, the reader matches the read characters with predefined character names. If it doesn’t match any, an error is signaled.

The following character names are recognized. These character names are case insensitive.

space

Whitespace (ASCII #x20)

newline, nl, lf

Newline (ASCII #x0a)

return, cr

Carriage return (ASCII #x0d)

tab, ht

Horizontal tab (ASCII #x09)

page

Form feed (ASCII #x0c)

alarm

Bell (ASCII #x07)

backspace

Backspace (ASCII #x08)

escape, esc

Escape (ASCII #x1b)

delete, del

Delete (ASCII #x7f)

null

NUL character (ASCII #x00)

xN

A character whose Unicode codepoint is the integer N, when N is a hexadecimal integer. This is R7RS lexical syntax. (See the compatibility note below).

uN

A character whose Unicode codepoint is the integer N, where N is 4-digit or 8-digit hexadecimal number.

This is legacy Gauche lexical syntax. Use \xN syntax for the new code. (See the compatibility note below).

#\newline ⇒ #\newline ; newline character
#\x0a     ⇒ #\newline ; ditto
#\x41     ⇒ #\A       ; ASCII letter ’A’
#\x3042   ⇒ ; Hiragana letter A
#\x2a6b2  ⇒ ; JISX0213 Kanji 2-94-86

Compatibility note: Before 0.9.4, \xNN syntax uses Gauche’s internal character encoding as opposed to Unicode codepoint. Both are the same if Gauche is compiled with internal encoding utf-8 or none (if it’s none, only characters up to U+00ff is supported and in this range the characters are the same as Unicode characters.) If Gauche is compiled with encoding euc-jp or sjis, the meaning of \xNN beyond ASCII range differs from 0.9.3.3 or before.

If you set the reader mode to legacy (see Reader lexical mode), #\xNN is read as before, keeping the compatibility (but it isn’t compatible to R7RS). Alternatively, you can use #\uNNNN, or a character itself, to make the code work in both new and old versions of Gauche.

Function: char? obj: [R7RS base] Returns #t if obj is a character, #f otherwise.

Function: char=? char1 char2 char3 …
Function: char<? char1 char2 char3 …
Function: char<=? char1 char2 char3 …
Function: char>? char1 char2 char3 …
Function: char>=? char1 char2 char3 …: [R7RS base] Compares characters. Character comparison is done in internal character encoding.

Function: char-ci=? char1 char2 char3 …

Function: char-ci<? char1 char2 char3 …

Function: char-ci<=? char1 char2 char3 …

Function: char-ci>? char1 char2 char3 …

Function: char-ci>=? char1 char2 char3 …

[R7RS char] Compares characters in case-insensitive way. The comparison is done in the internal character code of the foldcase of the each character; see char-foldcase below.

In R7RS, these procedures are in the (scheme char) library.

Function: char-alphabetic? char

Function: char-numeric? char

Function: char-whitespace? char

Function: char-upper-case? char

Function: char-lower-case? char

Function: char-title-case? char

[R7RS char][SRFI-129] Returns true if a character char is an alphabetic character (Unicode character category Lu, Ll, Lt, Lm, Lo, Nl), a numeric character (Unicode character category Nd), a whitespace character, (Unicode character category Zs, Zp, Zl), an upper case character (Unicode character category Lu), or a lower case character (Unicode character category Ll), respectively.

In R7RS, these procedures except char-title-case? are in the (scheme char) library, while char-title-case? is defined in SRFI-129.

Function: char-general-category char

[R6RS] Returns one of the following symbols, representing the Unicode general category of char.

`Cc`	Other, Control
`Cf`	Other, Format
`Cn`	Other, Not Assigned
`Co`	Other, Private Use
`Cs`	Other, Surrogate
`Ll`	Letter, Lowercase
`Lm`	Letter, Modifier
`Lo`	Letter, Other
`Lt`	Letter, Titlecase
`Lu`	Letter, Uppercase
`Mc`	Mark, Spacing Combining
`Me`	Mark, Enclosing
`Mn`	Mark, Nonspacing
`Nd`	Number, Decimal Digit
`Nl`	Number, Letter
`No`	Number, Other
`Pc`	Punctuation, Connector
`Pd`	Punctuation, Dash
`Pe`	Punctuation, Close
`Pf`	Punctuation, Final quote
`Pi`	Punctuation, Initial quote
`Po`	Punctuation, Other
`Ps`	Punctuation, Open
`Sc`	Symbol, Currency
`Sk`	Symbol, Modifier
`Sm`	Symbol, Math
`So`	Symbol, Other
`Zl`	Separator, Line
`Zp`	Separator, Paragraph
`Zs`	Separator, Space

If Gauche is compiled with euc-jp or shift_jis encoding, there are characters that don’t have corresponding Unicode codepoint (each of them are represented by one unicode character plus one unicode modifier character). A provisional category is assigned to those characters. If future versions of Unicode incorporates these characters, the category may be reassigned.

SJIS	EUC	Cat	Unicode
`82F5`	`A4F7`	`Lo`	`U+304B U+309A` (Semi-voiced Hiragana KA)
`82F6`	`A4F8`	`Lo`	`U+304D U+309A` (Semi-voiced Hiragana KI)
`82F7`	`A4F9`	`Lo`	`U+304F U+309A` (Semi-voiced Hiragana KU)
`82F8`	`A4FA`	`Lo`	`U+3051 U+309A` (Semi-voiced Hiragana KE)
`82F9`	`A4FB`	`Lo`	`U+3053 U+309A` (Semi-voiced Hiragana KO)
`8397`	`A5F7`	`Lo`	`U+30AB U+309A` (Semi-voiced Katakana KA)
`8398`	`A5F8`	`Lo`	`U+30AD U+309A` (Semi-voiced Katakana KI)
`8399`	`A5F9`	`Lo`	`U+30AF U+309A` (Semi-voiced Katakana KU)
`839A`	`A5FA`	`Lo`	`U+30B1 U+309A` (Semi-voiced Katakana KE)
`839B`	`A5FB`	`Lo`	`U+30B3 U+309A` (Semi-voiced Katakana KO)
`839C`	`A5FC`	`Lo`	`U+30BB U+309A` (Semi-voiced Katakana SE)
`839D`	`A5FD`	`Lo`	`U+30C4 U+309A` (Semi-voiced Katakana TSU)
`839E`	`A5FE`	`Lo`	`U+30C8 U+309A` (Semi-voiced Katakana TO)
`83F6`	`A6F8`	`Lo`	`U+31F7 U+309A` (Semi-voiced small Katakana FU)
`8663`	`ABC4`	`Ll`	`U+00E6 U+0300` (Accented latin small ae)
`8667`	`ABC8`	`Ll`	`U+0254 U+0300` (Accented latin small open o)
`8668`	`ABC9`	`Ll`	`U+0254 U+0301` (Accented latin small open o)
`8669`	`ABCA`	`Ll`	`U+028C U+0300` (Accented latin small turned v)
`866A`	`ABCB`	`Ll`	`U+028C U+0301` (Accented latin small turned v)
`866B`	`ABCC`	`Ll`	`U+0259 U+0300` (Accented latin small schwa)
`866C`	`ABCD`	`Ll`	`U+0259 U+0301` (Accented latin small schwa)
`866D`	`ABCE`	`Ll`	`U+025A U+0300` (Accented latin small schwa w/hook)
`866E`	`ABCF`	`Ll`	`U+025A U+0301` (Accented latin small schwa w/hook)
`8685`	`ABE5`	`Sk`	`U+02E9 U+02E5`
`8686`	`ABE6`	`Sk`	`U+02E5 U+02E9`

Function: char->integer char

Function: integer->char n

[R7RS base] char->integer returns an exact integer that represents internal encoding of the character char. integer->char returns a character whose internal encoding is an exact integer n. The following expression is always true for valid character char:

(eq? char (integer->char (char->integer char)))

Note: R7RS defines these procedures to deal with Unicode codepoints. Gauche complies it when compiled with utf-8 or none internal encoding (for the latter, only characters up to U+00ff are supported). If Gauche is compiled with euc-jp or sjis internal encoding, you need to use char->ucs/ucs->char below to convert between Unicode codepoints and characters.

The result is undefined if you pass n to integer->char that doesn’t have a corresponding character.

Function: char->ucs char

Function: ucs->char n

Converts a character char to integer UCS codepoint, and integer UCS codepoint n to a character, respectively.

If Gauche is compiled with UTF-8 encoding, these procedures are the same as char->integer and integer->char.

When Gauche’s internal encoding differs from UTF-8, these procedures implicitly loads gauche.charconv module to convert internal character code to UCS or vice versa (see Character code conversion). If char doesn’t have corresponding UCS codepoint, char->ucs returns #f. If UCS codepoint n can’t be represented in the internal character encoding, ucs->char returns #f, unless the conversion routine provides a substitution character.

Function: char-upcase char

Function: char-downcase char

Function: char-titlecase char

Function: char-foldcase char

[R7RS char][SRFI-129] Returns the upper case, lower case, title case and folded case of char, respectively.

The mapping is done according to Unicode-defined character-by-character case mapping whenever possible. If the native encoding doesn’t support the mapped character defined in Unicode, the operation becomes no-op. If the native encoding is ’none’, we treat the characters as if they are Latin-1 (ISO-8859-1) characters. So, upcasing Latin-1 character small y with diaresis (U+00ff) maps to capital y with diaeresis (U+0178) if the internal encoding is utf-8, but it is no-op if the internal encoding is none.

R7RS defines char-upcase, char-downcase, and char-foldcase in the (scheme char) library, while char-titlecase is defined in SRFI-129. R6RS defines all of them.

The character-by-character case mapping doesn’t consider a character that may map to more than one characters; a notable example is eszett (latin small letter sharp S, U+00df), which is is mapped to two capital S’s in string context, but char-upcase #\ß returns #\ß. To get a full mapping, use string-upcase etc. in gauche.unicode module (see Full string case conversion).

Function: digit->integer char :optional (radix 10) (extended-range? #f)

If given character char is a valid digit character in radix radix number, the corresponding integer is returned. Otherwise #f is returned.

(digit->integer #\4) ⇒ 4
(digit->integer #\e 16) ⇒ 14
(digit->integer #\9 8) ⇒ #f

If the optional extended-range? argument is true, this procedure recognizes not only ASCII digits, but also all characters with Nd general category—such as FULLWIDTH DIGIT ZERO to NINE (U+ff10 - U+ff19).

R7RS has digit-value, which is equivalent to (digit->integer char 10 #t).

Note: CommonLisp has a similar function in rather confusing name, digit-char-p.

Function: integer->digit integer :optional (radix 10) (basechar1 #\0) (basechar2 #\a)

Reverse operation of digit->integer. Returns a character that represents the number integer in the radix radix system. If integer is out of the valid range, #f is returned.

(integer->digit 13 16) ⇒ #\d
(integer->digit 10) ⇒ #f

The optional basechar1 argument specifies the character that stands for zero; by default, it’s #\0. You can give alternative character, for example, U+0660 (ARABIC-INDIC DIGIT ZERO) to convert an integer to a arabic-indic digit character.

Another optional basechar2 argument is used for integers over 10. The default value is #\a. You can pass #\A to get upper-case hex digits, for example.

Note: CommonLisp’s digit-char.

Function: gauche-character-encoding

Returns a symbol designates the native character encoding, selected at the compile time. The possible return values are those:

euc-jp: EUC-JP
utf-8: UTF-8
sjis: Shift JIS
none: No multibyte character support (8-bit fixed-length character).

To switch code at compile time according to the internal encoding, you can use feature identifiers gauche.ces.*–see Platform-dependent features.

Function: supported-character-encodings: Returns a list of string names of character encoding schemes that are supported in the native multibyte encoding scheme.