Strings
Strings are finite sequences of characters. Of course, the real trouble comes when one asks what a character is. The characters that English speakers are familiar with are the letters A, B, C, etc., together with numerals and common punctuation symbols. These characters are standardized together with a mapping to integer values between 0 and 127 by the ASCII standard. There are, of course, many other characters used in non-English languages, including variants of the ASCII characters with accents and other modifications, related scripts such as Cyrillic and Greek, and scripts completely unrelated to ASCII and English, including Arabic, Chinese, Hebrew, Hindi, Japanese, and Korean. The Unicode standard tackles the complexities of what exactly a character is, and is generally accepted as the definitive standard addressing this problem. Depending on your needs, you can either ignore these complexities entirely and just pretend that only ASCII characters exist, or you can write code that can handle any of the characters or encodings that one may encounter when handling non-ASCII text. Julia makes dealing with plain ASCII text simple and efficient, and handling Unicode is as simple and efficient as possible. In particular, you can write C-style string code to process ASCII strings, and they will work as expected, both in terms of performance and semantics. If such code encounters non-ASCII text, it will gracefully fail with a clear error message, rather than silently introducing corrupt results. When this happens, modifying the code to handle non-ASCII data is straightforward and easy.There are a few noteworthy high-level features about Julia’s strings:
- String is an abstraction, not a concrete type — many different representations can implement the String interface, but they can easily be used together and interact transparently. Any string type can be used in any function expecting a String.
- Like C and Java, but unlike most dynamic languages, Julia has a first-class type representing a single character, called Char. This is just a special kind of 32-bit integer whose numeric value represents a Unicode code point.
- As in Java, strings are immutable: the value of a String object cannot be changed. To construct a different string value, you construct a new string from parts of other strings.
- Conceptually, a string is a partial function from indices to characters — for some index values, no character value is returned, and instead an exception is thrown. This allows for efficient indexing into strings by the byte index of an encoded representation rather than by a character index, which cannot be implemented both efficiently and simply for variable-width encodings of Unicode strings.
- Julia supports the full range of Unicode characters: literal strings are always ASCII or UTF-8 but other encodings for strings from external sources can be supported easily and efficiently.
Characters
A Char value represents a single character: it is just a 32-bit integer with a special literal representation and appropriate arithmetic behaviors, whose numeric value is interpreted as a Unicode code point. Here is how Char values are input and shown:julia> 'x'
'x'
julia> typeof(ans)
Char
julia> int('x')
120
julia> typeof(ans)
Int32
julia> char(120)
'x'
julia> char(0xd800)
'???'
julia> safe_char(0xd800)
invalid Unicode code point: U+d800
julia> char(0x110000)
'\U110000'
julia> safe_char(0x110000)
invalid Unicode code point: U+110000
You can input any Unicode character in single quotes using \u followed by up to four hexadecimal digits or \U followed by up to eight hexadecimal digits (the longest valid value only requires six):
julia> '\u0'
'\0'
julia> '\u78'
'x'
julia> '\u2200'
'∀'
julia> '\U10ffff'
'\U10ffff'
julia> int('\0')
0
julia> int('\t')
9
julia> int('\n')
10
julia> int('\e')
27
julia> int('\x7f')
127
julia> int('\177')
127
julia> int('\xff')
255
julia> 'x' - 'a'
23
julia> 'A' < 'a'
true
julia> 'A' <= 'a' <= 'Z'
false
julia> 'A' <= 'X' <= 'Z'
true
julia> 'A' + 1
66
julia> char(ans)
'B'
String Basics
Here a variable is initialized with a simple string literal:julia> str = "Hello, world.\n"
"Hello, world.\n"
julia> str[1]
'H'
julia> str[6]
','
julia> str[end]
'\n'
In any indexing expression, the keyword, end, can be used as a shorthand for length(x), where x is the object being indexed into, whether it is a string, an array, or some other indexable object. You can perform arithmetic and other operations with end, just like a normal value:
julia> str[end-1]
'.'
julia> str[end/2]
' '
julia> str[end/3]
'o'
julia> str[end/4]
'l'
julia> str[0]
in next: arrayref: index out of range
julia> str[end+1]
in next: arrayref: index out of range
julia> str[4:9]
"lo, wo"
julia> str[6]
','
julia> str[6:6]
","
Unicode and UTF-8
Julia fully supports Unicode characters and strings. As discussed above, in character literals, Unicode code points can be represented using unicode \u and \U escape sequences, as well as all the standard C escape sequences. These can likewise be used to write string literals:julia> s = "\u2200 x \u2203 y"
"∀ x ∃ y"
julia> s[1]
'∀'
julia> s[2]
invalid UTF-8 character index
julia> s[3]
invalid UTF-8 character index
julia> s[4]
' '
Because of variable-length encodings, strlen(s) and length(s) are not always the same: strlen(s) gives the number of characters in s while length(s) gives the maximum valid byte index into s. If you iterate through the indices 1 through length(s) and index into s, the sequence of characters returned, when errors aren’t thrown, is the sequence of characters comprising the string, s. Thus, we do have the identity that strlen(s) <= length(s) since each character in a string must have its own index. The following is an inefficient and verbose way to iterate through the characters of s:
julia> for i = 1:length(s)
try
println(s[i])
catch
# ignore the index error
end
end
∀
x
∃
y
julia> for c = s
println(c)
end
∀
x
∃
y
Interpolation
One of the most common and useful string operations is concatenation:julia> greet = "Hello"
"Hello"
julia> whom = "world"
"world"
julia> strcat(greet, ", ", whom, ".\n")
"Hello, world.\n"
julia> "$greet, $whom.\n"
"Hello, world.\n"
The shortest complete expression after the $ is taken as the expression whose value is to be interpolated into the string. Thus, you can interpolate any expression into a string using parentheses:
julia> "1 + 2 = $(1 + 2)"
"1 + 2 = 3"
julia> x = 2; y = 3; z = 5;
julia> "x,y,z: $[x,y,z]."
"x,y,z: [2,3,5]."
julia> v = [1,2,3]
[1,2,3]
julia> "v: $v"
"v: [1,2,3]"
julia> c = 'x'
'x'
julia> "hi, $c"
"hi, x"
julia> print("I have \$100 in my account.\n")
I have $100 in my account.
Common Operations
You can lexicographically compare strings using the standard comparison operators:julia> "abracadabra" < "xylophone"
true
julia> "abracadabra" == "xylophone"
false
julia> "Hello, world." != "Goodbye, world."
true
julia> "1 + 2 = 3" == "1 + 2 = $(1 + 2)"
true
julia> strchr("xylophone", 'x')
1
julia> strchr("xylophone", 'p')
5
julia> strchr("xylophone", 'z')
char not found
julia> strchr("xylophone", 'o')
4
julia> strchr("xylophone", 'o', 5)
7
julia> strchr("xylophone", 'o', 8)
char not found
julia> repeat(".:Z:.", 10)
".:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:."
- length(str) gives the maximal (byte) index that can be used to index into str.
- strlen(str) the number of characters in str; this is not the same as length(str).
- i = start(str) gives the first valid index at which a character can be found in str (typically 1).
- c, j = next(str,i) returns next character at or after the index i and the next valid character index following that. With the start and length, can be used to iterate through the characters in str. With length and start can be used to iterate through the characters in str in reverse.
- ind2chr(str,i) gives the number of characters in str up to and including any at index i.
- chr2ind(str,j) gives the index at which the jth character in str occurs.
Non-Standard String Literals
There are situations when you want to construct a string or use string semantics, but the behavior of the standard string construct is not quite what is needed. For these kinds of situations, Julia provides non-standard string literals. A non-standard string literal looks like a regular double-quoted string literal, but is immediately prefixed by an identifier, and doesn’t behave quite like a normal string literal.Two types of interpretation are performed on normal Julia string literals: interpolation and unescaping (escaping is the act of expressing a non-standard character with a sequence like \n, whereas unescaping is the process of interpreting such escape sequences as actual characters). There are cases where its convenient to disable either or both of these behaviors. For such situations, Julia provides three types of non-standard string literals:
- E"..." interpret escape sequences but do not interpolate, thereby rendering $ a harmless, normal character.
- I"..." perform interpolation but do not interpret escape sequences specially.
- L"..." perform neither unescaping nor interpolation.
julia> E"I have $100 in my account.\n"
"I have \$100 in my account.\n"
On the other hand, I"..." string literals perform interpolation but no unescaping:
julia> I"I have $100 in my account.\n"
"I have 100 in my account.\\n"
The third non-standard string form interprets all the characters between the opening and closing quotes literally: the L"..." form. Here is an example usage:
julia> L"I have $100 in my account.\n"
"I have \$100 in my account.\\n"
Byte Array Literals
Some string literal forms don’t create strings at all. In the next section, we will see that regular expressions are written as non-standard string literals. Another useful non-standard string literal, however, is the byte-array string literal: b"...". This form lets you use string notation to express literal byte arrays — i.e. arrays of Uint8 values. The convention is that non-standard literals with uppercase prefixes produce actual string objects, while those with lowercase prefixes produce non-string objects like byte arrays or compiled regular expressions. The rules for byte array literals are the following:- ASCII characters and ASCII escapes produce a single byte.
- \x and octal escape sequences produce the byte corresponding to the escape value.
- Unicode escape sequences produce a sequence of bytes encoding that code point in UTF-8.
julia> b"DATA\xff\u2200"
[68,65,84,65,255,226,136,128]
julia> "DATA\xff\u2200"
syntax error: invalid UTF-8 sequence
julia> b"\xff"
[255]
julia> b"\uff"
[195,191]
If this is all extremely confusing, try reading “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets”. It’s an excellent introduction to Unicode and UTF-8, and may help alleviate some confusion regarding the matter.
In byte array literals, objects interpolate as their binary representation rather than as their string representation:
julia> msg = "Hello."
"Hello."
julia> len = uint16(length(msg))
6
julia> b"$len$msg"
[6,0,72,101,108,108,111,46]
Regular Expressions
Julia has Perl-compatible regular expressions, as provided by the PCRE library. Regular expressions are related to strings in two ways: the obvious connection is that regular expressions are used to find regular patterns in strings; the other connection is that regular expressions are themselves input as strings, which are parsed into a state machine that can be used to efficiently search for patterns in strings. In Julia, regular expressions are input using non-standard string literals prefixed with various identifiers beginning with r. The most basic regular expression literal without any options turned on just uses r"...":julia> r"^\s*(?:#|$)"
r"^\s*(?:#|$)"
julia> typeof(ans)
Regex
julia> matches(r"^\s*(?:#|$)", "not a comment")
false
julia> matches(r"^\s*(?:#|$)", "# a comment")
true
julia> match(r"^\s*(?:#|$)", "not a comment")
julia> match(r"^\s*(?:#|$)", "# a comment")
RegexMatch("#")
m = match(r"^\s*(?:#|$)", line)
if m == nothing
println("not a comment")
else
println("blank or comment")
end
julia> m = match(r"^\s*(?:#\s*(.*?)\s*$|$)", "# a comment ")
RegexMatch("# a comment ", 1="a comment")
- the entire substring matched: m.match
- the captured substrings as a tuple of strings: m.captures
- the offset at which the whole match begins: m.offset
- the offsets of the captured substrings as a vector: m.offsets
julia> m = match(r"(a|b)(c)?(d)", "acd")
RegexMatch("acd", 1="a", 2="c", 3="d")
julia> m.match
"acd"
julia> m.captures
("a","c","d")
julia> m.offset
1
julia> m.offsets
[1,2,3]
julia> m = match(r"(a|b)(c)?(d)", "ad")
RegexMatch("ad", 1="a", 2=nothing, 3="d")
julia> m.match
"ad"
julia> m.captures
("a",nothing,"d")
julia> m.offset
1
julia> m.offsets
[1,0,2]
julia> first, second, third = m.captures
("a",nothing,"d")
julia> first
"a"
i Do case-insensitive pattern matching.
If locale matching rules are in effect, the case map is taken
from the current locale for code points less than 255, and
from Unicode rules for larger code points. However, matches
that would cross the Unicode rules/non-Unicode rules boundary
(ords 255/256) will not succeed.
m Treat string as multiple lines. That is, change "^" and "$"
from matching the start or end of the string to matching the
start or end of any line anywhere within the string.
s Treat string as single line. That is, change "." to match any
character whatsoever, even a newline, which normally it would
not match.
Used together, as /ms, they let the "." match any character
whatsoever, while still allowing "^" and "$" to match,
respectively, just after and just before newlines within the
string.
x Tells the regular expression parser to ignore most whitespace
that is neither backslashed nor within a character class. You
can use this to break up your regular expression into
(slightly) more readable parts. The '#' character is also
treated as a metacharacter introducing a comment, just as in
ordinary code.
julia> r"a+.*b+.*?d$"ism
r"a+.*b+.*?d$"ims
julia> match(r"a+.*b+.*?d$"ism, "Goodbye,\nOh, angry,\nBad world\n")
RegexMatch("angry,\nBad world")
No comments:
Post a Comment
Thank you