Strings and I/O
Strings (ugh)
Lots of stuff here
Fortunately, we've talked about a lot of it before
tl;dr:
- There's
char
which is a Unicode code point - There's
str
which is a UTF-8 string of bytes - There's
String
which is the "owned" version ofstr
- There's
Chars
"Character" can mean a lot of things. There's 7-bit ASCII, 8-bit "latin-1". There's a million per-language coding standards
In many languages there can be some debate about what constitutes a single "character". Even in English, is a ligature like
ff
a character?Rust's
char
is a Unicode "code point". It's a 32-bit quantity, but not all possible values are legalThere are the usual character classifiers and converters, which work on full Unicode
There's a bit of ASCII support, but probably shouldn't normally use it
You can always cast a
char
to any integer type big enough to hold itYou can't cast an integer to
char
: you need to usestd::char::from_u32()
or something like it. It returns anOption
depending on whether the particular input is a legal Unicode code pointNote that case conversions can't return a single character because uppercase ←→ lowercase is not always 1::1. So return a
char
iterator, which is super-annoying
str
It is deemed "not best practice" to store strings as sequences of 32-bit code points. So a compressed encoding called UTF-8 is used for strings. This encoding stores ASCII characters as themselves, and uses an escape convention to get multibyte coding of non-ASCII strings. A UTF-8 string is almost always much smaller than 4x the number of code points
A Rust
str
is like an array, except of UTF-8 text. Astr
is unsized, so it is really only useful in certain type declarations
String and &str
Let's just refer to
&str
andString
values collectively as "strings"An
&str
is a reference to astr
. It is a fat pointer that contains the size of the&str
in bytes. The normal borrow rules applyA
String
is an owned reference to astr
. It is a fat pointer that contains the size of thestr
in bytes.Because
String
is owned, you can modify the contained bytes. However, all the methods provided for this are guaranteed to preserve UTF-8 encoding of the bytes. This is carefully tuned to avoid troubleYou can get an iterator over a string's code points (
.chars()
) oru8
bytes (.bytes()
)There is no convenient way to go to a given character (code point) position in a string. If you plan to do that a lot, use
chars().collect()
String Methods
There are the obvious methods for converting these things around. Read the book carefully to learn about the vocabulary
Many of the string manipulation functions take a "pattern". There's a lot of kinds: read p. 402 of the book for the details. You will use them all, eventually
There's a regex package in the library. It's OK.
You can use
from_str()
or.parse()
to convert a string into something else. You can useto_string()
to convert other things intoString
You can use
.as_bytes()
and.into_bytes()
to grab the bytes of a string for freeThe
::from_utf8()
methods come in checked, lossy and unsafe flavors. Choose wisely
Cow
Previously discussed. Allows avoiding conversion until needed
Usually not worth the trouble
Formatting
Pages 411-424 of the book discuss the details of the format string language we've been using with
println!()
and the like. Read itA lot of the details of this section have to do with macros, so we'll deal with it then
I/O
There's
Read
,BufRead
andWrite
traitsThe first two correspond to "unbuffered" and "buffered"
Any sensible thing, including strings, either is
Read
/Write
or can be made to bestdin
,stdout
andstderr
are treated as functions returning readers / writers, because locking across threadsThe
flush
situation is non-optimalPath
and friends deal with filenames / pathnames.OsString
may also be necessary