CS Rust Sp2021: Strings

Strings (ugh)

Lots of stuff here
Fortunately, we've talked about a lot of it before
tl;dr:
- There's char which is a Unicode "code point"
- There's str which is a sequence of bytes representing Unicode code points in the compressed UTF-8 format
- There's &str which is a smart pointer to a str with with a length field
- There's String which is the "owned" version of str

"Character" can mean a lot of things. There's 7-bit ASCII, 8-bit "latin-1". There's a million per-language coding standards
In many languages there can be some debate about what constitutes a single "character". Even in English, is a ligature like ﬀ a character?
Rust's char is a Unicode "code point". It's a 32-bit quantity, but not all possible values are legal
There are the usual character classifiers and converters, which work on full Unicode
There's a bit of ASCII support, but probably shouldn't normally use it
You can always cast a char to any integer type. Watch out for overflow!
You can't cast an integer to char (except u8, because ASCII): you need to use std::char::from_u32() or something like it. It returns an Option depending on whether the particular input is a legal Unicode code point
Note that case conversions can't return a single character because uppercase ←→ lowercase is not always 1::1. So return a char iterator, which is super-annoying
examples/charcast.rs
examples/lower.rs

It is deemed "not best practice" to store strings as sequences of 32-bit code points. So a compressed encoding called UTF-8 is used for strings. This encoding stores ASCII characters as themselves, and uses an escape convention to get multibyte coding of non-ASCII strings. A UTF-8 string is almost always much smaller than 4x the number of code points
A Rust str is like an array, except of UTF-8 text. A str is unsized, so it is really only useful in certain type declarations

Let's just refer to &str and String values collectively as "strings"
An &str is a reference to a str. It is a fat pointer that contains the size of the &str in bytes. The normal borrow rules apply
A String is an owned reference to a str. It is a fat pointer that contains the size of the str in bytes.
Because String is owned, you can modify the contained bytes. However, all the methods provided for this are guaranteed to preserve UTF-8 encoding of the bytes. This is carefully tuned to avoid trouble
You can get an iterator over a string's code points (.chars()) or u8 bytes (.bytes())
There is no convenient way to go to a given character (code point) position in a string. If you plan to do that a lot, use chars().collect()
examples/stringchars.rs

There are the obvious methods for converting these things around. Read the book carefully to learn about the vocabulary
Many of the string manipulation functions take a "pattern". There's a lot of kinds: read the book for the details. You will use them all, eventually
There's a regex library crate. It's good.
You can use from_str() or .parse() to convert a string into something else. You can use to_string() to convert other things into String
You can use .as_bytes() and .into_bytes() to grab the bytes of a string for free
The ::from_utf8() methods come in checked, lossy and unsafe flavors. Choose wisely

The book discusses the details of the format string language we've been using with println!() and the like. Read it
A lot of the details of this section have to do with macros, so we'll deal with it then

Last modified: Tuesday, 18 May 2021, 2:00 PM