Strings (ugh)

  • Lots of stuff here

  • Fortunately, we've talked about a lot of it before

  • tl;dr:

    • There's char which is a Unicode code point
    • There's str which is a UTF-8 string of bytes
    • There's String which is the "owned" version of str

Chars

  • "Character" can mean a lot of things. There's 7-bit ASCII, 8-bit "latin-1". There's a million per-language coding standards

  • In many languages there can be some debate about what constitutes a single "character". Even in English, is a ligature like ff a character?

  • Rust's char is a Unicode "code point". It's a 32-bit quantity, but not all possible values are legal

  • There are the usual character classifiers and converters, which work on full Unicode

  • There's a bit of ASCII support, but probably shouldn't normally use it

  • You can always cast a char to any integer type big enough to hold it

  • You can't cast an integer to char: you need to use std::char::from_u32() or something like it. It returns an Option depending on whether the particular input is a legal Unicode code point

  • Note that case conversions can't return a single character because uppercase ←→ lowercase is not always 1::1. So return a char iterator, which is super-annoying

str

  • It is deemed "not best practice" to store strings as sequences of 32-bit code points. So a compressed encoding called UTF-8 is used for strings. This encoding stores ASCII characters as themselves, and uses an escape convention to get multibyte coding of non-ASCII strings. A UTF-8 string is almost always much smaller than 4x the number of code points

  • A Rust str is like an array, except of UTF-8 text. A str is unsized, so it is really only useful in certain type declarations

String and &str

  • Let's just refer to &str and String values collectively as "strings"

  • An &str is a reference to a str. It is a fat pointer that contains the size of the &str in bytes. The normal borrow rules apply

  • A String is an owned reference to a str. It is a fat pointer that contains the size of the str in bytes.

  • Because String is owned, you can modify the contained bytes. However, all the methods provided for this are guaranteed to preserve UTF-8 encoding of the bytes. This is carefully tuned to avoid trouble

  • You can get an iterator over a string's code points (.chars()) or u8 bytes (.bytes())

  • There is no convenient way to go to a given character (code point) position in a string. If you plan to do that a lot, use chars().collect()

String Methods

  • There are the obvious methods for converting these things around. Read the book carefully to learn about the vocabulary

  • Many of the string manipulation functions take a "pattern". There's a lot of kinds: read p. 402 of the book for the details. You will use them all, eventually

  • There's a regex package in the library. It's OK.

  • You can use from_str() or .parse() to convert a string into something else. You can use to_string() to convert other things into String

  • You can use .as_bytes() and .into_bytes() to grab the bytes of a string for free

  • The ::from_utf8() methods come in checked, lossy and unsafe flavors. Choose wisely

Cow

  • Previously discussed. Allows avoiding conversion until needed

  • Usually not worth the trouble

Formatting

  • Pages 411-424 of the book discuss the details of the format string language we've been using with println!() and the like. Read it

  • A lot of the details of this section have to do with macros, so we'll deal with it then

I/O

  • There's Read, BufRead and Write traits

  • The first two correspond to "unbuffered" and "buffered"

  • Any sensible thing, including strings, either is Read / Write or can be made to be

  • stdin, stdout and stderr are treated as functions returning readers / writers, because locking across threads

  • The flush situation is non-optimal

  • Path and friends deal with filenames / pathnames. OsString may also be necessary

Last modified: Thursday, 17 May 2018, 6:31 PM