Arrays and Strings

Arrays

  • Array

    • Owned type
    • Type includes size
    • Types like [u8;5], values like [0u8;5] or [0u8, 1u8].
  • Slice

    • Reference to an array — contents not owned
    • Reference is "smart pointer" that remembers the length
    • Types like &[u8], values like &[0u8;5] or &[0u8, 1u8].
    • Can index with a range to get a "slice" of the underlying array, e.g.

      let a = [0u8, 1, 2, 3, 4];
      let s = &a[1..4];
      assert_eq!(s.len(), 3);
      assert_eq!(s[0], 1);
      assert_eq!(s[2], 3);
      
    • There are lots of cool operations on slices; see the textbook and The Book and the docs

  • Vec

    • Heap-allocated array-like object
    • Length is tracked at runtime: storage is managed
    • Types like Vec<u8>, values like Vec::new() or vec!(1u8, 2, 3)
    • Can append to a Vec with push(), extract last with pop()
    • Can slice a Vec, e.g.

      let mut a = vec![0u8, 1, 2, 3, 4];
      assert_eq!(a.len(), 5);
      a.push(5);
      assert_eq!(a.len(), 6);
      let s = &a[1..4];
      assert_eq!(s.len(), 3);
      assert_eq!(s[0], 1);
      assert_eq!(s[2], 3);
      

Indexing

  • Can only index with value of type usize

  • Indices are bounds-checked at runtime: bounds checks are often lifted or omitted by clever compiler

  • May be more efficient and readable to iterate over values or references than to do the indexing

        let mut a: Vec<u8> = (0..5).collect();
        for i in 0..a.len() {
            a[i] += 1;
        }
    

    vs

        let mut a: Vec<u8> = (0..5).collect();
        for v in a.iter_mut() {
            *v += 1;
        }
    

Strings (ugh)

  • Lots of stuff here

  • tl;dr:

    • There's char which is a Unicode code point
    • There's str which is a UTF-8 string of bytes
    • There's String which is the "owned" version of str

Chars

  • "Character" can mean a lot of things. There's 7-bit ASCII, 8-bit "latin-1". There's a million per-language coding standards

  • In many languages there can be some debate about what constitutes a single "character". Even in English, is a ligature like ff a character?

  • Rust's char is a Unicode "code point". It's a 32-bit quantity, but not all possible values are legal

  • There are the usual character classifiers and converters, which work on full Unicode

  • There's a bit of ASCII support, but probably shouldn't normally use it

  • You can always cast a char to any integer type big enough to hold it

  • You can't cast an integer to char: you need to use std::char::from_u32() or something like it. It returns an Option depending on whether the particular input is a legal Unicode code point

  • Note that case conversions can't return a single character because uppercase ←→ lowercase is not always 1::1. So return a char iterator, which is super-annoying

str

  • It is deemed "not best practice" to store strings as sequences of 32-bit code points. So a compressed encoding called UTF-8 is used for strings. This encoding stores ASCII characters as themselves, and uses an escape convention to get multibyte coding of non-ASCII strings. A UTF-8 string is almost always much smaller than 4x the number of code points

  • A Rust str is like an array, except of UTF-8 text. A str is unsized, so it is really only useful in certain type declarations

String and &str

  • Let's just refer to &str and String values collectively as "strings"

  • An &str is a reference to a str. It is a fat pointer that contains the size of the &str in bytes. The normal borrow rules apply

  • A String is an owned reference to a str. It is a fat pointer that contains the size of the str in bytes.

  • Because String is owned, you can modify the contained bytes. However, all the methods provided for this are guaranteed to preserve UTF-8 encoding of the bytes. This is carefully tuned to avoid trouble

  • You can get an iterator over a string's code points (.chars()) or u8 bytes (.bytes())

  • There is no convenient way to go to a given character (code point) position in a string. If you plan to do that a lot, use chars().collect()

String Methods

  • There are the obvious methods for converting these things around. Read the book carefully to learn about the vocabulary

  • Many of the string manipulation functions take a "pattern". There's a lot of kinds: read p. 402 of the book for the details. You will use them all, eventually

  • There's a regex package in the library. It's OK.

  • You can use from_str() or .parse() to convert a string into something else. You can use to_string() to convert other things into String

  • You can use .as_bytes() and .into_bytes() to grab the bytes of a string for free

  • The ::from_utf8() methods come in checked, lossy and unsafe flavors. Choose wisely

Last modified: Monday, 1 July 2019, 1:37 AM