RISC-V Vector Basics and Development Environment

Aug 8, 2023 · 2094 words · 10 minute read

RISC-V SIMD

I recently setup a development environment for RISC-V Vector (RVV) intrinsics. Here are some tips and steps.

RVV Basics #

This section is not really necessary in terms of setting up a dev env, but it helps if you are also working with RVV.

Disclaimer: I am also a total beginner to RVV, so there may be errors or inaccuracies in my understanding.

RVV differs from other SIMD implmentations by making programs scalable across different vector register lengths. Traditional ISAs like x86 (SSE, AVX, etc) and Arm (Neon) have fixed vector registers, so developers don’t really need to care about the variations across different implementations. However, RISC-V implementations are highly diversed, and supporting different vector register lenghts is necessary. Thus, RVV introduces a set of new concepts like SEW, VLEN, VL, AVL, VLMAX, and LMUL.

These terms are very confusing at the beginning, but the programming model is simple: you request space from the CPU dynamically (think it like calloc(3)).

In RVV, you are processing an array of elements. The element length is called SEW (Selected element width), and it is either 8bits, 16bits, 32bits, or 64bits. The number of elements you request is called AVL (application vector length).

When you need to do vector operations, you first tell the CPU your element length (SEW) and how many elements you want (AVL), with other options to be discussed later. Then, the CPU will provide the available element count, VL (Vector length), which depends on VLEN, the hard-coded vector register length on that specific CPU.

These data are stored in a CSR (Control and status register) called vtype, and it can be configured using the following set of instructions:

# VL = Destination register that stores the given element count
# AVL = Register that stores the number of elements you want
# SEW = e8 / e16 / e32 / e64
# LMUL = Register grouping: mf8 / mf4 / mf2 / m1 / m2 / m4 / m8, see following
# Tail Agnostic: I don't know, put ta here
# Mask Agnostic: I don't know either, put ma here
vsetvli VL, AVL, SEW, LMUL, Tail Agnostic, Mask Agnostic
# Or use vsetivli or vsetvl

Besides VL, AVL, and SEW, we can also set LMUL, which groups multiple vector registers together to provide larger or smaller spaces. RVV supports both fractional grouping (1 / 8, 1 / 4, or 1 / 2) and non-fractional grouping (2 / 4 / 8). To be honest I don’t know how to use fractional grouping yet.

Grouped registers have the multipled length. That is, grouped registers have the length of VLEN * LMUL. For example, grouping of m8 will result in each register having VLEN * 8 bits. When doing vector operations, grouped registers are referenced using the first register name. For example, with m8 grouping, there are four registers: v0 (which contains v0, v1, v2, v3 , v4, v5, v6, v7), v8, v16, and v24 (there are 32 vector registers in total, regardless of implementation.)

As said before, SEW refers to the length (in bits) of a single array element. That is, on a given CPU, the max available number of elements (VLMAX) can be determined using LMUL * VLEN / SEW (Bits of a single grouped register divided by the length of a single element).

Lastly, you request a specific amount of elements using the AVL parameter. Therefore, the final number returned to your VL register is VL = min(AVL, VLMAX): take your number of elements, or the max number of elements if your number is too large.

Then, if I understood correctly, vector operations will all operate on the VL number of elements, except for load / store operations.

If your AVL exceeds VLMAX (which is the common case), you must loop yourself. Just use the VL number returned from vsetvli instruction to determine how many iterations do you need. This is called a “sliding window” approach.

        Max number of elements
            |
            |
            |
            |
Selected    v
 |       +-----+
 |       |VLMAX+--->
 |       +-----+--------------------------------+
 +------>| V L | (Data to process later)        |
         +-----+--------------------------------+
         |                                      |
         +--------------------------------------+->
                         AVL (Required number of elements)
      
                      +-----+
                      |VLMAX+--->
         +------------+-----+-------------------+
         |Processed   | V L |Data to process    |
         +------------+-----+-------------------+
         |                                      |
         +--------------------------------------+->
                         AVL
      
                                             +-----+
                                             |VLMAX|
         +-----------------------------------+--+--+
         |Processed                          |VL|
         +-----------------------------------+--+
         |                                      |
         +--------------------------------------+->
                         AVL

(Generated using https://asciiflow.com/, image credit goes to https://itnext.io/grokking-risc-v-vector-processing-6afe35f2b5de)

There is an exception: for load / store instructions, you get to choose a temporary SEW for that operation only: the EEW (Effective element width), and (I believe) the CPU will automatically adjust its LMUL to use a EMUL (Effective LMUL), which equals to (EEW / SEW) * LMUL. EEW can be equals to, greater, or smaller than SEW.

Learn more:

https://github.com/riscv/riscv-v-spec: RVV spec
https://itnext.io/grokking-risc-v-vector-processing-6afe35f2b5de: Intro
https://github.com/aosp-riscv/working-group/blob/b283a7cb9f9ee3e629a6014985a70a63ea05542f/articles/20230629-rvv-note.md: Comments to the above article (Chinese)
https://www.francisz.cn/2022/03/23/riscv-vector/: Very detailed intro to RVV intructions (Chinese)

Execution Environment: QEMU User and GDB #

I choose QEMU User to run my program. QEMU User is a lighweight userspace simulator: it runs just normal ELF files but with a different ISA. It will transparently handle ld.so (possibly in a new sysroot) and translate syscalls into the simulated ISA. Programs ran using QEMU User are just like other processes on your Unix machine (i.e., they can open whatever files on your system just like other processes), except that they are interpreted by QEMU. QEMU User can also be connected from GDB to debug your program.

I did not use Spike or other simulators because of personal preference.

Thus, we need to compile our programs into ordinary ELF files to be ran by QEMU User. For the program entry point, we can either provide our own _start, or use a C library. I am going to show both ways.

Learn more:

https://www.qemu.org/docs/master/user/main.html

Assembly Environment: Assembler and Linker #

If you just need to do RVV assembly programming, a cross compile assembler and a linker are enough. Remember that you need to write your own _start.

I just used the riscv-elf-binutils package from the Arch Linux repository. Check whether the assembler shipped with your distro supports RVV or not. You can always build a cross compile toolchain yourself (see the following sections). I don’t really know whtat’s the difference between -elf- and -linux-.

The process is pretty straight-forward: write your program, run as(1), then ld(1), and finally qemu-riscv64 /path/to/a.out.

When writing your program, make sure you export your _start function using the .global command. For other commands, see https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md.

.section text
# Make it visible by the linker.
# Without this command, ld(1) will not be able to find that label.
# You can test it using `riscv64-elf-nm /path/to/object-file`: `T` shows that
# the label is exported, while `t` shows that it is not.
# https://stackoverflow.com/a/37534357/6792243
.global _start
.start:
    # Call the Linux exit(2) syscall. QEMU User will automatically
    # translate that into host ABI.
    li a7, 0x93
    ecall

Assemble it:

% riscv64-elf-as -march=rv64iv -g -o rv.s.o rv.S

-march=rv64imv: The extensions to enable. The v extension is not enabled by default, so you need to specify your own. You can put other extensions here.
-g: Include debug info in the output file, so we can see the source and line numbers in GDB.
-o rv.s.o: Output file.

Examine the output object file on your own.

Link it:

% riscv64-elf-ld -o rv rv.s.o

Examine the output executable file on your own.

Run with QEMU and debug with GDB #

% qemu-riscv64 -cpu rv64,v=true,vlen=128 ./rv

-cpu rv64,v=true,vlen=128: Enable RVV support. QEMU by default does not use that extension, so vsetvli and other RVV instructions will raise illegal instruction traps. This sets VLEN to 128, which is the smallest number specified by the spec. You are free to use larger numbers like 256.
./rv: The executable file.

To debug, first install the appropriate GDB for RISC-V. Then append -g <port> to the QEMU command line. That argument asks QEMU to listen on the specified port for GDB to connect, and it will also wait for GDB connection before running anything.

Lastly, run:

$ riscv64-elf-gdb ./rv

and type target remote localhost:<port> to connect.

If you specified -g on the as(1) command line, you can use -tui to see the sources.

C Environment: GCC #

Writing RISC-V assembly is simply not enough. Everyone on earth prefers SIMD intrinsics rather than plain old assembly instructions. Similar to x86, RISC-V provides C intrinsics as well: https://github.com/riscv-non-isa/rvv-intrinsic-doc. Although it is not frozen, its API or architecture won’t change a lot in the future, I guess.

To enable the intrinsics, you need a compatible C compiler. I did not deal with clang, so I will only show the steps for GCC.

GCC had support for RVV intrinsics since 13.1, and Arch Linux does not yet ship this version (riscv64-linux-gnu-gcc and riscv64-elf-gcc are both 12.2.0). Thus, we need to build our own cross compiler.

Because I’m not sure how to do that myself, I used https://github.com/riscv-collab/riscv-gnu-toolchain, which is a set of scripts and Makefiles that builds a complete and up-to-date GNU toolchain for RISC-V. No worries, it pulls all source codes from the upstream, and this repository only contains build scripts. It greatly saved my life when building the toolchain.

To do so, just clone that repository and recursively init all submodules. Be warned though, it contains a lot of submodules, and a full clone takes about 11GiB (with git histroy). If you are in China, https://help.mirrors.cernet.edu.cn/riscv-toolchains/ may help with the network. https://mirror.iscas.ac.cn/riscv-toolchains/release/ also provides prebuilt binaries, but they are currently limited to 12.2 GCC for Ubuntu, so that is not helpful.) You may remove qemu, spike, and pk to reduce download time.

Before starting, switch to GCC 13.1:

% cd gcc
% git checkout releases/gcc-13.1.0

Then configure for RVV support:

% cd /whatever/build/workspace/path
% /path/to/riscv-gnu-toolchain/configure --prefix=$(pwd)/install --with-arch=rv64gcv --with-abi=lp64d

Lastly:

% make linux -jN

It takes about 15 ~ 20 minutes for a clean build on my Ryzen 5600G with -j25, 112GiB RAM, and fully built on tmpfs.

The final artifact (install path) is about 1.7GiB. Add the bin/ path to your PATH or install to your system.

Finally, rewrite your assembly code into C:

#include <riscv_vector.h>

void *memcpy_c(void *dst, const void *src, size_t n) {
        void *dst2 = dst;
        while (n) {
                size_t vl = __riscv_vsetvl_e8m8(n);
                vuint8m8_t v = __riscv_vle8_v_u8m8(src, vl);
                __riscv_vse8_v_u8m8(dst2, v, vl);
                dst2 += vl;
                src += vl;
                n -= vl;
        }
        return dst;
}

And compile with riscv-unknown-linux-gnu-gcc -march=rv64imv -g -nostdlib.... I switched my whole toolchain into riscv-unknown-linux-gnu-. Again, I am not clear with the difference between -elf- toolchain and -linux- one.

Finally, link together with the previous object file and test. GDB should show C sources as well.

Note: GCC does not have auto vectorization yet. You may try something like riscv-gcc-rvv-next.

Learn more:

Use glibc #

It is pretty painful to write programs in assembly, and it is still painful to do so without the help of libc. Fortunately, riscv-gnu-toolchain builds a fresh glibc together with other toolchain components, so we can just take the advantage of the familiar C standard functions and system call wrappers.

Just remove your assembly _start entry point and remove the -nostdlib flag. Also, use gcc instead of ld to link your executable because gcc takes care of the various crt0 crt1 crti crtn stuff that you definitely don’t want to deal with yourself.

If you run your program now, QEMU will complain that: qemu-riscv64: Could not open '/lib/ld-linux-riscv64-lp64d.so.1': No such file or directory.

The new executable is no longer a static executable - it now requires a ld.so and various glibc files (e.g., locale) to work properly. The only hard-coded path in ELF is the path to ld.so, and QEMU needs to be able to find the path. If you looked at the install dir of riscv-gnu-toolchain carefully, you will find a sysroot/ directory, containing a minimal glibc installation and the ld.so. You can definitely chroot(2), and it could be a very simple approach, but I want something lighter, at least something that don’t need root everytime I debug my program.

Fortunately, QEMU User supports supplying a custom search path for ld.so: through the -L flag. Therefore, we can add the following flag to QEMU:

qemu-riscv64 ... -L /path/to/toolchain/install/dir/sysroot/

and QEMU will be able to invoke ld.so to pull dynamically-linked stuff into our address space. It will do the dirty job of searching for libraries.

I haven’t tried linking against other libraries except for glibc at this time. It could be possible that ld.so won’t find such libraries. If that is the case, I have to chroot(2).

You may find https://git.yuuta.moe/rvv.git/ useful. Thanks for reading.