I recently setup a development environment for RISC-V Vector (RVV) intrinsics. Here are some tips and steps.
RVV Basics ๐
This section is not really necessary in terms of setting up a dev env, but it helps if you are also working with RVV.
Disclaimer: I am also a total beginner to RVV, so there may be errors or inaccuracies in my understanding.
RVV differs from other SIMD implmentations by making programs scalable across
different vector register lengths. Traditional ISAs like x86 (SSE, AVX, etc)
and Arm (Neon) have fixed vector registers, so developers don’t really need to
care about the variations across different implementations. However, RISC-V
implementations are highly diversed, and supporting different vector register
lenghts is necessary. Thus, RVV introduces a set of new concepts like SEW
,
VLEN
, VL
, AVL
, VLMAX
, and LMUL
.
These terms are very confusing at the beginning, but the programming model is
simple: you request space from the CPU dynamically (think it like calloc(3)
).
In RVV, you are processing an array of elements. The element length is called
SEW
(Selected element width), and it is either 8bits, 16bits, 32bits, or
64bits. The number of elements you request is called AVL
(application vector
length).
When you need to do vector operations, you first tell the CPU your element
length (SEW
) and how many elements you want (AVL
), with other options to
be discussed later. Then, the CPU will provide the available element count,
VL
(Vector length), which depends on VLEN
, the hard-coded vector register
length on that specific CPU.
These data are stored in a CSR
(Control and status register) called vtype
,
and it can be configured using the following set of instructions:
# VL = Destination register that stores the given element count
# AVL = Register that stores the number of elements you want
# SEW = e8 / e16 / e32 / e64
# LMUL = Register grouping: mf8 / mf4 / mf2 / m1 / m2 / m4 / m8, see following
# Tail Agnostic: I don't know, put ta here
# Mask Agnostic: I don't know either, put ma here
vsetvli VL, AVL, SEW, LMUL, Tail Agnostic, Mask Agnostic
# Or use vsetivli or vsetvl
Besides VL
, AVL
, and SEW
, we can also set LMUL
, which groups multiple
vector registers together to provide larger or smaller spaces. RVV supports
both fractional grouping (1 / 8, 1 / 4, or 1 / 2) and non-fractional grouping
(2 / 4 / 8). To be honest I don’t know how to use fractional grouping yet.
Grouped registers have the multipled length. That is, grouped registers have
the length of VLEN * LMUL
. For example, grouping of m8
will result in
each register having VLEN * 8
bits. When doing vector operations, grouped
registers are referenced using the first register name. For example, with m8
grouping, there are four registers: v0
(which contains v0
, v1
, v2
, v3
, v4
, v5
, v6
, v7
), v8
, v16
, and v24
(there are 32 vector
registers in total, regardless of implementation.)
As said before, SEW
refers to the length (in bits) of a single array element.
That is, on a given CPU, the max available number of elements (VLMAX
) can be
determined using LMUL * VLEN / SEW
(Bits of a single grouped register divided
by the length of a single element).
Lastly, you request a specific amount of elements using the AVL
parameter.
Therefore, the final number returned to your VL
register is
VL = min(AVL, VLMAX)
: take your number of elements, or the max number of
elements if your number is too large.
Then, if I understood correctly, vector operations will all operate on the
VL
number of elements, except for load / store operations.
If your AVL
exceeds VLMAX
(which is the common case), you must loop
yourself. Just use the VL
number returned from vsetvli
instruction to
determine how many iterations do you need. This is called a “sliding window”
approach.
Max number of elements
|
|
|
|
Selected v
| +-----+
| |VLMAX+--->
| +-----+--------------------------------+
+------>| V L | (Data to process later) |
+-----+--------------------------------+
| |
+--------------------------------------+->
AVL (Required number of elements)
+-----+
|VLMAX+--->
+------------+-----+-------------------+
|Processed | V L |Data to process |
+------------+-----+-------------------+
| |
+--------------------------------------+->
AVL
+-----+
|VLMAX|
+-----------------------------------+--+--+
|Processed |VL|
+-----------------------------------+--+
| |
+--------------------------------------+->
AVL
(Generated using https://asciiflow.com/, image credit goes to https://itnext.io/grokking-risc-v-vector-processing-6afe35f2b5de)
There is an exception: for load / store instructions, you get to choose a
temporary SEW
for that operation only: the EEW
(Effective element width),
and (I believe) the CPU will automatically adjust its LMUL
to use a EMUL
(Effective LMUL
), which equals to (EEW / SEW) * LMUL
. EEW
can be equals
to, greater, or smaller than SEW
.
Learn more:
- https://github.com/riscv/riscv-v-spec: RVV spec
- https://itnext.io/grokking-risc-v-vector-processing-6afe35f2b5de: Intro
- https://github.com/aosp-riscv/working-group/blob/b283a7cb9f9ee3e629a6014985a70a63ea05542f/articles/20230629-rvv-note.md: Comments to the above article (Chinese)
- https://www.francisz.cn/2022/03/23/riscv-vector/: Very detailed intro to RVV intructions (Chinese)
Execution Environment: QEMU User and GDB ๐
I choose QEMU User to run my program. QEMU User is a lighweight userspace
simulator: it runs just normal ELF files but with a different ISA. It will
transparently handle ld.so
(possibly in a new sysroot) and translate syscalls
into the simulated ISA. Programs ran using QEMU User are just like other
processes on your Unix machine (i.e., they can open whatever files on your
system just like other processes), except that they are interpreted by QEMU.
QEMU User can also be connected from GDB to debug your program.
I did not use Spike or other simulators because of personal preference.
Thus, we need to compile our programs into ordinary ELF files to be ran by
QEMU User. For the program entry point, we can either provide our own _start
,
or use a C library. I am going to show both ways.
Learn more:
Assembly Environment: Assembler and Linker ๐
If you just need to do RVV assembly programming, a cross compile assembler and
a linker are enough. Remember that you need to write your own _start
.
I just used the riscv-elf-binutils
package from the Arch Linux repository.
Check whether the assembler shipped with your distro supports RVV or not.
You can always build a cross compile toolchain yourself (see the following
sections). I don’t really know whtat’s the difference between -elf-
and
-linux-
.
The process is pretty straight-forward: write your program, run as(1)
, then
ld(1)
, and finally qemu-riscv64 /path/to/a.out
.
When writing your program, make sure you export your _start
function using
the .global
command. For other commands, see
https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md.
.section text
# Make it visible by the linker.
# Without this command, ld(1) will not be able to find that label.
# You can test it using `riscv64-elf-nm /path/to/object-file`: `T` shows that
# the label is exported, while `t` shows that it is not.
# https://stackoverflow.com/a/37534357/6792243
.global _start
.start:
# Call the Linux exit(2) syscall. QEMU User will automatically
# translate that into host ABI.
li a7, 0x93
ecall
Assemble it:
% riscv64-elf-as -march=rv64iv -g -o rv.s.o rv.S
-march=rv64imv
: The extensions to enable. Thev
extension is not enabled by default, so you need to specify your own. You can put other extensions here.-g
: Include debug info in the output file, so we can see the source and line numbers in GDB.-o rv.s.o
: Output file.
Examine the output object file on your own.
Link it:
% riscv64-elf-ld -o rv rv.s.o
Examine the output executable file on your own.
Run with QEMU and debug with GDB ๐
% qemu-riscv64 -cpu rv64,v=true,vlen=128 ./rv
-cpu rv64,v=true,vlen=128
: Enable RVV support. QEMU by default does not use that extension, sovsetvli
and other RVV instructions will raise illegal instruction traps. This setsVLEN
to 128, which is the smallest number specified by the spec. You are free to use larger numbers like 256../rv
: The executable file.
To debug, first install the appropriate GDB for RISC-V. Then append -g <port>
to the QEMU command line. That argument asks QEMU to listen on the specified
port for GDB to connect, and it will also wait for GDB connection before
running anything.
Lastly, run:
$ riscv64-elf-gdb ./rv
and type target remote localhost:<port>
to connect.
If you specified -g
on the as(1)
command line, you can use -tui
to see
the sources.
C Environment: GCC ๐
Writing RISC-V assembly is simply not enough. Everyone on earth prefers SIMD intrinsics rather than plain old assembly instructions. Similar to x86, RISC-V provides C intrinsics as well: https://github.com/riscv-non-isa/rvv-intrinsic-doc. Although it is not frozen, its API or architecture won’t change a lot in the future, I guess.
To enable the intrinsics, you need a compatible C compiler. I did not deal with clang, so I will only show the steps for GCC.
GCC had support for RVV intrinsics since 13.1, and Arch Linux does not yet
ship this version (riscv64-linux-gnu-gcc
and riscv64-elf-gcc
are both
12.2.0). Thus, we need to build our own cross compiler.
Because I’m not sure how to do that myself, I used https://github.com/riscv-collab/riscv-gnu-toolchain, which is a set of scripts and Makefiles that builds a complete and up-to-date GNU toolchain for RISC-V. No worries, it pulls all source codes from the upstream, and this repository only contains build scripts. It greatly saved my life when building the toolchain.
To do so, just clone that repository and recursively init all submodules. Be
warned though, it contains a lot of submodules, and a full clone takes about
11GiB (with git histroy). If you are in China,
https://help.mirrors.cernet.edu.cn/riscv-toolchains/ may help with the network.
https://mirror.iscas.ac.cn/riscv-toolchains/release/ also provides prebuilt
binaries, but they are currently limited to 12.2 GCC for Ubuntu, so that is not
helpful.) You may remove qemu
, spike
, and pk
to reduce download time.
Before starting, switch to GCC 13.1:
% cd gcc
% git checkout releases/gcc-13.1.0
Then configure for RVV support:
% cd /whatever/build/workspace/path
% /path/to/riscv-gnu-toolchain/configure --prefix=$(pwd)/install --with-arch=rv64gcv --with-abi=lp64d
Lastly:
% make linux -jN
It takes about 15 ~ 20 minutes for a clean build on my Ryzen 5600G with -j25
,
112GiB RAM, and fully built on tmpfs.
The final artifact (install path) is about 1.7GiB. Add the bin/
path to your
PATH or install to your system.
Finally, rewrite your assembly code into C:
#include <riscv_vector.h>
void *memcpy_c(void *dst, const void *src, size_t n) {
void *dst2 = dst;
while (n) {
size_t vl = __riscv_vsetvl_e8m8(n);
vuint8m8_t v = __riscv_vle8_v_u8m8(src, vl);
__riscv_vse8_v_u8m8(dst2, v, vl);
dst2 += vl;
src += vl;
n -= vl;
}
return dst;
}
And compile with riscv-unknown-linux-gnu-gcc -march=rv64imv -g -nostdlib...
.
I switched my whole toolchain into riscv-unknown-linux-gnu-
. Again, I am not
clear with the difference between -elf-
toolchain and -linux-
one.
Finally, link together with the previous object file and test. GDB should show C sources as well.
Note: GCC does not have auto vectorization yet. You may try something like
riscv-gcc-rvv-next
.
Learn more:
- https://github.com/riscv-collab/riscv-gnu-toolchain
- https://gitee.com/aosp-riscv/working-group/blob/master/articles/20220721-riscv-gcc.md: Chinese
Use glibc ๐
It is pretty painful to write programs in assembly, and it is still painful to do so without the help of libc. Fortunately, riscv-gnu-toolchain builds a fresh glibc together with other toolchain components, so we can just take the advantage of the familiar C standard functions and system call wrappers.
Just remove your assembly _start
entry point and remove the -nostdlib
flag.
Also, use gcc
instead of ld
to link your executable because gcc
takes
care of the various crt0
crt1
crti
crtn
stuff that you definitely
don’t want to deal with yourself.
If you run your program now, QEMU will complain that:
qemu-riscv64: Could not open '/lib/ld-linux-riscv64-lp64d.so.1': No such file or directory
.
The new executable is no longer a static executable - it now requires a ld.so
and various glibc files (e.g., locale) to work properly. The only hard-coded
path in ELF is the path to ld.so
, and QEMU needs to be able to find the path.
If you looked at the install dir of riscv-gnu-toolchain carefully, you will
find a sysroot/
directory, containing a minimal glibc installation and the
ld.so
. You can definitely chroot(2), and it could be a very simple approach,
but I want something lighter, at least something that don’t need root everytime
I debug my program.
Fortunately, QEMU User supports supplying a custom search path for ld.so
:
through the -L
flag. Therefore, we can add the following flag to QEMU:
qemu-riscv64 ... -L /path/to/toolchain/install/dir/sysroot/
and QEMU will be able to invoke ld.so
to pull dynamically-linked stuff into
our address space. It will do the dirty job of searching for libraries.
I haven’t tried linking against other libraries except for glibc at this time.
It could be possible that ld.so
won’t find such libraries. If that is the
case, I have to chroot(2).
You may find https://git.yuuta.moe/rvv.git/ useful. Thanks for reading.