SE Can't Code

A Tokyo based Software Engineer. Not System Engineer :(

Fundamental of LLVM's Intermediate Representation.

This entry is introduction of LLVM IR to understand construction of LLVM's intermediate representation as HelloWorld. I wanna explain LLVM IR using HelloWorld sample code written C language. LLVM is a compiler infrastructure, it's a ahead-of-time compiler for native languages like C and C++.

About LLVM architecture

First, I'm gonna show easily LLVM's architecture using below the major components of LLVM's architecture:


There are some module that is to convert machine optional language and source code with optimize LLVM IR and some library, tool to create these module. You can develop compiler applied various language and machine creating front end and back end for specified language and architecture. The front end, which takes source and turns it into LLVM IR. This translation simplifies the job of the rest of the compiler, which doesn't want to deal with the full complexity of C++ source code. The passes, which transform IR to IR. In ordinary circumstances, passes usually optimize the code: that is, they produce an IR program as output that does the same thing as the IR they took as input, except that it's faster. The back end, which generates actual machine code from LLVM IR. This back end is most difficult. In this here, The most important thing is independent from some language and architecture because this operation such as optimize is only for LLVM IR. You can freely develop compiler with combining analyzing of LLVM and optimize path.


The almost instructions of LLVM IR are the three-address code similar with assembly (The three-address code means format specified two inputs and one output memory or register.) and valuable is set into register in LLVM IR. LLVM IR is intermediate representation targeting register machine that has infinite virtual register and using SSA formats. The SSA(static single assignment) is a property of LLVM IR, which requires that each variable is assigned exactly once, and every variable is defined before it is used. A good point of SSA is that enable you to easily understand def-use chain of valuable and be simplification for optimize and analyzing. So in LLVM IR formats, there are some characteristic patterns:

  1. In-memory compiler IR
  2. Bit-code on disk
  3. Assembly format that human can read

Below image is construction of LLVM IR:


Modules contain Functions, which contain BasicBlocks, which contain Instructions. Everything but Module descends from Value.

Intermediate representation of LLVM

I'm gonna confirm construction of LLVM IR with converting this C code to LLVM IR. Below code is easy HelloWorld code that you have seen once:


int main(){
    return 0;

And run below command to convert HelloWorld.c to LLVM IR, you know this is -emit-llvm Clang command.

$ clang -emit-llvm -S -o HelloWorld.ll HelloWorld.c

After you run this command, you can see HelloWorld.ll file converted LLVM IR from C code.

; ModuleID = 'HelloWorld.c'
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@.str = private unnamed_addr constant [12 x i8] c"HelloWorld\0A\00", align 1

; Function Attrs: nounwind uwtable
define i32 @main() #0 {
  %retval = alloca i32, align 4
  store i32 0, i32* %retval, align 4
  %call = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i32 0, i32 0))
  ret i32 0

declare i32 @printf(i8*, ...) #1

attributes #0 = { nounwind uwtable "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 3.8.0 (tags/RELEASE_380/final)"}

First line to three line means information such as name of code and environment. In LLVM IR, ";" is comment, you can ignore one line if you add this symbol. "target datalayout" means a way of placement of alignment data and "target-triple" means information of environment such a architecture and OS. "@.str" is global valiable, means definition of string in this code. Since "define i32 @main() #0 {", means definition of main function. "entry" is label of BasicBlock and has instruction such as output HelloWorld in this case. In this instruction, it gets memory that is 32*1 region with 4 bites alignment as "retval" valuable using "allca" instruction. Next, "store" instruction make it assign 0 into "retval" and call "printf" function using "call" instruction and return value using "ret" instruction. Finally, there is declaration of "printf" function using "declare" instruction. That's all.

I seem that you can know that some characteristics of LLVM IR. Modules house Functions, which are exactly what they sound like: named chunks of executable code. And aside from declaring its name and arguments, a Function is mainly a container of BasicBlocks. The BasicBlock is a familiar concept from compilers. but for our purpose, it's just a contiguous chunk of instructions. An instruction is a single code operation. Most thing in LLVM are C++ classes that inherit from an omnivorous base class called Value.