2019년 6월 25일 화요일

Deep dive into Defense against BTB attacks

Deep-dive-into-Defense-against-BTB-attacks

Deep dive into Defense against BTB attacks (Branch Target Buffer)

Briefs on BTB attacks

  • Branch Target Collision

    • Can: Obtain the layout information (ASLR, KASLR)
    • Can't: Read the privileged memory
  • Branch Target Injection

    • Can: Read the privileged memory

Briefs on BTB defenses

  • Retpoline [1]

    • Blocks both Branch Target Collision and Branch Target Injection.
    • Prevents all exploits based on branch target buffer
    • Arch: x86, amd (Linux distros)
  • Flush BTB entries explicitly [2]

    • Blocks Branch Target Injection only. (How to deal with Branch Target Collision? ignore it? do they have another solution?)
    • Note: Actually it has been deprecated in latest kernel.
    • Arch: arm

Flush BTB entries explicitly

  • The simplest way to stop to leak KASLR bits (Branch Target Collision),
    is to perform the BTB flush in switching context between user and kernel.

  • Likewise, the way to stop Branch Target Injection
    is to perform the BTB flush in switching context between processes, VMs.

  • ARM takes this approach, because easy to implement, no side effects.
    there are 3 spots to flush the BTB entries.
    (But in latest kernel, It seems be deprecated.)
    • (1) Context Switch Between User and Kernel
      • It stops to leak KASLR bits, but at the same time, imposes a severe performance overhead.
    • (2) Context Switch Between User Processes
      • It stops Branch Target Injection between User Processes. (no memory leak of a victim process)
    • (3) Context Switch Between VMs
      • It stops Branch Target Injection between VMs. (no memory leak of VM, Host)

  • ARM has adopted both (2) and (3) in its Linux Kernel.
    (1) has not adopted because of the performance impact in it.
    [TODO] But.. does this decision mean ARM allows KASLR-leak??

  • ARM doesn't flush all BTB entries. Instead, they partially flush BTB entries that are exploitable to Branch Target Injection.
    and ARM's hardening is also depending on CPU type. It means they don't do anything on the CPU not vulnerable to Branch Target Injection.

  • In summary, ARM determines where the BTB flush performs in considering
    • CPU Type
      • If a CPU is not fundamentally vulnerable to BTB attacks, There are no BTB flushes.
    • Interest of an attacker
      • Even if a CPU is vulnerable, ARM doesn't do the flush on all three spots. (1), (2), (3).
      • Identify some spots where attackers are interested in,
        and do the BTB flush on the spots. That significantly reduces a number of BTB flushes.
      • ARM provides an interface for a developer to add a new spot for BTB flush.

  • In latest linux kernel (v5.2), There seems no BTB flushes anywhere.
    I think ARM thinks of the BTB flush as a temporal fix.

  • ARM takes special care of KVM vector, EL2 vector. ARM replaces the original vectors with hardened vectors instead of performing the BTB flush.

  • Linux kernel code (CONFIG_HARDEN_BRANCH_PREDICTOR, ARM64_HARDEN_BRANCH_PREDICTOR)
// Code-1: KVM
// Assumption behind it: An attacker can inject a branch target across VMs through a hyp-vector.
// Strategy: Makes sure that no branch prediction happens in hardened vector.
// [TODO] I don't understand it clearly. (Does NOPs work for no branch prediction?)
#ifdef CONFIG_HARDEN_BRANCH_PREDICTOR
....
// the hardening sequence is placed in one of the kvm vector slots.
static inline void *kvm_get_hyp_vector(void)
{
....
  if (cpus_have_const_cap(ARM64_HARDEN_BRANCH_PREDICTOR) && data->fn) {
    vect = kern_hyp_va(kvm_ksym_ref(__bp_harden_hyp_vecs_start));
    slot = data->hyp_vectors_slot;
  }
....
  return vect;
}
....
ENTRY(__kvm_hyp_vector) // original hyp vector
....
valid_vect  el1_sync        // Synchronous 64-bit EL1
....
ENDPROC(__kvm_hyp_vector)

.macro hyp_ventry
    .align 7
1:  .rept 27
    nop
    .endr
 ....
    * Where addr = kern_hyp_va(__kvm_hyp_vector) + vector-offset + 4.
    * See kvm_patch_vector_branch for details.
    */
 alternative_cb  kvm_patch_vector_branch
    b   __kvm_hyp_vector + (1b - 0b)
 ....
 alternative_cb_end
.endm

.macro generate_vectors
0:
    .rept 16
    hyp_ventry
    .endr
    .org 0b + SZ_2K     // Safety measure
.endm

    .align  11
ENTRY(__bp_harden_hyp_vecs_start) // hardened vector
    .rept BP_HARDEN_EL2_SLOTS
    generate_vectors
    .endr
ENTRY(__bp_harden_hyp_vecs_end)
// Code-2: Switching context between processes
static u64 new_context(struct mm_struct *mm)
{
...
// 
switch_mm_fastpath:
    arm64_apply_bp_hardening();
 // It means partially invalidating BTB. But, nothing happens in latest kernel.

// Code-3: Instruction abort that user triggers (Prefetch abort)
asmlinkage void __exception do_el0_ia_bp_hardening(unsigned long addr, ...
{
....
if (!is_ttbr0_addr(addr))
    arm64_apply_bp_hardening();
 // It means partially invalidating BTB. But, nothing happens in latest kernel.

Retpoline

  • Retpoline works by replacing all indirect jumps with direct jumps.
    It relies on the fact that the attacks are infeasible if there are no indirect jumps.
    And Retpoline makes a gadget that has no point at which speculation could be controlled by an external attacker.
Indirect jump replacement with retpoline
  • Before retpoline
jmp *%rax
  • After retpoline
(1) call load_label
    capture_ret_spec:
(2) pause; LFENCE
(3) jmp capture_ret_spec
    load_label:
(4) mov %rax, (%rsp)
(5) ret
  • Our goal is to jump to where the value in rax is pointing.

  • (1) --> (4) --> (5). That sequences means we're using ret instruction to jump to somewhere.
    But, it opens a new side channel that is RSB. (Return Stack Buffer)
    So we should also take special care of RSB.

  • (5) ret triggers a new speculative execution relying on RSB.
    If speculating, the CPU uses the RSB entry created in (1) and jumps to (2).

  • (2) uses a barrier and (3) jumps to (2) again. What do they mean?
    To make sure no speculation on (5), %rax must be taken from the stack memory.
    If there is no barrier, and hit (5) again, the speculating happens again at (5).
    That's why (2) is needed.

  • When (2) finishes, the CPU can execute either (3) or (5).
    If (3) is not there, the speculating happens again at (5).
    That's why (3) is needed.

  • Return Stack Buffer (RSB)

    • N entries that store recent return addresses.
    • When hitting call instruction, push the corresponding return address into RSB.
    • When hitting ret instruction, pop off the return address from RSB.
    • RSB: | ret-1 | ret-2 | ret-3 | ....
    • In above retpoline gadget, if (5) consumes the corresponding RSB and hit (5) in the context of speculation, (5) jumps to an undesired target.

  • Do LFENCE and PAUSE impose a lot of performance overhead?

    • If a CPU always executes the LFENCE, the answer is yes.
      But the LFENCE would be executed in context of speculative execution.
      Since benign execution will not trigger the LFENCE, it may not exhibit the same performance impact typically associated with speculation barriers.
  • For details about replacing call instruction, check out [3].

How Linux kernel supports Retpoline
  • There are 3 options about the way how Linux kernel supports Retpoline.

    • retpoline : replace indirect branches
    • retpoline,generic : google's original retpoline (This is what we've looked at in the previous section)
    • retpoline,amd : AMD-specific minimal thunk
  • Linux kernel mainline uses "retpoline" by default. (for x86_64_defconfig)
    Ubuntu kernel uses "retpoline,generic" by default. (It means a full retpoline protection)

  • Let's look at the "retpoline" that Linux kernel mainline adopts. (CONFIG_RETPOLINE)

    • "retpoline" version applies two different retpoline gadgets based on the severity of some codes.

    • Default Gadget (used everywhere except BPF JIT)
      • Default Gadget doesn't provide a perfect protection. Instead, add some hardening.
        Since all normal BTB updates are running in _x86_indirect_thunk[r], it's difficult for an attacker to mistrain that.
      • This approach reduces performance overhead coming up with full retpoline protection.
      // kernel code
      # define CALL_NOSPEC                        \
       ANNOTATE_NOSPEC_ALTERNATIVE             \
       ALTERNATIVE_2(                      \
       ANNOTATE_RETPOLINE_SAFE                 \
       "call *%[thunk_target]\n",              \
       "call __x86_indirect_thunk_%V[thunk_target]\n",     \
       X86_FEATURE_RETPOLINE,                  \
       "lfence;\n"                     \
       ANNOTATE_RETPOLINE_SAFE                 \
       "call *%[thunk_target]\n",              \
       X86_FEATURE_RETPOLINE_AMD)
      # define THUNK_TARGET(addr) [thunk_target] "r" (addr)
      
      // kernel binary dump
      // Before retpoline: jmpq *%rbx;
      callq  ffffffff81e00f20 <__x86_indirect_thunk_rbx>
      ....
      // All indirect branches through rbx are here.
      // Even though an attacker mistrain this branch,
      // Other dummy calls (view of attacker) will wipe the malicious entry out from BTB.
      // So it makes an effective hardening with a reasonable performance impact.
      ffffffff81e00f20 <__x86_indirect_thunk_rbx>:
      ffffffff81e00f20:   ff e3                   jmpq   *%rbx
      ffffffff81e00f22:   90                      nop
      ffffffff81e00f23:   90                      nop
      ....
      ffffffff81e00f40 <__x86_indirect_thunk_rcx>:
      ffffffff81e00f40:   ff e1                   jmpq   *%rcx
      ffffffff81e00f42:   90                      nop
      ffffffff81e00f43:   90                      nop
      ffffffff81e00f44:   90                      nop
      
    • BPF JIT Gadget (only used in BPF JIT)
      • Even if default gadgets are adopted in entire kernel code,
        attacker could generate some codes where the speculation could be executed via BPF JIT.
      • So we should apply full retpoline protection to a code generation logic of BPF JIT.
      // kernel code
      #ifdef CONFIG_RETPOLINE
      # ifdef CONFIG_X86_64
      #  define RETPOLINE_RAX_BPF_JIT_SIZE    17
      #  define RETPOLINE_RAX_BPF_JIT()               \
      do {                                \
       EMIT1_off32(0xE8, 7);    /* callq do_rop */     \
       /* spec_trap: */                    \
       EMIT2(0xF3, 0x90);       /* pause */            \
       EMIT3(0x0F, 0xAE, 0xE8); /* lfence */           \
       EMIT2(0xEB, 0xF9);       /* jmp spec_trap */        \
       /* do_rop: */                       \
       EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */    \
       EMIT1(0xC3);             /* retq */         \
      } while (0)
      
      // kernel code - arch/x86/net/bpf_jit_comp.c
      static void emit_bpf_tail_call(u8 **pprog)
      {
      ....
      /*
        * Wow we're ready to jump into next BPF program
        * rdi == ctx (1st arg)
        * rax == prog->bpf_func + prologue_size
        */
       RETPOLINE_RAX_BPF_JIT();
      ....
      
  • Next look at the "retpoline,generic" that Ubuntu kernel adopts. (full retpoline protection)

    • "retpoline,generic" uses only one retpoline gadget that google introduced.
      // kernel code
      .macro RETPOLINE_JMP reg:req
       call    .Ldo_rop_\@
      .Lspec_trap_\@:
       pause
       lfence
       jmp .Lspec_trap_\@
      .Ldo_rop_\@:
       mov \reg, (%_ASM_SP)
       ret
      .endm
      
      /*
       * This is a wrapper around RETPOLINE_JMP so the called function in reg
       * returns to the instruction after the macro.
       */
      .macro RETPOLINE_CALL reg:req
       jmp .Ldo_call_\@
      .Ldo_retpoline_jmp_\@:
       RETPOLINE_JMP \reg
      .Ldo_call_\@:
       call    .Ldo_retpoline_jmp_\@
      .endm
      
      // kernel binary
      // there are no callers using this retpoline gadget in binary
      // Instead record all sites that should be rearranged to a retpoline gadget.
      // __alt_instructions ~ __alt_instructions_end;
      ffffffff828d17df <.altinstr_replacement>:
      ....
      // indirect call using %rdi
      ffffffff828d17ee:   e8 07 00 00 00          callq  0xffffffff828d17fa
      ffffffff828d17f3:   f3 90                   pause
      ffffffff828d17f5:   0f ae e8                lfence
      ffffffff828d17f8:   eb f9                   jmp    0xffffffff828d17f3
      ffffffff828d17fa:   48 89 3c 24             mov    %rdi,(%rsp)
      ffffffff828d17fe:   c3                      retq
      ....
      // indirect call using %rbx
      ffffffff828d1831:   e8 07 00 00 00          callq  0xffffffff828d183d
      ffffffff828d1836:   f3 90                   pause
      ffffffff828d1838:   0f ae e8                lfence
      ffffffff828d183b:   eb f9                   jmp    0xffffffff828d1836
      ffffffff828d183d:   48 89 1c 24             mov    %rbx,(%rsp)
      ffffffff828d1841:   c3                      retq
      
      // kernel runtime
      // runtime patching to apply retpoline gadgets
      apply_alternatives(__alt_instructions, __alt_instructions_end);
      
      // Strengths
      // - Enable/Disable Retpoline in boot-time
      // - For disable Retpoline in boot-time, using "spectre_v2=off" on kernel command line.
      
Why can't ARM use Retpoline?
  • This is a quite interesting question.
    ARM didn't say anything about that in their whitepaper [4].
    I can't know the exact reason for that. The only thing I can do is just guessing.

  • x86's direct call instruction works in two phases.

    • push return address into stack
    • jump to dest
  • Likewise x86's return instruction works in two phases.

    • pop off return address from stack
    • jump to address
  • arm's direct call instruction(bl) works in one phase.

    • mov lr, =next instruction(pc)
    • mov pc, =dest
  • Likewise arm's return instruction works in one phase.

    • mov pc, lr (no acceess to stack)
  • As shown in the above, ARM doesn't push the return address into the stack in branching.
    It means that it's so difficult to handle speculation on both RSB and BTB.
    (I think, in particular, making no speculation on RSB would be difficult.)

Appendix

Appendix-1: The flow of handling hvc in KVM
  1. [Guest Kernel] Issue hvc (or trap)
  2. [Guest Kernel] Load kvm vector & Jump to vector entry
  3. [Hypervisor] Do __guest_exit. exit from guest, enter to host to handle the hvc. (__kvm_hyp_vector)
  4. [Host Kernel] Handling the hvc (or trap)
  5. [Host Kernel] Do __guest_enter. exit from host, enter to guest.
  6. [Guest Kernel] Do next instruction to hvc (or trap)

References

[1] Retpoline: a software construct for preventing branch-target-injection
[2] arm64 kpti hardening and variant 2 workarounds
[3] Deep Dive: Retpoline: A Branch Target Injection Mitigation
[4] ARM - Cache Speculation Issues Update

Deep dive into Page Table Isolation

Deep-dive-into-page-table-isolation

Deep dive into Page Table Isolation

Page Table Isolation(PTI)

  • It separates a page table of user mode from a page table of kernel mode.
  • What attacks can be closed by PTI?
    • Bypassing KASLR, Reading kernel memory from userspace (Meltdown)

Concept

pti
  • As shown in the above picture, the concept of PTI is to maintain two page tables for one process.
    One is for user mode, Another one is for kernel mode.

Challenges

  • There are 3 challenges for us to implement PTI in practice.
Challenge-1: How to switch the page table as fast as possible?
  • Since we have a different page table for user and kernel,
    We should do two more context switches when executing a syscall.
    user to kernel, kernel to user.
    How can we efficiently implement that?
Challenge-2: Is it possible to eliminate entire kernel mapping?
  • Ideally, Handling a syscall along with the PTI would work as follows.

    • Simple Version
    1. [User] Trigger a syscall
    2. [Kernel] Switch page table to one for the kernel.
    3. [Kernel] Handles a syscall
    
  • But if we work with the PTI, it would follow the below process.

    • Deeper Version
    1. [User - UserMapping] Trigger a syscall
    2. [Kernel - UserMapping] Interrupt handling
    3. [Kernel - UserMapping] Do something...
    4. [Kernel - UserMapping] Switch page table to one for the kernel.
    5. [Kernel - KernelMapping] Handles a syscall
    
    • What is the problem of the above process?
      Even before switch the page table, we need to execute kernel code and data.
      But at that point, since page table is still the one for the user, the kernel would crash.
  • That means that we have to figure out what kernel codes and data should be mapped to the page table for a user.
    and next, explicitly mapping them to both user-page-table and kernel-page-table.

Challenge-3: How to maintain the TLB efficiently?
  • As we discussed earlier, the TLB is a cache for a page table. So we have to synchronize them.
  • We must take special care of maintaining the TLB even with separated page table.
  • What problem can be occurred if we don't do anything for the TLB?
    1. [CPU-0][User] Trigger a syscall
    2. [CPU-0][Kernel] Switch page table to one for the kernel.
    3. [CPU-0][Kernel] Handles a syscall. --> Loads kernel entires to the TLB of CPU-0.
    4. [CPU-0][User] Access a kernel address.
        - If we access a kernel address, the fault will be delivered faster by the TLB hit due to 3.
        - If we access an invalid address, the fault will be delivered slower.
    
    • It still opens a side channel to an attacker.
  • How about explicitly flushing all TLBs in context switching?
    1. [CPU-0][User] Trigger a syscall
    2. [CPU-0][Kernel] Switch page table to one for the kernel.
    3. [CPU-0][Kernel] Handles a syscall. --> Loads kernel entires to the TLB of CPU-0.
    4. [CPU-0][Kernel] Flush the TLB of CPU-0.
    4. [CPU-0][User] Access a kernel address.
        - Since the TLB has no entries for kernel address, it perfectly closes a side channel.
    
    • It can close a side channel but comes up with a severe performance impact.
  • Then, How can we flush the TLB efficiently?

Challenge-1: How to switch the page table as fast as possible?

Concept
pti2
  • Very simple concept. The kernel allocates two page tables in adjacent place.
    With the layout, we easily can compute the address of the page table switching to.
Implementation on x86_64 (CONFIG_PAGE_TABLE_ISOLATION)
  • Allocating two page table.
// arch/x86/include/asm/pgalloc.h
#ifdef CONFIG_PAGE_TABLE_ISOLATION
/*
 * Instead of one PGD, we acquire two PGDs.  Being order-1, it is
 * both 8k in size and 8k-aligned.  That lets us just flip bit 12
 * in a pointer to swap between the two 4k halves.
 */
#define PGD_ALLOCATION_ORDER 1
#else
#define PGD_ALLOCATION_ORDER 0
#endif
....
....
// arch/x86/mm/pgtable.c
static inline pgd_t *_pgd_alloc(void)
{
  return (pgd_t *)__get_free_pages(PGALLOC_GFP, PGD_ALLOCATION_ORDER);
  // CONFIG_PAGE_TABLE_ISOLATION --> 8kb, two page tables.
  // !CONFIG_PAGE_TABLE_ISOLATION --> 4kb, one page table.
}
  • Switching the table.
// arch/x86/include/asm/pgtable.h
#ifdef CONFIG_PAGE_TABLE_ISOLATION
....
....
#define PTI_PGTABLE_SWITCH_BIT PAGE_SHIFT
....
....
static inline pgd_t *kernel_to_user_pgdp(pgd_t *pgdp)
{
  return ptr_set_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
}

static inline pgd_t *user_to_kernel_pgdp(pgd_t *pgdp)
{
  return ptr_clear_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
}
Implementation on arm64 (CONFIG_UNMAP_KERNEL_AT_EL0)
  • Allocating two page table.
// [TODO] Where is the code for that? I wasn't able to figure out that..
  • Switching the table.
// arch/arm64/kernel/entry.S
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
/*
 * Exception vectors trampoline.
 */
 .pushsection ".entry.tramp.text", "ax"

  // From User to Kernel
 .macro tramp_map_kernel, tmp
 mrs \tmp, ttbr1_el1   // Read address of user page table
 add \tmp, \tmp, #(PAGE_SIZE + RESERVED_TTBR0_SIZE) // Move it to kernel page table
 bic \tmp, \tmp, #USER_ASID_FLAG  // For the TLB
 msr ttbr1_el1, \tmp  // Write address of kernel page table to register
....
....
  // From Kernel to User
  .macro tramp_unmap_kernel, tmp
 mrs \tmp, ttbr1_el1
 sub \tmp, \tmp, #(PAGE_SIZE + RESERVED_TTBR0_SIZE)
 orr \tmp, \tmp, #USER_ASID_FLAG
 msr ttbr1_el1, \tmp

Challenge-2: Is it possible to eliminate entire kernel mapping?

Concept
  • Our goal is to minimize the amount of kernel mapping allowed to user page table. Which can reduce the amount of attack surfaces.
  • There might be two approaches.
Approach-1 : trampoline replacing exception entry (on arm64)
  • Install a trampoline code replacing original exception entry. When syscall hits, execution will first fall down to the trampoline.
pti3
  • Strengths

    • What an attacker gain by side channel attacks (on KASLR, Meltdown) is limited to about the trampoline.
  • Weaknesses

    • An indirect jump to original entry is additionally needed. (bad performance in the view of leveraging cache)
      Direct jump on full 64-bit may not be supported on any vendors, and even if it's possible, it indirectly exposes the address of kernel entry.
  • arm64 is taking this approach, x86_64 is not taking this approach.

Approach-2 : no trampoline. allowing a minimum set of mapping. (percpu on x86)
  • Allow user page table to map a minimum set of codes required before switching page table.
pti4
  • Strengths

    • No indirect branch (good performance)
  • Weaknesses

    • Allows side channel attacks to locate percpu area. (they said it's ok since it doesn't directly leak KASLR. (really??))
Implementation on arm64 (CONFIG_UNMAP_KERNEL_AT_EL0)
  • Define section for trampoline
// arch/arm64/kernel/vmlinux.lds.S
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
#define TRAMP_TEXT                  \
    . = ALIGN(PAGE_SIZE);               \
    __entry_tramp_text_start = .;           \
    *(.entry.tramp.text)                \
    . = ALIGN(PAGE_SIZE);               \
    __entry_tramp_text_end = .;
#else
#define TRAMP_TEXT
#endif
  • From trampoline to original exception vector
// arch/arm64/kernel/entry.S
// trampoline for pti
ENTRY(tramp_vectors)
    .space  0x400

    tramp_ventry
    tramp_ventry
    tramp_ventry
    tramp_ventry

    tramp_ventry    32
    tramp_ventry    32
    tramp_ventry    32
    tramp_ventry    32
END(tramp_vectors)
....
....
.macro tramp_ventry, regsize = 64
    .align  7
...
 tramp_map_kernel    x30  // Switch page table
#ifdef CONFIG_RANDOMIZE_BASE
    adr x30, tramp_vectors + PAGE_SIZE  // Get the address of original exception vector
alternative_insn isb, nop, ARM64_WORKAROUND_QCOM_FALKOR_E1003
    ldr x30, [x30]
#else
    ldr x30, =vectors
#endif
...
 add x30, x30, #(1b - tramp_vectors)
    isb
    ret   // Jump to the original exception vector
.endm
....
/*
 * Exception vectors.
 */
    .pushsection ".entry.text", "ax"

    .align  11
ENTRY(vectors)
...
 kernel_ventry   0, sync             // Synchronous 64-bit EL0 --> handles a syscall
  • Map trampoline to user page table
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
static int __init map_entry_trampoline(void)
{
    pgprot_t prot = rodata_enabled ? PAGE_KERNEL_ROX : PAGE_KERNEL_EXEC;
    phys_addr_t pa_start = __pa_symbol(__entry_tramp_text_start);

    /* The trampoline is always mapped and can therefore be global */
    pgprot_val(prot) &= ~PTE_NG; // ~PTE_NG!!

    /* Map only the text into the trampoline page table */
    memset(tramp_pg_dir, 0, PGD_SIZE);
    __create_pgd_mapping(tramp_pg_dir, pa_start, TRAMP_VALIAS, PAGE_SIZE,
                 prot, __pgd_pgtable_alloc, 0);
Implementation on x86_64 (CONFIG_PAGE_TABLE_ISOLATION)
  • Map percpu area to user page table
static void __init setup_cpu_entry_area(unsigned int cpu)
{
....
....
 cea_set_pte(&cea->gdt, get_cpu_gdt_paddr(cpu), gdt_prot);

    cea_map_percpu_pages(&cea->entry_stack_page,
                 per_cpu_ptr(&entry_stack_storage, cpu), 1,
                 PAGE_KERNEL);
....
 cea_map_percpu_pages(&cea->tss, &per_cpu(cpu_tss_rw, cpu),
                 sizeof(struct tss_struct) / PAGE_SIZE, tss_prot);
}
....
....
void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags)
{
    unsigned long va = (unsigned long) cea_vaddr;
    pte_t pte = pfn_pte(pa >> PAGE_SHIFT, flags);

    /*
     * The cpu_entry_area is shared between the user and kernel
     * page tables.  All of its ptes can safely be global.
     * _PAGE_GLOBAL gets reused to help indicate PROT_NONE for
     * non-present PTEs, so be careful not to set it in that
     * case to avoid confusion.
     */
    if (boot_cpu_has(X86_FEATURE_PGE) &&
        (pgprot_val(flags) & _PAGE_PRESENT))
        pte = pte_set_flags(pte, _PAGE_GLOBAL); // _PAGE_GLOBAL!!
  • Swithing the table at syscall entry
ENTRY(entry_SYSCALL_64)
    UNWIND_HINT_EMPTY
    /*
     * Interrupts are off on entry.
     * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
     * it is too small to ever cause noticeable irq latency.
     */

    swapgs
    /* tss.sp2 is scratch space. */
    movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)  // percpu-access before switching. since user already maps percpu area, it would be ok.
    SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp // Swithing the table
    movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp

Challenge-3: How to maintain the TLB efficiently?

Concept
  • Uses implicitly TLB flush.
  • We can assign the different tag to a TLB entry so that it's not needed anymore to explicitly do the TLB flush.
    It's the typical way used in maintaining TLBs for different processes.
    It works by hardware features called PCID(x86) and ASID(arm).
    pti5
Implementation on x86_64 (CONFIG_PAGE_TABLE_ISOLATION)
  • Assigning rule
// arch/x86/entry/calling.h
/*
 *  The x86 feature is called PCID (Process Context IDentifier).
 ....
 * ASID  - [0, TLB_NR_DYN_ASIDS-1]
 *         the canonical identifier for an mm
 *
 * kPCID - [1, TLB_NR_DYN_ASIDS]
 *         the value we write into the PCID part of CR3; corresponds to the
 *         ASID+1, because PCID 0 is special.
 *
 * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
 *         for KPTI each mm has two address spaces and thus needs two
 *         PCID values, but we can still do with a single ASID denomination
 *         for each mm. Corresponds to kPCID + 2048.
 *
// For instance, assume a process-p.
// process-p - ASID: 10 / kPCID : 10+1 / uPCID : 10+1+2048
// ASID acts as a placeholder here, kPCID is for kernel entry of process-p, uPCID is for user entry of process-p.
  • Tagging
#define PTI_USER_PGTABLE_AND_PCID_MASK  (PTI_USER_PCID_MASK | PTI_USER_PGTABLE_MASK)

.macro ADJUST_KERNEL_CR3 reg:req
    ALTERNATIVE "", "SET_NOFLUSH_BIT \reg", X86_FEATURE_PCID
    /* Clear PCID and "PAGE_TABLE_ISOLATION bit", point CR3 at kernel pagetables: */
    andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg   // we just need to flip one bit!
.endm

.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
    ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
    mov %cr3, \scratch_reg
    ADJUST_KERNEL_CR3 \scratch_reg  // tagging to cr3 register. low 12bits of cr3 is used to specify PCID.
    mov \scratch_reg, %cr3
.Lend_\@:
.endm
Implementation on arm64 (CONFIG_UNMAP_KERNEL_AT_EL0)
  • Tagging
#define USER_ASID_BIT   48  // ARM stores ASID (for TLB-tagging) in high 16bits of TTBR register.
#define USER_ASID_FLAG  (UL(1) << USER_ASID_BIT)
#define TTBR_ASID_MASK  (UL(0xffff) << 48)

.macro tramp_map_kernel, tmp
    mrs \tmp, ttbr1_el1
    add \tmp, \tmp, #(PAGE_SIZE + RESERVED_TTBR0_SIZE)
    bic \tmp, \tmp, #USER_ASID_FLAG  // tagging to TTBR1_EL1 register.
    msr ttbr1_el1, \tmp
  • Assigning rule
// bit[48] is reserved for indicating kernel or user.
// so if CONFIG_UNMAP_KERNEL_AT_EL0 is enabled, we have to shift 1bit in computing original ASID.
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
#define NUM_USER_ASIDS      (ASID_FIRST_VERSION >> 1)
#define asid2idx(asid)      (((asid) & ~ASID_MASK) >> 1)
#define idx2asid(idx)       (((idx) << 1) & ~ASID_MASK)
#else
#define NUM_USER_ASIDS      (ASID_FIRST_VERSION)
#define asid2idx(asid)      ((asid) & ~ASID_MASK)
#define idx2asid(idx)       asid2idx(idx)
#endif

References

[1] arm64 pti patch
[2] x86 pti patch
[3] x86/pti/64: Remove the SYSCALL64 entry trampoline

Side Channel Attacks on Linux Kernel

Side-Channel-Attacks-On-Linux-Kernel

Side Channel Attacks On Linux Kernel

Goal

  • Understanding the history of side channel attacks against Linux Kernel.
  • Understanding each attacks at high level. (I'll leave some links to get more details)
  • Think about the property that all attacks are sharing with.

Bypassing KASLR using Side Channel Attack

  • The first attempt to attack Linux Kernel was introduced by [1] in 2013 as an academic paper.
    The authors present different three kinds of side channel attack to defeat KASLR.
    But, the approach behind them is exactly same.

  • The most important attack in [1] is called "Double Page Fault Attack".
    To understand how this attack works, let's see how CPU handles a memory access.

    • Simple Version
    1. [Code] Access to a Virtual Address (VA)
    2. [CPU] Translate VA to PA (Physical Address)
    3. [CPU] Access to a Physical Address
    
    • Deeper Version
    1. [Code] Access to a Virtual Address (VA)
    2. [CPU] Translate VA to PA (Physical Address)
       2-1. [TLB] Perform TLB lookup to find a cache entry for the VA. (TLB - Caching PA associated with VA)
       2-2. [Page Table] If 2-1 fails, CPU have to walk page tables to find a corresponding PA.
    3. [CPU] Access to a Physical Address
    
  • The insight we can take from "Deeper Version" is that CPU makes a time-difference between 2-1 and 2-2.
    Let's imagine the attack scenario for recovering KASLR bits.

    • Assume that Linux kernel is placed on 0xabc as a virtual address, and TLB already has entries to translate the address, 0xabc.
    • If we try to access 0xabc that is the kernel address,
      we're able to omit the process 2-2, so the access will be end faster.
    • If we try to access 0xdbc that is not kernel address, and 0xdbc is invalid memory. (not mapped)
      we're not able to omit the process 2-2, page table lookup, so the access will be end slower.
  • One more thing we should keep in our mind

    • We act as an user, not the kernel, which means that accessing kernel address leads us to a segmentation fault.
    • So we should measure the time duration inside a segmentation fault handler.
    • The segmentation fault makes it difficult for an attacker to determine whether addr-a is kernel address.
  • Psuedo Code

    int time;
    unsigned long kaddr = 0xffffffffabc00000;
    ...
    void fault_handler(int signo, siginfo_t *info, void *extra)
    {
      // 3. measure the time duration between the access violation and here.
      // If kaddr is correct, the fault will be delivered faster to here.
      time = get_current_time() - time;
      ...
      if (time < threshold)
         printf("kaddr is kernel address!!\n");
      ...
    }
    ...
    int main(int argc, char **argv)
    {
      time = get_current_time();  // 1. measure current time.
      unsigned long d = *kaddr;
      // 2. access candidate kernel address,
      // hit access violation that leads us into fault_handler().
    }
    
  • Limitation

    • This attack suffers from both inaccuracy and slow speed. To understand the reason, turn again our focus to Psuedo Code.
    1. [User] Access candidate kernel address. --> Assume it makes time difference about 0.1 ms. (0.1 ~ 0.2 ms)
    2. [Kernel] Handling access violation.
    3. [Kernel] Deliver the segmentation fault to fault_handler() in user-space.
       --> 2 and 3 make much bigger time difference, about 10 ms. (5 ~ 15 ms)
    4. [User] Measure the time duration. --> Can we make sure that this time difference is made by No.1?
      - We access an invalid memory, so No.1 takes about 0.2ms.
        But, Handling access violation in kernel-side luckily takes a short time about 5 ms.
        That is so fast that we would make a wrong determination.
    
    • To overcome this limitation, this attack requires a lot of attempts to increase the probability, making it very slow.
    • Then the next question is, How can we overcome this limitation with satisfying both accuracy and speed?

Bypassing KASLR using More Efficient Side Channel Attack

  • How do we built more efficient side channel attack than the first one we've discussed?
    The way to do that is to omit the process of handling access violation inside kernel.
    Ideally it looks like following.

    time = get_current_time();         // 1. measure current time.
    unsigned long d = *kaddr;          // 2. access candidate kernel address. --> no jumping to the kernel!
    time = get_current_time() - time;  // 3. measure again. there is no noise for us to measure the time duration.
    
  • How is it possible to be implemented? [2], [3] gives us answers.

  • Breaking Kernel Address Space Layout Randomization with Intel TSX [2]

    void TSX_fault_handler(...)
    {
      time = get_current_time() - time;  // 3. measure again. there is no noise for us to measure the time duration.
    }
    int main(...)
    {
      time = get_current_time(); // 1. measure current time.
      unsigned long d = *kaddr;  // 2. access candidate kernel address
                                 // Directly jump to TSX_fault_handler without interrupt of kernel.  
                                 // A hardware feature called Intel TSX makes that possible.  
    }
    
    • I'll not detail what the TSX is, and how the TSX works. You would easily understand that by simply google:)
  • Prefetch Side-Channel Attacks: Bypassing SMAP and Kernel ASLR [3]

    time = get_current_time();         // 1. measure current time.
    prefetch(kaddr);                   // 2. prefetch candidate kernel address instead of accessing it.
    time = get_current_time() - time;  // 3. measure again. there is no noise for us to measure the time duration.
    
    • Prefetch instruction loads a memory to cache. (not commit to register)
      It performs TLB and page-table lookup as a load-operation does.
      But, It doesn't cause a segmentation fault unlike a normal load-operation.

Reading Privileged Memory using Meltdown

  • Code gadget
// Run this code in userspace
# rcx : kernel address (== secret)
# rbx : probe array (user data)
# al  : least significant byte of rax. secret that attaker wants to know
retry:
(1) mov al, byte [rcx]
(2) shl rax, 0xc
(3) jz retry
(4) mov rbx, qword [rbx + rax]
  • in-order execution (the way a developer expects to work)

    • (1) mov al, byte [rcx] --> An access violation occurs. But the value in rcx is stored into al.
    • (2) CPU detects the violation, raise an exception.
    • (3) Retire (1) --> clobber al (which has a sensitive kernel value)
  • out-of-order execution (the way meltdown expects to work)

    • (1) mov al, byte [rcx] --> An access violation occurs. But the value in rcx is stored into al.
    • (2) shl rax, 0xc
      jz retry
      mov rbx, qword [rbx + rax] --> memory-access depending on kernel value(rax) happens prior to raising an exception.
    • (3) CPU detects the violation at (1), raise an exception, retire (1).
    • In this out-of-order exectuion, we could guess the kernel value by examining the memory access happened at (2).

What is the common property of the attacks so far?

  • The attacks we've discussed so far are exploiting one fact,
    that is page table of user maps both user and kernel space,
    even though a user doesn't need to access kernel-space.

  • That can make it possible for a user to try to access kernel memory
    even though they can't read/write to the kernel memory.

  • What is the way to eliminate the property?
    That is eliminating entire kernel mapping from the page tabel a user can see.
    How do this close the attacks?

  • If the page table doesn't contain kernel mapping,
    both access to kernel-memory and access to non-kernel-memory requires a page-table-lookup.
    It makes it impossible for an attacker to determine the kernel memory due to no time difference.

  • pti image
    pti

Attacks exploiting Branch Target Buffer

Branch Prediction
  • Modern CPUs have a branch predictor to optimize the performance of branches.
  • There are three different kinds of branches that are using Branch Target Buffer (BTB).
// [virtual address]:
0x400: void B(void) {}

void A(int val, void (*fp)(void))
{
0x100:  if (val == 1)     // 1. Conditional Branch
    printf("...");
0x200: B();   // 2. Direct Branch
0x300: fp();  // 3. Indirect Branch
...

}
  • Let's look at how the branch prediction works with BTB.
PC (virtual addr) Target PC (virtual addr) Prediction state
0x200 0x400 T
0x300 0x500 T
  • In short, BTB stores both the address of branch and the target as a key-value pair.
  • BTB's indexing scheme can be different for each architectures.
    In Haswell, the low 31 bits are used as a PC. The Target PC records 64-bit full address.
    It can allow a BTB collision. For example,
    (1) 0x00000011.00002222: jmp 0x00000011.00004444
    (2) 0x00000022.00002222: jmp 0x00000022.00004444
    • (1), (2) have different address. But since they have same low 31 bits for an index of BTB,
      BTB would hit a false prediction.
    • Execute (1) --> Stores "BTB: 00002222 | 0x00000011.00004444"
    • Execute (2) --> Uses "BTB: 00002222 | 0x00000011.00004444"
      • Execution(2) with BTB results in a false prediction.
      • CPU has to detect the false and turn back to original execution.
    • Execute (2) --> Store "BTB: 00002222 | 0x00000022.00004444"
Bypassing KASLR by exploiting Direct Branch (Branch Target Buffer)
  • Jump Over ASLR: Attacking Branch Predictors to Bypass ASLR [5]

  • Even though the side channel attacks using TLB have blocked, we could go another way to acheive the same goal.
    The way is to exploit Branch Target Buffer we discussed in the earlier section.

btb
  • The main property we exploit here is that address tag doesn't use the full address as a index of BTB. It makes it possible for us to do aliasing-attack across some boundaries. (user-kernel)
    More specifically, since the index contains the KASLR bits, we could bypass KASLR using this primitive.

  • Attack scenario

    // Kernel address - 0xffffffffabc00000 ==> abc is randomized bits.
    1. [User] Allocate memory at 0xabc00000.
    2. [User] Perform an direct-branch at 0xabc00000 to other user-memory.
       - Stores "BTB: 0xabc00000 | user-memory"
    3. [User] Perform an direct-branch at 0xabc00000 again. --> It will be fast since it exists in BTB.
       - Uses "BTB: 0xabc00000 | user-memory"
    4. [User] Trigger a syscall
    5. [Kernel] Execute a syscall --> If the kernel address is 0xffffffffabc00000, it clears up the BTB entries the user stored.
       - Stores "BTB: 0xabc00000 | kernel-memory"
    6. [User] Perform an direct-branch at 0xabc00000 again. --> Is it slow? then the 0xabc would be correct!
       - Uses "BTB: 0xabc00000 | kernel-memory"
         - false prediction! it will be slow!
    
Reading privileged memory using Branch Target Injection (Direct Branch)
  • "Jump Over ASLR" is an attack to obatin a memory layout information of the kernel.
    Let's move one step further. How to read privileged memory with Branch Target Buffer?

  • If we can control the BTB entry, we're able to jump to the target we want.
    Let's imagine a simple scenario. There are two processes, attacker, victim.
    Since BTB is binding to each CPU, we should assume both processes are running under same CPU.

    // Both processes are running under CPU-0.
    // Attacker aims to read the memory of victim.
    // read-gadget: a gadget that opens a side-channel for reading memory. (victim) (0x400)
    
    // 1. the way victim works by default
    1. [victim] Perform an direct-branch at 0x200 to 0x700.
      - read-gadget (0x400) is not a target of this branch.
    
    // 2. the way victim works under attack
    1. [attacker] Allocate memory at 0x200.
    2. [attacker] Perform an direct-branch at 0x200 to 0x400. (the address of read-gadget of victim)
      - Stores "BTB: 0x200 | 0x400" 
    3. [attacker] Send a command to victim
    4. [victim] Perform a direct-branch at 0x200.
      - Uses "BTB: 0x200 | 0x400"
    5. [victim] Jump to 0x400 (read-gadget) due to the false prediction.
    6. [victim] Detect the false prediction. Jump to 0x700 again.  
    
  • What makes a victim jump to where an attacker wants is called "Branch Target Injection".
    One more thing remains. Is it possible to execute whole read-gadget with direct branch prediction?

  • Assume that there are two instructions in read-gadget, read-gadget-i1, read-gadget-i2.
    CPU handles the jmp instruction with a branch predictor in parallel.
    So the race between them might look like below picture.

spectre1
  • As shown in the picture, executing whole read-gadget always fails due to the fast detection of false prediction.
    next direction? maybe indirect branch!
Reading privileged memory using Branch Target Injection (Indirect Branch, Spectre Variant2)
  • The problem of branch target injecting using direct branch is that the jmp has only one phase,
    we should add more phases into the race we saw before to win the race.

  • Let's think about phases that an indirect branch contains.

    • jmp *(%rax)
      • To fix the target of this jmp, (1) Load %rax, (2) Jmp there
    • If we can add a latency using (1)Load %rax, we probably win the race.
spectre2
  • Let's see above picture. Even though we add Load %rax, we fail to execute all due to the cache hit.
spectre3
  • If we're able to evict all caches and make sure that cache miss takes place, we would execute whole read gadget! finally get reading privileged memory successfully.

  • In summary, for reading privileged memory, we should leverage both indirect branch and cache eviction. The attack is called Spectre Variant2.
    But actually, the exploit in reality is a lot more difficult than you expect because you should take some mitigations like ASLR into account.

  • Note some links for you to get more details of how to make a real exploit.

What is the common property of the BTB attacks?

  • BTB doesn't use all parts of the instruction address as an index. It allows BTB to collide across boundaries.
  • BTB doesn't have any tagging mechanism to distinguish a number of regions.
    It's very important in the view of defense.
    Because it means that we're not able to defend them in the same way PTI does unless modifying SoC.
    As a result, it opens a new problem that is how to defend them based on software-only approach.
  • The software-only approach is called Retpoline that aims to eliminate all indirect branches.
  • Unfortunately, it's not possible to tag on BTB like below. (because of the lack of hardware support)
Tag PC (virtual addr) Target PC (virtual addr) Prediction state
Kernel 0x200 0x400 T
Uesr 0x200 0x700 T

References

[1] Practical Timing Side Channel Attacks Against Kernel Space ASLR
[2] Breaking Kernel Address Space Layout Randomization with Intel TSX
[3] Prefetch Side-Channel Attacks: Bypassing SMAP and Kernel ASLR
[4] Meltdown: Reading Kernel Memory from User Space
[5] Jump Over ASLR: Attacking Branch Predictors to Bypass ASLR
[6] Spectre-Variant2, Exploiting indirect branches
[7] Reading privileged memory with a side-channel

2019년 2월 7일 목요일

[How Syzkaller Works] 03 - Fuzzing the Driver (Example)


:) Goal


- Write a simple Linux kernel driver that has bugs, and Fuzzing the driver using syzkaller to find the bugs.


:) Architecture & Setup & Fuzzing the kernel core


- Docs in syzkaller github explains very well about their architectur,  how to setup, how to use it for fuzzing the kernel core.
  So please check it out their github!
- I'll focus only on fuzzing the driver in this posting.


:) Codes


- The all of the codes described at this posting are available at below link.


:) Bugs in an example driver


struct poc_lkm_object {
    unsigned char data0;
    unsigned char data1;
    unsigned long data2;
    unsigned long data3;
};

noinline void poc_lkm_cmd0(struct poc_lkm_object *obj)
{
    if (obj->data0 == 32)      // condition to trigger bug0
        poc_lkm_bug0(obj);  // where bug0 occurs
}
noinline void poc_lkm_cmd1(struct poc_lkm_object *obj)
{
    if (obj->data1 == 64)     // condition to trigger bug1
        poc_lkm_bug1(obj);  // where bug1 occurs
}

noinline long poc_lkm_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
{
struct poc_lkm_object obj;
...
if (copy_from_user(&obj, (void *)arg, sizeof(obj))) {
...
switch (cmd) {
case POC_LKM_CMD_0:
    poc_lkm_cmd0(&obj);
    break;
case POC_LKM_CMD_1:
    poc_lkm_cmd1(&obj);
    break;
}

- There are two bugs in the example driver.
  bug0 will be triggered if data0 is 32.
  bug1 will be triggered if data1 is 64.
  and the data used in triggering each bug is taken from an user.

- An user can reach to both bugs via ioctl system call.
- Our goal is guiding syzkaller to find the bugs.


:) Syscall Description for the example driver


...
syz_open_dev$poc_lkm(dev ptr[in, string["/proc/poc_lkm"]], flags flags[open_flags], mode flags[open_mode]) fd_poc_lkm

ioctl$poc_lkm(fd fd_poc_lkm, cmd flags[poc_lkm_cmds], arg ptr[in, poc_lkm_object])
close$poc_lkm(fd fd_poc_lkm)

poc_lkm_object {
Data0 int8  # Mutate this field
Data1 int8   # Mutate this field
Data2 const[0, int64]   # Do not mutate this field
Data3 const[0, int64]   # Do not mutate this field
}

poc_lkm_cmds = 0xc0185000, 0xc0185001
...

- What we're interesting in is ioctl() since it allows us to reach the bugs.
  Therefore, We will use the constants as parameters to call open(), close().

- Look at the ioctl() description.
  We're not interesting in some arguments, fd that is 1st arg, cmd that is 2nd arg.
  So we're going to only focus on arg.
  Note that the 2nd arg cmd could be the one to be used in fuzzing. Because an wrong cmd can lead the kernel to a crash.

- The type of 3rd arg is struct poc_lkm_object that has 4 fields.
  Fields :  data0, data1, data2, data3.
  We also select some fields to be mutated by syzkaller, The fields we select are data0, data1 that are able to hit the bugs.
  Note that we recommend you to mutate all fields of an object if you'd like to fuzz real-world driver.


:) Run fuzzing


2019/02/08 00:30:50 VMs 1, executed 0, cover 2090, crashes 0, repro 0
2019/02/08 00:31:00 VMs 1, executed 879, cover 2352, crashes 0, repro 0
2019/02/08 00:31:10 VMs 1, executed 879, cover 2375, crashes 0, repro 0
2019/02/08 00:31:15 vm-0: crash: general protection fault in poc_lkm_bug1
2019/02/08 00:31:20 VMs 0, executed 879, cover 2375, crashes 0, repro 0
.....
2019/02/08 00:31:41 reproducing crash 'general protection fault in poc_lkm_bug1': 1804 programs, 1 VMs
2019/02/08 00:31:50 VMs 0, executed 879, cover 2375, crashes 1, repro 1
......

- Next, we can run fuzzing on the example driver.
  As the above log we see,  Syzkaller has found the bug1 very quickly.
  Once the bug1 found, Syzkaller attempts to reproduce the bug1, Print out the result of reproducing.

- Unfortunately,  I couldn't find both bugs.  In other words, Syzkaller couldn't find the bug0 after finding the bug1.  I don't know why exactly...
  

:) References