[Proceedings of the VLDB Endowment 2018] Efficient Distributed Memory Management with RDMA and Caching

Abstract

  • GAM是一個基於directory-based cache coherence over RDMA的distributed in-memory platform

1 Introduction

  • Shared-nothing架構已被廣泛運用在distributed computing,但跟shared-memory架構相比,shared-memory可以統一global的data access並將分散的節點視為1台具有單個unified memory space的powerful server
  • 通常來說,大部分的distributed shared memory (DSM) system都會有cache來保存remote memory access (以加快下次access速度)
    • 問題
      • 為了保持這些cache data的一致性,會需要sync primitive,從而導致顯著的overhead
    • 解法
      • 捨棄cache直接使用RDMA去access memory,但是即便現今的RDMA技術在throughput上接近直接access local memory,latency還是可能會高出許多
    • GAM
      • 保留cache機制
      • 利用RDMA做cache coherence protocol

2 System Design

2.1 Addressing Model and APIs

  • GAM採用partitioned global address space (PGAS) addressing的架構
  • PGAS是每台機器藉由RDMA進行memory之間的connect並只負責global address space中分到的其中一部份
  • GAM提供的API
    • 注意可透過給定的$gaddr$暗示這是local還是remote的操作

2.2 Cache Coherence Protocol

  • GAM的overview
    • layer 1: snooped-based protocol whithin a NUMA node
    • layer 2: directory-based protocol across NUMA nodes
    • layer 3: directory-based protocol used by GAM
  • 降低latency的方法:將memory分層
    • 分為cache及memory層
    • 盡可能降低access較為底層的memory (尤其是remote memory)的機會
  • 由於多了cache層,則必須要有cache coherence機制
    • 不使用snoop-based coherence protocol
      • 因為在RDMA network中做broadcast是不可靠的
      • 註:snoop-based coherence protocol
        • 每個cache Unit都會有一個對應的snoop Unit
        • 當處理器對記憶體區塊進行寫入動作,snooper會從BUS上面發現這個動作
        • snooper會檢查自己的cache是否也存有該區塊的資料並更新
    • 採用directory-based coherence protocol
  • 5種nodes
    • home/remote:data所在的physical memory的擁有節點為home,其餘皆為remote
    • request:要求share/exclusive (read/write)權限的節點
    • sharing/owner:擁有share/exclusive (read/write)權限的節點
  • 每個data可以有多個sharing node但最多只能有1個owner node
  • sharing node跟owner node不能同時存在,除非owner node本身就是唯一的sharing node
  • 3種directory state of cache line
    • Shared:shared by some remote nodes that have read permission
    • Dirty:owned by a remote node that has write permission
    • Unshared:owned by the home node
  • 3種cache state of cache line
    • Shared:read-only
    • Dirty:writable
    • Invalid:invalidated
  • 由於network會有延遲,state的transition不一定是atomic的
    • 故設計一些in-transition state如”SharedToDirty”,詳見2.5

2.3 Read

  • Workflow of read protocol

2.3.1 Local Read

  • request node = home node
  • 若無remote node holding ownership,則data要嘛在local memory (Unshared),不然就是處於read-only mode (Shared),兩者都能直接從local memory得到結果
  • 若有remote node holding ownership,則如上圖(a)

2.3.2 Remote Read

  • request node != home node
  • 若request node本身就有data的cache,則可直接從cache得到結果
  • 若request node本身無data的cache,則有2種case
    • “Unshared/Shared” (Non-dirty),如上圖(b)
    • “Dirty”,如上圖(c)

2.4 Write

  • Workflow of write protocol

2.4.1 Local Write

  • request node = home node
  • 若無sharing or owner node,則data處於Unshared狀態,可以安全地在local memory寫入data
  • 若有sharing or owner node,則有2種case
    • “Shared”,如上圖(a)
    • “Dirty”,大致如上圖(a),不同處在於
      • request node也會送invalidate通知給owner node,並等待ACK
      • owner node在回ACK時必須附帶最新的cache data (因為owner node有可能改過這data,所以要把改的東西傳回去)

2.4.2 Remote Write

  • request node != home node
  • 若request node同時也是owner node,則可以直接進行remote write
  • 若request node不是owner node,則有3種case
    • “Shared”,如上圖(a)
    • “Unshared”,大致如上圖(a),不同處在於
      • home node可以跳過步驟3和4
    • “Dirty”,如上圖(c)

2.5 Race Condition

  • Race condition的發生
    • Example
      • 當node有2個thread同時發送remote read/write指令並等待ACK時
    • 解法
      • 處理request時,所有相關的node都會被當作正處於in-transition state,並且將其他針對該cache line的request通通block
      • 直到處裡完request才解block
    • 問題
      • 會有deadlock產生
  • Deadlock的發生
    • Example
      • 當有一node傳送write request (WRITE_PERMISSION_ONLY)給home node
      • home node又正巧想修改同樣的data於是傳了INVALIDATE給大家
      • 這時就會因為雙方都在等ACK但又因自身處於in-transition state不會回別人ACK而變成deadlock
    • 解法
      • 讓request node有backoff機制,並讓home node處理不一致的問題
      • 舉例來說,當home node將directory state改為Unshared並發完INVALIDATE之後收到WRITE_PERMISSION_ONLY就知道有inconsistency發生了,並自行handle該狀況

2.6 LRU-based Caching

  • GAM採用least recently used (LRU) cache的機制來管理cache
  • 每個node都會維護一個hash table專門map global memory address和相應的cache line,當hash table滿了就用LRU的機制替換掉最少hit的cache
  • 由於採用單一LRU list在很多thread都試圖更新該list時可能會造成huge overhead,所以採用multiple LRU list

3 RDMA-based Implemention

3.1 Protocol Implemention

  • RDMA有2組Verbs可用來做data transmission
    • READ/WRITE
      • one-sided
      • no OS or CPU involved
      • difficult to figure out the completion of data transmission
    • SEND/RECEIVE
      • two-sided
      • receiver side必須先在receive queue post RECEIVE,sender才能做SEND
      • 傳送結束後sender必須notify receiver
  • 將communition分成3種不同的channel
    • control message channel
      • 避免使用RDMA WRITE Verb和busy memory polling,而是採用RDMA SEND/RECEIVE Verbs
        • 因為busy polling跟event-based的方法相比會消耗大量CPU資源
        • sender/receiver的communication buffer會占用大量memory
      • Sec 2.3和2.4中提到的”request”、”forward”、”invalidate”皆使用control channel (RDMA SEND/RECEIVE)
      • error reply亦使用control channel (而非下面提到的notification channel),因為requester需要更多feedback
    • data transmission channel
      • 使用RDMA WRITE Verb直接寫進dest address
    • notification channel
      • 使用RDMA WRITE_WITH_IMM Verb w/ & w/o payload
      • 當notification需要payload的時候會結合data channel,不需要時直接把request identifier嵌入header就好
        • Sec 2.3和2.4中提到的”reply”、”writeback”皆使用data + notification channel (RDMA SEND/RECEIVE w/ payload)
        • Sec 2.3和2.4中提到的”ack”、”transfer”皆使用notification channel only (RDMA SEND/RECEIVE w/o payload)
  • 不使用READ Verb,因為其效能比WRITE還差而且不保證data consistency
  • 所有communication皆採用reliable connection

3.2 Optimizations

4 Memory Consistency Model

  • 註:memory consistency (連結)
    • strong consistency model、relaxed consistency model (TSO、PSO)…
  • 雖然RDMA已經降低許多latency了,但如果強制執行strong consistency會造成嚴重的remote memory access latency,因為它要求read跟write都要同步
    • 解法
      • 採用PSO (relax Read-After-Write跟Write-After-Write)來達成synchronous read和asynchronous write

4.1 Synchronization Operations

  • 提供strong consistency

5 Logging and Failure Recovery

  • 2種logging
    • DLOG
      • request node在寫data進memory/cache前會呼叫
    • OLOG
      • home node在轉換ownership前會呼叫
  • 避免將read request做logging,有助減少overhead
  • 為減少logging時的效能降低,使用NVRAM先暫存log,當log快滿時再用非同步方式寫進disk