[Proceedings of the VLDB Endowment 2018] Efficient Distributed Memory Management with RDMA and Caching

Abstract

GAM是一個基於directory-based cache coherence over RDMA的distributed in-memory platform

1 Introduction

Shared-nothing架構已被廣泛運用在distributed computing，但跟shared-memory架構相比，shared-memory可以統一global的data access並將分散的節點視為1台具有單個unified memory space的powerful server
通常來說，大部分的distributed shared memory (DSM) system都會有cache來保存remote memory access (以加快下次access速度)
- 問題
  - 為了保持這些cache data的一致性，會需要sync primitive，從而導致顯著的overhead
- 解法
  - 捨棄cache直接使用RDMA去access memory，但是即便現今的RDMA技術在throughput上接近直接access local memory，latency還是可能會高出許多
- GAM
  - 保留cache機制
  - 利用RDMA做cache coherence protocol

2 System Design

2.1 Addressing Model and APIs

GAM採用partitioned global address space (PGAS) addressing的架構
PGAS是每台機器藉由RDMA進行memory之間的connect並只負責global address space中分到的其中一部份
GAM提供的API
- 注意可透過給定的$gaddr$暗示這是local還是remote的操作

2.2 Cache Coherence Protocol

GAM的overview
- layer 1: snooped-based protocol whithin a NUMA node
- layer 2: directory-based protocol across NUMA nodes
- layer 3: directory-based protocol used by GAM
降低latency的方法：將memory分層
- 分為cache及memory層
- 盡可能降低access較為底層的memory (尤其是remote memory)的機會
由於多了cache層，則必須要有cache coherence機制
- 不使用snoop-based coherence protocol
  - 因為在RDMA network中做broadcast是不可靠的
  - 註：snoop-based coherence protocol
    - 每個cache Unit都會有一個對應的snoop Unit
    - 當處理器對記憶體區塊進行寫入動作，snooper會從BUS上面發現這個動作
    - snooper會檢查自己的cache是否也存有該區塊的資料並更新
- 採用directory-based coherence protocol
  - 註：directory-based coherence protocol (連結)
5種nodes
- home/remote：data所在的physical memory的擁有節點為home，其餘皆為remote
- request：要求share/exclusive (read/write)權限的節點
- sharing/owner：擁有share/exclusive (read/write)權限的節點
每個data可以有多個sharing node但最多只能有1個owner node
sharing node跟owner node不能同時存在，除非owner node本身就是唯一的sharing node
3種directory state of cache line
- Shared：shared by some remote nodes that have read permission
- Dirty：owned by a remote node that has write permission
- Unshared：owned by the home node
3種cache state of cache line
- Shared：read-only
- Dirty：writable
- Invalid：invalidated
由於network會有延遲，state的transition不一定是atomic的
- 故設計一些in-transition state如”SharedToDirty”，詳見2.5

2.3 Read

Workflow of read protocol

2.3.1 Local Read

request node = home node
若無remote node holding ownership，則data要嘛在local memory (Unshared)，不然就是處於read-only mode (Shared)，兩者都能直接從local memory得到結果
若有remote node holding ownership，則如上圖(a)

2.3.2 Remote Read

request node != home node
若request node本身就有data的cache，則可直接從cache得到結果
若request node本身無data的cache，則有2種case
- “Unshared/Shared” (Non-dirty)，如上圖(b)
- “Dirty”，如上圖(c)

2.4 Write

Workflow of write protocol

2.4.1 Local Write

request node = home node
若無sharing or owner node，則data處於Unshared狀態，可以安全地在local memory寫入data
若有sharing or owner node，則有2種case
- “Shared”，如上圖(a)
- “Dirty”，大致如上圖(a)，不同處在於
  - request node也會送invalidate通知給owner node，並等待ACK
  - owner node在回ACK時必須附帶最新的cache data (因為owner node有可能改過這data，所以要把改的東西傳回去)

2.4.2 Remote Write

request node != home node
若request node同時也是owner node，則可以直接進行remote write
若request node不是owner node，則有3種case
- “Shared”，如上圖(a)
- “Unshared”，大致如上圖(a)，不同處在於
  - home node可以跳過步驟3和4
- “Dirty”，如上圖(c)

2.5 Race Condition

Race condition的發生
- Example
  - 當node有2個thread同時發送remote read/write指令並等待ACK時
- 解法
  - 處理request時，所有相關的node都會被當作正處於in-transition state，並且將其他針對該cache line的request通通block
  - 直到處裡完request才解block
- 問題
  - 會有deadlock產生
Deadlock的發生
- Example
  - 當有一node傳送write request (WRITE_PERMISSION_ONLY)給home node
  - home node又正巧想修改同樣的data於是傳了INVALIDATE給大家
  - 這時就會因為雙方都在等ACK但又因自身處於in-transition state不會回別人ACK而變成deadlock
- 解法
  - 讓request node有backoff機制，並讓home node處理不一致的問題
  - 舉例來說，當home node將directory state改為Unshared並發完INVALIDATE之後收到WRITE_PERMISSION_ONLY就知道有inconsistency發生了，並自行handle該狀況

2.6 LRU-based Caching

GAM採用least recently used (LRU) cache的機制來管理cache
每個node都會維護一個hash table專門map global memory address和相應的cache line，當hash table滿了就用LRU的機制替換掉最少hit的cache
由於採用單一LRU list在很多thread都試圖更新該list時可能會造成huge overhead，所以採用multiple LRU list

3 RDMA-based Implemention

3.1 Protocol Implemention

RDMA有2組Verbs可用來做data transmission
- READ/WRITE
  - one-sided
  - no OS or CPU involved
  - difficult to figure out the completion of data transmission
- SEND/RECEIVE
  - two-sided
  - receiver side必須先在receive queue post RECEIVE，sender才能做SEND
  - 傳送結束後sender必須notify receiver
將communition分成3種不同的channel
- control message channel
  - 避免使用RDMA WRITE Verb和busy memory polling，而是採用RDMA SEND/RECEIVE Verbs
    - 因為busy polling跟event-based的方法相比會消耗大量CPU資源
    - sender/receiver的communication buffer會占用大量memory
  - Sec 2.3和2.4中提到的”request”、”forward”、”invalidate”皆使用control channel (RDMA SEND/RECEIVE)
  - error reply亦使用control channel (而非下面提到的notification channel)，因為requester需要更多feedback
- data transmission channel
  - 使用RDMA WRITE Verb直接寫進dest address
- notification channel
  - 使用RDMA WRITE_WITH_IMM Verb w/ & w/o payload
  - 當notification需要payload的時候會結合data channel，不需要時直接把request identifier嵌入header就好
    - Sec 2.3和2.4中提到的”reply”、”writeback”皆使用data + notification channel (RDMA SEND/RECEIVE w/ payload)
    - Sec 2.3和2.4中提到的”ack”、”transfer”皆使用notification channel only (RDMA SEND/RECEIVE w/o payload)
不使用READ Verb，因為其效能比WRITE還差而且不保證data consistency
所有communication皆採用reliable connection

3.2 Optimizations

4 Memory Consistency Model

註：memory consistency (連結)
- strong consistency model、relaxed consistency model (TSO、PSO)…
雖然RDMA已經降低許多latency了，但如果強制執行strong consistency會造成嚴重的remote memory access latency，因為它要求read跟write都要同步
- 解法
  - 採用PSO (relax Read-After-Write跟Write-After-Write)來達成synchronous read和asynchronous write

4.1 Synchronization Operations

提供strong consistency

5 Logging and Failure Recovery

2種logging
- DLOG
  - request node在寫data進memory/cache前會呼叫
- OLOG
  - home node在轉換ownership前會呼叫
避免將read request做logging，有助減少overhead
為減少logging時的效能降低，使用NVRAM先暫存log，當log快滿時再用非同步方式寫進disk

上篇[SIGCOMM 2020] MasQ: RDMA for Virtual Private Cloud

下篇[NSDI 2020] Expanding across time to deliver bandwidth efficiency and low latency