Abstract
- GAM是一個基於directory-based cache coherence over RDMA的distributed in-memory platform
1 Introduction
- Shared-nothing架構已被廣泛運用在distributed computing,但跟shared-memory架構相比,shared-memory可以統一global的data access並將分散的節點視為1台具有單個unified memory space的powerful server
- 通常來說,大部分的distributed shared memory (DSM) system都會有cache來保存remote memory access (以加快下次access速度)
- 問題
- 為了保持這些cache data的一致性,會需要sync primitive,從而導致顯著的overhead
- 解法
- 捨棄cache直接使用RDMA去access memory,但是即便現今的RDMA技術在throughput上接近直接access local memory,latency還是可能會高出許多
- GAM
- 保留cache機制
- 利用RDMA做cache coherence protocol
- 問題
2 System Design
2.1 Addressing Model and APIs
- GAM採用partitioned global address space (PGAS) addressing的架構
- PGAS是每台機器藉由RDMA進行memory之間的connect並只負責global address space中分到的其中一部份
- GAM提供的API
- 注意可透過給定的$gaddr$暗示這是local還是remote的操作
2.2 Cache Coherence Protocol
- GAM的overview
- layer 1: snooped-based protocol whithin a NUMA node
- layer 2: directory-based protocol across NUMA nodes
- layer 3: directory-based protocol used by GAM
- 降低latency的方法:將memory分層
- 分為cache及memory層
- 盡可能降低access較為底層的memory (尤其是remote memory)的機會
- 由於多了cache層,則必須要有cache coherence機制
- 不使用snoop-based coherence protocol
- 因為在RDMA network中做broadcast是不可靠的
- 註:snoop-based coherence protocol
- 每個cache Unit都會有一個對應的snoop Unit
- 當處理器對記憶體區塊進行寫入動作,snooper會從BUS上面發現這個動作
- snooper會檢查自己的cache是否也存有該區塊的資料並更新
- 採用directory-based coherence protocol
- 不使用snoop-based coherence protocol
- 5種nodes
- home/remote:data所在的physical memory的擁有節點為home,其餘皆為remote
- request:要求share/exclusive (read/write)權限的節點
- sharing/owner:擁有share/exclusive (read/write)權限的節點
- 每個data可以有多個sharing node但最多只能有1個owner node
- sharing node跟owner node不能同時存在,除非owner node本身就是唯一的sharing node
- 3種directory state of cache line
- Shared:shared by some remote nodes that have read permission
- Dirty:owned by a remote node that has write permission
- Unshared:owned by the home node
- 3種cache state of cache line
- Shared:read-only
- Dirty:writable
- Invalid:invalidated
- 由於network會有延遲,state的transition不一定是atomic的
- 故設計一些in-transition state如”SharedToDirty”,詳見2.5
2.3 Read
- Workflow of read protocol
2.3.1 Local Read
- request node = home node
- 若無remote node holding ownership,則data要嘛在local memory (Unshared),不然就是處於read-only mode (Shared),兩者都能直接從local memory得到結果
- 若有remote node holding ownership,則如上圖(a)
2.3.2 Remote Read
- request node != home node
- 若request node本身就有data的cache,則可直接從cache得到結果
- 若request node本身無data的cache,則有2種case
- “Unshared/Shared” (Non-dirty),如上圖(b)
- “Dirty”,如上圖(c)
2.4 Write
- Workflow of write protocol
2.4.1 Local Write
- request node = home node
- 若無sharing or owner node,則data處於Unshared狀態,可以安全地在local memory寫入data
- 若有sharing or owner node,則有2種case
- “Shared”,如上圖(a)
- “Dirty”,大致如上圖(a),不同處在於
- request node也會送invalidate通知給owner node,並等待ACK
- owner node在回ACK時必須附帶最新的cache data (因為owner node有可能改過這data,所以要把改的東西傳回去)
2.4.2 Remote Write
- request node != home node
- 若request node同時也是owner node,則可以直接進行remote write
- 若request node不是owner node,則有3種case
- “Shared”,如上圖(a)
- “Unshared”,大致如上圖(a),不同處在於
- home node可以跳過步驟3和4
- “Dirty”,如上圖(c)
2.5 Race Condition
- Race condition的發生
- Example
- 當node有2個thread同時發送remote read/write指令並等待ACK時
- 解法
- 處理request時,所有相關的node都會被當作正處於in-transition state,並且將其他針對該cache line的request通通block
- 直到處裡完request才解block
- 問題
- 會有deadlock產生
- Example
- Deadlock的發生
- Example
- 當有一node傳送write request (WRITE_PERMISSION_ONLY)給home node
- home node又正巧想修改同樣的data於是傳了INVALIDATE給大家
- 這時就會因為雙方都在等ACK但又因自身處於in-transition state不會回別人ACK而變成deadlock
- 解法
- 讓request node有backoff機制,並讓home node處理不一致的問題
- 舉例來說,當home node將directory state改為Unshared並發完INVALIDATE之後收到WRITE_PERMISSION_ONLY就知道有inconsistency發生了,並自行handle該狀況
- Example
2.6 LRU-based Caching
- GAM採用least recently used (LRU) cache的機制來管理cache
- 每個node都會維護一個hash table專門map global memory address和相應的cache line,當hash table滿了就用LRU的機制替換掉最少hit的cache
- 由於採用單一LRU list在很多thread都試圖更新該list時可能會造成huge overhead,所以採用multiple LRU list
3 RDMA-based Implemention
3.1 Protocol Implemention
- RDMA有2組Verbs可用來做data transmission
- READ/WRITE
- one-sided
- no OS or CPU involved
- difficult to figure out the completion of data transmission
- SEND/RECEIVE
- two-sided
- receiver side必須先在receive queue post RECEIVE,sender才能做SEND
- 傳送結束後sender必須notify receiver
- READ/WRITE
- 將communition分成3種不同的channel
- control message channel
- 避免使用RDMA WRITE Verb和busy memory polling,而是採用RDMA SEND/RECEIVE Verbs
- 因為busy polling跟event-based的方法相比會消耗大量CPU資源
- sender/receiver的communication buffer會占用大量memory
- Sec 2.3和2.4中提到的”request”、”forward”、”invalidate”皆使用control channel (RDMA SEND/RECEIVE)
- error reply亦使用control channel (而非下面提到的notification channel),因為requester需要更多feedback
- 避免使用RDMA WRITE Verb和busy memory polling,而是採用RDMA SEND/RECEIVE Verbs
- data transmission channel
- 使用RDMA WRITE Verb直接寫進dest address
- notification channel
- 使用RDMA WRITE_WITH_IMM Verb w/ & w/o payload
- 當notification需要payload的時候會結合data channel,不需要時直接把request identifier嵌入header就好
- Sec 2.3和2.4中提到的”reply”、”writeback”皆使用data + notification channel (RDMA SEND/RECEIVE w/ payload)
- Sec 2.3和2.4中提到的”ack”、”transfer”皆使用notification channel only (RDMA SEND/RECEIVE w/o payload)
- control message channel
- 不使用READ Verb,因為其效能比WRITE還差而且不保證data consistency
- 所有communication皆採用reliable connection
3.2 Optimizations
- 略
4 Memory Consistency Model
- 註:memory consistency (連結)
- strong consistency model、relaxed consistency model (TSO、PSO)…
- 雖然RDMA已經降低許多latency了,但如果強制執行strong consistency會造成嚴重的remote memory access latency,因為它要求read跟write都要同步
- 解法
- 採用PSO (relax Read-After-Write跟Write-After-Write)來達成synchronous read和asynchronous write
- 解法
4.1 Synchronization Operations
- 提供strong consistency
5 Logging and Failure Recovery
- 2種logging
- DLOG
- request node在寫data進memory/cache前會呼叫
- OLOG
- home node在轉換ownership前會呼叫
- DLOG
- 避免將read request做logging,有助減少overhead
- 為減少logging時的效能降低,使用NVRAM先暫存log,當log快滿時再用非同步方式寫進disk