[SIGCOMM 2020] BeauCoup: Answering Many Network Traffic Queries, One Memory Update at a Time

Abstract

網路管理者通常需要大量方法來監控網路流量以偵測異常
目前現有方法通常聚焦在single statistic (如traffic volume)和traffic attribute (如dest IP)
但每個封包能access的memory有限，要執行錯綜複雜、數量又多的方法就會變成一項挑戰
BeauCoup是個基於coupon collector’s problem的系統
- 贈券收集問題 (coupon collector’s problem)
  - 假設有$n$種贈券，每種贈券獲取機率相同，而且贈券亦無限供應。若取贈券$t$張，能集齊$n$種贈券的機率多少？
  - ans：平均需要$n\cdot(1/1+1/2+\cdots+1/n)\approx n\ln{n}+\gamma n+1/2+o(1),as\space n\to\infty$次才能集齊$n$種贈券，時間複雜度為$\theta (n\ln{n})$
BeauCoup可同時支援多個distinct counting queries，並且保證每個封包只會access到極小且不變的memory

1 Introduction

先備知識
- 此篇論文中1個query就是1種攻擊(像是super-spreader、DDoS、port scan等)
- 1個query可以有多個key
- 每個(query, key)都能對應到複數個不同的coupon

Query, Key	Coupons
1, A	1 2 3 4
1, B	a b c d
1, C	5 6 7 8
2, D	w x y z

新興的switch可以直接分析經過data plane的封包，但是memory和processing resource有限
以往研究者為了解決這個問題會設計一些compact data structure讓其能夠在有限資源中算出大概的答案(不甚精準)，這方法可套用在
- single traffic monitoring query
- multiple queries over same key
  - 如果想support multiple queries with different keys，就必須將多個seperate data structure (相對於前面說的compact data structure)實例化，但這會浪費data plane中寶貴的memory space
  - 除了memory space問題之外，為了保持一定的line rate，switch通常只允許封包access有限且不多的memory
目前已存在支援處理multiple queries的技術大部分極度依賴跑在data plane外的一些程式，造成額外的overhead和latency
- 最簡單的方法是在data plane隨機對封包取樣，然後丟給程式跑一些統計數據
- 雖然以上方法對high volume flow (高流量)的偵測很有用，但是會降低一些計算distinct attribute數量的query的accuracy
BeauCoup能夠支持通用的query abstraction，它可以在一組有關連的封包(具有相同的key)中對不同的item (如attribute)計數，並對那些超過threshold的key進行標記
- 舉例來說，當他要找尋網路蠕蟲，那封包的src IP就是key，而dest IP就是attribute，threshold決定當不同的dest IP超過多少個時就要把src IP標記為蠕蟲
BeauCoup允許用戶將各種header field自定義為query的key和attribute且可以在runtime更改query set而不需要recompile，只有前面提到的改變自定義key或attribute時要recompile
BeauCoup基於coupon collector’s problem的做法如下
- 以偵測super-spreader為例，假設我們想知道某個sender是否對至少130個不同的dest IP傳送封包
- 不將所有IP記錄下來，而是定義32張coupon並將dest IP隨機對應到32張coupon的其中1張
- 當32張都被拿到，就會標記該src IP是super-spreader
- 因為根據公式，要拿到32張coupon平均必須嘗試129.9次(coupon抽出後會被放回)，以此為依據推測src IP已經傳給了130個不同的dest IP封包
- 當threshold改變，要變的就是coupon數量($m$)、對於新的dest IP所會抽到的每個coupon的機率($p$)和必須被抽到的coupon數量($n$)
為了限制BeauCoup所使用的memory大小，必須保持coupon state都很小、必要時才devote state to a key、share memory across queries and keys
另外，BeauCoup必須保證1個封包最多只能拿到1種coupon (亦即只對應到1種攻擊)，所以要讓每個query都只有很小的機率能獲取coupon，且要跟其他query協調避免同時拿到一堆coupon，要做到這件事必須tune $m,\space p,\space and\space n$

2 Algorithm

2.1 Query: Count-Distinct with Threshold

當一個封包在time window $W$內滿足$

{attr_q(i)

key_q(i)=k}>T_q

$，則BeauCoup應輸出alert$(q,k)$for query $q$ and key $k$，其中在query $q$中

$attr_q(i)$是封包$i$在某個query $q$中的某個key對應到的不同的attribute出現的次數
$key_q(i)$是封包$i$在某個query $q$中對應到的每個key (query $q$中所有key的集合)

attribute可以高度客製化，如底下表格的Heavy hitter、SYN-flood

2.2 Updating the Coupon-Collector Table

Coupon-Collector Table是由多個query-key pair $(q,\space k)$的row所組成的，每條row裡有1個$w$-bit的word vector ($w$取決於該系統1個word是多少bit)，1個bit即表示那張coupon是否曾被獲取過
BeauCoup會先針對進來的封包透過random hash function得到1個query和1張coupon，並找到對應的coupon將其mark為1 (表示已被獲取)
coupon table遵循以下設計理念
- Compact rows
  - 每個row只儲存1 $w$-bit word as a bit vector，表示最多有$w$張coupon
- Space efficiency
  - 只有被獲取過至少1張coupon的bit vector對應到的query會被保留在memory裡
  - 當update query table時發現沒有該row才會allocate
- Limited access
  - BeauCoup只有在要獲取coupon時才會access memory，若封包沒有獲取到coupon就不會去access

2.3 Selecting a Query and a Coupon

對於封包$i$的attribute $attr_q(i)$ ($q$是所有queries中其中1條query)，採用一random hash function $h$，使得$h:{attr_q(i),\forall i}\to[0,1)$，並檢查該output是否落在某coupon的獲取範圍裡
- 舉例來說，query $q$有$m_q=4$張coupon，選中其中1張coupon的機率為$p_q=1/8$，則欲獲取coupon#1，hash function output必須落在0~1/8，即$h(attr_q(i))\in [0,1/8)$，其餘類推
- 若hash function output落在$[4/8, 1)$則BeauCoup不會access memory，亦即該封包不會獲取coupon
定義$\gamma_q$為：對於query $q$每個封包所允許獲取的coupon的平均數量，以上一點的舉例來說$\gamma_q=m_q\cdot p_q=1/2$
通常會讓$\gamma_q\lt1$，這有2個優點
- 因為不是每次都會獲得coupon，所以coupon table不用維護每個key的state，只有那些曾獲取過coupon的key會allocate memory並被維護
- 1個小的$\gamma_q$可以讓空閒的memory (因為不是每個$q$都有allocate memory，所以會多出一些空閒memory)被拿來使用在獲取另一個query的coupon的計算，達到multiple queries run concurrently
Grouping queries with the same attribute
- 不同的query可能有相同的attribute (如dest IP)
- 這些query可以共用相同的hash function，如下圖

Coordinating across queries with different attributes
- BeauCoup會針對每個不同的attribute建立一個random hash function，當封包進來，BeauCoup會把所有hash function都計算一遍
- 如果只有一個hash function的output是成功獲取coupon，那BeauCoup就會去update coupon table
- 如果都沒有就不動作
- 如果有多個hash function的output都成功獲取到coupon，那就進入tie-breaking
- 但目前BeauCoup只有在獲得剛好2張coupon時才會進入tie-breaking，並在tie-breaking用擲硬幣(各50%機率)決定要拿哪張coupon
- 多於2張coupon的狀況BeauCoup會決定都不拿，也就是都不動作，但這對accuracy影響很小(在Appendix B有證明1個封包獲得多張coupon的機率很小)

3 The BeauCoup Query Compiler

3.1 Coupon Collector’s Accuracy

由於實際需要把coupon拿完的次數可能跟coupon collector’s problem原始問題計算出的期望值落差很大，所以團隊設計了一個Relative Error用來計算bias和variance

True count

假設系統在觀察了$i_1,i_2,\cdots,i_t$個封包後才發出alert$(q,k)$，定義ground truth number of distinct attributes為$\tau=

{attr_q(i)

key_q(i)=k,i\in i_1,i_2,\cdots,i_t}

Absolute error

但其實早在出現恰好$T_q$ (threshold)個distinct attributes時，系統就該發alert了，我們定義absolute error為$

\tau-T_q

Relative error
- 定義output$(q,k)$的relative error為$( \tau-T_q )/T_q$

3.2 Finding the Best Configuration

implement BeauCoup on PISA switch
- 因為1個memory word是$w$=32-bit，所以必須讓$m_q\le32$
- 為了促進random hash mapping到coupon的效率，必須讓$p_q$是2的指數次方
- 必須遵循每個封包能拿到的coupon小於等於$\gamma_q$，即$m_q\cdot p_q\le \gamma_q$
給定threshold $T_q$及每個封包的coupon limit $\gamma_q$，根據以下步驟來獲取最好的config
- 對於所有可行的$p_q=2^{-j}$，試著計算出最大的$m_q$ (query $q$所能擁有的最大coupon數量)。計算方式為$\overline{m_q}=min(w,\gamma_q/p_q)$，若$\overline{m_q}\lt1$就停止
- 對於每個$p_q$我們要找出可行的config且必須滿足$1\le n_q\le m_q\le\overline{m_q}$，然後計算每個config拿coupon的期望值$CC(m_q,p_q,n_q)$當期望值落在threshold $T_q$的大約5%誤差之間時視為reasonable的config，之所以選5%是因為optimal collector的minimum relative error大約10%
- 從所有reasonable config中基於他們的minimum relative error選擇適當的config

4 BeauCoup on PISA Hardware

PISA switch有2種memory，TCAM跟SRAM
- TCAM可以同時match a bit string with many match rules，通常被使用在packet forwarding。BeauCoup將其用在hash function map from attribute to coupon以及tie-breaking process
- SRAM因為空間有限(通常只有幾MB)，且每個封包只能access為數不多的SRAM，所以BeauCoup將其用在coupon collect table

4.1 Using TCAM for Drawing Coupons

每個random hash function的output都會被encode成16-bit，然後coupon的range會被轉譯成bit prefix並對應到那encode出來的random bits
- 如上圖table #1，coupon range $[0, 1/4)$對應到00、$[1/4, 2/4)$對應到01，$[2/4, 2/4+1.8)$對應到100、$[2/4+1/8, 2/4+2/8)$對應到101
每張table都會對應到一個random hash function
如果有2張coupon同時被拿到，如上圖Table #1的(1,1)跟Table #2的(6,1)，就會進入tie-breaking，即上圖右邊的table
From table#1對應到1000，From table #2對應到0100，From table #1&2就對應到1100，然後用random number generator生出1-bit random number，來決定拿哪張coupon

4.2 Recording Coupons in SRAM

使用SRAM的register array作為coupon table，每個array有$S$個word，每個word有32 bits
BeauCoup會先對進來的封包extract query key $key_q(i)$，然後用一個random hash function $H$把query跟key map到array index，$idx=H(q,key_q(i))$
BeauCoup定義了3種register array，每個array有$S$個words
- $\tau S[\cdot]$存timestamp，用於檢查某個query的coupon collector是否超過了time window $W$，來判斷是否過期，若是則需要allocate新的coupon collector
- $QK[\cdot]$存checksum $checksum(key_q(i))$，每個checksum有32 bits，用於檢查$H$是否發生collision
- $CC[\cdot]$存coupon collector的bit vector
實際操作為以下步驟
- Create new collector
  - 當$\tau S[idx]\lt i.timestamp-W$表示該collector已經過期，則走以下步驟allocate新的collector
  - $\tau S[idx]\leftarrow i.timestamp$
  - $QK[idx]\leftarrow checksum(key_q(i))$
  - $CC[idx]\leftarrow onehot(c)$ ($onehot(c)$是將coupon encode成1個有31個0、1個1的32 bits的bit string)
- Update existing collector
  - 當$\tau S[idx]\ge i.timestamp-W$且$QK[idx]=checksum(key_q(i))$表示沒過期且沒有collision發生，則走以下步驟更新collector
  - $CC[idx]\leftarrow (CC[idx]\lor onehot(c))$ (這邊的OR是bitwise-OR)
  - 如果$CC[idx]$的32 bits中裡的1的數量達到$n_q$就發alert$(q,key_q(i))$
- Handle collision
  - 當$\tau S[idx]\ge i.timestamp-W$且$QK[idx]\ne checksum(key_q(i))$表示沒過期但有collision發生，則忽略該coupon
  - collision發生原因為太多coupon collector，系統快要沒memory可用了

4.3 Query Compiler and Code Generation

略(主要敘述怎麼生出BeauCoup的P4 code)

上篇[SIGCOMM 2020] NetLock: Fast, Centralized Lock Management Using Programmable Switches

下篇[SIGCOMM 2020] OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy