Download CS 636 Internetworking: Exact Match Lookups and Router Algorithms - Prof. Ramana Rao Kompe and more Study notes Computer Science in PDF only on Docsity! CS 636 Internetworking CS
636
Internetworking
Ramana
Kompella
ROUTER
ALGORITHMICS
Midterm
Review
1 CS 636 Internetworking Exact Match Lookups
CS 636 Internetworking CS 636 Internetworking
Scaling
via
hashing
• Gigaswitch
32
x
100Mbps
FDDI
ports
• Use
hashing
instead
of
search
tree
– Avoid
worst
case
by
using
perfect
hashing
to
avoid
too
many
collisions
–
A(x)*M(x)
mod
G(x)
where
G(x)
=
X48+X36+X25+X10+1,
and
A(x)
is
address,
M(x)
is
a
non‐zero
mulJplier
• Bocom
16
bits
index
into
64K
hash
table
– Remaining
32bits
used
to
disambiguate
entries
5 CS 636 Internetworking Problem
with
hashing
• Non‐determinisJc
– Do
not
provide
worst
case
guarantees
– Can
restrict
to
3‐4
memory
accesses,
but
update
complexity
even
worse
• Alternate
approach
– Hardware
pipelining
of
a
binary
search
tree
6 CS 636 Internetworking CS 636 Internetworking IP Prefix Lookups 7 CS 636 Internetworking Unibit
tries
P1 0* P2 1* P3 100* P4 1000* P5 100000* P6 101* P7 110* P8 11001* P9 111* Uni-bit trie 0 P1 1 P2 0 1 0 P7 1 P9 0 P3 1 P6 0 P4 1 0 1 0 P5 1 0 1 0 1 P8 Input 11000010 Longest matching prefix P2 7 10 CS 636 Internetworking MulJ
bit
tries
• Consider
mulJple
bits
at
a
Jme
• Faster
lookup
• Problem:
– Prefixes
are
not
aligned
with
stride
boundary
• SoluJon:
– Controlled
prefix
expansion
11 CS 636 Internetworking Controlled
prefix
expansion
P1 0* P2 1* P3 100* P4 1000* P5 100000* P6 101* P7 110* P8 11001* P9 111* Old routing table P1 000* P1 001* P1 010* P1 011* P2 100* P2 101* P2 110* P2 111* P3 100* P4 100000* P4 100001* P4 100010* P4 100011* P5 100000* P6 101* P7 110* P8 110010* P8 110011* P9 111* Controlled prefix expansion with stride 3 P1 000* P1 001* P1 010* P1 011* P3 100* P4 100001* P4 100010* P4 100011* P5 100000* P6 101* P7 110* P8 110010* P8 110011* P9 111* New routing table 12 CS 636 Internetworking Lulea
compressed
tries
P1 000* P1 001* P1 010* P1 011* P3 100* P4 100001* P4 100010* P4 100011* P5 100000* P6 101* P7 110* P8 110010* P8 110011* P9 111* 000 P1 001 P1 010 P1 011 P1 100 101 P5 110 111 P9 P1 P5 P9 000 1 001 0 010 0 011 0 100 1 101 1 110 1 111 1 Lulea bitmap compression Repeating entries are stored only once in the compressed array. An auxiliary bitmap is needed to find the right entry in the compressed node. It stores a 0 for positions that do not differ from the previous one. Reduces storage to about 160KBytes for MAE East Compressed node 15 CS 636 Internetworking CS 636 Internetworking 16 P1 000* P1 001* P1 010* P1 011* P3 100* P4 100001* P4 100010* P4 100011* P5 100000* P6 101* P7 110* P8 110010* P8 110011* P9 111* 00000 1 00001 0 00010 0 00011 0 00100 1 00101 1 00110 1 00111 0 01000 1 01001 0 01010 1 01011 0 01100 1 01101 0 01110 0 01111 1 10000 1 10001 0 10010 1 10011 0 10100 1 10101 1 10110 1 10111 0 11000 0 11001 0 11010 0 11011 0 11100 1 11101 1 11110 1 11111 0 00 0 01 4 10 8 11 13 When the compression bitmaps are large it is expensive to count bits during lookup. The bitmap is divided into chunks and a pre- computed auxiliary array stores the number of bits set before each chunk. The lookup algorithm needs to count only bits set within one chunk. Bitmap supporting fast counting 13+0=13 11001010 Longest matching prefix P7 16 CS 636 Internetworking 11001010 P1 0* P2 1* P3 100* P4 1000* P5 100000* P6 101* P7 110* P8 11001* P9 111* Longest matching prefix P7 000 0 001 0 010 0 011 0 100 1 101 0 110 1 111 0 0* 1 1* 1 00* 0 01* 0 10* 0 11* 0 000* 0 001* 0 010* 0 011* 0 100* 1 101* 1 110* 1 111* 1 P1 P2 P3 P6 P7 P9 Pointers to children and prefixes are stored in separate structures. Prefixes of all lengths are stored, thus leaf pushing is not needed and update is fast. Bitmaps have 1s corresponding to entries that are not empty. Representing node as tree bitmap 2 17 CS 636 Internetworking CS 636 Internetworking Packet Classification 20 CS 636 Internetworking Example
Classifier
Rule Destination Address Source Address R1 0* 10* R2 0* 01* R3 0* 1* R4 00* 1* R5 00* 11* R6 10* 1* R7 * 00* 21 Set‐pruning
Tries
[Tsuchiya,
Sri98]
Dimension DA 0 0 0 1 R7 R2 R1 R5 R7 R2 R1 R3 R7 R6 R7 R4 O(N2) memory O(2W) lookup Dimension SA Rule Destination Address Source Address R1 0* 10* R2 0* 01* R3 0* 1* R4 00* 1* R5 00* 11* R6 10* 1* R7 * 00* 22 CS 636 Internetworking Beyond
2‐d
• Simplest
scheme
–
extend
any
2d
scheme
to
mulJple
dimensions
– Given
at
most
20
rules
match,
use
linear
search
• Advantage:
no
replicaJon,
port
ranges
stay
as
ranges.
Any
two
dimensional
search
algorithm
for
all
matches
for
(S,
D)
R1
R2
R5
R7
R2
R6
25 CS 636 Internetworking Extended grid of tries
; Field 1
Field 1
a ooo Sc ur Ft0,F11
1 eee
crit : wR er 2
Le: Fe Fo,\F1 L3: F5, Fé Ls: Fe
© Field 2 la ‘8
~ \ jump pointer
¢ Grid of tries for normal 2-d matches
— We fixed backtracking with replication
¢ EGT-PC [BSV03] uses pre-computation of
rule costs and path computation
Divide‐and‐conquer
• Three
schemes
– Bit
vector
linear
search
– On‐demand
cross‐producJng
– Equivalenced
cross‐producJng
• Common
idea:
Search
along
individual
dimensions
and
combine
results.
27 CS 636 Internetworking Equivalenced
cross‐producJng
• Equivalenced
cross‐producJng
(a.k.a.
recursive
flow
classificaJon
or
RFC)
• Combines
the
results
of
the
per‐field
longest
matching
prefix
operaJons
two
by
two.
• Pairs
of
values
grouped
in
equivalence
classes
• Leads
to
significant
memory
savings
as
compared
to
simple
cross‐producJng.
• Provides
fast
packet
classificaJon,
but
compared
to
other
algorithms,
the
memory
requirements
relaJvely
large
Dest IP - Src IP Rule bitmap Class 0 M,S 11110011 C1 1 M,TO 11010011 C2 2 M,Net 11010111 C3 3 M,* 11010011 C2 4 TI,S 00000011 C4 5 TI,T0 00001011 C5 6 TI,Net 00000111 C6 7 TI,* 00000011 C4 8 Net,S 00000011 C4 9 Net,TO 00000011 C4 10 Net,Net 00000111 C6 11 Net,* 00000011 C4 12 *,S 00000001 C7 13 *,TO 00000001 C7 14 *,Net 00000100 C8 15 *,* 00000001 C7 16 entries, 8 distinct classes Src IP Dest IP Src Port Dest Port Proto Final result 30 CS 636 Internetworking Decision
tree
approaches
• At
each
node
of
the
tree
test
a
bit
in
a
field
or
perform
a
range
test
– Large
fan‐out
leads
to
shallow
trees
and
fast
classificaJon
• Leaves
contain
a
few
rules
traversed
linearly
• Interior
nodes
may
contain
rules
that
match
also
• Tests
may
look
at
bits
from
mulJple
fields
• A
rule
may
appear
in
mulJple
nodes
of
the
decision
tree
–
this
can
lead
to
increased
memory
usage
• Tree
built
using
heurisJcs
that
pick
fields
to
compare
on
that
divide
remaining
rules
relaJvely
evenly
among
descendants
• Fast
and
compact
on
rule
sets
used
today
31 CS 636 Internetworking HiCuts,
HyperCuts
Dest
port
<
50
Source
=
S
?
DestPort
=
53
?
Dest
Port
=
53
?
R1 R3 R5 R6 R10 R2 R7 R2 R7 R9 R5 R4 32 CS 636 Internetworking Interconnects
Two
basic
techniques
Input Queueing Output Queueing Usually a non-blocking switch fabric (e.g. crossbar) Usually a fast bus 35 CS 636 Internetworking Karol’s
result:
intuiJve
proof
• Assume
saturaJon
(i.e.,
all
inputs
have
cells
to
send
at
any
given
instant)
• Assume
each
packet
desJned
to
each
output
with
probability
1/N
• Equal
size
packets,
probability
that
an
output
O
is
idle
is
probability
that
none
of
the
inputs
choose
O
• Each
input
does
not
choose
O
with
probability
1
–
1/N.
P
(O
idle)
=
(1‐1/N)N
– Converges
to
(1‐1/e)
~
0.63
• Careful
analysis
that
avoids
the
independence
assumpJon
across
rounds
by
Karol
shows
throughput
converges
to
2‐√2
~
0.58
36 CS 636 Internetworking Head of Line Blocking
If more than one input has a packet
destined to the same output, head of line
blocking occurs.
Wastes bandwidth significantly
Input
Queueing
Scheduling
Request Graph 1 2 3 4 1 2 3 4 2
5
2
4
2
Bipartite Matching 1 2 3 4 1 2 3 4 (Weight = 18) Question: Maximum weight or maximum size? 7 40 CS 636 Internetworking Input
Queueing
Scheduling
• Maximum
Size
– Maximizes
instantaneous
throughput
– Does
it
maximize
long‐term
throughput?
– Is
it
stable
for
all
arrivals
?
• Maximum
Weight
– Can
clear
most
backlogged
queues
– But
does
it
sacrifice
long‐term
throughput?
– Is
it
stable
for
all
arrivals
?
41 CS 636 Internetworking Maximum
Size
Matching
(MSM)
• MSM
maximizes
instantaneous
throughput
• MSM
algorithm:
among
all
size
matchings,
pick
the
maximum
size
• If
mulJple
pick
any
at
random.
• Stable
for
uniform
arrivals
Request Graph Bipartite Match Maximum Size matching Q11(n) QN1(n) 42 CS 636 Internetworking Maximum
weight
matching
Longest
Queue
First
or
Oldest
Cell
First
Weight Waiting Time 100% Queue Length { } = 1 2 3 4 1 2 3 4 10
1
1
10
1 1
1 2 3 4 1 2 3 4 45 CS 636 Internetworking LQF
(Longest
Queue
First)
• LQF
is
the
name
given
to
the
maximum
weight
matching,
where
weight
wij(n) = Lij(n).
• LQF
doesn’t
necessarily
serve
the
longest
queue.
• LQF
can
leave
a
short
queue
unserved
indefinitely.
• However,
MWM‐LQF
is
very
important
theoreJcally:
most
(if
not
all)
scheduling
algorithms
that
provide
100%
throughput
for
unknown
traffic
matrices
are
variants
of
MWM!
46 CS 636 Internetworking Complexity
of
Maximum
Matchings
• Maximum
Size
Matchings:
– Typical
complexity
O(N0.5
M)
or
O(N2.5)
– Finding
maximum
flow
through
a
network
flow
graph
• Maximum
Weight
Matchings:
– Typical
complexity
O(N3)
– Algorithm
by
Kuhn
• In
general:
– Hard
to
implement
in
hardware
– Slooooow.
• Can
we
find
a
faster
algorithm?
47 CS 636 Internetworking iSLIP
[McKeown
et
al.,
1993]
1 2 3 4 1 2 3 4 1: Requests 1 2 3 4 1 2 3 4 3: Accept/Match 1 2 3 4 1 2 3 4 #1 #2 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 2: Grant 1 2 3 4 50 CS 636 Internetworking iSLIP
OperaJon
• Grant
phase:
Each
output
selects
the
requesJng
input
at
the
pointer,
or
the
next
input
in
round‐robin
order.
It
only
updates
its
pointer
if
the
grant
is
accepted.
• Accept
phase:
Each
input
selects
the
granJng
output
at
the
pointer,
or
the
next
output
in
round‐robin
order.
• Consequence:
Under
high
load,
grant
pointers
tend
to
move
to
unique
values.
51 CS 636 Internetworking Maximal
Matches
• Maximal
matching
algorithms
are
widely
used
in
industry
(especially
algorithms
based
on
WFA
and
iSLIP).
• PIM
and
iSLIP
are
rarely
run
to
compleJon
(i.e.
they
are
sub‐maximal).
• We
will
see
that
a
maximal
match
with
a
speedup
of
2
is
stable
for
non‐uniform
traffic.
52 CS 636 Internetworking ClOQ emulating OQ switch
4 21 1
1, ly 1 40 '
~ a} (eyo {ett a
42
3 32 3, 3 43210
T0
¢ Emulation: Apply the same inputs, cell-
by-cell to both switches and order of cells
should match.
Key
concept:
Urgency
• Urgency
=
departure
Jme
–
current
Jme
• Algorithm
:
most
urgent
cell
first
(MUCF)
• In
each
phase,
– Outputs
get
most
urgent
cells
first
from
inputs
– Inputs
grant
to
outputs
whose
cell
is
most
urgent
• Ties
are
broken
based
on
port
number
– Loser
outputs
try
to
obtain
next
urgent
cell
– No
more
matchings
possible,
cells
are
transferred
56 CS 636 Internetworking CS 636 Internetworking Packet
Buffers
57 CS 636 Internetworking Works
fine
if
there
is
only
one
FIFO
Write Rate, R One 40B packet every 8ns Read Rate, R One 40B packet every 8ns Buffer Manager 40-79 Bytes: 0-39 … … … … … 280-319 320B Buffer Memory 320B 40B 320B 320B 40B 40B 40B 40B 40B 40B 40B 40B 40B 320B 320B 320B 320B 320B 320B 320B 320B 320B 320B 60 CS 636 Internetworking Arriving Packets R Unpredictable Scheduler Requests Departing Packets R 1 2 1 Q 2 1 2 3 4 3 4 5 1 2 3 4 5 6 Small head SRAM cache for FIFO heads SRAM Hybrid
Memory
Hierarchy
Large DRAM memory holds the body of FIFOs 5 7 6 8 10 9 7 9 8 10 11 12 14 13 15 50 52 51 53 54 86 88 87 89 91 90 82 84 83 85 86 92 94 93 95 6 8 7 9 11 10 1 Q 2 Writing b bytes Reading b bytes cache for FIFO tails 55 56 96 97 87 88 57 58 59 60 89 90 91 1 Q 2 Small tail SRAM DRAM 61 CS 636 Internetworking Theorem
[IKM08]
ImpaJent
Arbiter:
An
SRAM
cache
of
size
Qb(2 + ln Q)
bytes
is
sufficient
to
guarantee
a
byte
is
always
available
when
requested.
Algorithm
is
called
MDQF
(Most
Deficit
Queue
first).
Examples: 1. 40Gb/s linecard, b=640, Q=128: SRAM = 560kBytes 2. 160Gb/s linecard, b=2560, Q=512: SRAM = 10MBytes [IKM08] Designing Packet Buffers for Router Line Cards, In TON 2008 Please see my webpage for the paper. 62 CS 636 Internetworking CS 636 Internetworking Packet
Scheduling
65 CS 636 Internetworking The
problems
caused
by
FIFO
queues
in
routers
1. In
order
to
maximize
its
chances
of
success,
a
source
has
an
incenJve
to
maximize
the
rate
at
which
it
transmits.
2. (Related
to
#1)
When
many
flows
pass
through
it,
a
FIFO
queue
is
“unfair”
–
it
favors
the
most
greedy
flow.
3. It
is
hard
to
control
the
delay
of
packets
through
a
network
of
FIFO
queues.
Fa irn es s D el ay G ua ra nt ee s 66 CS 636 Internetworking Max‐Min
Fairness
A
common
way
to
allocate
flows
N
flows
share
a
link
of
rate
C.
Flow
f
wishes
to
send
at
rate
W(f),
and
is
allocated
rate
R(f). 1. Pick
the
flow,
f,
with
the
smallest
requested
rate.
2. If
W(f)<C/N,
then
set
R(f) = W(f). 3. If
W(f) >C/N,
then
set
R(f) = C/N.
4. Set
N = N – 1. C = C – R(f). 5. If
N>0
goto
1.
67 CS 636 Internetworking Deficit
Round
Robin
• Provides
excellent
bandwidth
guarantees
• One
major
problem:
– Poor
delay
bounds
• ImplementaJon
complexity
– Need
to
skip
a
lot
of
queues
to
find
next
acJve
queue
– We
can
use
an
acJve
list
for
maintaining
this
– However,
it
can
lead
to
inacJve
queues
not
accumulaJng
their
fair
share.
70 CS 636 Internetworking How
to
provide
delay
guarantees
?
• Fair
queuing
has
good
delay
bounds
• MDRR
tries
to
provide
some
delay
guarantees,
but
is
an
ad
hoc
soluJon
• Classic
way
to
provide
delay
bounds
is
to
use
earliest
deadline
first
(EDF)
algorithm
– Schedule
the
packet
with
earliest
deadline
– ImplementaJon
using
virtual
clock
[Zha91]
71 CS 636 Internetworking Virtual
clock
• Short
term
unfairness.
– Since
flow
2
was
not
using
the
bandwidth
between
0
and
100,
it
gets
to
use
up
a
lot
of
short‐term
bandwidth
1
100
1
100
Flow 1 R = 0.5 Flow 2 R = 0.5 Deadline = 200 Deadline = 200 Deadline = 2 72 CS 636 Internetworking Scalable
fair
queuing
• AggregaJon
– IP
lookups
scaled
by
using
up
only
150,000
prefixes
for
over
100
Million
nodes
– Apply
aggregaJon
in
the
context
of
fair
queuing
– Focus
on
scheduling
aggregates
instead
of
individual
flows
• Random
aggregaJon
• Edge
aggregaJon
75 CS 636 Internetworking StochasJc
fair
queuing
[McKenney]
• Routers
keep
state
for
a
fixed
amount,
say
100,000
flows
on
which
they
do
DRR
• A
packet
can
then
be
hashed
based
on
its
header
fields
to
map
to
one
of
several
queues
• MulJple
flows
map
to
a
given
flow
– 200,000
flows
~
2
flows
share
same
class
• Problems:
– Flows
compete
with
different
flows
at
different
routers
– No
explicit
differenJaJon
between
flows
76 CS 636 Internetworking Edge
aggregaJon
via
Diffserv
• DifferenJated
services
(Diffserv)
also
aggregates
flows
into
classes
• Edge
routers
mark
packet
class
by
using
a
standardized
value
in
the
IP
TOS
field.
• Expedited
service
– Certain
bandwidth
reserved
for
this
class
• Assured
service
– Lower
drop
rate
for
RED
in
output
queues
77 CS 636 Internetworking AcJve
queue
management
• Queue
Management
– Drop
as
a
way
to
feedback
to
TCP
sources
– Part
of
a
closed‐loop
• TradiJonal
Queue
Management
– Drop
Tail
– Problems
• AcJve
Queue
Management
– RED
– CHOKe
– AFD
80 CS 636 Internetworking Random
Early
DetecJon
(RED)
yes Drop the new packet end Admit packet with a probability p end AvgQsize > Maxth? yes Arriving packet no Admit the new packet end AvgQsize > Minth? no 81 CS 636 Internetworking Extending
RED
for
Flow
IsolaJon
• Problem:
what
to
do
with
non‐cooperaJve
flows?
• Fair
queuing
achieves
isolaJon
using
per‐flow
state
–
expensive
at
backbone
routers
– How
can
we
isolate
unresponsive
flows
without
per‐flow
state?
• RED
penalty
box
– Monitor
history
for
packet
drops,
idenJfy
flows
that
use
disproporJonate
bandwidth
– Isolate
and
punish
those
flows
82 CS 636 Internetworking CHOose
and
Keep
for
Responsive
flows
yes Drop the new packet end Admit packet with a probability p end AvgQsize > Maxth? yes Arriving packet no Admit the new packet end AvgQsize > Minth? no yes no Drop both matched packets end Draw a packet at random from queue Flow id same as the new packet id ? yes Drop the new packet end Admit packet with a probability p end no AvgQsize > Maxth? no 85 CS 636 Internetworking Traffic
Shaping
and
Policing
• Can
we
add
bandwidth
guarantees
for
flows
that
are
placed
in
the
common
queues
without
segregaJon
?
– E.g.,
an
ISP
wants
to
restrict
NEWS
traffic
to
1Mbps
– UDP
traffic
restricted
to
some
value.
• Token
bucket
policing/shaping
– Uses
a
single
queue
– One
counter
per
flow
86 CS 636 Internetworking How
the
user/flow
can
conform
to
the
(σ,ρ)
regulaJon
Leaky
bucket
as
a
“shaper”
Tokens at rate,ρ Token bucket sizeσ Variable bit-rate compression To network time bytes time bytes time bytes ρ C 87 CS 636 Internetworking