mirror of
https://gitlab.torproject.org/tpo/core/tor.git
synced 2024-11-30 23:53:32 +01:00
55c3619c23
svn:r15905
91 lines
3.7 KiB
Plaintext
91 lines
3.7 KiB
Plaintext
Filename: 108-mtbf-based-stability.txt
|
|
Title: Base "Stable" Flag on Mean Time Between Failures
|
|
Version: $Revision$
|
|
Last-Modified: $Date$
|
|
Author: Nick Mathewson
|
|
Created: 10-Mar-2007
|
|
Status: Closed
|
|
Implemented-In: 0.2.0.x
|
|
|
|
Overview:
|
|
|
|
This document proposes that we change how directory authorities set the
|
|
stability flag from inspection of a router's declared Uptime to the
|
|
authorities' perceived mean time between failure for the router.
|
|
|
|
Motivation:
|
|
|
|
Clients prefer nodes that the authorities call Stable. This flag is (as
|
|
of 0.2.0.0-alpha-dev) set entirely based on the node's declared value for
|
|
uptime. This creates an opportunity for malicious nodes to declare
|
|
falsely high uptimes in order to get more traffic.
|
|
|
|
Spec changes:
|
|
|
|
Replace the current rule for setting the Stable flag with:
|
|
|
|
"Stable" -- A router is 'Stable' if it is active and its observed Stability
|
|
for the past month is at or above the median Stability for active routers.
|
|
Routers are never called stable if they are running a version of Tor
|
|
known to drop circuits stupidly. (0.1.1.10-alpha through 0.1.1.16-rc
|
|
are stupid this way.)
|
|
|
|
Stability shall be defined as the weighted mean length of the runs
|
|
observed by a given directory authority. A run begins when an authority
|
|
decides that the server is Running, and ends when the authority decides
|
|
that the server is not Running. In-progress runs are counted when
|
|
measuring Stability. When calculating the mean, runs are weighted by
|
|
$\alpha ^ t$, where $t$ is time elapsed since the end of the run, and
|
|
$0 < \alpha < 1$. Time when an authority is down do not count to the
|
|
length of the run.
|
|
|
|
Rejected Alternative:
|
|
|
|
"A router's Stability shall be defined as the sum of $\alpha ^ d$ for every
|
|
$d$ such that the router was considered reachable for the entire day
|
|
$d$ days ago.
|
|
|
|
This allows a simpler implementation: every day, we multiply
|
|
yesterday's Stability by alpha, and if the router was observed to be
|
|
available every time we looked today, we add 1.
|
|
|
|
Instead of "day", we could pick an arbitrary time unit. We should
|
|
pick alpha to be high enough that long-term stability counts, but low
|
|
enough that the distant past is eventually forgotten. Something
|
|
between .8 and .95 seems right.
|
|
|
|
(By requiring that routers be up for an entire day to get their
|
|
stability increased, instead of counting fractions of a day, we
|
|
capture the notion that stability is more like "probability of
|
|
staying up for the next hour" than it is like "probability of being
|
|
up at some randomly chosen time over the next hour." The former
|
|
notion of stability is far more relevant for long-lived circuits.)
|
|
|
|
Limitations:
|
|
|
|
Authorities can have false positives and false negatives when trying to
|
|
tell whether a router is up or down. So long as these aren't terribly
|
|
wrong, and so long as they aren't significantly biased, we should be able
|
|
to use them to estimate stability pretty well.
|
|
|
|
Probing approaches like the above could miss short incidents of
|
|
downtime. If we use the router's declared uptime, we could detect
|
|
these: but doing so would penalize routers who reported their uptime
|
|
accurately.
|
|
|
|
Implementation:
|
|
|
|
For now, the easiest way to store this information at authorities
|
|
would probably be in some kind of periodically flushed flat file.
|
|
Later, we could move to Berkeley db or something if we really had to.
|
|
|
|
For each router, an authority will need to store:
|
|
The router ID.
|
|
Whether the router is up.
|
|
The time when the current run started, if the router is up.
|
|
The weighted sum length of all previous runs.
|
|
The time at which the weighted sum length was last weighted down.
|
|
|
|
Servers should probe at random intervals to test whether servers are
|
|
running.
|