tor/doc/spec/proposals/108-mtbf-based-stability.txt

Filename: 108-mtbf-based-stability.txt
Title: Base "Stable" Flag on Mean Time Between Failures
Version: $Revision$
Last-Modified: $Date$
Author: Nick Mathewson
Created: 10-Mar-2007
Status: Open

Overview:

   This document proposes that we change how directory authorities set the
   stability flag from inspection of a router's declared Uptime to the
   authorities' perceived mean time between failure for the router.

Motivation:

   Clients prefer nodes that the authorities call Stable.  This flag is (as
   of 0.2.0.0-alpha-dev) set entirely based on the node's declared value for
   uptime.  This creates an opportunity for malicious nodes to declare
   falsely high uptimes in order to get more traffic.

Spec changes:

   Replace the current rule for setting the Stable flag with:

   "Stable" -- A router is 'Stable' if it is active and its observed Stability
   for the past month is at or above the median Stability for active routers.
   Routers are never called stable if they are running a version of Tor
   known to drop circuits stupidly. (0.1.1.10-alpha through 0.1.1.16-rc
   are stupid this way.)

   Stability shall be defined as the mean length of the runs observed by a
   given directory authority.  A run begins when an authority decides
   that the server is Running, and ends when the authority decides that
   the server is not Running.  In-progress runs are counted when
   measuring Stability.

Issues:

   How do you define a clipped MTBF?  If the current month begins with one
   day at the end of a one-year uptime, and then has 29 days of uptime, do we
   average one day and 29 days?  Or do we average one year and 29 days?  Or
   take 29 days on its own and discard the year?

   Surely somebody has done this kinds of thing before.

Alternative:

   "A router's Stability shall be defined as the sum of $\alpha ^ d$ for every
   $d$ such that the router was not observed to be unavailable $d$ days ago."

   This allows a simpler implementation: every day, we multiply
   yesterday's Stability by alpha, and if the router was observed to be
   available every time we looked today, we add 1.

   Instead of "day", we could pick an arbitrary time unit.  We should
   pick alpha to be high enough that long-term stability counts, but low
   enough that the distant past is eventually forgotten.  Something
   between .8 and .95 seems right.

   (By requiring that routers be up for an entire day to get their
   stability increased, instead of counting fractions of a day, we
   capture the notion that stability is more like "probability of being
   staying up for the next hour" than it is like "probability of being
   up at some randomly chosen time over the next hour."  The former
   notion of stability is far more relevant for long-lived circuits.)

Limitations:

   Authorities can have false positives and false negatives when trying to
   tell whether a router is up or down.  So long as these aren't terribly
   wrong, and so long as they aren't significantly biased, we should be able
   to use them to estimate stability pretty well.

   Probing approaches like the above could miss short incidents of
   downtime.  If we use the router's declared uptime, we could detect
   these: but doing so would penalize routers who reported their uptime
   accurately.

Implementation:

   For now, the easiest way to store this information at authorities
   would probably be in some kind of periodically flushed flat file.
   Later, we could move to Berkeley db or something if we really had to.
r12522@Kushana: nickm \| 2007-03-10 02:38:33 -0500 Mark 107 closed (since it was implemented and merged into the spec). Put MTBF proposal in 108. svn:r9793 2007-03-10 08:39:23 +01:00			`Filename: 108-mtbf-based-stability.txt`
			`Title: Base "Stable" Flag on Mean Time Between Failures`
unified svn properties and keywords for proposals and address spec svn:r10625 2007-06-17 01:23:19 +02:00			`Version: $Revision$`
			`Last-Modified: $Date$`
r12522@Kushana: nickm \| 2007-03-10 02:38:33 -0500 Mark 107 closed (since it was implemented and merged into the spec). Put MTBF proposal in 108. svn:r9793 2007-03-10 08:39:23 +01:00			`Author: Nick Mathewson`
put some dates on the proposals, and add an acknowledgments section to proposal 110. the proposals that were around before the proposals scheme got started still don't have dates. svn:r9815 2007-03-14 05:48:13 +01:00			`Created: 10-Mar-2007`
r12522@Kushana: nickm \| 2007-03-10 02:38:33 -0500 Mark 107 closed (since it was implemented and merged into the spec). Put MTBF proposal in 108. svn:r9793 2007-03-10 08:39:23 +01:00			`Status: Open`

			`Overview:`

			`This document proposes that we change how directory authorities set the`
clean up proposal 108 svn:r9824 2007-03-15 08:26:11 +01:00			`stability flag from inspection of a router's declared Uptime to the`
r12522@Kushana: nickm \| 2007-03-10 02:38:33 -0500 Mark 107 closed (since it was implemented and merged into the spec). Put MTBF proposal in 108. svn:r9793 2007-03-10 08:39:23 +01:00			`authorities' perceived mean time between failure for the router.`

			`Motivation:`

clean up proposal 108 svn:r9824 2007-03-15 08:26:11 +01:00			`Clients prefer nodes that the authorities call Stable. This flag is (as`
			`of 0.2.0.0-alpha-dev) set entirely based on the node's declared value for`
r12522@Kushana: nickm \| 2007-03-10 02:38:33 -0500 Mark 107 closed (since it was implemented and merged into the spec). Put MTBF proposal in 108. svn:r9793 2007-03-10 08:39:23 +01:00			`uptime. This creates an opportunity for malicious nodes to declare`
			`falsely high uptimes in order to get more traffic.`

			`Spec changes:`

clean up proposal 108 svn:r9824 2007-03-15 08:26:11 +01:00			`Replace the current rule for setting the Stable flag with:`
r12522@Kushana: nickm \| 2007-03-10 02:38:33 -0500 Mark 107 closed (since it was implemented and merged into the spec). Put MTBF proposal in 108. svn:r9793 2007-03-10 08:39:23 +01:00
r12760@Kushana: nickm \| 2007-04-20 11:23:21 -0400 Describe a simpler implementation for proposal 108, and note some limitations in the proposal. svn:r9993 2007-04-20 19:17:13 +02:00			`"Stable" -- A router is 'Stable' if it is active and its observed Stability`
			`for the past month is at or above the median Stability for active routers.`
clean up proposal 108 svn:r9824 2007-03-15 08:26:11 +01:00			`Routers are never called stable if they are running a version of Tor`
			`known to drop circuits stupidly. (0.1.1.10-alpha through 0.1.1.16-rc`
			`are stupid this way.)`
r12522@Kushana: nickm \| 2007-03-10 02:38:33 -0500 Mark 107 closed (since it was implemented and merged into the spec). Put MTBF proposal in 108. svn:r9793 2007-03-10 08:39:23 +01:00
r12760@Kushana: nickm \| 2007-04-20 11:23:21 -0400 Describe a simpler implementation for proposal 108, and note some limitations in the proposal. svn:r9993 2007-04-20 19:17:13 +02:00			`Stability shall be defined as the mean length of the runs observed by a`
r12522@Kushana: nickm \| 2007-03-10 02:38:33 -0500 Mark 107 closed (since it was implemented and merged into the spec). Put MTBF proposal in 108. svn:r9793 2007-03-10 08:39:23 +01:00			`given directory authority. A run begins when an authority decides`
			`that the server is Running, and ends when the authority decides that`
			`the server is not Running. In-progress runs are counted when`
r12760@Kushana: nickm \| 2007-04-20 11:23:21 -0400 Describe a simpler implementation for proposal 108, and note some limitations in the proposal. svn:r9993 2007-04-20 19:17:13 +02:00			`measuring Stability.`
r12522@Kushana: nickm \| 2007-03-10 02:38:33 -0500 Mark 107 closed (since it was implemented and merged into the spec). Put MTBF proposal in 108. svn:r9793 2007-03-10 08:39:23 +01:00
			`Issues:`

			`How do you define a clipped MTBF? If the current month begins with one`
			`day at the end of a one-year uptime, and then has 29 days of uptime, do we`
			`average one day and 29 days? Or do we average one year and 29 days? Or`
			`take 29 days on its own and discard the year?`

			`Surely somebody has done this kinds of thing before.`
clean up proposal 108 svn:r9824 2007-03-15 08:26:11 +01:00
r12760@Kushana: nickm \| 2007-04-20 11:23:21 -0400 Describe a simpler implementation for proposal 108, and note some limitations in the proposal. svn:r9993 2007-04-20 19:17:13 +02:00			`Alternative:`

r13437@catbus: nickm \| 2007-06-15 14:29:56 -0400 Incorporate comments [from april, ugh] into proposal 108. svn:r10636 2007-06-17 17:10:40 +02:00			`"A router's Stability shall be defined as the sum of $\alpha ^ d$ for every`
r12760@Kushana: nickm \| 2007-04-20 11:23:21 -0400 Describe a simpler implementation for proposal 108, and note some limitations in the proposal. svn:r9993 2007-04-20 19:17:13 +02:00			`$d$ such that the router was not observed to be unavailable $d$ days ago."`

r13437@catbus: nickm \| 2007-06-15 14:29:56 -0400 Incorporate comments [from april, ugh] into proposal 108. svn:r10636 2007-06-17 17:10:40 +02:00			`This allows a simpler implementation: every day, we multiply`
			`yesterday's Stability by alpha, and if the router was observed to be`
			`available every time we looked today, we add 1.`

			`Instead of "day", we could pick an arbitrary time unit. We should`
			`pick alpha to be high enough that long-term stability counts, but low`
			`enough that the distant past is eventually forgotten. Something`
			`between .8 and .95 seems right.`

			`(By requiring that routers be up for an entire day to get their`
			`stability increased, instead of counting fractions of a day, we`
			`capture the notion that stability is more like "probability of being`
			`staying up for the next hour" than it is like "probability of being`
			`up at some randomly chosen time over the next hour." The former`
			`notion of stability is far more relevant for long-lived circuits.)`
r12760@Kushana: nickm \| 2007-04-20 11:23:21 -0400 Describe a simpler implementation for proposal 108, and note some limitations in the proposal. svn:r9993 2007-04-20 19:17:13 +02:00
			`Limitations:`

			`Authorities can have false positives and false negatives when trying to`
			`tell whether a router is up or down. So long as these aren't terribly`
			`wrong, and so long as they aren't significantly biased, we should be able`
			`to use them to estimate stability pretty well.`
r13437@catbus: nickm \| 2007-06-15 14:29:56 -0400 Incorporate comments [from april, ugh] into proposal 108. svn:r10636 2007-06-17 17:10:40 +02:00
			`Probing approaches like the above could miss short incidents of`
			`downtime. If we use the router's declared uptime, we could detect`
			`these: but doing so would penalize routers who reported their uptime`
			`accurately.`

			`Implementation:`

			`For now, the easiest way to store this information at authorities`
			`would probably be in some kind of periodically flushed flat file.`
			`Later, we could move to Berkeley db or something if we really had to.`