tor/doc/spec/proposals/ideas/xxx-microdescriptors.txt

Filename: xxx-microdescriptors.txt
Title: Clients download consensus + microdescriptors
Version: $Revision$
Last-Modified: $Date$
Author: Roger Dingledine
Created: 17-Jan-2009
Status: Open

1. Overview

  This proposal replaces section 3.2 of proposal 141, called "Fetching
  descriptors on demand". Rather than modifying the circuit-building
  protocol to fetch a server descriptor inline at each circuit extend,
  we instead put all of the information that clients need either into
  the consensus itself, or into a new set of data about each relay
  called a microdescriptor. The goal is that descriptor elements that
  are small and frequently changing should go in the consensus itself,
  descriptor elements that are small and relatively static should go in
  the microdescriptor, and if we ever end up with descriptor elements
  that aren't small yet clients need to know them, we'll need to resume
  considering some design like the one in proposal 141.

2. Motivation

  See
  http://archives.seul.org/or/dev/Nov-2008/msg00000.html and
  http://archives.seul.org/or/dev/Nov-2008/msg00001.html and especially
  http://archives.seul.org/or/dev/Nov-2008/msg00007.html
  for a discussion of the options and why this is currently the best
  approach.

3. Design

  There are three pieces to the proposal. First, authorities will list
  in their votes (and thus in the consensus) what relay elements are
  included in the microdescriptor, and also list the expected hash of
  microdescriptor for each relay. Second, directory mirrors will serve
  microdescriptors. Third, clients will ask for them and then cache them.

3.1. Consensus changes

  V3 votes should include a new line:
    microdescriptor-elements bar baz foo

  We also need to include the hash of each expected microdescriptor in
  the routerstatus section. I suggest a new "m" line for each stanza,
  with the base64 of the hash of the elements that the authority voted
  for above.

  The consensus microdescriptor-elements and "m" lines are then computed
  as described in Section 3.1.2 below.

  I believe that means we need a new consensus-method "6" that knows
  how to compute the microdescriptor-elements and add "m" lines.

3.1.1. Descriptor elements to include for now

  To start, the element list that authorities suggest should be
    family onion-key

  (Note that the or-dev posts above only mention onion-key, but if
  we don't also include family then clients will never learn it. It
  seemed like it should be relatively static, so putting it in the
  microdescriptor is smarter than trying to fit it into the consensus.)

3.1.2. Computing consensus for microdescriptor-elements and "m" lines

  One approach is for the consensus microdescriptor-elements line to
  include all elements listed by a majority of authorities, sorted. The
  problem here is that it will no longer be deterministic what the correct
  hash for the "m" line should be. We could imagine telling the authority
  to go look in its descriptor and produce the right hash itself, but
  we don't want consensus calculation to be based on external data like
  that. (Plus, the authority may not have the descriptor that everybody
  else voted to use.)

  The better approach is to take the exact set that has the most votes
  (breaking ties by the set that has the most elements, and breaking
  ties after that by whichever is alphabetically first). That will
  increase the odds that we actually get a microdescriptor hash that
  is both a) for the descriptor we're putting in the consensus, and b)
  over the elements that we're declaring it should be for.

  Then the "m" line for a given relay is the one that gets the most votes
  from authorities that a) voted for the microdescriptor-elements line
  we're using, and b) voted for the descriptor we're using.

  (If there's a tie, use the smaller hash. But really, if there are
  multiple such votes and they differ about a microdescriptor, we caught
  one of them being lying or buggy. We should log it to track down why.)

  If there are no such votes, then we leave out the "m" line for that
  relay. That means clients should avoid it for this time period. (As
  an extension it could instead mean that clients should fetch the
  descriptor and figure out its microdescriptor themselves. But let's
  not get ahead of ourselves.)

  It would be nice to have a more foolproof way to agree on what
  microdescriptor hash each authority should vote for, so we can avoid
  missing "m" lines. Just switching to a new consensus-method each time
  we change the set of microdescriptor-elements won't help though, since
  each authority will still have to decide what hash to vote for before
  knowing what consensus-method will be used.

  Here's one way we could do that. Each vote / consensus includes both
  the microdescriptor-elements that were used to compute the hashes,
  and also a preferred-microdescriptor-elements set. If an authority
  has a consensus from the previous period, then it should use the
  consensus preferred-microdescriptor-elements when computing its votes
  for microdescriptor-elements and the appropriate hashes in the upcoming
  period. (If it has no previous consensus, then it just puts down its
  own preferences in both lines.)

3.2. Directory mirrors serve microdescriptors

  Directory mirrors should then read the microdescriptor-elements line
  from the consensus, and learn how to answer requests.

  The microdescriptors with hashes <D1>,<D2>,<D3> should be available at:
    http://<hostname>/tor/micro/d/<D1>+<D2>+<D3>.z

  All the microdescriptors from the current consensus should also be
  available at:
    http://<hostname>/tor/micro/all.z
  so a client that's bootstrapping doesn't need to send a 70KB URL just
  to name every microdescriptor it's looking for.

  The format of a microdescriptor is the header line
  "microdescriptor 1"
  followed by each element (keyword and body), alphabetically. There's
  no need to mention what hash it is, since you can hash the elements
  to learn this.

  (Do we need a footer line to show that it's over, or is the next
  microdescriptor line or EOF enough of a hint? A footer line wouldn't
  hurt much. Also, no fair voting for the microdescriptor-element
  "microdescriptor".)

  The hash of the microdescriptor is simply the hash of the concatenated
  elements -- not counting the header line or hypothetical footer line.
  Is this smart?

  Note that I put a "1" up there in the header line. It isn't part
  of what's hashed, though. Is there a way to put in a version that's
  more useful?

  Directory mirrors should check to make sure that the microdescriptors
  they're about to serve match the right hashes (either the hashes from
  the fetch URL or the hashes from the consensus, respectively).

  We will probably want to consider some sort of smart data structure to
  be able to quickly convert microdescriptor hashes into the appropriate
  microdescriptor. Clients will want this anyway when they load their
  microdescriptor cache and want to match it up with the consensus to
  see what's missing.

3.3. Clients fetch them and cache them

  When a client gets a new consensus, it looks to see if there are any
  microdescriptors it needs to learn. If it needs to learn more than
  some threshold of the microdescriptors (half?), it requests 'all',
  else it requests only the missing ones.

  Clients maintain a cache of microdescriptors along with metadata like
  when it was last referenced by a consensus. They keep a microdescriptor
  until it hasn't been mentioned in any consensus for a week.

3.3.1. Information leaks from clients

  If a client asks you for a set of microdescs, then you know she didn't
  have them cached before. How much does that leak? What about when
  we're all using our entry guards as directory guards, and we've seen
  that user make a bunch of circuits already?

  Fetching "all" when you need at least half is a good first order fix,
  but might not be all there is to it.

4. Transition and deployment

  Phase one, the directory authorities should start voting on
  microdescriptors and microdescriptor elements, and putting them in the
  consensus. This should happen during the 0.2.1.x series, and should
  be relatively easy to do.

  Phase two, directory mirrors should learn how to serve them, and learn
  how to read the consensus to find out what they should be serving. It
  would be great if we can squeeze this in during 0.2.1.x also, so once
  clients start to fetch them there will be many mirrors to choose from.

  (Are there reasonable ways to build only part of phase two in 0.2.1.x?)

  Phase three, clients should start fetching and caching them instead
  of normal descriptors. This should happen post 0.2.1.x.
some notes on how exactly to do this microdescriptor thing. svn:r18163 2009-01-18 10:51:09 +01:00			`Filename: xxx-microdescriptors.txt`
			`Title: Clients download consensus + microdescriptors`
			`Version: $Revision$`
			`Last-Modified: $Date$`
			`Author: Roger Dingledine`
			`Created: 17-Jan-2009`
			`Status: Open`

			`1. Overview`

			`This proposal replaces section 3.2 of proposal 141, called "Fetching`
			`descriptors on demand". Rather than modifying the circuit-building`
			`protocol to fetch a server descriptor inline at each circuit extend,`
			`we instead put all of the information that clients need either into`
			`the consensus itself, or into a new set of data about each relay`
			`called a microdescriptor. The goal is that descriptor elements that`
			`are small and frequently changing should go in the consensus itself,`
			`descriptor elements that are small and relatively static should go in`
			`the microdescriptor, and if we ever end up with descriptor elements`
			`that aren't small yet clients need to know them, we'll need to resume`
			`considering some design like the one in proposal 141.`

			`2. Motivation`

			`See`
			`http://archives.seul.org/or/dev/Nov-2008/msg00000.html and`
			`http://archives.seul.org/or/dev/Nov-2008/msg00001.html and especially`
			`http://archives.seul.org/or/dev/Nov-2008/msg00007.html`
			`for a discussion of the options and why this is currently the best`
			`approach.`

			`3. Design`

			`There are three pieces to the proposal. First, authorities will list`
			`in their votes (and thus in the consensus) what relay elements are`
			`included in the microdescriptor, and also list the expected hash of`
			`microdescriptor for each relay. Second, directory mirrors will serve`
			`microdescriptors. Third, clients will ask for them and then cache them.`

			`3.1. Consensus changes`

			`V3 votes should include a new line:`
			`microdescriptor-elements bar baz foo`

			`We also need to include the hash of each expected microdescriptor in`
			`the routerstatus section. I suggest a new "m" line for each stanza,`
			`with the base64 of the hash of the elements that the authority voted`
			`for above.`

			`The consensus microdescriptor-elements and "m" lines are then computed`
			`as described in Section 3.1.2 below.`

			`I believe that means we need a new consensus-method "6" that knows`
			`how to compute the microdescriptor-elements and add "m" lines.`

			`3.1.1. Descriptor elements to include for now`

			`To start, the element list that authorities suggest should be`
			`family onion-key`

			`(Note that the or-dev posts above only mention onion-key, but if`
			`we don't also include family then clients will never learn it. It`
			`seemed like it should be relatively static, so putting it in the`
			`microdescriptor is smarter than trying to fit it into the consensus.)`

			`3.1.2. Computing consensus for microdescriptor-elements and "m" lines`

			`One approach is for the consensus microdescriptor-elements line to`
			`include all elements listed by a majority of authorities, sorted. The`
			`problem here is that it will no longer be deterministic what the correct`
			`hash for the "m" line should be. We could imagine telling the authority`
			`to go look in its descriptor and produce the right hash itself, but`
			`we don't want consensus calculation to be based on external data like`
			`that. (Plus, the authority may not have the descriptor that everybody`
			`else voted to use.)`

			`The better approach is to take the exact set that has the most votes`
			`(breaking ties by the set that has the most elements, and breaking`
			`ties after that by whichever is alphabetically first). That will`
			`increase the odds that we actually get a microdescriptor hash that`
			`is both a) for the descriptor we're putting in the consensus, and b)`
			`over the elements that we're declaring it should be for.`

			`Then the "m" line for a given relay is the one that gets the most votes`
			`from authorities that a) voted for the microdescriptor-elements line`
			`we're using, and b) voted for the descriptor we're using.`

			`(If there's a tie, use the smaller hash. But really, if there are`
			`multiple such votes and they differ about a microdescriptor, we caught`
			`one of them being lying or buggy. We should log it to track down why.)`

			`If there are no such votes, then we leave out the "m" line for that`
			`relay. That means clients should avoid it for this time period. (As`
			`an extension it could instead mean that clients should fetch the`
			`descriptor and figure out its microdescriptor themselves. But let's`
			`not get ahead of ourselves.)`

			`It would be nice to have a more foolproof way to agree on what`
			`microdescriptor hash each authority should vote for, so we can avoid`
			`missing "m" lines. Just switching to a new consensus-method each time`
			`we change the set of microdescriptor-elements won't help though, since`
			`each authority will still have to decide what hash to vote for before`
			`knowing what consensus-method will be used.`

			`Here's one way we could do that. Each vote / consensus includes both`
			`the microdescriptor-elements that were used to compute the hashes,`
			`and also a preferred-microdescriptor-elements set. If an authority`
			`has a consensus from the previous period, then it should use the`
			`consensus preferred-microdescriptor-elements when computing its votes`
			`for microdescriptor-elements and the appropriate hashes in the upcoming`
			`period. (If it has no previous consensus, then it just puts down its`
			`own preferences in both lines.)`

			`3.2. Directory mirrors serve microdescriptors`

			`Directory mirrors should then read the microdescriptor-elements line`
			`from the consensus, and learn how to answer requests.`

			`The microdescriptors with hashes <D1>,<D2>,<D3> should be available at:`
			`http://<hostname>/tor/micro/d/<D1>+<D2>+<D3>.z`

			`All the microdescriptors from the current consensus should also be`
			`available at:`
			`http://<hostname>/tor/micro/all.z`
			`so a client that's bootstrapping doesn't need to send a 70KB URL just`
			`to name every microdescriptor it's looking for.`

			`The format of a microdescriptor is the header line`
			`"microdescriptor 1"`
			`followed by each element (keyword and body), alphabetically. There's`
			`no need to mention what hash it is, since you can hash the elements`
			`to learn this.`

			`(Do we need a footer line to show that it's over, or is the next`
			`microdescriptor line or EOF enough of a hint? A footer line wouldn't`
			`hurt much. Also, no fair voting for the microdescriptor-element`
			`"microdescriptor".)`

			`The hash of the microdescriptor is simply the hash of the concatenated`
			`elements -- not counting the header line or hypothetical footer line.`
			`Is this smart?`

			`Note that I put a "1" up there in the header line. It isn't part`
			`of what's hashed, though. Is there a way to put in a version that's`
			`more useful?`

			`Directory mirrors should check to make sure that the microdescriptors`
			`they're about to serve match the right hashes (either the hashes from`
			`the fetch URL or the hashes from the consensus, respectively).`

			`We will probably want to consider some sort of smart data structure to`
			`be able to quickly convert microdescriptor hashes into the appropriate`
			`microdescriptor. Clients will want this anyway when they load their`
			`microdescriptor cache and want to match it up with the consensus to`
			`see what's missing.`

			`3.3. Clients fetch them and cache them`

			`When a client gets a new consensus, it looks to see if there are any`
			`microdescriptors it needs to learn. If it needs to learn more than`
			`some threshold of the microdescriptors (half?), it requests 'all',`
			`else it requests only the missing ones.`

			`Clients maintain a cache of microdescriptors along with metadata like`
			`when it was last referenced by a consensus. They keep a microdescriptor`
			`until it hasn't been mentioned in any consensus for a week.`

			`3.3.1. Information leaks from clients`

			`If a client asks you for a set of microdescs, then you know she didn't`
			`have them cached before. How much does that leak? What about when`
			`we're all using our entry guards as directory guards, and we've seen`
			`that user make a bunch of circuits already?`

			`Fetching "all" when you need at least half is a good first order fix,`
			`but might not be all there is to it.`

			`4. Transition and deployment`

			`Phase one, the directory authorities should start voting on`
			`microdescriptors and microdescriptor elements, and putting them in the`
			`consensus. This should happen during the 0.2.1.x series, and should`
			`be relatively easy to do.`

			`Phase two, directory mirrors should learn how to serve them, and learn`
			`how to read the consensus to find out what they should be serving. It`
			`would be great if we can squeeze this in during 0.2.1.x also, so once`
			`clients start to fetch them there will be many mirrors to choose from.`

			`(Are there reasonable ways to build only part of phase two in 0.2.1.x?)`

			`Phase three, clients should start fetching and caching them instead`
			`of normal descriptors. This should happen post 0.2.1.x.`