From df183bb75e688b5e86b5433f8734fc101f76dbad Mon Sep 17 00:00:00 2001
From: Roger Dingledine <arma@torproject.org>
Date: Thu, 9 Nov 2006 08:53:13 +0000
Subject: [PATCH] that's your plan, ray? get her? more work on the discovery
 section.

svn:r8923
---
 doc/design-paper/blocking.tex | 308 ++++++++++++++++++----------------
 1 file changed, 167 insertions(+), 141 deletions(-)

diff --git a/doc/design-paper/blocking.tex b/doc/design-paper/blocking.tex
index 19486ffbf0..32c49e8cb1 100644
--- a/doc/design-paper/blocking.tex
+++ b/doc/design-paper/blocking.tex
@@ -88,6 +88,10 @@ leveraged for a new blocking-resistant design; Section~\ref{sec:related}
 explains the features and drawbacks of the currently deployed solutions;
 and ...
 
+% The other motivation is for places where we're concerned they will
+% try to enumerate a list of Tor users. So even if they're not blocking
+% the Tor network, it may be smart to not be visible as connecting to it.
+
 %And adding more different classes of users and goals to the Tor network
 %improves the anonymity for all Tor users~\cite{econymics,usability:weis2006}.
 
@@ -497,18 +501,19 @@ to get more relay addresses, and to distribute them to users differently.
 
 \subsection{Bridge relays}
 
-Today, Tor servers operate on less than a thousand distinct IP; an adversary
+Today, Tor servers operate on less than a thousand distinct IP addresses;
+an adversary
 could enumerate and block them all with little trouble.  To provide a
 means of ingress to the network, we need a larger set of entry points, most
 of which an adversary won't be able to enumerate easily.  Fortunately, we
-have such a set: the Tor userbase.
+have such a set: the Tor users.
 
 Hundreds of thousands of people around the world use Tor. We can leverage
 our already self-selected user base to produce a list of thousands of
 often-changing IP addresses. Specifically, we can give them a little
 button in the GUI that says ``Tor for Freedom'', and users who click
-the button will turn into \emph{bridge relays}, or just \emph{bridges}
-for short. They can rate limit relayed connections to 10 KB/s (almost
+the button will turn into \emph{bridge relays} (or just \emph{bridges}
+for short). They can rate limit relayed connections to 10 KB/s (almost
 nothing for a broadband user in a free country, but plenty for a user
 who otherwise has no access at all), and since they are just relaying
 bytes back and forth between blocked users and the main Tor network, they
@@ -537,13 +542,13 @@ bridge directory authorities.
 
 The main difference between bridge authorities and the directory
 authorities for the main Tor network is that the main authorities provide
-out a list of every known relay, but the bridge authorities only give
+a list of every known relay, but the bridge authorities only give
 out a server descriptor if you already know its identity key. That is,
 you can keep up-to-date on a bridge's location and other information
 once you know about it, but you can't just grab a list of all the bridges.
 
-The identity keys, IP address, and directory port for the bridge
-authorities ship by default with the Tor software, so the bridge relays
+The identity key, IP address, and directory port for each bridge
+authority ship by default with the Tor software, so the bridge relays
 can be confident they're publishing to the right location, and the
 blocked users can establish an encrypted authenticated channel. See
 Section~\ref{subsec:trust-chain} for more discussion of the public key
@@ -551,8 +556,8 @@ infrastructure and trust chain.
 
 Bridges use Tor to publish their descriptors privately and securely,
 so even an attacker monitoring the bridge directory authority's network
-can't make a list of all the addresses contacting the authority and
-track them that way.  Bridges may publish to only a subset of the
+can't make a list of all the addresses contacting the authority.
+Bridges may publish to only a subset of the
 authorities, to limit the potential impact of an authority compromise.
 
 
@@ -666,7 +671,7 @@ Note that, unlike many settings, the reputation problem should not be
 hard here. If a bridge says it is blocked, then it might as well be.
 If an adversary can say that the bridge is blocked wrt
 $\mathcal{censor}_i$, then it might as well be, since
-$\mathcal{censor}_i$ can presumaby then block that bridge if it so
+$\mathcal{censor}_i$ can presumably then block that bridge if it so
 chooses.
 
 11. How much damage can the adversary do by running nodes in the Tor
@@ -718,10 +723,9 @@ be most useful, because clients behind standard firewalls will have
 the best chance to reach them. Is this the best choice in all cases,
 or should we encourage some fraction of them pick random ports, or other
 ports commonly permitted through firewalls like 53 (DNS) or 110
-(POP)?  Or perhaps we should use a port where TLS traffic is expected, like
-443 (HTTPS), 993 (IMAPS), or 995 (POP3S).  We need
-more research on our potential users, and their current and anticipated
-firewall restrictions.
+(POP)?  Or perhaps we should use other ports where TLS traffic is
+expected, like 993 (IMAPS) or 995 (POP3S).  We need more research on our
+potential users, and their current and anticipated firewall restrictions.
 
 Furthermore, we need to look at the specifics of Tor's TLS handshake.
 Right now Tor uses some predictable strings in its TLS handshakes. For
@@ -762,11 +766,14 @@ variety of protocols, and we'll want to automatically handle web browsing
 differently from, say, instant messaging.
 
 % Tor cells are 512 bytes each. So TLS records will be roughly
-% multiples of this size? How bad is this?
+% multiples of this size? How bad is this? -RD
 % Look at ``Inferring the Source of Encrypted HTTP Connections''
 % by Marc Liberatore and Brian Neil Levine (CCS 2006)
 % They substantially flesh out the numbers for the  web fingerprinting
-% attack.
+% attack. -PS
+% Yes, but I meant detecting the signature of Tor traffic itself, not
+% learning what websites we're going to. I wouldn't be surprised to
+% learn that these are related problems, but it's not obvious to me. -RD
 
 \subsection{Identity keys as part of addressing information}
 
@@ -811,29 +818,29 @@ unfortunate fact is that we have no magic bullet for discovery. We're
 in the same arms race as all the other designs we described in
 Section~\ref{sec:related}.
 
-In this section we describe four approaches to adding discovery
-components for our design, in order of increasing complexity. Note that
-we can deploy all four schemes at once---bridges and blocked users can
-use the discovery approach that is most appropriate for their situation.
+In this section we describe three approaches to adding discovery
+components for our design. Note that we should deploy all the schemes
+at once---bridges and blocked users can then use the discovery approach
+that is most appropriate for their situation.
 
 \subsection{Independent bridges, no central discovery}
 
 The first design is simply to have no centralized discovery component at
 all. Volunteers run bridges, and we assume they have some blocked users
 in mind and communicate their address information to them out-of-band
-(for example, through gmail). This design allows for small personal
+(for example, through Gmail). This design allows for small personal
 bridges that have only one or a handful of users in mind, but it can
 also support an entire community of users. For example, Citizen Lab's
 upcoming Psiphon single-hop proxy tool~\cite{psiphon} plans to use this
 \emph{social network} approach as its discovery component.
 
-There are some variations on bootstrapping in this design. In the simple
+There are several ways to do bootstrapping in this design. In the simple
 case, the operator of the bridge informs each chosen user about his
 bridge's address information and/or keys. A different approach involves
 blocked users introducing new blocked users to the bridges they know.
 That is, somebody in the blocked area can pass along a bridge's address to
 somebody else they trust. This scheme brings in appealing but complex game
-theory properties: the blocked user making the decision has an incentive
+theoretic properties: the blocked user making the decision has an incentive
 only to delegate to trustworthy people, since an adversary who learns
 the bridge's address and filters it makes it unavailable for both of them.
 
@@ -860,30 +867,143 @@ recommended bridges from any of the working bridges. Now the client can
 learn new additions to the bridge pool, and can expire abandoned bridges
 or bridges that the adversary has blocked, without the user ever needing
 to care. To simplify maintenance of the community's bridge pool, each
-community could run its own bridge directory authority---accessed via
-the available bridges, or mirrored at each bridge.
+community could run its own bridge directory authority---reachable via
+the available bridges, and also mirrored at each bridge.
 
-\subsection{Social networks with directory-side support}
+\subsection{Public bridges with central discovery}
 
+What about people who want to volunteer as bridges but don't know any
+suitable blocked users? What about people who are blocked but don't
+know anybody on the outside? Here we describe a way to make use of these
+\emph{public bridges} in a way that still makes it hard for the attacker
+to learn all of them.
+
+The basic idea is to divide public bridges into a set of buckets based on
+identity key, where each bucket has a different policy for distributing
+its bridge addresses to users. Each of these \emph{distribution policies}
+is designed to exercise a different scarce resource or property of
+the user.
+
+How do we divide bridges into buckets such that they're evenly distributed
+and the allocation is hard to influence or predict, but also in a way
+that's amenable to creating more buckets later on without reshuffling
+all the bridges? We compute the bucket for a given bridge by hashing the
+bridge's identity key along with a secret that only the bridge authority
+knows: the first $n$ bits of this hash dictate the bucket number,
+where $n$ is a parameter that describes how many buckets we want at this
+point. We choose $n=3$ to start, so we have 8 buckets available; but as
+we later invent new distribution policies, we can increment $n$ to split
+the 8 into 16 buckets. Since a bridge can't predict the next bit in its
+hash, it can't anticipate which identity key will correspond to a certain
+bucket when the buckets are split. Further, since the bridge authority
+doesn't provide any feedback to the bridge about which bucket it's in,
+an adversary signing up bridges to fill a certain bucket will be slowed.
+
+% This algorithm is not ideal. When we split buckets, each existing
+% bucket is cut in half, where half the bridges remain with the
+% old distribution policy, and half will be under what the new one
+% is. So the new distribution policy inherits a bunch of blocked
+% bridges if the old policy was too loose, or a bunch of unblocked
+% bridges if its policy was still secure. -RD
+
+The first distribution policy (used for the first bucket) publishes bridge
+addresses in a time-release fashion. The bridge authority divides the
+available bridges into partitions which are deterministically available
+only in certain time windows. That is, over the course of a given time
+slot (say, an hour), each requestor is given a random bridge from within
+that partition. When the next time slot arrives, a new set of bridges
+are available for discovery. Thus a bridge is always available when a new
+user arrives, but to learn about all bridges the attacker needs to fetch
+the new addresses at every new time slot. By varying the length of the
+time slots, we can make it harder for the attacker to guess when to check
+back. We expect these bridges will be the first to be blocked, but they'll
+help the system bootstrap until they \emph{do} get blocked. Further,
+remember that we're dealing with different blocking regimes around the
+world that will progress at different rates---so this bucket will still
+be useful to some users even as the arms race progresses.
+
+The second distribution policy publishes bridge addresses based on the IP
+address of the requesting user. Specifically, the bridge authority will
+divide the available bridges in the bucket into a bunch of partitions
+(as in the first distribution scheme), hash the requestor's IP address
+with a secret of its own (as in the above allocation scheme for creating
+buckets), and give the requestor a random bridge from the appropriate
+partition. To raise the bar, we should discard the last octet of the
+IP address before inputting it to the hash function, so an attacker
+who only controls a ``/24'' address only counts as one user. A large
+attacker like China will still be able to control many addresses, but
+the hassle of needing to establish connections from each network (or
+spoof TCP connections) may still slow them down. (We could also imagine
+a policy that combines the time-based and location-based policies to
+further constrain and rate-limit the available bridge addresses.)
+
+The third policy is based on Circumventor's discovery strategy. Realizing
+that its adoption will remain limited without some central coordination
+mechanism, the Circumventor project has started a mailing list to
+distribute new proxy addresses every few days. From experimentation it
+seems they have concluded that sending updates every three or four days
+is sufficient to stay ahead of the current attackers. We could give out
+bridge addresses from the third bucket in a similar fashion
+
+The fourth policy provides an alternative approach to a mailing list:
+users provide an email address, and receive an automated response
+listing an available bridge address. We could limit one response per
+email address. To further rate limit queries, we could require a CAPTCHA
+solution~\cite{captcha} in each case too. In fact, we wouldn't need to
+implement the CAPTCHA on our side: if we only deliver bridge addresses
+to Yahoo or GMail addresses, we can leverage the rate-limiting schemes
+that other parties already impose for account creation.
+
+The fifth policy ties in
+...
+reputation system
 Pick some seeds---trusted people in the blocked area---and give
 them each a few hundred bridge addresses. Run a website next to the
 bridge authority, where they can log in (they only need persistent
 pseudonyms). Give them tokens slowly over time. They can use these
 tokens to delegate trust to other people they know. The tokens can
 be exchanged for new accounts on the website.
-
 Accounts in ``good standing'' accrue new bridge addresses and new
 tokens.
-
 This is great, except how do we decide that an account is in good
 standing? One answer is to measure based on whether the bridge addresses
 we give it end up blocked. But how do we decide if they get blocked?
 Other questions below too.
+\ref{sec:accounts}
 
-\subsection{Public bridges, allocated in different ways}
+Buckets six through eight are held in reserve, in case our currently
+deployed tricks all fail at once---so we can adapt and move to
+new approaches quickly, and have some bridges available for the new
+schemes. (Bridges that sign up and don't get used yet may be unhappy that
+they're not being used; but this is a transient problem: if bridges are
+on by default, nobody will mind not being used yet.)
+
+\subsubsection{Bootstrapping: finding your first bridge.}
+\label{subsec:first-bridge}
+How do users find their first public bridge, so they can reach the
+bridge authority to learn more?
+Most government firewalls are not perfect. That is, they allow connections to
+Google cache or some open proxy servers, or they let file-sharing traffic or
+Skype or World-of-Warcraft connections through. We assume that the
+users have some mechanism for bypassing the firewall initially.
+For users who can't use any of these techniques, hopefully they know
+a friend who can---for example, perhaps the friend already knows some
+bridge relay addresses.
+(If they can't get around it at all, then we can't help them---they
+should go meet more people.)
+
+Is it useful to load balance which bridges are handed out? The above
+bucket concept makes some bridges wildly popular and others less so.
+But I guess that's the point.
+
+Families of bridges: give out 4 or 8 at once, bound together.
+
+\subsection{Advantages of deploying all solutions at once}
+
+For once we're not in the position of the defender: we don't have to
+defend against every possible filtering scheme, we just have to defend
+against at least one.
 
-public proxies. given out like circumventors. or all sorts of other rate
-limiting ways.
 
 
 \subsection{Remaining unsorted notes}
@@ -898,67 +1018,32 @@ number of new bridge relays an external attacker can discover.
 Going to be an arms race. Need a bag of tricks. Hard to say
 which ones will work. Don't spend them all at once.
 
-\subsection{Bootstrapping: finding your first bridge}
-\label{subsec:first-bridge}
-
-Most government firewalls are not perfect. They allow connections to
-Google cache or some open proxy servers, or they let file-sharing or
-Skype or World-of-Warcraft connections through.
-For users who can't use any of these techniques, hopefully they know
-a friend who can---for example, perhaps the friend already knows some
-bridge relay addresses.
-(If they can't get around it at all, then we can't help them---they
-should go meet more people.)
-
 Some techniques are sufficient to get us an IP address and a port,
 and others can get us IP:port:key. Lay out some plausible options
 for how users can bootstrap into learning their first bridge.
 
-Round one:
-
-- the bridge authority server will hand some out.
-
-- get one from your friend.
-
-- send us mail with a unique account, and get an automated answer.
-
-- 
-
-Round two:
-
-- social network thing
-
 attack: adversary can reconstruct your social network by learning who
 knows which bridges.
 
-\subsection{Centrally-distributed personal proxies}
-
-Circumventor, realizing that its adoption will remain limited if would-be
-users can't connect with volunteers, has started a mailing list to
-distribute new proxy addresses every few days. From experimentation
-it seems they have concluded that sending updates every 3 or 4 days is
-sufficient to stay ahead of the current attackers.
-
-If there are many volunteer proxies and many interested users, a central
-watering hole to connect them is a natural solution. On the other hand,
-at first glance it appears that we've inherited the \emph{bad} parts of
-each of the above designs: not only do we have to attract many volunteer
-proxies, but the users also need to get to a single site that is sure
-to be blocked.
-
-There are two reasons why we're in better shape. First, the users don't
-actually need to reach the watering hole directly: it can respond to
-email, for example. Second, 
-
-In fact, the JAP
-project~\cite{web-mix,koepsell:wpes2004} suggested an alternative approach
-to a mailing list: new users email a central address and get an automated
-response listing a proxy for them.
-While the exact details of the
-proposal are still to be worked out, the idea of giving out
 
 
 
+
+
+%\section{The account / reputation system}
+\section{Social networks with directory-side support}
+\label{sec:accounts}
+
+Perhaps each bridge should be known by a single bridge directory
+authority. This makes it easier to trace which users have learned about
+it, so easier to blame or reward. It also makes things more brittle,
+since loss of that authority means its bridges aren't advertised until
+they switch, and means its bridge users are sad too.
+(Need a slick hash algorithm that will map our identity key to a
+bridge authority, in a way that's sticky even when we add bridge
+directory authorities, but isn't sticky when our authority goes
+away. Does this exist?)
+
 \subsection{Discovery based on social networks}
 
 A token that can be exchanged at the bridge authority (assuming you
@@ -978,55 +1063,6 @@ way that a given transaction was successful or unsuccessful.
 (Lesson from designing reputation systems~\cite{rep-anon}: easy to
 reward good behavior, hard to punish bad behavior.
 
-\subsection{How to allocate bridge addresses to users}
-
-Hold a fraction in reserve, in case our currently deployed tricks
-all fail at once---so we can move to new approaches quickly.
-(Bridges that sign up and don't get used yet will be sad; but this
-is a transient problem---if bridges are on by default, nobody will
-mind not being used.)
-
-Perhaps each bridge should be known by a single bridge directory
-authority. This makes it easier to trace which users have learned about
-it, so easier to blame or reward. It also makes things more brittle,
-since loss of that authority means its bridges aren't advertised until
-they switch, and means its bridge users are sad too.
-(Need a slick hash algorithm that will map our identity key to a
-bridge authority, in a way that's sticky even when we add bridge
-directory authorities, but isn't sticky when our authority goes
-away. Does this exist?)
-
-Divide bridges into buckets based on their identity key.
-[Design question: need an algorithm to deterministically map a bridge's
-identity key into a category that isn't too gameable. Take a keyed
-hash of the identity key plus a secret the bridge authority keeps?
-An adversary signing up bridges won't easily be able to learn what
-category he's been put in, so it's slow to attack.]
-
-One portion of the bridges is the public bucket. If you ask the
-bridge account server for a public bridge, it will give you a random
-one of these. We expect they'll be the first to be blocked, but they'll
-help the system bootstrap until it *does* get blocked, and remember that
-we're dealing with different blocking regimes around the world that will
-progress at different rates.
-
-The generalization of the public bucket is a bucket based on the bridge
-user's IP address: you can learn a random entry only from the subbucket
-your IP address (actually, your /24) maps to.
-
-Another portion of the bridges can be sectioned off to be given out in
-a time-release basis. The bucket is partitioned into pieces which are
-deterministically available only in certain time windows.
-
-And of course another portion is made available for the social network
-design above.
-
-Captchas.
-
-Is it useful to load balance which bridges are handed out? The above
-bucket concept makes some bridges wildly popular and others less so.
-But I guess that's the point.
-
 \subsection{How do we know if a bridge relay has been blocked?}
 
 We need some mechanism for testing reachability from inside the
@@ -1079,12 +1115,6 @@ progress reports.
 The above geoip-based approach to detecting blocked bridges gives us a
 solution though.
 
-\subsection{Advantages of deploying all solutions at once}
-
-For once we're not in the position of the defender: we don't have to
-defend against every possible filtering scheme, we just have to defend
-against at least one.
-
 \section{Security considerations}
 \label{sec:security}
 
@@ -1397,10 +1427,6 @@ that is, if they want to block our new design, they will need to
 add a feature to block exactly this.
 strategically speaking, this may come in handy.
 
-hash identity key + secret that bridge authority knows. start
-out dividing into 2^n buckets, where n starts at 0, and we choose
-which bucket you're in based on the first n bits of the hash.
-
 Bridges come in clumps of 4 or 8 or whatever. If you know one bridge
 in a clump, the authority will tell you the rest. Now bridges can
 ask users to test reachability of their buddies.