diff --git a/doc/spec/proposals/126-geoip-reporting.txt b/doc/spec/proposals/126-geoip-reporting.txt index d2da4dc304..57480ff85c 100644 --- a/doc/spec/proposals/126-geoip-reporting.txt +++ b/doc/spec/proposals/126-geoip-reporting.txt @@ -205,7 +205,7 @@ Status: Needs-Research 6. Controllers use the IP-to-country db for mapping and for path building - Down the road, vidalia can use the IP-to-country mappings for placing + Down the road, Vidalia could use the IP-to-country mappings for placing on its map: - The location of the client - The location of the bridges, or other relays not in the @@ -222,6 +222,14 @@ Status: Needs-Research GETINFO ip-to-country/128.31.0.34 250+ip-to-country/128.31.0.34="US","USA","UNITED STATES" +6.1. Other interfaces + + Robert Hogan has also suggested a + GETINFO relays-by-country/cn + + as well as torrc options for ExitCountryCodes, EntryCountryCodes, + ExcludeCountryCodes, etc. + 7. Relays and bridges use the IP-to-country db for usage summaries Once bridges have a GeoIP database locally, they can start to publish @@ -231,5 +239,156 @@ Status: Needs-Research switch to using directory guards for all users by default. But how to safely summarize this information without opening too many - anonymity leaks seems hard... + anonymity leaks? + +7.1 Attacks to think about + + First, note that we need to have a large enough time window that we're + not aiding correlation attacks much. I hope 24 hours is enough. So + that means no publishing stats until you've been up at least 24 hours. + And you can't publish follow-up stats more often than every 24 hours, + or people could look at the differential. + + Second, note that we need to be sufficiently vague about the IP + addresses we're reporting. We are hoping that just specifying the + country will be vague enough. But a) what about active attacks where + we convince a bridge to use a GeoIP db that labels each suspect IP + address as a unique country? We have to assume that the consensus GeoIP + db won't be malicious in this way. And b) could such singling-out + attacks occur naturally, for example because of countries that have + a very small IP space? We should investigate that. + +7.2. Granularity of users + + Do we only want to report countries that have a very small anonymity set + (that is, number of users) for the day? For example, we might avoid + listing any countries that have seen less than five addresses over + the 24 hour period. This approach would be helpful in reducing the + singling-out opportunities -- in the extreme case, we could imagine a + situation where one blogger from the Sudan used Tor on a given day, and + we can discover which entry guard she used. + + But I fear that especially for bridges, seeing only one hit from a + given country in a given day may be quite common. + + As a compromise, we should start out with an "Other" category in + the reported stats, which is the sum of unlisted countries; if that + category is consistently interesting, we can think harder about how + to get the right data from it safely. + + But note that bridge summaries will not be made public individually, + since doing so would help people enumerate bridges. Whereas summaries + from normal relays will be public. So perhaps that means we can afford + to be more specific in bridge summaries? In particular, I'm thinking the + "other" category should be used by public relays but not for bridges + (or if it is, used with a lower threshold). + + Even for countries that have many Tor users, we might not want to be + too specific about how many users we've seen. For example, we might + round down the number of users we report to the nearest multiple of 5. + My instinct for now is that this won't be that useful. + +7.3 Other issues + + Another note: we'll likely be overreporting in the case of users with + dynamic IP addresses: if they rotate to a new address over the course + of the day, we'll count them twice. So be it. + +7.4. Where to publish the summaries? + + We designed extrainfo documents for information like this. So they + should just be more entries in the extrainfo doc. + + But if we want to publish summaries every 24 hours (no more often, + no less often), aren't we tried to the router descriptor publishing + schedule? That is, if we publish a new router descriptor at the 18 + hour mark, and nothing much has changed at the 24 hour mark, won't + the new descriptor get dropped as being "cosmetically similar", and + then nobody will know to ask about the new extrainfo document? + + One solution would be to make and remember the 24 hour summary at the + 24 hour mark, but not actually publish it anywhere until we happen to + publish a new descriptor for other reasons. If we happen to go down + before publishing a new descriptor, then so be it, at least we tried. + +7.5. What if the relay is unreachable or goes to sleep? + + Even if you've been up for 24 hours, if you were hibernating for 18 + of them, then we're not getting as much fuzziness as we'd like. So + I guess that means that we need a 24-hour period of being "awake" + before we'll willing to publish a summary. A similar attack works if + you've been awake but unreachable for the first 18 of the 24 hours. As + another example, a bridge that's on a laptop might be suspended for + some of each day. + + This implies that some relays and bridges will never publish summary + stats, because they're not ever reliably working for 24 hours in + a row. If a significant percentage of our reporters end up being in + this boat, we should investigate whether we can accumulate 24 hours of + "usefulness", even if there are holes in the middle, and publish based + on that. + + What other issues are like this? It seems that just moving to a new + IP address shouldn't be a reason to cancel stats publishing, assuming + we were usable at each address. + +7.6. IP addresses that aren't in the geoip db + + Some IP addresses aren't in the public geoip databases. In particular, + I've found that a lot of African countries are missing, but there + are also some common ones in the US that are missing, like parts of + Comcast. We could just lump unknown IP addresses into the "other" + category, but it might be useful to gather a general sense of how many + lookups are failing entirely, by adding a separate "Unknown" category. + + We could also contribute back to the geoip db, by letting bridges set + a config option to report the actual IP addresses that failed their + lookup. Then the bridge authority operators can manually make sure + the correct answer will be in later geoip files. This config option + should be disabled by default. + +7.7 Bringing it all together + + So here's the plan: + + 24 hours after starting up (modulo Section 7.5 above), bridges and + relays should construct a daily summary of client countries they've + seen, including the above "Unknown" category (Section 7.6) as well. + + Non-bridge relays lump all countries with less than K (e.g. K=5) users + into the "Other" category (see Sec 7.2 above), whereas bridge relays are + willing to list a country even when it has only one user for the day. + + Whenever we have a daily summary on record, we include it in our + extrainfo document whenever we publish one. The daily summary we + remember locally gets replaced with a newer one when another 24 + hours pass. + +7.8. Some forward secrecy + + How should we remember addresses locally? If we convert them into + country-codes immediately, we will count them again if we see them + again. On the other hand, we don't really want to keep a list hanging + around of all IP addresses we've seen in the past 24 hours. + + Step one is that we should never write this stuff to disk. Keeping it + only in ram will make things somewhat better. Step two is to avoid + keeping any timestamps associated with it: rather than a rolling + 24-hour window, which would require us to remember the various times + we've seen that address, we can instead just throw out the whole list + every 24 hours and start over. + + We could hash the addresses, and then compare hashes when deciding if + we've seen a given address before. We could even do keyed hashes. Or + Bloom filters. But if our goal is to defend against an adversary + who steals a copy of our ram while we're running and then does + guess-and-check on whatever blob we're keeping, we're in bad shape. + + We could drop the last octet of the IP address as soon as we see + it. That would cause us to undercount some users from cablemodem and + DSL networks that have a high density of Tor users. And it wouldn't + really help that much -- indeed, the extent to which it does help is + exactly the extent to which it makes our stats less useful. + + Other ideas?