Tor has configure libevent to attempt up to 3 times a DNS query for a
maximum of 5 seconds each. Once that 5 seconds has elapsed, it consider
the query "Timed Out" but tor only gets a timeout if all 3 attempts have
failed.
For example, using Unbound, it has a much higher threshold of timeout.
It is well defined in
https://www.nlnetlabs.nl/documentation/unbound/info-timeout/ and has
some complexity to it. But the gist is that if it times out, it will be
much more than 5 seconds.
And so the Tor DNS timeouts are more of a "UX issue" rather than a
"network issue". For this reason, we are removing this metric from the
overload general signal.
See https://gitlab.torproject.org/tpo/network-health/team/-/issues/139
for more information.
Fixes#40527
Signed-off-by: David Goulet <dgoulet@torproject.org>
This avoids performing and then freeing a lot of small mallocs() if
the hash line has too many elements.
Fixes one case of bug 40472; resolves OSS-Fuzz 38363. Bugfix on
0.3.1.1-alpha when the consdiff parsing code was introduced.
Some PT applications support more than one transport. For example,
obfs4proxy supports obfs4, obfs3, and meek. If one or more transports
specified in the torrc file are supported, we shouldn't kill the managed
proxy on a {C,S}METHOD-ERROR. Instead, we should log a warning.
We were already logging warnings on method errors. This change just
makes sure that the managed proxy isn't killed, and then if no
transports are configured for the managed proxy, bumps the log level up
from a notice to a warning.
Closes#7362
Mingw headers sometimes like to define alternative scanf/printf
format attributes depending on whether they're using clang, UCRT,
MINGW_ANSI_STDIO, or the microsoft version of printf/scanf. This
change attempts to use the right one on the given platform.
This is an attempt to fix part of #40355.
This also moves the warnings and add some theatrical effect around the
code so anyone modifying those list should notice the warnings signs and
read the comment accordingly.
Signed-off-by: David Goulet <dgoulet@torproject.org>
Our code doesn't allow it and so this prevents an assert() crash if the
DirPort is for instance IPv6 only.
Fixes#40494
Signed-off-by: David Goulet <dgoulet@torproject.org>
When we don't yet have a descriptor for one of our bridges, disable
the entry guard retry schedule on that bridge. The entry guard retry
schedule and the bridge descriptor retry schedule can conflict,
e.g. where we mark a bridge as "maybe up" yet we don't try to fetch
its descriptor yet, leading Tor to wait (refusing to do anything)
until it becomes time to fetch the descriptor.
Fixes bug 40497; bugfix on 0.3.0.3-alpha.
we do a notice-level log when we decide we *don't* have enough dir
info, but in 0.3.5.1-alpha (see commit eee62e13d9, #14950) we lost our
corresponding notice-level log when things come back.
bugfix on 0.3.5.1-alpha; fixes bug 40496.
Specifically, every time a guard moves into or out of state
GUARD_REACHABLE_MAYBE, it is an opportunity for the guard reachability
state to get out of sync with the have-minimum-dir-info state.
Fixes even more of #40396.
When we try to fetch a bridge descriptor and we fail, we mark
the guard as failed, but we never scheduled a re-compute for
router_have_minimum_dir_info().
So if we had already decided we needed to wait for this new descriptor,
we would just wait forever -- even if, counterintuitively, *losing* the
bridge is just what we need to *resume* using the network, if we had it
in state GUARD_REACHABLE_MAYBE and we were stalling to learn this outcome.
See bug 40396 for more details.
The bridge descriptor fetching codes ends up fetching a lot of duplicate
bridge descriptors, because this is how we learn when the descriptor
changes.
This commit only changes comments plus whether we log that one line.
It moves us back to the old behavior, before the previous commit for
30496, where we would only log that line when the bridge descriptor
we're talking about is better than the one we already had (if any).
Without this change, if we have a working bridge, and we add a new bridge,
we will schedule the fetch attempt for that new bridge descriptor for
three hours(!) in the future.
This change is especially needed because of bug #40396, where if you have
one working bridge and one bridge whose descriptor you haven't fetched
yet, your Tor will stall until you have successfully fetched that new
descriptor -- in this case for hours.
In the old design, we would put off all further bridge descriptor fetches
once we had any working bridge descriptor. In this new design, we make the
decision per bridge based on whether we successfully got *its* descriptor.
To make this work, we need to also call learned_bridge_descriptor() every
time we get a bridge descriptor, not just when it's a novel descriptor.
Fixes bug 40396.
Also happens to fix bug 40495 (redundant descriptor fetches for every
bridge) since now we delay fetches once we succeed.
A side effect of this change is that if we have any configured bridges
that *aren't* working, we will keep trying to fetch their descriptors
on the modern directory retry schedule -- every couple of seconds for
the first half minute, then backing off after that -- which is a lot
faster than before.
When this method is in place, then any relay which is assigned
MiddleOnly has Exit, V2Dir, Guard, and HSDir cleared
(and has BadExit set if appropriate).
This proposal implements part of Prop335; it's based on a patch
from Neel Chauhan.
When configured to do so, authorities will assign a MiddleOnly flag
to certain relays. Any relay which an authority gives this flag
will not get Exit, V2Dir, Guard, or HSDir, and might get BadExit if
the authority votes for that one.
We keep it around until libevent is fixed, it should be used again. In
the meantime, avoid the compiler to complain of this unused variable.
https://gitlab.torproject.org/dgoulet/tor/-/jobs/43358#L1522
Signed-off-by: David Goulet <dgoulet@torproject.org>
We don't output per-type DNS errors anymore so avoid looping over the
DNS query type and output each errors for them. Before this commit, it
created 3x the same message because we had A, AAAA and PTR type records.
Fix on previous commit e7abab8782
Signed-off-by: David Goulet <dgoulet@torproject.org>
This is due to the libevent bug
https://github.com/libevent/libevent/issues/1219 that fails to return
back the DNS record type on error.
And so, the MetricsPort now only reports the errors as a global counter
and not a per record type.
Closes#40490
Signed-off-by: David Goulet <dgoulet@torproject.org>
With this commit, we will only report a general overload state if we've
seen more than X% of DNS timeout errors over Y seconds. Previous
behavior was to report when a single timeout occured which is really too
small of a threshold.
The value X is a consensus parameters called
"overload_dns_timeout_scale_percent" which is a scaled percentage
(factor of 1000) so we can represent decimal points for X like 0.5% for
instance. Its default is 1000 which ends up being 1%.
The value Y is a consensus parameters called
"overload_dns_timeout_period_secs" which is the time period for which
will gather DNS errors and once over, we assess if that X% has been
reached ultimately triggering a general overload signal.
Closes#40491
Signed-off-by: David Goulet <dgoulet@torproject.org>
With this commit, we will only report a general overload state if we've
seen more than X% of DNS timeout errors over Y seconds. Previous
behavior was to report when a single timeout occured which is really too
small of a threshold.
The value X is a consensus parameters called
"overload_dns_timeout_scale_percent" which is a scaled percentage
(factor of 1000) so we can represent decimal points for X like 0.5% for
instance. Its default is 1000 which ends up being 1%.
The value Y is a consensus parameters called
"overload_dns_timeout_period_secs" which is the time period for which
will gather DNS errors and once over, we assess if that X% has been
reached ultimately triggering a general overload signal.
Closes#40491
Signed-off-by: David Goulet <dgoulet@torproject.org>
Now that we don't have version 2, it gives us:
[warn] HiddenServiceVersion must be between 3 and 3, not 2.
This commit changes it to:
[warn] HiddenServiceVersion must be 3, not 2.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Now that we don't have version 2, it gives us:
[warn] HiddenServiceVersion must be between 3 and 3, not 2.
This commit changes it to:
[warn] HiddenServiceVersion must be 3, not 2.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Relay do not accept both stores and lookups of version 2 descriptor.
This effectively disable version 2 HSDir supports for relays.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Now that we don't have version 2, it gives us:
[warn] HiddenServiceVersion must be between 3 and 3, not 2.
This commit changes it to:
[warn] HiddenServiceVersion must be 3, not 2.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Relay do not accept both stores and lookups of version 2 descriptor.
This effectively disable version 2 HSDir supports for relays.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Upon receiving a v2 introduction request, the relay will close the
circuit and send back a tor protocol error.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
The minimum service version is raised from 2 to 3 which effectively
disable loading or creating an onion service v2.
As for ADD_ONION, for version 2, a 551 error is returned:
"551 Failed to add Onion Service"
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Values greater than 100 would have had the same effect as 100, so
this doesn't actually change Tor's behavior; it just makes the
intent clearer. Fixes#40486; see also torspec#66.
This is the loudest of our LOG_PROTOCOL_WARN messages, it can occur
naturally, and there doesn't seem to be a great response to it.
Partial fix for 40400; bugfix on 0.1.1.13-alpha.
Series 0.4.2.x, 0.4.3.x and 0.4.4.x will all be rejected at the
authority level at this commit.
Futhermore, the 0.4.5.x alphas and rc will also be rejected.
Closes#40480
Signed-off-by: David Goulet <dgoulet@torproject.org>
Relay do not accept both stores and lookups of version 2 descriptor.
This effectively disable version 2 HSDir supports for relays.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Upon receiving a v2 introduction request, the relay will close the
circuit and send back a tor protocol error.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
The minimum service version is raised from 2 to 3 which effectively
disable loading or creating an onion service v2.
As for ADD_ONION, for version 2, a 551 error is returned:
"551 Failed to add Onion Service"
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
When a directory request fails, we flag the relay as non Running so we
don't use it anymore.
This can be problematic with onion services because there are cases
where a tor instance could have a lot of services, ephemeral ones, and
keeps failing to upload descriptors, let say due to a bad network, and
thus flag a lot of nodes as non Running which then in turn can not be
used for circuit building.
This commit makes it that we never flag nodes as non Running on a onion
service directory request (upload or fetch) failure as to keep the
hashring intact and not affect other parts of tor.
Fortunately, the onion service hashring is _not_ selected by looking at
the Running flag but since we do a 3-hop circuit to the HSDir, other
services on the same instance can influence each other by removing nodes
from the consensus for path selection.
This was made apparent with a small network that ran out of nodes to
used due to rapid succession of onion services uploading and failing.
See #40434 for details.
Fixes#40434
Signed-off-by: David Goulet <dgoulet@torproject.org>
Without this message getting logged at 'WARN', it's hard to
contextualize the messages we get about compression bombs, so this
message should fix#40175.
I'm rate-limiting this, however, since it _could_ get spammy if
somebody on the network starts acting up. (Right now it should be
very quiet; I've asked Sebastian to check it, and he says that he
doesn't hit this message in practice.)
Closes#40175.
Cached_dir_t is a somewhat "legacy" kind of storage when used for
consensus documents, and it appears that there are cases when
changing our settings causes us to stop updating those entries.
This can cause trouble, as @arma found out in #40375, where he
changed his settings around, and consensus diff application got
messed up: consensus diffs were being _requested_ based on the
latest consensus, but were being (incorrectly) applied to a
consensus that was no longer the latest one.
This patch is a minimal fix for backporting purposes: it has Tor do
the same search when applying consensus diffs as we use to request
them. This should be sufficient for correct behavior.
There's a similar case in GETINFO handling; I've fixed that too.
Fixes#40375; bugfix on 0.3.1.1-alpha.
Current counters are reset every heartbeat. This commit adds two
counters for the assigned and dropped onionskins that are not reset so
they can be exported onto the MetricsPort.
Closes#40387
Signed-off-by: David Goulet <dgoulet@torproject.org>
This resulted in the labels not being surrounded by double quotes and
thus Prometheus not liking it.
Signed-off-by: David Goulet <dgoulet@torproject.org>
Emit on the MetricsPort all the DNS statistics we have that is the total
number of queries seen and errors per record type.
Related to #40367
Signed-off-by: David Goulet <dgoulet@torproject.org>
We now keep track of all errors and total number of request seen. This
is so we can expose those values to the MetricsPort to help Exit
operators monitor the DNS requests and failures.
Related to #40367.
Signed-off-by: David Goulet <dgoulet@torproject.org>
This emits two events (read and write) of the total number that the
global connection limit was reached.
Related to #40367
Signed-off-by: David Goulet <dgoulet@torproject.org>
With this commit, a relay will emit metrics that give the total number
of sockets and total number of opened sockets.
Related to #40367
Signed-off-by: David Goulet <dgoulet@torproject.org>
With this commit, a relay now emits metrics event on the MetricsPort
related to how many onionskins were handled (processed or dropped) for
each handshake type.
Related to #40367
Signed-off-by: David Goulet <dgoulet@torproject.org>
With this commit, a relay now emits metrics event on the MetricsPort
related to the OOM invocation for:
- DNS cache
- GeoIP database
- Cell queues
- HSDir caches
Everytime the OOM is invoked, the number of bytes is added to the
metrics counter for that specific type of invocation.
Related to #40367
Signed-off-by: David Goulet <dgoulet@torproject.org>
The basic functions for the relay subsystem to expose metrics onto the
MetricsPort.
Part of #40367
Signed-off-by: David Goulet <dgoulet@torproject.org>
It is a common function that a lot of subsystem can use which is to
format a label so move it out of the HS subsystem into the more generic
metrics library.
Signed-off-by: David Goulet <dgoulet@torproject.org>
Turns out that passing client authorization keys to ADD_ONION for v3 was
not working because we were not setting the "is_client_auth_enabled"
flag to true once the clients were configured. This lead to the
descriptor being encoded without the clients.
This patch removes that flag and instead adds an inline function that
can be used to check if a given service has client authorization
enabled.
This will be much less error prone of needing to keep in sync the client
list and a flag instead.
Fixes#40378
Signed-off-by: David Goulet <dgoulet@torproject.org>
This function has been a no-op since Libevent 2.0.4-alpha, when
libevent got an arc4random() implementation. Libevent has finally
removed it, which will break our compilation unless we stop calling
it. (This is currently breaking compilation in OSS-fuzz.)
Closes#40371.
This is related to ticket #40360 which found this problem when a Bridge entry
with a transport name (let say obfs4) is set without a fingerprint:
Bridge obfs4 <IP>:<PORT> cert=<...> iat-mode=0
(Notice, no fingerprint between PORT and "cert=")
Problem: commit 09c6d03246 added a check in
get_sampled_guard_for_bridge() that would return NULL if the selected bridge
did not have a valid transport name (that is the Bridge transport name that
corresponds to a ClientTransportPlugin).
Unfortuantely, this function is also used when selecting our eligible guards
which is done *before* the transport list is populated and so the added check
for the bridge<->transport name is querying an empty list of transports
resulting in always returning NULL.
For completion, the logic is: Pick eligible guards (use bridge(s) if need be)
then for those, initiate a connection to the pluggable transport proxy and
then populate the transport list once we've connected.
Back to get_sampled_guard_for_bridge(). As said earlier, it is used when
selecting our eligible guards in a way that prevents us from selecting
duplicates. In other words, if that function returns non-NULL, the selection
continues considering the bridge was sampled before. But if it returns NULL,
the relay is added to the eligible list.
This bug made it that our eligible guard list was populated with the *same*
bridge 3 times like so (remember no fingerprint):
[info] entry_guards_update_primary(): Primary entry guards have changed. New primary guard list is:
[info] entry_guards_update_primary(): 1/3: [bridge] ($0000000000000000000000000000000000000000)
[info] entry_guards_update_primary(): 2/3: [bridge] ($0000000000000000000000000000000000000000)
[info] entry_guards_update_primary(): 3/3: [bridge] ($0000000000000000000000000000000000000000)
When tor starts, it will find the bridge fingerprint by connecting to it and
will then update the primary guard list by calling
entry_guard_learned_bridge_identity() which then goes and update only 1 single
entry resulting in this list:
[debug] sampled_guards_update_consensus_presence(): Sampled guard [bridge] ($<FINGERPRINT>) is still listed.
[debug] sampled_guards_update_consensus_presence(): Sampled guard [bridge] ($0000000000000000000000000000000000000000) is still listed.
[debug] sampled_guards_update_consensus_presence(): Sampled guard [bridge] ($0000000000000000000000000000000000000000) is still listed.
And here lies the problem, now tor is stuck attempting to wait for a valid
descriptor for at least 2 guards where the second one is a bunch of zeroes and
thus tor will never fully bootstraps:
[info] I learned some more directory information, but not enough to build a
circuit: We're missing descriptors for 1/2 of our primary entry guards
(total microdescriptors: 6671/6703). That's ok. We will try to fetch missing
descriptors soon.
Now, why passing the fingerprint then works? This is because the list of
guards contains 3 times the same bridge but they all have a fingerprint and so
the descriptor can be found and tor can bootstraps.
The solution here is to entirely remove the transport name check in
get_sampled_guard_for_bridge() since the transport_list is empty at that
point. That way, the eligible guard list only gets 1 entry, the bridge, and
can then go on to bootstrap properly.
It is OK to do so since when launching a bridge descriptor fetch, we validate
that the bridge transport name is OK and thus avoid connecting to a bridge
without a ClientTransportPlugin. If we wanted to keep the check in place, we
would need to populate the transport_list much earlier and this would require
a much bigger refactoring.
Fixes#40360
Signed-off-by: David Goulet <dgoulet@torproject.org>
We use it in router.c, where chunks are joined with "", not with
NL... so leaving off the terminating NL will lead to an unparseable
extrainfo.
Found by toralf. Bug not in any released Tor.
```
src/feature/stats/rephist.c: In function ‘overload_happened_recently’:
src/feature/stats/rephist.c:215:21: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare]
if (overload_time > approx_time() - 3600 * n_hours) {
```
from https://gitlab.torproject.org/tpo/core/tor/-/issues/40341#note_2729364
- Implement overload statistics structure.
- Implement function that keeps track of overload statistics.
- Implement function that writes overload statistics to descriptor.
- Unittest for the whole logic.
This option changes the time for which a bandwidth measurement period
must have been in progress before we include it when reporting our
observed bandwidth in our descriptors. Without this option, we only
consider a time period towards our maximum if it has been running
for a full day. Obviously, that's unacceptable for testing
networks, where we'd like to get results as soon as possible.
For non-testing networks, I've put a (somewhat arbitrary) 2-hour
minimum on the option, since there are traffic analysis concerns
with immediate reporting here.
Closes#40337.