Change it from "timeout" to "tor_timeout" in order to indicate that the
DNS timeout is one from tor's DNS threshold and not the DNS server
itself.
Fixes#40527
Signed-off-by: David Goulet <dgoulet@torproject.org>
Tor has configure libevent to attempt up to 3 times a DNS query for a
maximum of 5 seconds each. Once that 5 seconds has elapsed, it consider
the query "Timed Out" but tor only gets a timeout if all 3 attempts have
failed.
For example, using Unbound, it has a much higher threshold of timeout.
It is well defined in
https://www.nlnetlabs.nl/documentation/unbound/info-timeout/ and has
some complexity to it. But the gist is that if it times out, it will be
much more than 5 seconds.
And so the Tor DNS timeouts are more of a "UX issue" rather than a
"network issue". For this reason, we are removing this metric from the
overload general signal.
See https://gitlab.torproject.org/tpo/network-health/team/-/issues/139
for more information.
Fixes#40527
Signed-off-by: David Goulet <dgoulet@torproject.org>
This avoids performing and then freeing a lot of small mallocs() if
the hash line has too many elements.
Fixes one case of bug 40472; resolves OSS-Fuzz 38363. Bugfix on
0.3.1.1-alpha when the consdiff parsing code was introduced.
Some PT applications support more than one transport. For example,
obfs4proxy supports obfs4, obfs3, and meek. If one or more transports
specified in the torrc file are supported, we shouldn't kill the managed
proxy on a {C,S}METHOD-ERROR. Instead, we should log a warning.
We were already logging warnings on method errors. This change just
makes sure that the managed proxy isn't killed, and then if no
transports are configured for the managed proxy, bumps the log level up
from a notice to a warning.
Closes#7362
where the or_conn for testing the failure cache would be initialized
with random stack data, so e.g. its potentially_used_for_bootstrapping
field would start out at some random value.
The connect failure cache had a bad interaction with retrying connections
to our guards or bridges when we go offline and then come back online --
while offline we would fail to connect and cache this result, and then
when we return we would decline to even attempt to connect, because our
failure cache said it wouldn't work.
Now only cache connect failures for relays when we connected to them
because of somebody else's EXTEND request.
Fixes bug 40499; bugfix on 0.3.3.4-alpha.
Mingw headers sometimes like to define alternative scanf/printf
format attributes depending on whether they're using clang, UCRT,
MINGW_ANSI_STDIO, or the microsoft version of printf/scanf. This
change attempts to use the right one on the given platform.
This is an attempt to fix part of #40355.
glibc versions 2.33 and newer use the modern "statx" system call in their
implementations of stat() and opendir() for Linux on i386. Prevent failures in
the sandbox unit tests by modifying the sandbox to allow this system call
without restriction on i386 when it is available, and update the test suite to
skip the "sandbox/stat_filename" test in this case as it is certain to fail.
On 32-bit architectures where Linux provides the "stat64" system call,
including i386, the sandbox is unable to filter calls to stat() as glibc uses
this system call itself internally and the sandbox must allow it without
restriction.
Update the sandbox unit tests to skip the "sandbox/stat_filename" test on
systems where the "stat64" system call is defined and the test is certain to
fail. Also reorder the "#if" statement's clauses to correspond with the
comment preceding it, for clarity.
On 32-bit architectures where Linux provides the "clock_gettime64" system call,
including i386, glibc uses it in place of "clock_gettime". Modify the sandbox
implementation to match, to prevent Tor's monotonic-time functions (in
src/lib/time/compat_time.c) failing when the sandbox is active.
On i386 glibc uses the "chown32" system call instead of "chown". Prevent
attempts to filter calls to chown() on this architecture from failing by
modifying the sandbox implementation to match.
This also moves the warnings and add some theatrical effect around the
code so anyone modifying those list should notice the warnings signs and
read the comment accordingly.
Signed-off-by: David Goulet <dgoulet@torproject.org>
Our code doesn't allow it and so this prevents an assert() crash if the
DirPort is for instance IPv6 only.
Fixes#40494
Signed-off-by: David Goulet <dgoulet@torproject.org>
While trying to resolve our CI issues, the Windows build broke with an
unused function error:
src/test/test_switch_id.c:37:1: error: ‘unprivileged_port_range_start’
defined but not used [-Werror=unused-function]
We solve this by moving the `#if !defined(_WIN32)` test above the
`unprivileged_port_range_start()` function defintion such that it is
included in its body.
This is an unreviewed commit.
See: tor#40275
When we don't yet have a descriptor for one of our bridges, disable
the entry guard retry schedule on that bridge. The entry guard retry
schedule and the bridge descriptor retry schedule can conflict,
e.g. where we mark a bridge as "maybe up" yet we don't try to fetch
its descriptor yet, leading Tor to wait (refusing to do anything)
until it becomes time to fetch the descriptor.
Fixes bug 40497; bugfix on 0.3.0.3-alpha.
we do a notice-level log when we decide we *don't* have enough dir
info, but in 0.3.5.1-alpha (see commit eee62e13d9, #14950) we lost our
corresponding notice-level log when things come back.
bugfix on 0.3.5.1-alpha; fixes bug 40496.
Specifically, every time a guard moves into or out of state
GUARD_REACHABLE_MAYBE, it is an opportunity for the guard reachability
state to get out of sync with the have-minimum-dir-info state.
Fixes even more of #40396.
When we try to fetch a bridge descriptor and we fail, we mark
the guard as failed, but we never scheduled a re-compute for
router_have_minimum_dir_info().
So if we had already decided we needed to wait for this new descriptor,
we would just wait forever -- even if, counterintuitively, *losing* the
bridge is just what we need to *resume* using the network, if we had it
in state GUARD_REACHABLE_MAYBE and we were stalling to learn this outcome.
See bug 40396 for more details.
The bridge descriptor fetching codes ends up fetching a lot of duplicate
bridge descriptors, because this is how we learn when the descriptor
changes.
This commit only changes comments plus whether we log that one line.
It moves us back to the old behavior, before the previous commit for
30496, where we would only log that line when the bridge descriptor
we're talking about is better than the one we already had (if any).
Without this change, if we have a working bridge, and we add a new bridge,
we will schedule the fetch attempt for that new bridge descriptor for
three hours(!) in the future.
This change is especially needed because of bug #40396, where if you have
one working bridge and one bridge whose descriptor you haven't fetched
yet, your Tor will stall until you have successfully fetched that new
descriptor -- in this case for hours.
In the old design, we would put off all further bridge descriptor fetches
once we had any working bridge descriptor. In this new design, we make the
decision per bridge based on whether we successfully got *its* descriptor.
To make this work, we need to also call learned_bridge_descriptor() every
time we get a bridge descriptor, not just when it's a novel descriptor.
Fixes bug 40396.
Also happens to fix bug 40495 (redundant descriptor fetches for every
bridge) since now we delay fetches once we succeed.
A side effect of this change is that if we have any configured bridges
that *aren't* working, we will keep trying to fetch their descriptors
on the modern directory retry schedule -- every couple of seconds for
the first half minute, then backing off after that -- which is a lot
faster than before.
When this method is in place, then any relay which is assigned
MiddleOnly has Exit, V2Dir, Guard, and HSDir cleared
(and has BadExit set if appropriate).
This proposal implements part of Prop335; it's based on a patch
from Neel Chauhan.
When configured to do so, authorities will assign a MiddleOnly flag
to certain relays. Any relay which an authority gives this flag
will not get Exit, V2Dir, Guard, or HSDir, and might get BadExit if
the authority votes for that one.
We keep it around until libevent is fixed, it should be used again. In
the meantime, avoid the compiler to complain of this unused variable.
https://gitlab.torproject.org/dgoulet/tor/-/jobs/43358#L1522
Signed-off-by: David Goulet <dgoulet@torproject.org>
When we looked, this was the third most frequent message at
PROTOCOL_WARN, and doesn't actually tell us what to do about it.
Now:
* we just log it at info
* we log it only once per circuit
* we report, in the heartbeat, how many times it happens, how many
cells it happens with per circuit, and how long these circuits
have been alive (on average).
Fixes the final part of #40400.
We don't output per-type DNS errors anymore so avoid looping over the
DNS query type and output each errors for them. Before this commit, it
created 3x the same message because we had A, AAAA and PTR type records.
Fix on previous commit e7abab8782
Signed-off-by: David Goulet <dgoulet@torproject.org>
This is due to the libevent bug
https://github.com/libevent/libevent/issues/1219 that fails to return
back the DNS record type on error.
And so, the MetricsPort now only reports the errors as a global counter
and not a per record type.
Closes#40490
Signed-off-by: David Goulet <dgoulet@torproject.org>
With this commit, we will only report a general overload state if we've
seen more than X% of DNS timeout errors over Y seconds. Previous
behavior was to report when a single timeout occured which is really too
small of a threshold.
The value X is a consensus parameters called
"overload_dns_timeout_scale_percent" which is a scaled percentage
(factor of 1000) so we can represent decimal points for X like 0.5% for
instance. Its default is 1000 which ends up being 1%.
The value Y is a consensus parameters called
"overload_dns_timeout_period_secs" which is the time period for which
will gather DNS errors and once over, we assess if that X% has been
reached ultimately triggering a general overload signal.
Closes#40491
Signed-off-by: David Goulet <dgoulet@torproject.org>
With this commit, we will only report a general overload state if we've
seen more than X% of DNS timeout errors over Y seconds. Previous
behavior was to report when a single timeout occured which is really too
small of a threshold.
The value X is a consensus parameters called
"overload_dns_timeout_scale_percent" which is a scaled percentage
(factor of 1000) so we can represent decimal points for X like 0.5% for
instance. Its default is 1000 which ends up being 1%.
The value Y is a consensus parameters called
"overload_dns_timeout_period_secs" which is the time period for which
will gather DNS errors and once over, we assess if that X% has been
reached ultimately triggering a general overload signal.
Closes#40491
Signed-off-by: David Goulet <dgoulet@torproject.org>
This means that at this commit, tor will stop logging that v2 is
deprecated and treat a v2 address as a bad hostname that we can't use.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Now that we don't have version 2, it gives us:
[warn] HiddenServiceVersion must be between 3 and 3, not 2.
This commit changes it to:
[warn] HiddenServiceVersion must be 3, not 2.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Some tests were removed because they were testing something not usable
anymore.
Some tests remains to make sure that things are indeed disabled.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Now that we don't have version 2, it gives us:
[warn] HiddenServiceVersion must be between 3 and 3, not 2.
This commit changes it to:
[warn] HiddenServiceVersion must be 3, not 2.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Some tests were removed because they were testing something not usable
anymore.
Some tests remains to make sure that things are indeed disabled.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Relay do not accept both stores and lookups of version 2 descriptor.
This effectively disable version 2 HSDir supports for relays.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Now that we don't have version 2, it gives us:
[warn] HiddenServiceVersion must be between 3 and 3, not 2.
This commit changes it to:
[warn] HiddenServiceVersion must be 3, not 2.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Some tests were removed because they were testing something not usable
anymore.
Some tests remains to make sure that things are indeed disabled.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Relay do not accept both stores and lookups of version 2 descriptor.
This effectively disable version 2 HSDir supports for relays.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Upon receiving a v2 introduction request, the relay will close the
circuit and send back a tor protocol error.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
The minimum service version is raised from 2 to 3 which effectively
disable loading or creating an onion service v2.
As for ADD_ONION, for version 2, a 551 error is returned:
"551 Failed to add Onion Service"
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
This effectively turns off the ability of tor to use HSv2 as a client by
invalidating the v2 onion hostname passed through a SOCKS request.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Values greater than 100 would have had the same effect as 100, so
this doesn't actually change Tor's behavior; it just makes the
intent clearer. Fixes#40486; see also torspec#66.
This is the loudest of our LOG_PROTOCOL_WARN messages, it can occur
naturally, and there doesn't seem to be a great response to it.
Partial fix for 40400; bugfix on 0.1.1.13-alpha.
This one happens every time we get a failure from
circuit_receive_relay_cell -- but for all the relevant failing cases
in that function, we already log in that function.
This resolves one case of #40400. Two cases remain.
Series 0.4.2.x, 0.4.3.x and 0.4.4.x will all be rejected at the
authority level at this commit.
Futhermore, the 0.4.5.x alphas and rc will also be rejected.
Closes#40480
Signed-off-by: David Goulet <dgoulet@torproject.org>
Coverity report: CID 1492322
________________________________________________________________________________________________________
*** CID 1492322: Integer handling issues (OVERFLOW_BEFORE_WIDEN)
/src/core/or/congestion_control_flow.c: 399 in circuit_process_stream_xon()
393 }
394
395 log_info(LD_EDGE, "Got XON: %d", xon->kbps_ewma);
396
397 /* Adjust the token bucket of this edge connection with the drain rate in
398 * the XON. Rate is in bytes from kilobit (kpbs). */
>>> CID 1492322: Integer handling issues (OVERFLOW_BEFORE_WIDEN)
>>> Potentially overflowing expression "xon_cell_get_kbps_ewma(xon) * 1000U" with type "unsigned int" (32 bits, unsigned) is evaluated using 32-bit arithmetic, and then used in a context that expects an expression of type "uint64_t" (64 bits, unsigned).
399 uint64_t rate = xon_cell_get_kbps_ewma(xon) * 1000;
400 if (rate == 0 || INT32_MAX < rate) {
401 /* No rate. */
402 rate = INT32_MAX;
403 }
404 token_bucket_rw_adjust(&conn->bucket, (uint32_t) rate, (uint32_t) rate);
Fixes#40478
Signed-off-by: David Goulet <dgoulet@torproject.org>
Fixes issue #22469 where port strings such as '0x00' get accepted, not
because the string gets converted to hex, but because the string is
silently truncated past the invalid character 'x'. This also causes
issues for strings such as '0x01-0x02' which look like a hex port range,
but in reality gets truncated to '0', which is definitely not what a
user intends.
Warn and reject such port strings as invalid.
Also, since we're throwing that "malformed port" warning a lot in the
function, wrap it up in a nice goto.
Fixes#22469
The connection_ap_attach_pending() function processes all pending
streams in the pending_entry_connections list. It first copy the pointer
and then allocates a brand new empty list.
It then iterates over that copy pointer to try to attach entry
connections onto any fitting circuits using
connection_ap_handshake_attach_circuit().
That very function, for onion service, can lead to flagging _all_
streams of the same onion service to be put in state RENDDESC_WAIT from
CIRCUIT_WAIT. By doing so, it also tries to remove them from the
pending_entry_connections but at that point it is already empty.
Problem is that the we are iterating over the previous
pending_entry_connections which contains the streams that have just
changed state and are no longer in CIRCUIT_WAIT.
This lead to this bug warning occuring a lot on busy services:
May 01 08:55:43.000 [warn] connection_ap_attach_pending(): Bug:
0x55d8764ae550 is no longer in circuit_wait. Its current state is
waiting for rendezvous desc. Why is it on pending_entry_connections?
(on Tor 0.4.4.0-alpha-dev )
This fix is minimal and basically allow a state to be not CIRCUIT_WAIT
and move on to the next one without logging a warning. Because the
pending_entry_connections is emptied before processing, there is no
chance for a streams to be stuck there forever thus it is OK to ignore
streams not in the right state.
Fixes#34083
Signed-off-by: David Goulet <dgoulet@torproject.org>
We only need to rate limit reading on edges for flow control, as per the rate
that comes in the XON from the other side. When we rate limit reading from the
edge source to this rate, we will only deliver that fast to the other side,
thus satisfying its rate request.
Signed-off-by: David Goulet <dgoulet@torproject.org>
Relay do not accept both stores and lookups of version 2 descriptor.
This effectively disable version 2 HSDir supports for relays.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Upon receiving a v2 introduction request, the relay will close the
circuit and send back a tor protocol error.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
The minimum service version is raised from 2 to 3 which effectively
disable loading or creating an onion service v2.
As for ADD_ONION, for version 2, a 551 error is returned:
"551 Failed to add Onion Service"
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
This effectively turns off the ability of tor to use HSv2 as a client by
invalidating the v2 onion hostname passed through a SOCKS request.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
When building with --enable-fragile-hardening, add or relax Linux
seccomp rules to allow AddressSanitizer to execute normally if the
process terminates with the sandbox active.
Further resolves issue 11477.
We currently assume that the only way for Tor to listen on ports in the
privileged port range (1 to 1023), on Linux, is if we are granted the
NET_BIND_SERVICE capability. Today on Linux, it's possible to specify
the beginning of the unprivileged port range using a sysctl
configuration option. Docker (and thus the CI service Tor uses) recently
changed this sysctl value to 0, which causes our tests to fail as they
assume that we should NOT be able to bind to a privileged port *without*
the NET_BIND_SERVICE capability.
In this patch, we read the value of the sysctl value via the /proc/sys/
filesystem iff it's present, otherwise we assume the default
unprivileged port range begins at port 1024.
See: tor#40275
This code is based directly on the specification, without looking at
the reference implementation or the implementation in Arti.
Nonetheless, it is now passing with the test vectors generated by
the reference implementation.
When a directory request fails, we flag the relay as non Running so we
don't use it anymore.
This can be problematic with onion services because there are cases
where a tor instance could have a lot of services, ephemeral ones, and
keeps failing to upload descriptors, let say due to a bad network, and
thus flag a lot of nodes as non Running which then in turn can not be
used for circuit building.
This commit makes it that we never flag nodes as non Running on a onion
service directory request (upload or fetch) failure as to keep the
hashring intact and not affect other parts of tor.
Fortunately, the onion service hashring is _not_ selected by looking at
the Running flag but since we do a 3-hop circuit to the HSDir, other
services on the same instance can influence each other by removing nodes
from the consensus for path selection.
This was made apparent with a small network that ran out of nodes to
used due to rapid succession of onion services uploading and failing.
See #40434 for details.
Fixes#40434
Signed-off-by: David Goulet <dgoulet@torproject.org>
Fixes bug 40078.
As reported by hdevalence our batch verification logic can cause an assert
crash.
The assert happens because when the batch verification of ed25519-donna fails,
the code in `ed25519_checksig_batch()` falls back to doing a single
verification for each signature.
The crash occurs because batch verification failed, but then all signatures
individually verified just fine.
That's because batch verification and single verification use a different
equation which means that there are sigs that can pass single verification
but fail batch verification.
Fixing this would require modding ed25519-donna which is not in scope for
this ticket, and will be soon deprecated in favor of arti and
ed25519-dalek, so my branch instead removes batch verification.