I ran into a fun issue on Friday evening with one of our servers, so I thought I’d toss it out into the world just incase anyone else runs into a similar issue.
We have a rather heavily loaded administrative server, an Xserve, running Mac OS X Server. Additionally, we run a modified sshd configuration, where sshd runs as a standalone service, instead of via launchd as is with the default system installation. Is it relevant? Yes.
So, Friday evening rolls around (of course), and I get an e-mail from a co-worker saying that when she tries to log in, she gets that oh-so-common error message:
ssh_exchange_identification: Connection closed by remote host
Usually, in all of my experiences, this has been related to TCP wrappers; you’ve got a hosts.allow or hosts.deny set up, and you’re blocking (purposely or by accident), and just need to fix the config. Sometimes you’re totally screwed, but whatever. In our case, however, we aren’t using TCP wrappers. We could, since the default sshd in OS X does support it:
[chet@myhost]$ strings /usr/sbin/sshd | grep wrap
Connection refused by tcp wrapper
libwrap refuse returns
or, if you have Xcode installed, you can use otool (the OS X equivalent to ldd in Linux):
[chet@myhost]$ otool -L /usr/sbin/sshd | grep libwrap
/usr/lib/libwrap.7.dylib (compatibility version 7.0.0, current version 7.6.0)
However, we don’t have a hosts.(allow|deny), so that’s not the issue. Thinking further, I remembered someone mentioning that the box was under more load than usual, and may have hit some type of system limit (maxprocs, maxttys, etc). To test, I basically flooded the box with ssh connection requests using a super-basic loop that a 3rd grader would write:
$ for connection in `jot 500 1`; do
$ ssh myhost &
$ done
Sure enough, right when I started the loop, I immediately began getting ssh_exchange_identification errors thrown back at me, whereas a single login not amidst a loop would work just fine. Bingo. So, it wasn’t a system limit per-se, but it was some type of sshd limit that was being triggered within the daemon itself.
Looking through the man pages, I came across an sshd_config option called MaxStartups. Here’s what the man page had to say:
MaxStartups
Specifies the maximum number of concurrent unauthenticated con-
nections to the sshd daemon. Additional connections will be
dropped until authentication succeeds or the LoginGraceTime
expires for a connection. The default is 10.
This sounded like a totally plausible cause for the issue. When I would flood, it would make requests so fast that sshd couldn’t keep up with authentications, so sshd would reach the MaxStartups limit and start blocking (to protect against DoS attacks, etc). To test it out, I changed MaxStartups from 10 to 200, ran my flood loop again, and sure enough, was able to connect without any errors. Perfect.
However, how come we hadn’t run into it before? Did it have something to do with us running sshd as a standalone daemon instead of via launchd? The standalone configuration is new for us, so we’re still keeping our eye out for bugs. To test, I ran my same flooding loop on an OS X Server machine running the system default (via launchd) sshd configuration. Sure enough, the issue did NOT present itself for that machine. Interesting.
Now, why is this, you may ask? It has to do with the way the sshd daemon runs via launchd.
The launchd service is very similar to xinetd when it comes to running network services: launchd knows what network services are configured under it, and it knows the ports those network services should accept connections from. So, launchd itself will listen on those ports. When a new connection request comes in (example: ssh), launchd will see it’s for port 22/sshd, fork off an sshd parent, and your connection gets handed off to that parent. If another connection comes in, launchd will fork off yet another sshd parent, and that 2nd user will get the 2nd parent. See the issue?
With the MaxStartups option in sshd_config, it relies on the sshd parent being the process handling all incoming connections, and forking off children for each new connection. If it sees that >= MaxStartups children are in an un-authenticated state, it will block new connections from establishing.
In the case of launchd, however, each connection is a parent forked from launchd, so no parent will ever see more than 1 connection (since it will have no children). In that respect, it totally subverts the sshd_config option for MaxStartups, opening up your machine to DoS attacks via sshd floods. Awesome.
In any case, now that we understand the issue, how come it was presenting itself under a scenario where people weren’t trying to flood it with ssh connections? The answer lies somewhere between system load, DirectoryServices, and the pam_securityserver.so PAM module.
When an ssh request comes in, and PAM is enabled, sshd will use the options defined in the /etc/pam.d/sshd file to determine what needs to be done to authorize the user. In our case, we have this configuration (slightly modified from the default OS install, since we’re not running via launchd):
[chet@myhost] $ cat /etc/pam.d/sshd
# sshd: auth account password session
auth required pam_nologin.so
auth optional pam_afpmount.so
auth sufficient pam_securityserver.so
auth sufficient pam_unix.so
auth required pam_deny.so
account required pam_securityserver.so
password required pam_deny.so
session required pam_permit.so
session optional pam_afpmount.so
Looking at the file, we notice a couple of entries for pam_securityserver.so. This module is actually the bread and butter of the authentication process: it queries the local DirectoryServices agent running on the machine to do the user authentication, and will receive a success or failure from the agent. The DirectoryServices agent can be configured to authenticate for local users, remote users from an LDAP directory server, or a mix of both. It’s versatile, but it does a good amount of work. In the case of it being bound to a remote LDAP server, it can also be very slow (depending on if you’re caching account results or not).
Our issue was two-fold: at the time, the host was heavily loaded with some CPU and network intensive processes. The CPU intensive processes were slowing down the communication between pam_securityserver.so and DirectoryServices, and the network intensive processes were slowing down the communication between DirectoryServices and our remote LDAP server. Between the both of these, authentications became super slow, and incoming, un-authentication connections were slowly creeping up (especially due to the scripted cron-job logins that had no logic to drop a hung login – awesome).
However, at that time, things were actually still okay. Slow, but people could still get in. Unfortunately (or luckily), someone sent out a chat message saying the host was acting slow, which of course prompted everyone to try logging in at the same time. At that point, 10 MaxStartups was hit almost immediately (ie: 10 people tried logging in, and since authentications were taking forever, all were stuck as un-authenticated), resulting in the ssh_exchange_identification error being sent back to all subsequent logins immediately (this was a nice bit of detail to note: the sshd daemon itself was still very quick to respond, but as soon as it had to hand an authentication back off to PAM, it just sat around waiting).
Ultimately, the resolution for this was we increased MaxStartups for our heavily-loaded machines (which is still better than the un-regulated launchd sshd), and we will be setting up separate server to handle all the jobs that were running.
Looking at other options, I noticed while writing this that pam_securityserver.so comes BEFORE pam_unix.so. As a test, I may try using pam_unix.so first, allowing root logins to bypass DirectoryServices, hitting passwd/shadow directly, and hopefully allowing us to jump in on what seems like a hung system due to DS being super slow. I’ll try to get back with an update if I end up testing this.
That’s all from me for now. Feel free to comment on your own experiences, knowledge, other ideas, or if you’ve run into the same thing. Enjoy!




Thanks for theexplanation, that is pretty cool. Now is there a way for an admin to be able to extract a metric on the number of these sessions in preauthenticated state, for analysis, trending, and/or alarming? I am imagining a stats dump like named does when you send a SIGUSR1 (IIRC) to it.
Keep it up!
–N.
You can also accidentally hit that basic issue when you’re using ssh-keyscan to set up known-hosts for an hadoop account. At least if you use a parallel ssh script to start off all the keyscanning. Ask me how I know…
hahah, nice! i’ll keep that one in mind when the time comes