HN Gopher Feed (2017-12-29) - page 1 of 10 ___________________________________________________________________
Escaping Docker container using waitid() - CVE-2017-5123
283 points by Da5hes
https://www.twistlock.com/2017/12/27/escaping-docker-container-u...___________________________________________________________________
cirowrc - 6 hours ago
In the article the author states:> CVE-2017-5123 was published
earlier this year on Oct 12 ? it was a Linux kernel vulnerability
in the waitid() syscall for 4.12-4.13 kernel versions.Does this
mean that kernel versions prior to 4.12 are not affected? That's
what I understood from the related issue in the bug tracker
https://bugzilla.redhat.com/show_bug.cgi?id=1500094By the way, this
is very important:> In 2017 alone, 434 linux kernel exploits where
found, and as you have seen in this post, kernel exploits can be
devastating for containerized environments. This is because
containers share the same kernel as the host, thus trusting the
built-in protection mechanisms alone isn?t sufficient. Make sure
your kernel is always updated on all of your production hosts.Great
article!
Da5hes - 6 hours ago
that's correct, versions prior to 4.12 are not affected
userbinator - 6 hours ago
This is 4.11:https://elixir.free-
electrons.com/linux/v4.11/source/kernel/...The code is
significantly different but I still see a lack of access_ok(),
so was the checking performed somewhere else that I didn't
notice (I haven't looked closely at this part of the kernel
before)?
[deleted]
Da5hes - 5 hours ago
it is the use of unsafe_put_user without access_ok(), not
access_ok() alone
icebraining - 5 hours ago
IIUC, you only need the access_ok() when using the new
unsafe_put_user(). That code is still using put_user().
benevol - 4 hours ago
> Make sure your kernel is always updated on all of your
production hosts.And in order to avoid any zero-day exploits,
always use dedicated machines, never use a VPS server.
oblio - 4 hours ago
I'm not sure I get this - are you saying that you are more at
risk due to the VM host layer?
openasocket - 3 hours ago
Personally, I'd reasonably trust Xen or KVM or something else
with hardware-based virtualization and the like to protect me
in an multi-tenancy scenario. Much less so in the case of
Docker. Sharing a full kernel with potentially malicious
actors is more risky than sharing a hypervisor, much more
surface area for attack.
TheDong - 6 hours ago
That range isn't quite correct. It only impacted 4.13 kernels,
and only 4.13.0-4.13.6 (inclusive, distro-dependent due to
backports).It was patched in 4.13.7 after being introduced in the
4.13.0 merge window.See https://lwn.net/Articles/736348/This
issue shouldn't have happened at all, but it was caught and
patched very quickly, so relatively few real-world systems are or
were affected.
cirowrc - 6 hours ago
Thanks!
Da5hes - 6 hours ago
it was introduced by this commmit: https://git.kernel.org/pub/s
cm/linux/kernel/git/torvalds/lin...which by itself is in a 4.12
vanilla code tree
TheDong - 5 hours ago
That commit is correct
(4c48abe91be03d191d0c20cc755877da2cb35622), but it was not in
4.12 as cut by linus.https://github.com/torvalds/linux/commit
/4c48abe91be03d191d0... (click the little '...' to expand
tags it's in) or: $ git tag --contains
4c48abe91be03d191d0c20cc755877da2cb35622 v4.13 What is
your methodology that gets that it is in the 4.12 tree?
Da5hes - 5 hours ago
you are right, i actually didn't check on git, my bad
hacknat - 4 hours ago
This exploit is interesting, but if you are doing container
security correctly it?s actually not a big deal. In particular if
you are setting per-container usernamespaces, like you ought to be,
then this exploit doesn?t do anything. In fact you can actively
give a usernamespaced container any CAPs you want, because they are
isolated to that container?s uid:gid offset.Obviously, giving
containers unecessary CAP privileges in unwise, but if you are
practicing sound security best practices then there would be
multiple layers of defense between you and this CVE. I think a
strong AppArmor profile and SecComp profile would also make this
CVE moot.Edit: Also, this exploit relies on you being able to fork
up to a certain pid value. You can and should take advantage of
Linux?s per cgroup ulimit functionality. No container needs more
than 255 threads (even if they do you can make special exceptions
for such applications).Edit2: Additionally this CVE relies on the
getuid syscall being available, there is no reason to give a
container this syscall, you should block it, ala this guide:
https://rhelblog.redhat.com/2016/10/17/secure-your-container...I
have to say I?m more than a little dissapointed in Twistlock for
not pointing out what countermeasures you can employ against this
and other CVEs.
bmitch3020 - 2 hours ago
> In particular if you are setting per-container usernamespaces,
like you ought to be, then this exploit doesn?t do anything.User
namespacing in docker is enabled at the daemon level, not per
container, so all containers share the same offset. This would
ensure that a root user in the container would escape to a
different uid on the host, but doesn't prevent someone from
moving sideways through the containers on the same host.Note that
enabling this will break the developer workflow of mounting files
from the host into the container. I believe files will show up
with the wrong ownership inside the container.
hacknat - 1 hours ago
You don?t need to use containerd. Other runtimes make it
possible (per container offsets have been possible in runc for
over year).
wahern - 1 hours ago
1) User namespaces don't magically protect you from a
vulnerability that allows writing to kernel memory. Neither would
AppArmor. seccomp could theoretically, but waitid is a pretty
fundamental Unix syscall and blocking it would break a lot of
basic software.The author devised a particular exploit, but his
example was hardly the only way to leverage the vulnerability.
Being able to write to kernel memory is about as huge a
vulnerability as you can get. Just because you can't think of a
way to leverage a vulnerability doesn't mean an attacker can't;
your failure of imagination is not evidence that it cannot be
done.2) Plenty of containers need more than 255 threads. Like,
pretty much any Java server. In any event, this particular
exploit doesn't necessarily require hundreds or thousands of
simultaneous processes.3) Blocking getuid is even worse than
blocking waitid. Block getuid and you'll break glibc and god
knows what. In any event, it would be futile as the real and
effective UIDs are passed to the process through the auxiliary
process vector when the kernel executes the process.4) You're
missing the forest for the trees. The real moral of the story is
this: "In 2017 alone, 434 linux kernel exploits where found".
Unless you're prepared to pour over every published exploit,
24/7, meticulously devise countermeasures, and be prepared to run
effectively crippled software, you really shouldn't be relying on
containers to isolate high-value assets. I wouldn't rely on VMs,
either, as the driver infrastructure of hypervisors has also
proven fertile ground for exploits.
hacknat - 1 hours ago
It?s not a full kernel memory CVE, you have +-255 bytes access
to kernel memor from the cred pointer. I have no idea if that
extends to userns or not. Also I think your confusing Java
threads for system threads they are not the same.I think your
being overly alarmist. You have to trust someone else?s code at
some point, otherwise you?ll be paralyzed by non-productivity.
geofft - 1 hours ago
> It?s not a full kernel memory CVE, you have +-255 bytes
access to kernel memor from the cred pointer. I have no idea
if that extends to userns or not.As I understand it, a kuid_t
is the UID in the root namespace, so setting your cred->uid
to 0 gets you considered as equivalent to root in the
container host.Also, don't think that limited exposure to
kernel memory saves you - take a look at the sudo "vudo"
exploit from 2001, in which a single byte that was
erroneously overwritten with 0, and then put back, turned out
to be exploitable. http://phrack.org/issues/57/8.html (And in
general, don't confuse the lack of public existence of an
exploit with a proof that a thing isn't exploitable in a
certain way.)> Also I think your confusing Java threads for
system threads they are not the same.Current versions of the
HotSpot JVM (where by "current" I mean "since about 1.1")
create one OS thread per Java thread: http://openjdk.java.net
/groups/hotspot/docs/RuntimeOverview.... "The basic threading
model in Hotspot is a 1:1 mapping between Java threads (an
instance of java.lang.Thread) and native operating system
threads. The native thread is created when the Java thread is
started, and is reclaimed once it terminates." Plus there are
some other OS threads for the runtime itself.> I think your
being overly alarmist. You have to trust someone else?s code
at some point, otherwise you?ll be paralyzed by non-
productivity.Sure, but you can choose which code to trust,
and how to structure your systems to take advantage of the
code you trust and not the code you don't. Putting mutually-
distrusted things on physically separate Linux machines on
the same network is a pretty good architecture: I trust that
the Linux kernel is relatively low on CVEs that let TCP
packets from a remote machine overwrite kernel memory.
twic - 1 hours ago
> Also I think your confusing Java threads for system threads
they are not the same.Oh? On mainstream JVMs, a Java thread
is the same as the thing you could create with pthreads. What
do you mean by "system threads"?
wahern - 44 minutes ago
255 bytes is huge, though IIUC it's actually less than that.
Nonetheless, it's much more than is typical. Sometimes these
holes are limited to a single word, and only a single value
for that word (like NULL), and attackers still come up with
marvelously devious exploits.The critical vulnerability is
that the cred pointer address is entirely under your control,
so you get to poke at whatever kernel memory you want. The
limitation is 1) locating the address of what you want to
poke, and 2) being limited to a smallish ranges of values
that you can write out.Also, I'm not confusing Java threads
with system threads. Most JVMs use a 1:1 threading model. And
because on Linux a thread is just a process (which
unfortunately still causes headaches with things like
signals, setuid, etc), each thread has its own PID.I'm not
being alarmist, just realistic. Nobody is going to stop using
Linux anytime soon. Nor am I. But the fact of the matter is
that the Linux kernel is riddled with vulnerabilities.
Something like the waitid vulnerability comes along at least
3 or 4 times a year, and that's just the published ones.
(IMO, part of the reason is precisely because of complex
features like user namespaces, which add tremendous
complexity to the kernel. But that's a contentious point.)At
least for high-value assets (however you want to define
that), people should just treat Linux as if it lacks secure
process isolation entirely, absent a commitment to herculean
efforts--extremely locked down seccomp, PNaCL-like
sandboxing, etc for all your code that juggles tainted data.
Even then, vulnerabilities like recvmmsg come along and ruin
your day, but those are rare enough that it would be unfair
to single-out Linux.Not only is that pragmatic and
reasonable, after 25 years of endless vulnerabilities of this
sort I wouldn't trust the judgment of anyone who thought
otherwise. And for what it's worth, I'd make much the same
point about Windows, although I have much less experience on
that platform.
borplk - 3 hours ago
> but if you are doing container security correctly ...The
container that wasn't!(I get the gist of it, just tongue in
cheek)
geofft - 1 hours ago
> Edit2: Additionally this CVE relies on the getuid syscall being
availableThis exploit relies on it. The vulnerability does not.
The exploit happens to use getuid() along the way to using heap
spraying, but the writeup is pretty clear that neither getuid()
nor heap spraying is required.
hacknat - 1 hours ago
Yeah and I?m wrong about that part anyways. You can?t cap out
or block getuid without breaking glibc. I meant setuid, but
that call isn?t used in this exploit. I got confused.
jwilk - 3 hours ago
> Additionally this CVE relies on the getuid syscall being
available, there is no reason to give a container this syscall,
you should block it,Huh? Lots of legitimate things will break
without working getuid().> you should block it, ala this
guidegetuid() doesn't require any capabilities, so it can't be
blocked by taking them away.
hacknat - 1 hours ago
Oops good call. I meant setuid, but either way I was wrong.
oblio - 4 hours ago
As a somewhat of a container noob, could you expand on "per-
container usernamespaces"?
baq - 4 hours ago
Follow-up question: And why docker doesn't do that by default?
andbberger - 2 hours ago
Maybe because it breaks things?I just enabled user namespaces
after reading this post. Broke Jenkins and there doesn't
appear to be an easy solution. I mount the docker socket in
the Jenkins container, which is not an option with user
namespaces as the user Jenkins now runs as does not have
permission to access the socket.It seems to be possible to
provide this user access to the socket through a socket
proxy, but since all containers use the same user this seems
to defeat the purpose of using namespaces in the first
place.Cherry on top: although `docker run` supports running
containers with custom userns settings, docker swarm, which I
use to run Jenkins, does not.So as far as I can tell my only
options are:1. Go back to not using user namespaces 2. Make
the docker daemon on the host available over HTTP, which is
really something I was trying to avoid...Anyone have a more
elegant solution?
zenlikethat - 17 minutes ago
Mm, if you're bind mounting in the Docker socket, enabling
user namespaces won't help much. You just have to deal with
the fact that you have a privileged container (Docker API
access == root, at least unless you're using authz). It'd
be nice if we could see more RBAC around Docker API so you
could do things like "grant only permission to run this one
container".
andbberger - 1 minutes ago
Totally. But the vast majority of containers I use do not
get a bind mount to the Docker socket... for which user
namespaces would be a very nice feature.
LaGrange - 37 minutes ago
As far as I remember things, because it breaks overlay
filesystems, which are a major space saver in Docker world.
Something might have changed, but last time I checked, you
couldn't "offset" uids/gids on a filesystem overlay, so every
layer of the container would have to be copied and chowned
(slowly).This would obviously only work for minimal
containers (i.e. ones that don't contain a distribution), but
software has to be pretty much built for such a case (e.g.
statically linked, no dependencies on common tooling ?
popular with Go, but your Python application won't work edit:
unless you copy all the layers, that is).You can read the
docs here: https://docs.docker.com/engine/security/userns-
remap/#prereq..., and note that it stores image/container
layers in subdirectories under /var/lib/docker.Tl;dr: user
namespaces are inherently incompatible with many of the
usability features Docker brings over other solutions, while
they're not particularly useful for many popular use cases
(no shared hosting, minor differences in consequence between
escalating to the root of the container and its host - though
that's an assumption frequently wrongly made).
zenlikethat - 5 minutes ago
Also, people hold their bind mounts to the host near and
dear, and user namespaces would break all kinds of things
people expect to "just work" with bind mounts. Having user
namespaces on by default would break tons of existing
scripts, Compose/Kube files, etc. that do things like mount
/var/lib/mysql into the container for persistence.
fpoling - 1 hours ago
I tried it a couple of months ago. It immediately broke build
of one of the images. It was a known bug. So I guess I just
wait one more year to try.In the mean time, I make sure that
all my containers runs as non-root with max security
restrictions. The exception so far was sshd from OpenSSH and
mostly due to incorrect porting from OpenBSD in portable ssh.
hacknat - 3 hours ago
Because the Docker project doesn?t make money off of
security. It is actually quite infuriating, because they have
become the de facto container image standard. Most of their
security has actually come from Twistlock (I am not a
Twistlock employee, FYI). My recommendation to most Admins or
Devs that are serious about container security is to let your
developers use docker, but run your images with CRI-O on your
servers: http://cri-o.io/
nvarsj - 3 hours ago
CRI-O is bleeding edge. I'm not sure it's ready for
production usage. But it looks very promising. The sooner
we can all dump docker in kubernetes the better.
fpoling - 1 hours ago
Docker excels in image building especially now with multi-
image Dockefile support. It seems all those alternatives to
the Docker just gave up on providing anything on their own.
The documentation typically starts "lets pull a docker
image".On the other hand the container runtime is
straightforward. I recently discovered that one can run a
docker image with a bash script and the unshare command and
get a very tight security setup. That explains
proliferation of various alternatives to Docker to run its
images.
zenlikethat - moments ago
Nah, Docker has taken lots of strides in the right
direction for security over the years, Twistlock or not,
albeit with a few weird dangling remainders. They'd love to
turn user ns on by default but it'd break lots of existing
stuff. Many more users would be mad about having that on by
default than leaving it off.Disclaimer: I worked at Docker
for 3 years.
cpuguy83 - 2 hours ago
There are trade-offs to using userns and many ppl don't
like the current set of trade-offs. In addition changing a
default like this is a breaking change. Admins can enable
userns by default in a daemon, but making it a hard-coded
default is much more difficult.It's not just a matter of
enabling user ns. There is no support at the vfs layer for
uid/gid mapping. This means in order to use it, images must
be chowned with the remapped ID's. Per-container mappings
are not supported for this reason (it would require copying
and chowning the entire image for each container
mapping).Do you care to qualify your statement about CRI-O?
ecnahc515 - 12 minutes ago
I recall seeing some patches submitted to make it
possible to pass an uid/gid offset to the mount syscall
at one point when people were implementing usernamespaces
for container runtimes like docker. So is this fixable
without having to make every file system implement this
feature, or is there something else holding back better
support for doing uid shifting for use with user
namespaces?
cpuguy83 - 7 minutes ago
That has not been accepted into the kernel. It's called
"shiftfs", which basically let's you perform the uid/gid
shift on mount.
eikenberry - 2 hours ago
Viewing docker containers as anything more than a bundling
and deployment system is a mistake. While they might help
with security they will never be completely secure and you
should architect your deployments with that in mind.Unless
you are a giant enterprise shop with the resources to staff
a decent sized K8s team, you should use the hosted
solutions.
raesene9 - 2 hours ago
The docker project has done quite a bit of work on
container security, so I'm not sure it's entirely fair to
say that them not enabling user namespace by default is for
that reason.For example the work that was done on their
seccomp filtering and apparmor profiles.My guess would be
that it's as User namespacing can introduce some issues
(e.g. where mounting host filesystems), that they've
decided the trade-offs aren't worth it.Also looking at the
CRI-O trello (https://trello.com/c/Ak2yMcpf/714-epic-cri-o
-support-for-use...) it seems like user namespaces isn't
even an option there yet?
hacknat - 4 hours ago
Sure. User namespace-ing is a feature of container security
that allows you to grant a process root access to a filesystem
that itself is not root. To the running process it appears that
it is or can run as root, but on the host it actually isn?t
root, but some uid:gid offset. Here?s an article explaining
more: http://man7.org/linux/man-
pages/man7/user_namespaces.7.htmlThe gist is that a container
is further sandboxed by the kernel that is agnostic of the
higher level security precautions. It?s not perfect by itself,
but used in conjunction with other features like AppArmor or
SELinux and SecComp it can make a container virtually
sandboxed.
quotemstr - 41 minutes ago
> No container needs more than 255 threads > Additionally this
CVE relies on the getuid syscall being available, there is no
reason to give a container this syscall,The problem with MAC
schemes is that, in practice, they lead to security people
imposing random and arbitrary restrictions on general APIs in the
name of the least privilege. In doing so, they break the
orthogonality of general-purpose platform concepts and break the
reductive mental model necessary to get anything done. It's a
misunderstanding of what least privilege actually means.Security
is better achieved by creating clear, principled security domains
and boundaries, then controlling access to these domains in a
general and transparent way. Saying "you, unix process, you can
call system call X, but not system call Y, because in my opinion,
Y is risky", when neither X nor Y breaks through a security
domain, is bad practice. So is arbitrarily capping the number of
threads in a container.
eikenberry - 2 hours ago
> but if you are doing container security correctlyDoing it
correctly should be the case using the default settings.
Defaulting to an insecure setup is a bug.
dvdhnt - 2 hours ago
Hmm. Perhaps this is a difference between dev and ops, but
almost every tool we use comes out of the box with settings
unfit for production. Instead, they're tuned for development,
and in some cases, deploying to a staging environment. At
least, this has been the case in my experience.
sverhagen - 2 hours ago
Ah, dev... ops... How about DevOps? As a (originally) dev I
bring my app to production. How do I stand a fighting chance
to reconfigure the defaults in the way you suggest, without
suddenly gaining a whole new set of skills? Good defaults
would be helpful, even if they're very conservative. I can
break things open, but at least then I know what to read up
on.
kemitche - 2 hours ago
100% agree here. Ship with secure settings by default and
have simple "developer guides" that show what to crack open
for easier use in non-production environments.
bacongobbler - 1 hours ago
I would argue the opposite. As a developer I want to have
tools that make my life easier to - you guessed it -
develop. Enabling unnecessary secure defaults that either
hinder or don't apply to my use case is silly.There's a
reason most users choose Ubuntu over OpenBSD as their
workstation. I would put good money on the reason is
because it's "secure enough" without getting too heavy
handed on production use cases.However, I do agree that
there has to be a balance. Most tooling I write tends to
lean more towards the "good user experience" side first,
and then document the production use case. Either that or
release two separate (but similar) products; one for
developers, one for operations teams. Docker's doing that
with the Community Edition/Enterprise Edition, but I
still think the Community Edition is far too heavy-handed
when it comes to things like pulling images from
"insecure" registries.
crb002 - 6 hours ago
Linus really needs to start having more formal verification around
patches.
Matt3o12_ - 5 hours ago
This might be a bit off topic but I wonder why the vulnerability
has been patched this way: if (!access_ok(VERIFY_WRITE, infop,
sizeof(*infop))) goto Efault; Why doesn?t the if use curly
brackets? I thought it has been established that it is best
practices to always use curly brackets even if they are explicit,
especially after Apple's infamous goto bug[1].Secondly, why does it
use goto at all? I thought it has also been established not to use
goto unless it is the only performant solution (and performance is
important in that case). Sure Efault with probably kill the program
but wouldn?t it still be better to use a function call considering
that the desired resolution should be the same?[1]:
https://www.imperialviolet.org/2014/02/22/applebug.html
chowyuncat - 5 hours ago
The kernel style guide mandates no curly brackets in this
case.https://www.kernel.org/doc/html/v4.10/process/coding-
style.h...
[deleted]
heinrichhartman - 5 hours ago
1. Linux kernel coding style is documented here [1], and contains
the line:> Do not unnecessarily use braces where a single
statement will do.2. There is no built-in exception handling in
C. `goto ERROR_HANDLING_CODE` is a common and well established
pattern to handle exceptions in C, see e.g. [2].[1]
https://www.kernel.org/doc/html/v4.10/process/coding-style.h...
[2] https://news.ycombinator.com/item?id=3883310
[deleted]
umanwizard - 5 hours ago
> I thought it has also been established not to use goto unless
[...]?Established? by whom? Certainly not by kernel developers ?
`goto` is very common in all kernels I have looked at (xnu,
Linux, bsd)
[deleted]
drchickensalad - 4 hours ago
And it has quite a consensus as the best solution to this
problem. Goto being considered harmful is a generally true
statement. However, this usage is a more specific exception to
the rule, with objective benefits vs alternatives.
efficax - 2 hours ago
goto is kind of a necessity in any complex C codebase unless you
want to duplicate tons of code. Sometimes you need to jump way
out of the context, especially to handle errors, and C does not
have "exceptions" (although you could do them with
setjmp/longjmp)
adam-ff - 4 hours ago
Note: The present solution replaced the goto with a return:https:
//git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
icebraining - 4 hours ago
From your link: Maybe the coding style contributed to this by
allowing ifs without braces, but one can have incorrect
indentation with braces too, so that doesn't seem terribly
convincing to me.By the way, gcc now has -Wmisleading-
indentation, which is activated by default if you enable -Wall:
https://news.ycombinator.com/item?id=10875449
marcoperaza - 4 hours ago
goto is the best way to do early exit and cleanup from a C
function. The alternatives are 1) deeply nested if's, one level
for each function call that can return an error code, or 2)
repeat the same cleanup code over and over again at every exit
point.
anonfunction - 4 hours ago
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
ttul - 4 hours ago
OpenBSD has randomized pids since the dawn of time. Why has Linux
not taken this basic step to improve security?
hacknat - 4 hours ago
Randomized pids wouldn?t nexessarily help that much in this
situation, especially if the getuid syscall is available.
However, I agree with your general sentiment that there are basic
security features that Linux could implement to make a lot of
CVEs impotent. I think the community is coming round, but this
stuff takes more work than most people may realize.
alpb - 5 hours ago
Any ideas why this is branded as "Docker"? Are the same namespacing
constructs not being used by other Linux container runtimes? I
think this should be titled "Escaping Linux containers" as docker
is not at fault here?
derefr - 5 hours ago
"Docker" isn't at fault here either way, as Docker isn't its own
"execution driver" (in Docker parlance) any more; that would be
https://github.com/opencontainers/runc.But to answer the spirit
your question, each container runtime uses its own peculiar
combination of such constructs. It's helpful to know that this
attack allows you to break out of the combination used
specifically by runc, and thereby to break out of any system
relying on Docker (with the default runc execution driver.)
icebraining - 4 hours ago
I doubt any runtime would have prevented this bug; some have
seccomp-bpf profiles to blacklist some kernel operations, but I
doubt any block a function as basic as waitid().
jo909 - 5 hours ago
What do you mean by "branded"?The author shows a concrete exploit
of the kernel bug described in CVE-2017-5123 as he has developed
it in the context of the docker container environment.He shows
how to use this bug to break out of docker, so he calls the blog
post "Escaping Docker ...".Which is IMHO the most interesting
container runtime to write such an exploit for first because it
is very widely deployed, but it might also just have been what
the author is most familiar with or what was easiest to develop
for him.
chowyuncat - 4 hours ago
Think of it this way: what if the author had titled it
"Escaping Ubuntu containers" ?
dchest - 4 hours ago
Why? The article demonstrates exploitation of Docker
containers.
oblio - 4 hours ago
Yeah, but is it limited to Docker containers? Can other
container types be attacked in the same way?
jo909 - 3 hours ago
It is a reasonable _assumption_ that other container
runtimes on linux might be affected by the same kernel
bug. The article does not explore that and the author has
no duty to do so just to avoid using a branded technology
name.How would you reasonably talk about "Linux
containers" without having a very exhaustive list of all
existing implementations and testing all of them? If one
of them is not affected you are now factually wrong.
chowyuncat - 6 minutes ago
The exploit overwrites kernel memory credentials of a
task structure. That structure is the lynchpin of kernel
security, including SELinux.
dchest - 3 hours ago
Sure, you can write an article demonstrating exploitation
of Ubuntu containers and call it "Escaping Ubuntu
containers".
jo909 - 3 hours ago
"Ubuntu container" is just not a name typically used for
anything. "Docker container" is.Let's make it realistic and
say he had used RedHat OpenShift as his target and example
for the exploit. I'd be completely fine with the title
referencing that exact product by name.Why would he have to
dance around what he is using in his demo? Maybe that
concrete product has multiple layers of security or lacks
them, or uses a certain version etc. He can only speak to
what he himself was using and testing. "Escaping Docker
container..." is the best short description (as you would
need it for a title) of this demo exploit I can think of.
shykes - 5 hours ago
Marketing.There's a long tradition by enterprise vendors large
and small to market someone else's product as insecure, in order
to create demand for their "improved, secured" version.In this
particular instance, Twistlock is selling Docker security by
amplifying the meme of "insecure Docker". The Docker brand has
visibility the target audience (Enterprise IT) so it's a good
target for this kind of piggyback.This type of FUD marketing
happens all the time in many different markets, it's not specific
to Docker.
cbisnett - 2 hours ago
Just to clarify the terminology here:- A vulnerability is a sofware
bug that has particular behaviors and ramifications that allow it
to be used maliciously.- An exploit is a crafted piece of input
data that is designed to trigger a vulnerability to execute
arbitrary code, crash the target (Denial-of-Service), etc.> In 2017
alone, 434 linux kernel exploits where found, and as you have seen
in this post, kernel exploits can be devastating for containerized
environments.There are a few places in the article like this one
where the correct terminology is vulnerability not exploit.
cvedetails.com aggregates vulnerabilities. Places like exploit-
db.com aggregate exploits people have written to take advantage of
vulnerabilities to enable them to perform some unintended action
against the target.Edit: formatting
AgentME - 4 hours ago
Does this escape only work if they have root inside of the
container? I usually try to make it so my containers always contain
a non-root process as an extra layer of security.
hacknat - 3 hours ago
No it doesn?t matter. If they have waituid and getuid then they
are off to the races.
saagarjha - 2 hours ago
In case the article author is here: The code snippets given aren't
escaped, so &, <, > show up as HTML entities instead.
snvzz - 2 hours ago
>In 2017 alone, 434 linux kernel exploits where found, and as you
have seen in this post, kernel exploits can be devastating for
containerized environments. This is because containers share the
same kernel as the host, thus trusting the built-in protection
mechanisms alone isn?t sufficient.More than one kernel exploit _per
day_. Exploiting Linux is just a matter of finding one such
vulnerability and using it. This can be done in a single
day.There's just no fixing megabytes of buggy kernel code.It really
drives home the need for a proper OS based on a verified,
capability-enabled microkernel such as seL4.
xi- - 58 minutes ago
>This can be done in a single day.Only if you're lucky. Most of
these exploits probably took weeks to find and analyze properly,
it's not like one person found more than one a day. They're found
because whole teams are working with the linux kernel at the same
time and either happen by them or actively look for them.
mehrdadn - 56 minutes ago
I'll surely get a lock of flak for this, but these kinds of bugs
would be trivial to avoid in C++. All you need is to make the
pointer arguments to syscalls be some other data type (say,
user_ptr
) that performs an access-check upon conversion to a
raw pointer. Then the compiler simply wouldn't let you bypass the
access-check, so you simply could not forget to do so. That's the
fundamental difference between C++ and C: one of them actually
lets you write code that cannot contain many classes of mistakes,
and the other, well, doesn't. For the life of me I don't
understand the stubbornness behind sticking to the same languages
and tools from decades ago.
mattmcknight - 1 hours ago
If anything, this points out that the use case of Docker for
security isolation, such as in a multi-tenant architecture, is
probably still not a good one.In most use cases I see containers
used for rapid and consistent deployment. The isolation benefit
with multiple containers on a host is that if you install things
with different library dependencies you don't run into conflicts.
As such, the comparison for the common use case is just software
installed directly on the host, which also is subject to this vuln.
quotemstr - 29 minutes ago
Related: https://lwn.net/Articles/736348/
eeZi - 5 hours ago
This is precisely why we need projects like Grsecurity.
DyslexicAtheist - 25 minutes ago
Linux kernels in production (since we all now like to run docker
there :)) without grsec/seccomp have always been pretty dangerous.
What I dislike about docker is their feature creep and lack of
proactively steering their users to accepting more secure defaults.
The mindset towards security in the Linux kernel community remains
shockingly stubborn compared to the shift to "better security",
which is taking over the rest of the industry.
DyslexicAtheist - 16 minutes ago
actually the most affected by this CVE would be medium sized
companies not investing in enough internal development pumping
out services fast with secure default (startups rushing to their
MVP maybe too). The companies running in a totally automated farm
with Kubernets or docker swarm usually don't have containers with
long uptimes.
mehrdadn - 1 hours ago
> The vulnerability is that the highlighted access_ok() check was
missing in the waitid() syscall.Why in the world does this class of
vulnerabilities still exist in 2017? Why are kernel maintainers not
writing some kind of C linter that makes sure every single pointer
argument to every syscall is passed to a well-known function like
access_ok (Linux) or ProbeForRead (Windows)? Literally all you need
is a syntactic check; you don't even need to do any kind of
semantic analysis... since all you want is to flag the code so
someone can inspect each spot manually. Why is this not done?!
quotemstr - 26 minutes ago
C++ would also make it harder to get it wrong. Its type system is
powerful enough to enforce rules like "you must call access_ok
before writing through a pointer": you just have access_ok
transform an inaccessible pointer token of some sort, passed in
as a syscall parameter, into a different kind of object through
which you can write into memory.The generated machine code would
be identical to what's in the kernel today, but it'd be both
safer and cleaner. C++ still has to get over the bad gang-of-
four-1990s-era-object-goo reputation it has among systems people.
mehrdadn - 25 minutes ago
> C++ would also make it harder to get it wrong.Funny you
mention this... https://news.ycombinator.com/item?id=16032324
quotemstr - 20 minutes ago
It's a thought a lot of people have, I bet. :-)