Capsicum is a lightweight operating system (OS) capability and sandbox framework planned for inclusion in FreeBSD 9. Capsicum extends, rather than replaces, UNIX APIs, providing new kernel primitives (sandboxed capability mode and capabilities) and a userspace sandbox API. These tools support decomposition of monolithic UNIX applications into compartmentalized logical applications, an increasingly common goal that is supported poorly by existing OS access control primitives. We demonstrate our approach by adapting core FreeBSD utilities and Google's Chromium Web browser to use Capsicum primitives, and compare the complexity and robustness of Capsicum with other sandboxing techniques.
Capsicum is an API that brings capabilities, unforgeable tokens of authority, to UNIX. Fine-grained capabilities have long been the province of research operating systems (OSs) such as EROS.17 UNIX systems have less fine-grained access control, but are widely deployed. Capsicum's additions to the UNIX API suite give application authors an adoption path for one of the ideals of OS security: least-privilege operation. We validate Capsicum through a prototype built on (and planned for inclusion in) FreeBSD 9.0.
Today, many security-critical applications have been decomposed into sandboxed parts in order to mitigate vulnerabilities. Privilege separation,12 or compartmentalization, has been adopted for applications such as OpenSSH, Apple's SecurityServer, and Google's Chromium Web browser. Sandboxing is enforced using various access control techniques, but only with significant programmer effort and limitations: current OSes are simply not designed for this purpose.
Conventional (non-capability-oriented) OSes primarily use Discretionary Access Control (DAC) and Mandatory Access Control (MAC). DAC allows the owner of an object (such as a file) to specify the permissions other users have for it, which are checked when the object is accessed. MAC enforces systemic policies: administrators specify requirements (e.g., "users cleared to Secret may not read Top Secret documents"), which are checked when objects are accessed.
Capsicum addresses these problems by introducing new (and complementary) security primitives to support compartmentalization: capability mode and capabilities. Capabilities extend UNIX file descriptors, encapsulating rights on specific objects, such as files or sockets; they may be delegated from process to process. Capability mode processes can access only resources that they have been explicitly delegated. Capabilities should not be confused with OS privileges, occasionally described as POSIX capabilities, which are exemptions from access control or system integrity protections, such as the right to override file permissions.
We have modified several applications, including UNIX utilities and Chromium, to use Capsicum. No special privilege is required, and code changes are minimal: the
tcp-dump utility, plagued with past security vulnerabilities, can be sandboxed with Capsicum in around 10 lines of C, and Chromium in just 100 lines. In addition to being more secure and easier to use than other sandboxing techniques, Capsicum performs well: unlike pure capability systems that employ extensive message passing, Capsicum system calls are just a few percent slower than their UNIX counterparts.
Capsicum blends capabilities with UNIX, achieves many of the benefits of least-privilege operation, preserves existing UNIX APIs and performance, and offers application authors an adoption path for capability-oriented software design. Capsicum extends, rather than replaces, standard UNIX APIs by adding kernel-level primitives (a sandboxed capability mode, capabilities, and others) and userspace support code (libcapsicum and a capability-aware runtime linker). These extensions support application compartmentalization, the decomposition of monolithic applications into logical applications whose components run in sandboxes (Figure 1).
Capsicum requires application modification to exploit new security functionality, but this may be done gradually, rather than requiring a wholesale conversion to a pure capability model. Developers can select the changes that maximize positive security impact while minimizing unacceptable performance costs; where Capsicum replaces existing sandbox technology, a performance improvement may even be seen.
Capsicum incorporates many pragmatic design choices, emphasizing compatibility and performance over capability purism, not least by eschewing microkernel design. While applications may adopt message-passing, and indeed will need to do so to fully benefit from the Capsicum architecture, we provide "fast paths" direct system calls operating on delegated file descriptors. This allows native UNIX I/O performance, while leaving the door open to techniques such as message-passing system calls if that proves desirable.
2.1. Capability mode
Capability mode is a process credential flag set by a new system call,
cap_enter(); once set, the flag cannot be cleared, and it is inherited by all descendent processes. Processes in capability mode are denied access to global namespaces such as absolute filesystem paths and PIDs (Figure 1). Several system management interfaces must also be protected to maintain UNIX process isolation (including
/dev device nodes, some
ioctl() operations, and APIs such as
Capability mode system calls are restricted: those requiring global namespaces are blocked, while others are constrained. For instance,
sysctl() can be used not only to query process-local information such as address space layout, but also to monitor a system's network connections. Roughly 30 of 3000
sysctl() MIB entries are permitted in capability mode.
Other constrained system calls include
shm_open(), which is permitted to create anonymous memory objects but not named ones, and the
openat() family. These calls accept a directory descriptor argument relative to which
open(), rename(), etc. path lookups occur; in capability mode, operations are limited to objects "under" the passed directory.
The most critical choice in adding capability support to a UNIX system is the relationship between capabilities and file descriptors. Some systems, such as Mach, maintain entirely independent notions: Mac OS X provides each task with both capabilities (Mach ports) and BSD file descriptors. Separating these concerns is logical, as ports have different semantics from file descriptors; however, confusing results can arise for application developers dealing with both Mach and BSD APIs, and we wish to reuse existing APIs as much as possible. Instead, we extend the file descriptor abstraction, and introduce a new file descriptor type, the capability, to wrap and protect raw file descriptors.
File descriptors already have some properties of capabilities: they are unforgeable tokens of authority, and can pass between processes via inheritance or inter-process communication (IPC). Unlike "pure" capabilities, they confer very broad rights: even if a file descriptor is read-only, meta-data writes such as
fchmod() are permitted. In Capsicum, we restrict file descriptor operations by wrapping it in a capability that masks available operations (Figure 2).
There are roughly 60 capability mask rights, striking a balance between message-passing (two rights: send and receive) and MAC systems (hundreds of access control checks). Capability rights align with methods on underlying objects: system calls implementing similar operations require the same rights, and calls may require multiple rights. For example,
pread() (read file data) requires
read() (read data and update the file offset) requires
cap_new() system call creates a new capability given an existing file descriptor and rights mask; if the original descriptor is a capability, the new rights must be a subset of the original rights. Capabilities can wrap any type of file descriptor including directories, which can be passed as arguments to
openat() and related system calls. Directory capabilities delegate namespace subtrees, which may be used with
*at() system calls (Figure 3). As a result, sandboxed processes can access multiple files in a directory without the performance overhead or complexity of proxying each
open() to a process with ambient authority via IPC.
Many past security extensions have composed poorly with UNIX security leading to vulnerabilities; thus, we disallow privilege elevation via
setgid binaries in capability mode. This restriction does not prevent
setuid binaries from using sandboxes.
2.3. Runtime environment
Manually creating sandboxes without leaking resources via file descriptors, memory mappings, or memory contents is difficult,
libcapsicum provides a high-level API for managing sandboxes, hiding the implementation details of cutting off global namespace access, closing file descriptors not delegated to the sandbox, and flushing the address space via
fexecve(). libcapsicum returns a socket that can be used for IPC with the sandbox, and to delegate further capabilities (Table 1).
3.1. Kernel changes
Most constraints are applied in the implementation of kernel services, rather than by filtering system calls. The advantage of this approach is that a single constraint, such as denying access to the global file system (FS) namespace, can be implemented in one place,
namei(), which is responsible for processing all path lookups. For example, one might not have expected the
fexecve() call to cause global namespace access, since it takes a file descriptor for a binary as its argument rather than a path. However, the binary passed by file descriptor specifies its runtime linker via an embedded path, which the kernel will implicitly open and execute.
Similarly, capability rights are checked by the kernel function
fget(), which converts a numeric descriptor into a
struct file reference. We have added a new
rights argument, allowing callers to declare what capability rights are required to perform the current operation. If the file descriptor is a raw file descriptor, or wrapped by a capability with sufficient rights, the operation succeeds. Otherwise,
ENOTCAPABLE is returned. Changing the signature of
fget() allows us to use the compiler to detect missed code paths, giving us greater confidence that all cases have been handled.
One less trivial global namespace to handle is the process ID (PID) namespaceused for process creation, signaling, debugging, and exit statuscritical operations for a logical application. A related problem is that libraries cannot create and manage worker processes without interfering with process management in the application itselfunexpected process IDs may be returned by
wait(). Process descriptors address these problems in a manner similar to Mach task ports: creating a process with
pdfork() returns a file descriptor suitable for process management tasks, such as monitoring for exit via
poll(). When a process descriptor is closed, its process is terminated, providing a user experience consistent with that of monolithic processes: when the user hits Ctrl-C, all processes in the logical application exit.
3.2. The Capsicum runtime environment
Removing access to global namespaces forces fundamental changes to the UNIX runtime environment. Even the most basic UNIX operations for starting processes and running programs are restricted:
exec() rely on global PID and FS namespaces, respectively.
Responsibility for launching a sandbox is split between
rtld-elf-cap.libcapsicum is invoked by the application, forks a new process using
pdfork(), gathers delegated capabilities from the application and libraries, and directly executes the runtime linker, passing target binary as a capability. Directly executing the capability-aware runtime linker avoids dependence on
fexecve loading a runtime linker via the global FS namespace. Once
rtld-elf-cap is executing in the new process, it links the binary using libraries loaded via directory capabilities. The application is linked against normal C libraries and has access to all of the full C run-time, subject to sandbox restrictions.
lcs_get() to look up delegated capabilities and retrieve an IPC handle so that they can process RPCs. Capsicum does not specify an Interface Description Language (IDL), as existing compartmentalized or privilege-separated applications have their own, often hand-coded, RPC marshalling already. Here, our design differs from historic microkernel systems, which universally have specified IDLs, such as the Mach Interface Generator (MIG).
libcapsicum's fdlist (file descriptor list) abstraction allows modular applications to declare a set of capabilities to be passed into sandboxes. This avoids hard-coding file descriptor numbers into the ABI between applications and their sandboxed components, a technique used in Chromium that we felt was likely to lead to bugs. Instead, application and library components bind file descriptors to names before creating a sandbox; corresponding code in the sandbox retrieves file descriptors using the same names.
Adapting applications for sandboxing is a nontrivial task, regardless of the framework, as it requires analyzing programs to determine their resource dependencies and adopting a distributed system programming style in which components use message passing or explicit shared memory rather than relying on a common address space. In Capsicum, programmers have access to a number of programming models; each model has its merits and costs in terms of starting point, development complexity, performance, and security:
Modify applications to use
cap_enter() directly in order to place an existing process with ambient authority in capability mode, retaining selected capabilities and virtual memory mappings. This works well for applications with simple structures such as "open all resources, process them in an I/O loop," e.g., programs in a UNIX pipeline or that use a network single connection. Performance overhead is extremely low, as changes consist of encapsulating converting file descriptor rights into capabilities, followed by entering capability mode. We illustrate this approach with
Reinforce existing compartmentalization with
cap_enter(). Applications such as
dhclient and Chromium are already structured for message passing, and so benefit from Capsicum without performance or complexity impact. Both programs have improved vulnerability mitigation under Capsicum.
Modify the application to use the
libcapsicum API, possibly introducing new compartmentalization.
libcapsicum offers a simpler and more robust API than handcrafted separation, but at a potentially higher performance cost: residual capabilities and virtual memory mappings are rigorously flushed. Introducing new separation in an application comes at a significant development cost: boundaries must be identified such that not only it is security improved (i.e., code processing risky data is isolated), but also resulting performance is acceptable. We illustrate this technique with
Compartmentalized application development is distributed application development, with components running in different processes and communicating via message passing. Commodity distributed debugging tools are, unfortunately, unsatisfying and difficult to use. While we have not attempted to extend debuggers, such as
gdb, to better support compartmentalization, we have modified several FreeBSD tools to understand Capsicum, and take some comfort in the synchronous nature of compartmentalized applications.
procstat command inspects kernel state of running processes, including file descriptors, memory mappings, and credentials. In Capsicum, these resource lists become capability lists, representing the rights available to the process. We have extended
procstat to show Capsicum-related information, such as capability rights masks on file descriptors and a process credential flag indicating capability mode.
When adapting existing software to run in capability mode, identifying capability requirements can be tricky; often the best technique is to discover them through dynamic analysis, identifying missing dependencies by tracing real-world use. To this end, capability-related failures are distinguished by new
ENOTCAPABLE. System calls such as
open() are blocked in
namei, rather than at the kernel boundary, so that paths are available in
Another common compartmentalized debugging strategy is to allow the multiprocess logical application to be run as a single process for debugging purposes,
libcapsicum provides an API to query sandbox policy, making it easy to disable sandboxing for testing. As RPCs are generally synchronous, the thread stack in a sandbox is logically an extension of the thread stack in the host process, making the distributed debugging task less fraught than it might otherwise appear.
tcpdump provides not only an excellent example of Capsicum offering immediate security improvement through straightforward changes, but also the subtleties that arise when sandboxing software not written with that in mind.
tcpdump has a simple model: compile a Berkeley Packet Filter (BPF) rule, configure a BPF device as an input source, and loop reading and printing packets. This structure lends itself to capability mode: resources are acquired early with ambient authority, and later processing requires only held capabilities. The bottom three lines of Figure 4 implement this change.
This change significantly improves security, as historically fragile packet-parsing code now executes with reduced privilege. However, analysis with the
procstat tool is required to confirm that only desired capabilities are exposed, and reveals unconstrained access to a
/dev/pts/0, which would permit improper access to user input. Adding
lc_limitfd calls as in Figure 4 prevents reading
stdin while still allowing output. Figure 5 illustrates
procstat, including capabilities wrapping file descriptors to narrow delegated rights.
ktrace reveals another problem: the DNS resolver depends on FS access, but only after
cap_enter() (Figure 6). This illustrates a subtle problem with sandboxing: modular software often emplous on-demand initialization scattered throughout its components. We correct this by proxying DNS via a local resolver daemon, addressing both FS and network address namespace concerns.
Despite these limitations, this example shows that even minor changes can lead to dramatic security improvements, especially for a critical application with a long history of security problems. An exploited buffer overflow, for example, will no longer yield arbitrary FS or network access.
FreeBSD ships with the privilege-separated OpenBSD DHCP client. DHCP requires substantial privilege to open BPF descriptors, create raw sockets, and configure network interfaces, so is an appealing target for attackers: complex network packet processing while running with root privilege. Traditional UNIX proves only weak tools for sandboxing: the DHCP client starts as the root user, opens the resources its unprivileged component requires (raw socket, BPF descriptor, lease configuration file), forks a process to continue privileged network configuration, and then confines the parent process using
setuid(). Despite hardening of the BPF
ioctl() interface to prevent reprogramming the filter, this confinement is weak:
chroot() limits only FS access, and switching credentials offers poor protection against incorrectly configured DAC on System V IPC.
The two-line addition of
cap_enter() reinforces existing sandboxing with Capsicum, limiting access to previously exposed global namespaces. As there has been no explicit flush of address space or capabilities, it is important to analyze what capabilities are retained by the sandbox (Figure 7).
dhclient has done an effective job at eliminating directory access, but continues to allow sandboxes to submit arbitrary log messages, modify the lease database, and use a raw socket. It is easy to imagine extending
dhclient to use capabilities to further constrain file descriptors inherited by the sandbox, for example, by limiting the IP raw socket to
ioctl(). I/O interposition could be used to enforce log message and lease file constraints.
gzip presents an interesting target for several reasons: it implements risky compression routines that have suffered past vulnerabilities, executes with ambient user authority, yet is uncompartmentalized. UNIX sandboxing techniques, such as
chroot() and sandbox UIDs, are a poor match not only because of their privilege requirement, but also because the notion of a single global application sandbox is inadequate. Many simultaneous
gzip sessions can run independently for many different users, and placing them in the same sandbox provides few of the desired security properties.
The first step is to identify natural fault lines in the application: for example, code that requires ambient authority (e.g., opening files or network connections) and code that performs more risky activities (e.g., decoding data). In
gzip, this split is obvious: the main run loop of the application opens input and output files, and supplies file descriptors to compression routines. This suggests a partitioning in which pairs of capabilities are passed to a sandbox for processing.
gzip to optionally proxy compression and decompression to a sandbox. Each RPC passes input and output capabilities into a sandbox, as well as miscellaneous fields such as size, original filename, and modification time. By limiting capability rights to combinations of
CAP_READ, CAP_WRITE, and
CAP_SEEK, a tightly constrained sandbox is created, preventing access to globally named resources, in the event a vulnerability in compression code is exploited.
This change adds 409 lines (16%) to the
gzip source code, largely to marshal RPCs. In adapting
gzip, we were initially surprised to see a performance improvement; investigation of this unlikely result revealed that we had failed to propagate the compression level (a global variable) into the sandbox, leading to the incorrect algorithm selection. This serves as a reminder that code not originally written for decomposition requires careful analysis. Oversights such as this one are not caught by the compiler: the variable was correctly defined in both processes, but values were not properly propagated.
gzip raises an important design question: is there a better way to apply sandboxing to applications most frequently used in pipelines? Seaborn has suggested one possibility: a Principle of Least Authority Shell (PLASH), in which the shell runs with ambient privilege but places pipeline components in sandboxes.16 We have begun to explore this approach on Capsicum, but observe that the design tension exists here as well:
gzip's non-pipeline mode performs a number of application-specific operations requiring ambient privilege, and logic like this is equally awkwardly placed in the shell. On the other hand, when operating purely in a pipeline, the PLASH approach offers the possibility of near-zero application modification.
We are also exploring library self-compartmentalization, in which library code sandboxes itself transparently to the host application. This has motivated several of our process model design choices: masking
SIGCHLD delivery to the parent when using process descriptors avoids disturbing application state. This approach would allow sandboxed video processing in unmodified Web browsers. However, library APIs are often not crafted for sandbox-friendliness: one reason we placed separation in
gzip rather than
libz is that whereas
gzip internal APIs used file descriptors,
libz APIs acted on buffers. Forwarding capabilities offers full I/O performance, whereas the cost of transferring buffers via RPCs grows with file size. This approach does not help where vulnerabilities lie in library API use; for example, historic vulnerabilities in
libjpeg have centered on callbacks into applications.
The FreeBSD port of Chromium did not include sandboxing, and the sandboxing facilities provided as part of the similar Linux and Mac OS X ports bear little resemblance to Capsicum. However, existing compartmentalization was a useful starting point: Chromium assumes sandboxes cannot open files, certain services were already forwarded to renderers (e.g., font loading via passed file descriptors and renderer output via shared memory).
Roughly 100 lines of code were required to constrain file descriptors passed to sandboxes, such as Chromium
stdio,/dev/random, and font files, to call
cap_enter(), and to configure capability-oriented POSIX Shared memory instead of System V IPC shared memory. This compares favorably with 4.3 million lines of Chromium source code, but would not have been possible without existing sandbox support.
Chromium provides an ideal context for a comparison with existing sandboxing mechanisms, as it employs six different sandboxing technologies (Table 2). Of these, two are DAC-based, two MAC-based, and two capability-based.
5.1. Windows ACLs and SIDs
On Windows, Chromium employs DAC to create sand-boxes.13 The unsuitability of inter-user protections for the intra-user context is well demonstrated: the model is both incomplete and unwieldy. Chromium uses Access Control Lists (ACLs) and Security Identifiers (SIDs) to sandbox renderers on Windows. Chromium creates a SID with reduced privilege, which does not appear in the ACL of any object, in effect running the renderer as an anonymous user, and attaches renderers to an "invisible desktop," isolating them from the user's desktop environment. Many legitimate system calls are denied to sandboxed processes. These calls are forwarded to a trusted process responsible for filtering and processing, which comprises most of the 22,000 lines of code in the sandbox module.
Objects without ACL support are not protected, including FAT FSs and TCP/IP sockets. A sandbox may be unable to read NTFS files, but it can communicate with any server on the Internet or use a configured VPN. USB sticks present a significant concern, as they are used for file sharing, backup, and robustness against malware.
5.2. Linux chroot
suid model also attempts to create a sandbox using legacy access control; the result is similarly porous, but with the additional risk posed by the need for OS privilege to create the sandbox. In this model, access to the filesystem is limited to a directory via
chroot(): the directory becomes the sandbox's virtual root directory. Access to other namespaces, including System V shared memory (where the user's X window server can be contacted) and network access, is unconstrained, and great care must be taken to avoid leaking resources when entering the sandbox.
chroot() requires a
setuid binary helper with full system privilege. While similar in intent to Capsicum's capability mode, this model suffers from significant weakness (e.g., permitting full access to the System V shared memory as well as all operations on passed file descriptors).
5.3. Mac OS X Sandbox
On Mac OS X, Chromium uses Apple's Sandbox system. Sandbox constrains processes according to a scheme-based policy language5 implemented via the MAC Framework.19 Chromium uses three policies for different components, allowing access to font directories while restricting access to the global FS namespace. Chromium can create stronger sandboxes than is possible with DAC, but rights granted to renderer processes are still very broad, and policy must be specified independently from code.
As with other techniques, resources are acquired before constraints are imposed, so care must be taken to avoid leaking resources into the sandbox. Fine-grained file system constraints are possible, but other namespaces such as POSIX IPC are an all-or-nothing affair. The Seatbelt-based sandbox model is less verbose than other approaches, but like all MAC systems, policy must be expressed separately from code. This can lead to inconsistencies and vulnerabilities.
Chromium's SELinux sandbox employs a Type Enforcement (TE) policy.9 SELinux provides fine-grained rights management, but in practice, broad rights are often granted as fine-grained TE policies are difficult to write and maintain. SELinux requires that an administrator be involved in defining new policy, which is a significant inflexibility: application policies are effectively immutable.
The Fedora reference policy for Chromium creates a single SELinux domain,
chrome_sandbox_t, shared by all renderers, risking potential interference. The domain is assigned broad rights, such as the ability to read the terminal device and all files in
/etc. Such policies are easier to craft than fine-grained ones, reducing the impact of the dual-coding problem, but are less effective, allowing leakage between sandboxes and broad access to resources outside of the sandbox.
In contrast, Capsicum eliminates dual-coding by combining policy with code, with both benefits and drawbacks. Bugs cannot arise due to inconsistencies between policy and code, but there is no easily statically analyzable policy. This reinforces our belief that MAC and capabilities are complementary, filling different security niches.
5.5. Linux seccomp
Linux has an optionally compiled capability mode-like facility called
seccomp. Processes in
seccomp mode are denied access to all system calls except
read(), write(), and
exit(). On face value this seems promising, but software infrastructure is minimal, so application writers must write their own. In order to allow other system calls within sandboxes, Chromium constructs a process in which one thread executes in
seccomp mode, and another thread shares the same address space and has full system call access. Chromium rewrites
glibc to forward system calls to the trusted thread, where they are filtered to prevent access to inappropriate memory objects, opening files for write, etc. However, this default policy is quite weak, as read of any file is permitted.
seccomp sandbox contains over a thousand lines of handcrafted assembly to set up sandboxing, implement system call forwarding, and craft a security policy. Such code is difficult to write and maintain, with any bugs likely leading to security vulnerabilities. Capsicum's approach resembles
seccomp, but offers a rich set of services to sandboxes, so is easier to use correctly.
5.6. Summary of Chromium isolation models
Table 2 compares the security properties of the different sandbox models. Capsicum offers the most complete isolation across various system interfaces: FS, IPC, and networking (Net), as well as isolating sandboxes from one another (S S'), and avoiding the requirement for OS privilege to instantiate new sandboxes (Priv). Exclamation marks indicate cases where protection does exist in a model, but is either incomplete (FAT protection in Windows) or improperly used (while
open, Chromium re-enables it with excessive scope via forwarding).
Typical OS security benchmarks try to illustrate near-zero overhead in the hopes of selling general applicability of a technology. Our thrust is different: application authors already adopting compartmentalization accept significant overheads for mixed security return. Our goal is to accomplish comparable performance with significantly improved security. We summarize our results here; detailed exploration may be found in our USENIX Security paper.18
We evaluated performance by characterizing the overhead of Capsicum's new primitives through API micro-benchmarks and more broad application benchmarks. We were unable to measure a performance change in our adapted
dhclient due to the negligible cost of entering capability mode; on turning our attention to
gzip, we found an overhead of 2.4 ms to decompress an empty file. Micro-benchmarks revealed a cost of 1.5 ms for creating and destroying a sandbox, largely attributable to process management. This cost is quickly amortized with growth in data file size: by 512K, performance overhead was <5%.
Capsicum provides an effective platform for capability work on UNIX platforms. However, further research and development are required to bring this project to fruition.
Refinement of the Capsicum primitives would be useful. Performance might be improved for sandbox creation by employing Bittau's S-thread primitive.2 A formal "logical application" construct might improve termination properties.
Another area for research is in integrating user interfaces and OS security; Shapiro has proposed that capability-centered window systems are a natural extension to capability OSs. It is in the context of windowing systems that we have found capability delegation most valuable: gesture-based access control can be investigated through Capsicum enhancements to UI elements, such as Powerboxes (file dialogues with ambient authority) and drag-and-drop. Improving the mapping of application security into OS sandboxes would improve the security of Chromium, which does not consistently assign Web security domains to OS sandboxes.
Finally, it is clear that the single largest problem with Capsicum and similar approaches is programmability: converting local development into de facto distributed system development hampers application-writers. Aligning security separation with application structure is also important if such systems are to mitigate vulnerabilities on a large scale: how can the programmer identify and correctly implement compartmentalizations with real security benefits?
Saltzer and Schroeder's 1975 exploration of Multics-era OS protection describes the concepts of hardware capabilities and ACLs, and observes that systems combine the two approaches in order to offer a blend of protection and performance.14 Neumann et al.'s Provably Secure Operating System (PSOS),11 and successor LOCK, propose a tight integration of MAC and capabilities; TE is extended in LOCK to address perceived shortcomings in the capability model,15 and later appears in systems such as SELinux.9 We adopt a similar philosophy in Capsicum, supporting DAC, MAC, and capabilities.
Despite experimental hardware such as Wilkes' CAP computer,20 the eventual dominance of page-oriented virtual memory over hardware capabilities led to exploration of microkernel object-capability systems. Hydra,3 Mach,1 and later L48 epitomize this approach, exploring successively greater extraction of historic kernel components into separate tasks, and integrating message passing-based capability security throughout their designs. Microkernels have, however, been largely rejected by commodity OS vendors in favor of higher-performance monolithic kernels. Microkernel capability research has continued in the form of systems such as EROS,17 inspired by KEYKOS.6 Capsicum is a hybrid capability system, observably not a microkernel, and retains support for global namespaces (outside of capability mode), emphasizing compatibility over capability purism.
Provos's OpenSSH privilege separation12 and Kilpatrick's Privman7 in the early 2000s rekindled interest in microkernel-like compartmentalization projects, such as the Chromium Web browser13 and Capsicum's logical applications. In fact, large application suites compare formidably with the size and complexity of monolithic kernels: the FreeBSD kernel is composed of 3.8 million lines of C, whereas Chromium and WebKit come to a total of 4.1 million lines of C++. How best to decompose monolithic applications remains an open research question; Bittau's Wedge offers a promising avenue through automated identification of software boundaries.2
Seaborn and Hand have explored application compartmentalization on UNIX through capability-centric Plash,16 and Xen,10 respectively. Plash offers an intriguing layering of capability security over UNIX semantics by providing POSIX APIs over capabilities, but is forced to rely on the same weak UNIX primitives analyzed in Section 5. Hand's approach suffers from similar issues to
seccomp, in that the runtime environment for Xen-based sandboxes is functionality-poor. Garfinkel's Ostia4 proposes a delegation-centric UNIX approach, but focuses on providing sandboxing as an extension, rather than a core OS facility.
We have described Capsicum, a capability security extension to the POSIX API to appear in FreeBSD 9.0 (with ports to other systems, including Linux, under way). Capsicum's capability mode and capabilities appear a more natural fit to application compartmentalization than widely deployed discretionary and mandatory schemes. Adaptations of real-world applications, from
tcpdump to the Chromium Web browser, suggest that Capsicum improves the effectiveness of OS sandboxing. Unlike research capability systems, Capsicum implements a hybrid capability model that supports commodity applications. Security and performance analyses show that improved security is not without cost, but that Capsicum improves on the state of the art. Capsicum blends immediate security improvements to current applications with long-term prospects of a more capability-oriented future. More information is available at: http://www.cl.cam.ac.uk/research/security/capsicum/
We thank Mark Seaborn, Andrew Moore, Joseph Bonneau, Saar Drimer, Bjoern Zeeb, Andrew Lewis, Heradon Douglas, Steve Bellovin, Peter Neumann, Jon Crowcroft, Mark Handley, and the anonymous reviewers for their help.
1. Accetta, M., Baron, R., Golub, D., Rashid, R., Tevanian, A., Young, M. Mach: A New Kernel Foundation for UNIX Development. Technical report, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, Aug. 1986.
2. Bittau, A., Marchenko, P., Handley, M., Karp, B. Wedge: Splitting applications into reduced-privilege compartments. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation (2008), USENIX Association, 309322.
5. Google, Inc. The Chromium Project: Design Documents: OS X Sandboxing Design. http://dev.chromium.org/developers/design-documents/sandbox/osx-sandboxing-design, Oct. 2010.
9. Loscocco, P.A., Smalley, S.D. Integrating flexible support for security policies into the Linux operating system. In Proceedings of the USENIX Annual Technical Conference (June 2001), USENIX Association, 2942.
11. Neumann, P.G., Boyer, R.S., Feiertag, R.J., Levitt, K.N., Robinson, L. A Provably Secure Operating System: The System, Its Applications, and Proofs, Second Edition. Technical Report CSL-116, Computer Science Laboratory, SRI International, Menlo Park, CA, May 1980.
16. Seaborn, M. Plash: Tools for practical least privilege, 2007. http://plash.beasts.org/
19. Watson, R.N.M., Feldman, B., Migus, A., Vance, C. Design and implementation of the TrustedBSD MAC framework. In Proceedings of the Third DARPA Information Survivability Conference and Exhibition (DISCEX) (April 2003), IEEE.
The original version of this paper "Capsicum: Practical Capabilities for UNIX" was published in the Proceedings of the 19th USENIX Security Symposium, 2010.
©2012 ACM 0001-0782/12/0300 $10.00
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from email@example.com or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2012 ACM, Inc.
No entries found