Dark pools are a recent type of stock exchange in which information about outstanding orders is deliberately hidden in order to minimize the market impact of large-volume trades. The success and proliferation of dark pools have created challenging and interesting problems in algorithmic trading—in particular, the problem of optimizing the allocation of a large trade over multiple competing dark pools. In this work, we formalize this optimization as a problem of multi-venue exploration from censored data, and provide a provably efficient and near-optimal algorithm for its solution. Our algorithm and its analysis have much in common with well-studied algorithms for managing the explorationexploitation trade-off in reinforcement learning. We also provide an extensive experimental evaluation of our algorithm using dark pool execution data from a large brokerage.

### 1. Introduction

*Dark pools* are a relatively new type of exchange designed to address the problems that arise from the transparent (or “light”) nature of a typical stock exchange—namely, the difficulty of minimizing the impact of large-volume trades.^{3, 5, 7} In a typical exchange, the revelation that there is a large-volume buyer (seller) in the market can cause prices to rise (fall) at the buyer’s (seller’s) expense. If the volume is sufficiently large, and the trading period sufficiently short, such market impacts remain even if one attempts to fragment the trade over time into smaller transactions. As a result, there has been increasing interest in recent years in execution mechanisms that allow full or partial concealment of large trades.

In a typical dark pool, buyers and sellers submit orders that simply specify the total volume of shares they wish to buy or sell, with the price of the transaction determined exogenously by “the market”.^{a} Upon submitting an order to buy (or sell) *v* shares, a trader is put in a queue of buyers (or sellers) awaiting transaction. Matching between buyers and sellers occurs in sequential arrival of orders, similar to a light exchange. However, unlike a light exchange, no information is provided to traders about how many parties or shares might be available in the pool at any given moment. Thus in a given time period, a submission of *v* shares results only in a report of how many shares up to *v* were executed.

While presenting their own trading challenges, dark pools have become tremendously popular exchanges, responsible for executing 1020% of the overall US equity volume. In fact, they have been so successful that there are now approximately 40+ dark pools for the US Equity market alone. The popularity of these exchanges has left large-volume traders and brokerages facing a novel problem: How should one optimally distribute a large trade over the many independent dark pools?

To answer this question, we analyze a framework and algorithm for a more general multi-venue exploration problem. We consider a setting in which at each time period, we have some exogenously determined *volume* of *V* units of an abstract good (for example, shares of a stock that a client would like to sell). Our goal is to “sell” or “consume” as many of these units as possible at each step, and there are *K* abstract “venues” (for example, various dark pools) in which this selling or consumption may occur. We can divide our *V* units into any way we like across the venues in service of this goal. What differentiates this problem from most standard learning settings is that if *v*_{i} units are allocated to venue *i*, and all of them are consumed, we learn only that the total demand at venue *i* was *at least* *v*_{i}, not the precise number of units that *could* have been consumed there. This important aspect of our framework is known as *censoring* in the statistics literature.

In this work, we make the natural and common assumption that the maximum amount of consumption available in venue *i* at each time step (or the total liquidity available, in the dark pool problem) is drawn according to a fixed but unknown distribution *P*_{i}. Formally speaking, this means that when *v*_{i} units are submitted to venue *i*, a value *s*_{i} is drawn randomly from *P*_{i} and the observed (and possibly censored) amount of consumption is min {*s*_{i}, *v*_{i}}.

A learning algorithm in our framework receives a sequence of volumes *V*^{1}, *V*^{2}, … and must decide how to distribute the *V*^{t} units across the venues at each time step *t*. Our goal is to *efficiently* (in time polynomial in the parameters of the model) learn a near-optimal allocation policy. There is a distinct *between-venue exploration* component to this problem, since the best number of shares to submit to venue *i* may depend on both *V*^{t} and the distributions for the *other* venues, and the only mechanism by which we can discover the distributions is by submitting allocations. If we routinely submit too-small volumes to a venue, we receive censored observations and are underutilizing the venue; if we submit too-large volumes, we receive uncensored observations but have excess inventory.

Our main theoretical contribution is a provably polynomial-time algorithm for learning a near-optimal policy for any unknown venue distributions *P*_{i}. This algorithm takes a particularly natural and appealing form, in which *allocation* and distribution *reestimation* are repeatedly alternated. More precisely, at each time step we maintain estimates of the distributions *P*_{i}; pretending that these estimates are in fact exactly correct, we allocate the current volume *V* accordingly. These allocations generate observed consumptions in each venue, which in turn are used to update the estimates. We show that when the estimates are “optimistic tail modifications” of the classical *KaplanMeier* maximum likelihood estimator for censored data, this estimateallocate loop has *provably efficient between-venue exploration* behavior that yields the desired result. Venues with smaller available volumes are gradually given smaller allocations in the estimateallocate loop, whereas venues with repeated censored observations are gradually given larger allocations, eventually settling on a near-optimal overall allocation distribution.

Finally, we present an extensive experimental evaluation of our model and algorithm on the dark pool problem, using trading data from a large brokerage.

The closest problem to our setting is the widely studied *newsvendor problem* from the operations research literature. In this problem, at each time period a player (representing a newsstand owner) chooses a quantity *V* of newspapers to purchase at a fixed per-unit price, and tries to optimize profit in the face of demand uncertainty at a single venue (their newsstand).^{b} Huh et al.^{10} were the first to consider the use of the KaplanMeier estimator in this class of problems. They use an estimateallocate loop similar to ours, and show *asymptotic* convergence to near-optimal behavior in a single venue. Managing the distribution of an *exogenously specified* volume *V* across *multiple* venues (which are the important aspects of the dark pool problem, where the volume to be traded is specified by a client, and there are many dark pools) and the attendant explorationexploitation trade-off *between venues* are key aspects and differentiators of our algorithm and analysis. We also obtain stronger (polynomial time rather than asymptotic) bounds, which require a modification of the classical KaplanMeier estimator.

### 2. The Formal Model

Formally, we consider the following problem. At each time *t*, a learner is presented with a quantity or *volume* *V*^{t}
{1, …, *V*} of units, where *V*^{t} is sampled from an unknown distribution *Q*. The learner must decide on an *allocation*
of these shares to a set of *K* known *venues*, with *v*^{t}_{i}
{0, …, *V*^{t}} for each *i*
{1, …, *K*}, and Σ^{k}_{i = 1} *v*^{t}_{i} = *V*^{t}. The learner is then told the number of units *r*^{t}_{i} *consumed* at each venue *i*. Here *r*^{t}_{i} = min{*s*^{t}_{i}, *v*^{t}_{i}}, where *s*^{t}_{i} is the maximum consumption level of venue *i* at time *t*, which is sampled independently from a fixed but unknown distribution *P*_{i}. If *r*^{t}_{i} = *v*^{t}_{i}, we say that the algorithm receives a *censored observation* because it is possible to infer only that *r*^{t}_{i} ≤ *s*^{t}_{i}st. If *r*^{t}_{i} < *v*^{t}_{i}, we say that the algorithm receives a *direct observation* because it must be the case that *r*^{t}_{i} = *s*^{t}_{i}.

The goal of the learner is to discover a near-optimal one-step allocation policy, that is, an allocation policy that approximately optimizes the expected number of units out of *V*^{t} consumed at each time step *t*. (We briefly discuss other objectives at the end of Section 4.4.)

Throughout the remainder of the paper, we use the shorthand *T*_{i} for the *tail probabilities* associated with *P*_{i}. That is, *T*_{i}(*s*) = Σ_{s‘ ≥ s} *P*_{i}(*s*‘).^{c} Clearly *T*_{i}(0) = 1 for all *i*. We use
^{t}_{i}(*s*) for an empirical estimate of *T*_{i}(*s*) at time *t*.

### 3. A Greedy Allocation Scheme

Before tackling the full exploration-exploitation problem, we must examine a more basic question: Given estimates
_{i} of the tail probabilities *T*_{i} for each venue *i*, how can we maximize the (estimated) expected number of units consumed on a single time step? It turns out that this can be accomplished using a simple greedy allocation scheme. The greedy algorithm allocates one unit at a time. The venue to which the next unit is allocated is chosen to maximize the estimated probability that the unit will be consumed. It is easy to see that if *v*_{i} units have already been allocated to venue *i*, then the estimated probability that the next allocated unit will be consumed is simply
_{i}(*v*_{i} + 1). A formal description of the Greedy algorithm is given in Figure 1.

THEOREM 1. *The allocation returned by* Greedy *maximizes the expected number of units consumed in a single time step, where the expectation is taken with respect to the estimated tail probabilities* {
_{i}}^{K}_{i = 1}

The proof of this theorem is fairly simple. Using the fact that tail probabilities must satisfy
_{i}(*s*) ≥
_{i}(*s*‘) for all *s* ≤ *s*‘, it is easy to verify that by greedily adding units to the venues in decreasing order of
_{i}(*s*), the algorithm returns

The remainder of the proof involves showing that the expression being maximized here equivalent to the expected number of units consumed. This can be done algebraically.^{d}

### 4. The Censored ExplorationExploitation Algorithm

We now present our main theoretical result, which is a polynomial-time, near-optimal algorithm for multi-venue exploration from censored data. The analysis of our algorithm bears strong resemblance to the exploration-exploitation arguments common in the E^{3} and RMAX family of algorithms for reinforcement learning.^{4, 12} In particular, there is an analogy to the notion of a *known state* inherent in those earlier algorithms, along with an *exploitation lemma* (proving that expected payoffs from known states are high) and an *exploration lemma* (proving that extended periods of low payoffs must result in more states becoming known). In our setting, however, the number of states is exponential and thus the special structure of our problem is required to obtain a polynomial time algorithm. We first provide an overview of the algorithm and its analysis before examining it in more detail.

At the highest level, the algorithm is quite simple and natural. It maintains estimates
^{t}_{i} for the true unknown tail probabilities *T*_{i} for each venue *i*. These estimates improve with time in a particular quantifiable sense which drives between-venue exploration. At any given time *t*, the current volume *V*^{t} is allocated across the venues by simply calling the optimal greedy allocation scheme from Figure 1 on the current set of estimated tail probabilities
^{t}_{i}. This results in new censored observations from each venue, which in turn are used to update the estimates
^{t + 1}_{i} used at the next time step. Thus the algorithm, which is formally stated in Figure 2, implements a continuous allocatereestimate loop.

Note that we have not yet described the algorithm’s subroutine *OptimisticKM*, which specifies how we estimate
^{t}_{i} from the observed data. The most natural choice would be the maximum likelihood estimator on the data. This estimator is well-known in the statistics literature as the *KaplanMeier* estimator. In the following section, we describe KaplanMeier and derive a new convergence result that suits our particular needs. This result in turn lets us define an optimistic tail modification of KaplanMeier that becomes our choice for *OptimisticKM*. Figure 3 shows the full subroutine.

The analysis of our algorithm, which is developed in more detail over the next few sections, proceeds as follows:

**Step 1:** We first review the KaplanMeier maximum likelihood estimator for censored data and provide a new finite sample convergence bound for this estimator. This bound allows us to define a *cut-off* for each venue *i* such that the KaplanMeier estimate of the tail probability *T*_{i}(*s*) for every value of *s* up to the cut-off is guaranteed to be close to the true tail probability. We then define a lightly modified version of the KaplanMeier estimates in which the tail probability of the next unit above the cut-off is modified in an optimistic manner. We show that in conjunction with the greedy allocation scheme, this minor modification leads to increased exploration, since the next unit beyond the cut-off always looks at least as good as the cut-off itself.

**Step 2:** We next prove our main *Exploitation Lemma* (Lemma 3). This lemma shows that at any time step, if it is the case that the number of units allocated to each venue by the greedy algorithm is strictly below the cut-off for that venue (which can be thought of as being in a *known state* in the parlance of reinforcement learning) then the allocation is provably *ε*-optimal.

**Step 3:** We then prove our main *Exploration Lemma* (Lemma 4), which shows that on any time step at which the allocation made by the greedy algorithm is *not* *ε*-optimal, it is possible to lower bound the probability that the algorithm explores. Thus, any time we cannot ensure a near-optimal allocation, we are instead assured of exploring.

**Step 4:** Finally, we show that on any sufficiently long sequence of time steps (where *sufficiently long* is polynomial in the parameters of the model), it must be the case that either the algorithm has already implemented a near-optimal solution at almost every time step (and thus will continue to perform well in the future), or the algorithm has explored sufficiently often to learn accurate estimates of the tail distributions out to *V* units on every venue. In either case, we can show that with high probability, at the end of the sequence, the current algorithm achieves an *ε*-optimal solution at each time step with probability at least 1 − *ε*.

**4.1. Convergence of KaplanMeier estimators**

We begin by describing the standard KaplanMeier maximum likelihood estimator for censored data,^{11, 13} restricting our attention to a single venue *i*. Let *z*_{i,s} be the true probability that the demand in this venue is *exactly s* units given that the demand is *at least s* units. Formally,

It is easy to verify that for any *s* > 0,

At a high level, we can think of KaplanMeier as first computing a separate estimate of *z*_{i,s} for each *s* and then using these estimates to compute an estimate of *T*_{i}(*s*).

More specifically, let *M*^{t}_{i,s} be the number of *direct* observations of *s* units up to time *t*, that is, the number of time steps at which strictly more than *s* units were allocated to venue *i* and exactly *s* were consumed. Let *N*^{t}_{i,s} be the number of either direct or censored observations of *at least s* units on time steps at which strictly more than *s* units were allocated to venue *i*. We can then naturally define our estimate *
*^{t}_{i,s} = *M*^{t}_{i,s}/*N*^{t}_{i,s}, with *
*^{t}_{i,s} = 0 if *N*^{t}_{i,s} = 0. The KaplanMeier estimator of the tail probability for any *s* > 0 after *t* time steps can then be expressed as

with
^{t}_{i}(0) = *T*_{i}(0) = 1 for all *t*.

Previous work has established convergence rates for the KaplanMeier estimator to the true underlying distribution in the case that each submission in the sequence *v*^{1}_{i},…,*v*^{t}_{i} is independently and identically distributed (i.i.d.),^{8} and asymptotic convergence for non-i.i.d. settings.^{10} We are not in the i.i.d. case, since the submitted volumes at one venue are a function of the entire history of allocations and executions across all venues. In the following theorem, we give a new finite sample convergence bound applicable to our setting.

THEOREM 2. *Let
*^{t}_{i} *be the KaplanMeier estimate of T*_{i} *as given in Equation 1. For any δ > 0, with probability at least* 1 − *δ, for every s*
{1, …, *V*},

This result shows that as we make more and more direct or censored observations of at least *s* − 1 units on time steps at which at least *s* units are allocated to venue *i*, our estimate of the tail probability for *s* shares rapidly improves.

To prove this theorem, we must first show that the estimates *
*^{t}_{i,s} converge to the true probabilities *z*_{i,s}. In an i.i.d. setting, this could be accomplished easily using standard concentration results such as Hoeffding’s inequality. In our setting, we instead appeal to Azuma’s inequality (see, for example, Alon and Spencer^{2}), a tool for bounding *martingales*, or sequences *X*_{1}, *X*_{2}, … such that for each *n*, |*X*_{n} − *X*_{n+1}| ≤ 1 and E [*X*_{n+1}|*X*_{n}] = *X*_{n}. In particular, we show that the value *N*^{t}_{i,s}(*z*_{i,s} − *
*^{t}_{i,s}) can be expressed as the final term of a martingale sequence, allowing us to bound its absolute value. This in turn implies that bound on |*z*_{i,s} − *
*^{t}_{i,s}| that we need, and all that remains is to show that these bounds imply a bound on the discrepancy between *T*_{i}(*s*) and the estimator
_{i}(*s*).

In Figure 3, we describe the minor modification of KaplanMeier necessary for our analysis. As described above (Step 1), the value *c*^{t}_{i} in this algorithm can intuitively be viewed as a cut-off up to which we are guaranteed to have sufficient data to accurately estimate the tail probabilities using KaplanMeier; this is formalized in Lemma 1. Thus for every quantity *s* < *c*^{t}_{i}, we simply let
^{t}_{i}(*s*) be precisely the KaplanMeier estimate as in Equation 1.

However, to promote exploration, we set the value of
^{t}_{i}(*c*^{t}_{i} + 1) optimistically to the KaplanMeier estimate of the tail probability at *c*^{t}_{i} (*not* at *c*^{t}_{i} + 1). This optimistic modification is necessary to ensure that the greedy algorithm explores (i.e., has a chance of making progress towards increasing at least one cut-off value) on every time step for which it is not already producing an *ε*-optimal allocation. In particular, suppose that the current greedy solution allocated no more than *c*^{t}_{i} units to any venue *i* and exactly *c*^{t}_{j} units to some venue *j.* Using the standard KaplanMeier tail probability estimates, it could be the case that this allocation is suboptimal (there is no way to know if it would have been better to include unit *c*^{t}_{i} + 1 from venue *j* in place of a unit from another venue since we do not have an accurate estimate of the tail probability for this unit), and yet no exploration is taking place. By optimistically modifying the tail probability
^{t}_{i}(*c*^{t}_{j} + 1) for each venue, we ensure that no venue remains unexplored simply because the algorithm unluckily observes a low demand a small number of times.

We now formalize the idea of *c*^{t}_{i} as a cut-off up to which the KaplanMeier estimates are accurate. In the results that follow, we think of *ε* > 0 and *δ* > 0 as fixed parameters of the algorithm.^{e}

LEMMA 1. *For any s ≤ V, let
*^{t}_{i}(*s*) *be the KaplanMeier estimator for T*_{i}(*s*) *returned by* OptimisticKM. *With probability at least* 1 − *δ, for all* *s*≤*c*^{t}_{i},|*T*_{i}(*s*) −
^{t}_{i}(*s*)|≤ *ε* /(8*V*).

PROOF. It is always the case that *T*_{i}(0) =
^{t}_{i}(0) = 1, so the result holds trivially unless *c*^{t}_{i} > 0. Suppose this is the case. Recall that *N*^{t}_{i,s} is the number of direct or censored observations of at least *s* units on time steps at which strictly more than *s* units were allocated to venue *i.* By definition, it must be the case that *N*^{t}_{i,s} ≥ *N*^{t}_{i,s}, whenever *s* ≤ *s*‘. Thus by definition of the cut-off *c*^{t}_{i} in Figure 3, for all *s* < *c*^{t}_{i}cit, *N*^{t}_{i,s} ≥ 128(*s*^{V}/*ε*)^{2} ln(2*V*/*ε*). The lemma then follows immediately from an application of Theorem 2.

Lemma 2 shows that it is also possible to achieve additive bounds on the error of tail probability estimates for quantities *s* much *bigger* than *c*^{t}_{i} as long as the estimated tail probability at *c*^{t}_{i} is sufficiently small. Intuitively, this is because the tail probability at these large values of *s* must be smaller than the true tail probability at *c*^{t}_{i}, which, in this case, is known to be very small already.

LEMMA 2. *If
*^{t}_{i} (*c*^{t}_{i}) ≤ *ε/*(4*V*) *and the high probability event in Lemma 1 holds, then for all s such that c*^{t}_{i} < *s* ≤ *V*, |*T*_{i} (*s*) −
^{t}_{i} (*s*)| ≤ *ε/*(2*V*).

**4.3. Exploitation and exploration lemmas**

We are now ready to state our main *Exploitation Lemma* (Step 2), which formalizes the idea that once a sufficient amount of exploration has occurred, the allocation output by the greedy algorithm is *ε*-optimal. The proof of this lemma is where the optimistic tail modification to the KaplanMeier estimator becomes important. In particular, because of the optimistic setting of
^{t}_{i}(*c*^{t}_{i} + 1), we know that if the greedy policy allocates exactly *c*^{t}_{i} units to a venue *i*, it could not gain too much by reallocating additional units from another venue to venue *i* instead. In this sense, we create a buffer above each cut-off, guaranteeing that it is not necessary to continue exploring as long as one of the two conditions in the lemma statement is met for each venue.

The second condition in the lemma may appear mysterious at first. To see why it is necessary, notice that the rate at which the estimate
^{t}_{i}(*c*^{t}_{i} + 1) converges to the true tail probability *T*_{i}(*c*^{t}_{i} + 1) implied by Theorem 2 depends on the number of times that we observe a consumption of *c*^{t}_{i} or more units. If *T*_{i}(*c*^{t}_{i}) is very small, then the consumption of this many units does not frequently occur. Luckily, if this is the case, then we know that *T*_{i}(*c*^{t}_{i} + 1) must be very small as well, and more exploration of this venue is not needed.

LEMMA 3 (EXPLOITATION LEMMA). *Assume that at time t, the high probability event in Lemma 1 holds. If for each venue i, either (1), v ^{t}_{i} ≤ c^{t}_{i} or (2),
^{t}_{i}*(

*c*)

^{t}_{i}*≤ ε/*(4

*V*)

*, the difference between the expected number of units consumed under allocation and the expected number of units consumed under the optimal allocation is at most ε*.

PROOF SKETCH. The proof begins by creating an arbitrary one-to-one mapping between the units allocated to different venues by the algorithm and an optimal allocation. Consider any such pair in this mapping.

If the first condition in the lemma holds for the venue *i* to which the unit was allocated by the algorithm, we can use Lemma 1 to show that the algorithm’s estimate of the probability of this unit being consumed is close to the true probability; in particular, the algorithm is not overestimating this probability too much. If the second condition holds, then the algorithm’s estimate of the probability of the share being consumed is so small that, again, the algorithm cannot possibly be overestimating it too much (because the lowest the probability could be is zero). This follows from Lemma 2.

Now consider the venue *j* to which unit was allocated by the optimal allocation. If the number of units *v*^{t}_{j} allocated to this venue by the algorithm is strictly less than the cut-off *c*^{t}_{j}, then by Lemma 1, the algorithm could not have underestimated the probability of additional units being consumed by too much. Furthermore, because of the optimistic tail modification of the KaplanMeier estimator, this also holds if *v*^{t}_{j} = *c*^{t}_{j}. Finally, if it is instead the case that the second condition in the lemma statement holds for venue *j*, then the algorithm again could not possibly have underestimated the probability of the unit being consumed too much because the true probability is so low.

Putting these pieces together, we can argue that for each pair in the matching (of which there are no more than *V*), since the algorithm did not overestimate the probability of unit it chose being consumed by too much (in this case, *too much* means more than *ε*/(2*V*)) and did not underestimate the probability of the corresponding unit in the optimal allocation by too much (again, by *ε*/(2*V*)), the difference in expected units consumed between the optimal allocation and the algorithm’s is at most *ε*.

Finally, Lemma 4 presents the main exploration lemma (Step 3), which states that on any time step at which the allocation is *not* *ε*-optimal, the probability of obtaining a useful observation is at least *ε*/(8*V*).

LEMMA 4 (EXPLORATION LEMMA). *Assume that at time t, the high probability event in Lemma 1 holds. If the allocation is not ε-optimal, then for some venue i, with probability at least*

PROOF. Suppose the allocation is not *ε*-optimal at time *t*. By Lemma 3, it must be the case that there exists some venue *i* for which *v*^{t}_{i} > *c*^{t}_{i} and
^{t}_{i}(*c*^{t}_{i}) > *ε*/(4*V*), i.e., a venue in which the algorithm has allocated units past the cut-off but for which the tail probability at the cut-off is not too close to zero. Let be a venue for which this is true. Since *v*^{t}_{
}> *c*^{t}_{
}, it will be the case that the algorithm obtains a useful observation for exploration of this venue (i.e., an observation causing
to be incremented) if the number of units consumed at this venue is sufficiently high (specifically, if *r*^{t}_{
} > *c*^{t}_{
}). Since
^{t}_{
}(*c*^{t}_{
}) > *ε/*(4*V*), Lemma 1 implies that *T*_{
}(*c*^{t}_{
}) > *ε/*(8*V*), which in turn implies that the number of units consumed is high enough to constitute a useful observation with probability at least *ε*/(8*V*).

With the exploitation and exploration lemmas in place, we are finally ready to state our main theorem.

THEOREM 3 (MAIN THEOREM). *For any ε > 0 and δ > 0, with probability* 1* − δ (over the randomness of draws from Q and* {*P*_{i}}*), after running for a time polynomial in K, V,* 1*/ ε, and* ln(1/*δ*),* the algorithm in Figure 2 makes an ε-optimal allocation on each subsequent time step with probability at least* 1* − ε*.

PROOF SKETCH. Suppose that the algorithm runs for *R* time steps, where *R* is a (specific, but unspecified for now) polynomial in the model parameters *K*, *V*, 1/*ε*, and ln(1/*δ*). If it is the case that the algorithm was already *ε*-optimal on a fraction (1 − *ε*) of the *R* time steps, then we can argue that the algorithm will continue to be *ε*-optimal on at least a fraction (1 − *ε*) of future time steps since the algorithm’s performance should improve on average over time as estimates become more accurate.

On the other hand, if the algorithm chose sub-optimal allocations on at least a fraction *ε* of the *R* time steps, then by Lemma 4, the algorithm must have incremented
for some venue *i* and cut-off *c*^{t}_{i} approximately *ε ^{2}R*/(8

*V*) times. By definition of the

*c*

^{t}

_{i}, it can never be the case that was incremented too many times for any

*fixed*values of

*i*and

*c*

^{t}

_{i}(where

*too many*is a polynomial in

*V*, 1/

*ε*, and ln(1/

*δ*)); otherwise the cut-off would have increased. Since there are only

*K*venues and

*V*possible cut-off values to consider in each venue, the total number of increments can be no more than

*KV*times this polynomial, another polynomial in

*V*, 1/

*ε*, ln(1/

*δ*), and now

*K*. If

*R*is sufficiently large (but still polynomial in all of the desired quantities) and approximately

*ε*

^{2}

*R*/(8

*V*) increments were made, we can argue that every venue must have been fully explored, in which case, again, future allocations will be

*ε*-optimal.

We remark that our optimistic tail modifications of the KaplanMeier estimators are relatively mild. This leads us to believe that using the same estimateallocate loop with an *unmodified* KaplanMeier estimator would frequently work well in practice. We investigate a parametric version of this learning algorithm in the experiments described below.

### 5. The Dark Pool Problem

The remainder of this article is devoted to the application of our techniques to the dark pool problem. We begin with a description of the trading data we used, and go on to describe a variety of experiments we performed.

**5.1. Summary of the dark pool data**

Our data set is from the internal dark pool order flow for a major US brokerdealer. Each (possibly censored) observation is of the form discussed throughout the paper—a triple consisting of the dark pool name, the number of shares sent to that pool, and the number of shares subsequently executed within a short time interval. It is important to highlight some limitations of the data. First, note that the data set conflates the policy the brokerage used for allocation across the dark pools with the liquidity available in the pools themselves. For our data set, the policy in force was very similar to the bandit-style approach we discuss below. Second, the “parent” orders determining the overall volumes to be allocated across the pools were determined by the brokerage’s trading needs, and are similarly out of our control.

The data set contains submissions and executions for four active dark pools: BIDS Trading, Automated Trading Desk, D.E. Shaw, and NYFIX, each for a dozen of relatively actively-traded stocks,^{f} thus yielding 48 distinct stockpool data sets. The average daily trading volume of these stocks across all exchanges (light and dark) ranges from 1 to 60 million shares, with a median volume of 15 million shares. Energy, Financials, Consumer, Industrials, and Utilities industries are represented. Our data set spans 30 trading days. For every stockpool pair we have on average 1,200 orders (from 600 to 2,000), which corresponds to 1.3 million shares (from 0.5 to 3 million). Individual order sizes range from 100 to 50,000 shares, with 1,000 shares being the median. Sixteen percent of orders are filled at least partially (meaning that fully 84% result in no shares executed), 9% of the total submitted volume was executed, and 11% of all observations were censored.

**5.2. Parametric models for dark pools**

The theory and algorithm we have developed for censored exploration permit a very general form for the venue distributions *P*_{i}. The downside of this generality is that we are left with the problem of learning a very large number of parameters. More parameters generally mean that more data is necessary to guarantee that the model will generalize well, which means more rounds of exploration are needed before the algorithm’s future performance is near-optimal. In some applications, it is therefore advantageous to employ a less general but more simple parametric form for these distributions.

We experimented with a variety of common parametric forms for the distributions. For each such form, the basic methodology was the same. For each of the 4 × 12 = 48 venuestock pairs, the data for that pair was split evenly into a training set and a test set. The training data was used to select the maximum likelihood model from the parametric class. Note that we can no longer directly apply the nonparametric KaplanMeier estimator—within each model class, we must directly maximize the likelihood on the censored training data. This is a relatively straightforward and efficient computation for each of the model classes we investigated. The test set was then used to measure the generalization performance of each maximum likelihood model.

Our investigations revealed that the best models maintained a separate parameter for the probability of zero shares being available (that is, *P*_{i}(0) is explicitly estimated)—a *zero bin* or ZB parameter. This is due to the fact that the vast majority of submissions (84%) to dark pools result in no shares being executed. We then examined various parametric forms for the nonzero portions of the venue distributions, including uniform (which of course requires no additional parameters), and Poisson, exponential and power law forms (each of which requires a single additional parameter); each of these forms were applied up to the largest volume submitted in the data sets, then normalized.

The generalization results strongly favor the power law form, in which the probability of *s* shares being available is proportional to 1/*s ^{β}* for real

*β*—a so-called heavy-tailed distribution when

*β*> 0. Nonparametric models trained with KaplanMeier are best on the training data but over-fit badly due to their complexity relative to the sparse data, while the other parametric forms cannot accommodate the heavy tails of the data. This is summarized in Table 1. Based on this comparison, for our dark pool study we investigate a variant of our main algorithm, in which the estimateallocate loop has an estimation step using maximum likelihood estimation within the ZB + Power Law model, and allocations are done greedily on these same models.

In terms of the estimated ZB + Power Law parameters themselves, we note that for all 48 stockpool pairs the Zero Bin parameter accounted for most of the distribution (between a fraction 0.67 and 0.96), which is not surprising considering the aforementioned preponderance of entirely unfilled orders in the data. The vast majority of the 48 exponents *β* fell between *β* = 0.25 and *β* = 1.3—so rather long tails indeed—but it is noteworthy that for one of the four dark pools, 7 of the 12 estimated exponents were actually *negative*, yielding a model that predicts *higher* probabilities for larger volumes. This is likely an artifact of our size- and time-limited data set, but is not entirely unrealistic and results in some interesting behavior in the simulations.

**5.3. Data-based simulation results**

As in any control problem, the dark pool data in our possession is unfortunately insufficient to evaluate and compare different allocation algorithms. This is because of the aforementioned fact that the volumes submitted to each venue were fixed by the specific policy that generated the data, and we cannot explore alternative choices—if our algorithm chooses to submit 1000 shares to some venue, but in the data only 500 shares were submitted, we simply cannot infer the outcome of our desired submission.

We thus instead use the raw data to derive a *simulator* with which we can evaluate different approaches. In light of the modeling results of Section 5.2, the simulator for stock *S* was constructed as follows. For each dark pool *i*, we used *all* of the data for *i* and stock *S* to estimate the maximum likelihood Zero Bin + Power Law distribution. (Note that there is no need for a training-test split here, as we have already separately validated the choice of distributional model.) This results in a set of four venue distribution models *P*_{i} that form the simulator for stock *S*. This simulator accepts allocation vectors (*v*_{1}, *v*_{2}, *v*_{3}, *v*_{4}) indicating how many shares some algorithm wishes to submit to each venue, draws a “true liquidity” value *s*_{i} from *P*_{i} for each *i*, and returns the vector (*r*_{1}, *r*_{2}, *r*_{3}, *r*_{4}), where *r*_{i} = min(*v*_{i}, *s*_{i}) is the possibly censored number of shares filled in venue *i*.

Across all 12 stocks, we compared the performance of four different allocation algorithms. The (obviously unrealistic) *ideal allocation* is given the *true parameters* of the ZB + Power Law distributions used by the simulator and allocates shares optimally (greedily) with respect to these distributions. The *uniform allocation* divides any order equally among all four venues. Our *learning algorithm* implements the repeated allocatereestimate loop as in Figure 2, using the maximum likelihood ZB + Power Law model for the reestimation step. Finally, the simple (and fairly naive) *bandit-style algorithm* maintains a weighting over the venues and chooses allocations proportional to the weights. It begins with equal weights assigned to all venues, and each allocation to a venue which results in any nonzero number of shares being executed causes that venue’s weight to be multiplied by a constant factor *α*. (Optimizing *α* over all stockpool pairs resulted in a value of *α* = 1.05.)

Some remarks on these algorithms are in order. First, note that the ideal and uniform allocation methods are nonadaptive and are meant to serve as baselines—one of them the best performance we could hope for (ideal), and the other the most naive allocation possible (uniform). Second, note that our algorithm has a distinct advantage in the sense that it is using the correct parametric form, the same being used by the simulator itself. Thus our evaluation of this algorithm is certainly optimistic compared to what should be expected in practice. Finally, note that the bandit algorithm is the crudest type of weight-based allocation scheme of the type that abounds in the no-regret literature^{6}; we are effectively forcing our problem into a 0/1 loss setting corresponding to “no shares” and “some shares” being executed. Certainly more sophisticated bandit-style approaches can and should be examined.

Each algorithm was run in simulation for some number of *episodes*. Each episode consisted of the allocation of a fixed number *V* of shares—thus the same number of shares is repeatedly allocated by the algorithm, though of course this allocation will change over time for the two adaptive algorithms as they learn. Each episode of simulation results in some fraction of the *V* shares being executed. Two values of *V* were investigated—a smaller value *V* = 1000, and the larger and potentially more difficult *V* = 8000.

We begin by showing full learning curves over 2000 episodes with *V* = 8000 for a couple of representative stocks in Figure 4. Here the average performance of the two non-adaptive allocation schemes (ideal and uniform) are represented as horizontal lines, while learning curves are given for the adaptive schemes. Due to high variance of the heavy-tailed venue distributions used by the simulator, a single trial of 2000 episodes is extremely noisy, so we both average over 400 trials for each algorithm, and smooth the resulting averaged learning curve with a standard exponential decay temporal moving average.

We see that our learning algorithm converges towards the ideal allocation (as suggested by the theory), often relatively quickly. Furthermore, in each case this ideal asymptote is significantly better than the uniform allocation strawman, meaning that optimal allocations are highly nonuniform. Learning curves for the bandit approach exhibit one of the three general behaviors over the set of 12 stocks. In some cases, the bandit approach is quite competitive with our algorithm, though converging to ideal perhaps slightly slower (not shown in Figure 4). In other cases, the bandit approach learns to outperform uniform allocation but appears to asymptote short of the ideal allocation. Finally, in some cases the bandit approach appears to actually “learn the wrong thing”, with performance decaying significantly with more episodes. This happens when one venue has a very heavy tail, but also a relatively high probability of executing zero shares, and occurs because the very naive bandit approach that we use does not have an explicit representation of the tails of the distribution.

The left column of Figure 5 shows more systematic head-to-head comparisons of our algorithm’s performance versus the other allocation techniques after 2000 episodes for both small and large *V*. The values plotted are averages of the last 50 points on learning curves similar to Figure 4. These scatterplots show that across all 12 stocks and both settings of *V*, our algorithm competes well with the optimal allocation, dramatically outperforms uniform, and significantly outperforms the naive bandit allocations (especially with *V* = 8000). The average completion rate across all stocks for the large (small) order sequences is 10.0% (13.1%) for uniform and 13.6% (19.4%) for optimal allocations. Our algorithm performs almost as well as optimal—13.5% (18.7%)—and much better than bandits at 11.9% (17.2%).

In the right column, we measure performance not by the fraction of *V* shares filled in one step, but by the natural alternative of *order half-life*—the number of steps of *repeated* resubmission of any remaining shares to get the total number executed above *V*/2. Despite the fact that our algorithm is not designed to optimize this criterion and that our theory does not directly apply to it, we see the same broad story on this metric as well—our algorithm competes with ideal, dominates uniform allocation and beats the bandit approach on large orders. The average order half-life for large (small) orders is 7.2 (5.3) for uniform allocation and 5.9 (4.4) for the greedy algorithm on the true distributions. Our algorithm requires on average 6.0 (4.9) steps, while bandits uses 7.0 (4.4) to trade the large (small) orders.

### 6. Conclusion

While there has been longstanding interest in quantitative finance in the use of models from machine learning and related fields, they are often applied towards the attempt to predict directional price movements, or in the parlance of the field, to “generate alpha” (outperform the market). Here we have instead focused on a problem in what is often called *algorithmic trading*—where one seeks to optimize properties of a specified trade, rather than decide what to trade in the first place—in the recently introduced dark pool mechanism. In part because of the constraints imposed by the mechanism and the structure of the problem, we have been able to adapt and blend methods from statistics and reinforcement learning in the development of a simple, efficient, and provably effective algorithm. We expect there will be many more applications of machine learning methods in algorithmic trading in the future.

### Acknowledgments

We are grateful to Curtis Pfeiffer and Andrew Westhead for valuable conversations and to Bobby Kleinberg for introducing us to the literature on the newsvendor problem.

### Figures

Figure 1. Optimal allocation algorithm *Greedy*.

Figure 3. Subroutine *OptimisticKM.* Let *M*^{t}_{i,s‘} and *N*^{t}_{i,s‘} be defined in Section 4.1, and assume that *ε, δ* > 0 are fixed parameters.

Figure 4. Sample learning curves. For the stock AIG (left panel), the naive bandits algorithm (labeled blue curve) beats uniform allocation (dashed horizontal line) but appears to asymptote short of ideal (solid horizontal line). For the stock NRG (right panel), the bandits algorithm actually deteriorates with more episodes, underperforming both the uniform and ideal allocations. For both stocks (and the other 10 in our data set), our algorithm (labeled red curve) performs nearly optimally.

Figure 5. Comparison of our learning algorithm to the three baselines. In each plot, the performance of the learning algorithm is plotted on the *y*-axis, and the performance of one of the baselines on the *x*-axis. Left column: Evaluated by the fraction of submitted shares executed in a single time step; higher values are better, and points above the diagonal are wins for our algorithm. Right: Evaluated by order half-life; lower values are better, and points below the diagonal are wins for our algorithm. Each point corresponds to a single stock and order size; small orders (red plus signs) are 1000 shares, large orders (blue squares) are 8000 shares.

## Join the Discussion (0)

## Become a Member or Sign In to Post a Comment