Diving Head-First Into The Dark Pool Problem

Reinforcement Learning for Orders Execution in Modern Quantitative Finance

Published in

Towards Data Science

22 min readDec 15, 2020

This article presents a review, discussion and implementation of the paper Censored Exploration and the Dark Pool Problem by Kuzman Ganchev, Michael Kearns, Yuriy Nevmyvaka and Jennifer Wortman Vaughan from the Computer and Information Science department of the University of Pennsylvania. You can find it here.

It is quite long because I try to be exhaustive on the mathematical arguments that guarantee the efficiency of the proposed algorithms. I also think it really helps understanding what everything is about. However, if you’re unfamiliar or uncomfortable with the concepts, don’t hesitate to skip to the maths and jump to the implementation parts.

Since we will meet a lot of difficult financial, mathematical and algorithmic concepts, I am going to decorate this article with heavenly swimming pools which will make you want to plunge always deeper into the subject.

Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details.

Part One : Overview, Vocabulary and Definitions

This paper is not easy to grab at first if you are unfamiliar with some technicalities of modern quantitative finance and reinforcement learning. Here, I will try to do a brief overview of the most important concepts.

First, let’s start with the abstract, which will provide us with a more accurate idea of the general topic. It sounds very dense at first glance and it can be difficult to find your way around. I have bolded the phrases that should probably require some explanation for the average reader.

We introduce and analyze a natural algorithm for multi-venue exploration from censored data, which is motivated by the Dark Pool Problem of modern quantitative finance. We prove that our algorithm converges in polynomial time to a near-optimal allocation policy; prior results for similar problems in stochastic inventory control guaranteed only asymptotic convergence and examined variants in which each venue could be treated independently. Our analysis bears a strong resemblance to that of efficient exploration / exploitation schemes in the reinforcement learning literature. We describe an extensive experimental evaluation of our algorithm on the Dark Pool Problem using real trading data.

Dark pools are a type of stock exchange in which information about outstanding orders is deliberately hidden in order to minimize the market impact of large-volume trades. That is to say, if someone who owns 30% of the total market capitalization of a company wanted to sell his share, it is likely that the very process of selling would be sufficient to disrupt market prices.

That’s just the basic demand-and-supply process : the sudden arrival of such a huge amount of stocks on the market would be similar to an increase in supply ; demand remaining more or less the same, prices would plummet. Thus, the seller would receive only a fraction of the market value his shares had before being sold. Hiding information about this outstanding order would prevent buyers’ from noticing the sudden jump in supply, and enable the seller to get a price equal to the market value of his stocks before his trade. It is a voluntary case of asymmetry of information on a financial market.

Dark pools are a recent type of stock exchange in which relatively little information is provided about the current outstanding orders.

As of February 2020, there were more than 50 dark pools registered with the Securities and Exchange Commission (SEC). The success and proliferation of dark pools have created challenging and interesting problems in algorithmic trading, in particular, the problem of optimizing the allocation of a large trade over multiple competing dark pools : that is why the authors are talking about multi-venue exploration for finding an allocation policy for your trade (i.e., how many of your shares it is optimal to distribute on each of the venues).

In the Dark Pool Problem, at each time step a trader must buy or sell up to V shares of a given stock, and does so by allocating them over multiple distinct exchanges (venues) known as dark pools.

Censored data is a type of incomplete data where you only have information about the boundaries of a value. Example : you have a list of people. You also have their age for some of them, but for others you only know that they are between 20 and 30, or more than 50, or less than 18, etc. Be careful not to confuse censored data with truncated data : the latter occurs when values beyond a boundary are excluded, whether when gathered or when analyzed. For example, if someone was conducting a survey on middle-class workers, he would ask you if you make more than $100,000 ; if you answered “yes” (lucky you), he would then stop at that point : you’ve been truncated.

If v_i shares are allocated to dark pool i, and all of them are executed, the trader learns only that the liquidity available at exchange i was at least v_i, not the actual larger number that could have been executed there; this is why the data is considered censored.

An algorithm is said to be of polynomial-time if its running time is upper bounded by a polynomial expression in the size n of the input for the algorithm, i.e., T(n) = O(n^k) for some positive constant k. Problems for which deterministic polynomial-time algorithms exist are considered easy and fast, in contrast to problems of exponential complexity, i.e., T(n) = O(e^n) or worse, which quickly become unfeasible when the data is too large.

A near-optimal solution to a problem is obtained when finding the true solution is too hard and/or too costly, and one restricts the optimization problem to a class of potential solutions with interesting properties making the optimum easy to find (a convex class for example), or when one relies on numerical methods for finding a proxy to the solution.

The aim of stochastic inventory control is to determine the optimal timing of issuing a replenishment order (and its corresponding quantity), subject to uncertainty of demand, by stochastic methods. Asymptotic convergence of a stochastic procedure means that performing a huge number of random estimations, will make the algorithm tend towards a finite value that is the solution to the problem. That will become clearer later.

If we routinely allocate too-small volumes to a venue, we receive censored observations on sales and underutilize the venue; if we submit too-large volumes we receive uncensored (or direct) observations on sales in the venue, but have excess inventory.

A recurrent financial problem for companies is to find an efficient exploration and exploitation scheme. Firms are engaged in new activities like research and at the same time in more routine ones like development and production. Thus, they should find a satisfying financial arrangement between exploration (uncertain but large future financial gains) and exploitation (certain, present but moderate financial gains).

Its transposition in reinforcement learning is direct : the cost here is the running time of the algorithm, we’d like the agent to find the best solution to a given problem as fast as possible ; however, in the meantime, committing to solutions too quickly without enough exploration of other potential strategies sounds pretty bad, as it could lead to local minima. Modern RL algorithms that optimize for the best returns can achieve good exploitation quite efficiently, while exploration remains more like an open topic. The efficient exploration of the action and state space is a crucial factor in the convergence rate of a learning scheme.

A learning algorithm receives a sequence of total volumes V_1, V_2,… and must decide at each time step t how to distribute the V_t units across the venues.

Congratulations ! You made it so far… Now, let’s begin to dip into the subject.

Part Two : Presentation of the Problem

Generalities

Consider a setting in which at each time period, we have some volume of V units (possibly varying with time) of an abstract good. Our goal is to “sell” or “consume” as many of these units as possible at each step, and there are K abstract “venues” in which this selling or consumption may occur. We can divide our V units in any way we like across the venues in service of this goal.

Some comments about the problem settings are necessary:

The trader must buy or sell up to V shares (V is assumed to be “big”) of a given stock on behalf of a client. It is important that we view V as given exogenously by the client and not under the trader’s control.
We assume that the maximum amount of consumption available in venue i at each time step (s_i) is drawn according to a fixed unknown distribution P_i. Formally speaking, this means that when v_i units are submitted to venue i, s_i is drawn randomly from P_i and the observed (and possibly censored) amount of consumption is min{s_i, v_i}.
There is a distinct between-venue exploration component to this problem, since the “right” number of shares to submit to venue i may depend on both V_t and the distributions for the other venues, and the only way by which we can discover the distributions is by submitting allocations.

The objective of the paper is to efficiently (in time polynomial in the “complexity” of the P_i and other parameters) learn a near-optimal allocation policy over time, under stochastic assumptions on the venues, for any unknown venue distributions P_i.

The algorithm presented by the authors takes a particularly natural and appealing form, in which allocation and distribution reestimation are repeatedly alternated. More precisely :

At each time step t we maintain distributional estimates ;
Pretending that these estimates are in fact exactly correct, we allocate the current volume V accordingly ;
These allocations generate observed consumptions in each venue, which in turn are used to update or reestimate the distributional estimates.

This procedure will be referred to later as the estimate-allocate loop.

The paper shows that when the distributional estimates are “optimistic tail modifications” of the classical Maximum Likelihood Estimator for censored data (wow ! that’s technical, don’t worry, definitions are coming), this estimate-allocate loop explores (in an RL-meaning) efficiently allocation schemes between venues :

Venues with smaller available volumes (relative to the overall volume V_t and the other venues) are gradually given smaller allocations in the estimate-allocate loop,
Whereas venues with repeated censored observations are gradually given larger allocations,
Eventually settling on a near-optimal overall allocation distribution.

Related Work

The problem closest to this setting is the widely studied newsvendor problem in operations research. The two main differences are that :

it is a single-venue problem,
the agent chooses V, which is not exogeneous.

At each time period, a player (a newsstand owner) chooses the quantity V of newspapers to purchase at a fixed per-unit price, and tries to maximize profit in the face of demand uncertainty at a single venue (his newsstand).

This was where the Maximum Likelihood Estimator in perishable inventory problems (perishable because newspapers have a one-day-lifetime) appeared for the first time, as well as the estimate-allocate loop. It was shown that this problem has asymptotic convergence to near-optimal behavior.

Variables and Hypotheses

The mathematical formulation of the problem is presented hereunder.

At each time t, a volume of units is sampled from an unknown distribution Q.

The maximum consumption level of each venue i at time t is sampled independently from a fixed but unknown distribution P_i.

The learner must choose an allocation of these units to a set of K venues :

The learner is then told the number of units consumed at each venue i.

If the number of consumed units at venue i is equal to the number of units that had been allocated to it, we say that the algorithm receives a censored observation because it is only possible to infer that the max consumption level of venue i is greater than the observed number. On the other hand, if we consumed fewer units than were allocated, we say that the algorithm receives a direct observation because it means that the max consumption level of the venue has been reached.

We will denote T_i the tail probabilities associated with P_i :

The goal of the learner is to discover a near-optimal one-step allocation policy, that is, an allocation policy that approximately optimizes the expected number of units consumed out of the total available at each time step t.

Feeling at ease so far ? Now that you’ve dipped enough into the equations, let’s swim a little further into the theorems.

Part Three : Greedy Allocation is Optimal

In this section, we show that given estimates of the tail probabilities for each venue i, a simple greedy allocation algorithm maximizes the (estimated) expected number of units consumed at a single time step.

Presentation and Explanation of the Algorithm

The greedy algorithm allocates one unit at a time. The venue to which the next unit is allocated is chosen to maximize the estimated probability that the unit will be consumed; if v_i units have already been allocated to venue i, then the estimated probability that the next allocated unit will be consumed is simply the estimated tail probability of v_i + 1.

It is important to notice here that the Greedy Algorithm is identical at every timestep. The tail probability estimates are considered to be given beforehand and don’t change over time, there isn’t any time component anywhere in the process. That will be one of the main differences with the Censored Exploration Algorithm that will be presented later.

A formal description is given as Algorithm 1 below.

Theorem 1. The allocation returned by Greedy maximizes the expected number of units consumed in a single time step.

While not all units are allocated, the greedy algorithm iteratively adds a unit to the venue whose tail probability of the total allocation augmented by one would be the highest.

In order to understand the proof, you just have to note two things:

The tail probabilities (estimates) are decreasing functions ;
At each step of the algorithm, the maximum tail probability of an increased unit is therefore a global maximum over all the remaining units.

Thus, we can say that at each step of the algorithm, the sum of maxima over all venues and all allocated units is equal to the global maximum of this sum (what would generally speaking not be true were the series not decreasing, since adding the local maximum at each step would not guarantee convergence to the global maximum).

The figure above illustrates the process. At a given step, we have allocated a part of the V units on the K venues (area under the red curve) : in this example venue 1 has 4 units, venue 3 has 2 units... The sequential process guarantees that at every step, all items under the red curve have highest tail probability than all items above the curve. In order to choose which venue to allocate the next unit, welook at the tail probabilities and choose the venue where the estimated tail probability of an incrementation of 1 unit would be the highest (represented here by the white square).

The output of the greedy algorithm is :

We now want to prove that the expected number of consumed units is maximal for the greedy allocation. If it maximizes the expected number of units consumed at each venue, it also maximizes the overall expected number. Using the fact that tail probabilities at a level are defined as the sum of densities above this level, we can easily introduce telescopic sums that lead to the following result, for each venue i :

Tadam ! Maximizing the sum of tail probabilities, as Greedy does, is equivalent to maximizing the expected total consumption ! Isn’t that beautiful ? This very simple algorithm converges naturally to the optimal allocation.

Python Implementation on a Toy Example

To clear things up a bit, I implemented the greedy algorithm on a toy example. I took a total of V = 10,000 shares to be allocated on K = 50 dark pools.

For each venue i, I chose to model the law of its liquidity P_i by a Poisson law of unknown parameter mu. I sampled the parameters exogenously in advance as random integers uniformly taken in {0, …, 400}.

Please note that here, I take the true distributions directly instead of approximates.

Taking a look at the result, I noted that the venues had been allocated from only 7 up to 377 units. The repartition of the numbers of units allocated to each venue is the following :

Interpretation : there are 5 venues to which less than 40 units were allocated.

Now, time to check if the expected number of units consumed with the greedy allocation is indeed higher than with another allocation !

I will simply compare it to 2 allocations :

the uniform allocation, where all V shares are equally split between the K venues ;
a random allocation, where all V shares are randomly assigned to any of the K venues.

I get 9731 expected units consumed with the greedy allocation, 7683 with the uniform allocation and 7598 with the random allocation. That’s reassuring.

I’m impressed you managed to buoy yourself so far. Enough splashing on the surface! That was only the emerged part of the iceberg, now it’s time to immerse yourself in the depths of the mathematics supporting dark pools.

Part Four : The Censored Exploration Algorithm

Now, time to get to the serious stuff. In this section I will present the paper’s main theoretical result, which is a polynomial-time, near-optimal algorithm for multi-venue exploration from censored data. I will first provide an overview of the algorithm and its analysis before diving into the technical details.

Presentation and Explanation of the Algorithm

If you understood the Greedy Algorithm, you should have no difficulties with the Censored Exploration Algorithm. As I said previously, the greedy process was identical over time ; the authors simply introduced timely updates of the tail distributions estimates, calculated with the censored observations from each venue collected at the previous timestep (i.e., the consumed units at each venue given the number that had been allocated). Thus the algorithm (Algorithm 2 below) implements a continuous allocate-reestimate loop.

Everything seems simple in my description of the procedure, but as you will have understood, the crux of the matter is updating at each time step the estimated probabilities according to the censored data collected at this step. This corresponds to the subroutine OptimisticKM, which specifies how we calculate the new estimated tail probabilities from the observed data.

The most natural choice would be the Maximum Likelihood Estimator (MLE) on the data, aka the Kaplan-Meier estimator (thus the KM in OptimisticKM, which is only an optimistically modified MLE that will be detailed later).

I present hereunder an implementation of this algorithm.

You can see that I can provide here any implementation for the update of the tail estimates. I will provide, further in the article, an example with the traditional implementation of the MLE, as well as with the optimistic version.

It’s all very dense, so we will proceed step by step at this stage.

Step 1 :

We first review the MLE for censored data and provide a new finite-sample convergence bound for this estimator.
This allows us to define a “cut-off point” for each venue i such that the MLE of the tail probability T_i(s) for every value of s up to the cut-off point is guaranteed to be “close to” the true tail probability.
We then define a slightly modified version of the MLE in which the tail probability of the next unit above the cut-off is modified in an optimistic manner (to be defined).

We show that in conjunction with the greedy allocation algorithm, this minor modification leads to increased exploration, since the next unit beyond the cut-off point always looks at least as good as the cut-off point itself.

Step 2 : We next prove our main Exploitation Lemma (Lemma 4).

This shows that at any time step, if the number of units greedy-allocated to each venue is strictly below the cut-off point for that venue, then the allocation is “epsilon-optimal” (to be defined).

Step 3 : We then prove our main Exploration Lemma (Lemma 5).

This shows that on any time step at which the greedy-allocation is not “epsilon-optimal”, it is possible to lower-bound the probability that the algorithm explores. Anytime we are not in a known state and thus cannot ensure optimal allocation, we are instead assured of exploring.

Step 4 : Finally, we show that on any sufficiently long sequence of time steps (“sufficiently long” is polynomial in some parameters), we must have :

either the algorithm achieves an “epsilon-optimal” solution on a lower-bounded fraction of the sequence,
or the algorithm has explored sufficiently often to learn accurate estimates of the tail distributions out to V units on every venue.

In either case, we can show that with probability 1 − δ, at the end of the sequence, the current algorithm achieves an “epsilon-optimal” solution with a lower-bound probability 1 −ϵ.

Step 1 : Review & Adaptation of the Maximum Likelihood Estimator

We will work with a given venue i.

We define by z the true probability that the demand (or the liquidity, or the maximum number of units that can be absorbed) in this venue is exactly s units given that it is at least s :

Let D be the number of direct observations of s units up to time t, and let N be the number of (direct or censored) observations of at least s units on time steps at which more than s units were requested. The quantity N is then the number of times there was an opportunity for a direct observation of s units, whether or not one occurred.

We can then naturally define the empirical probability of a direct observation of s units given that a direct observation of s units was possible, what leads us to the MLE of the tail probability for any s > 0 after t time steps :

MLE of tail probabilities, traditional calculation

Here is what we get when we implement the MLE update in Algorithm 2 :

We’ve reached the point where mathematical theory stops helping us :

There are convergence rates for the MLE to the true underlying distribution in the case that the submission sequence is i.i.d ;
There are asymptotic convergence results for non-i.i.d. settings ;
We are in the non-i.i.d. case, since the submitted volumes at one venue are a function of the entire history of allocations and executions across all venues, but we would like a finite sample convergence bound.

That’s not a deadend though, as the authors demonstrate. Indeed, from the formula of the tail probability estimate, they extract theorem 2 hereunder which states that these tail proability estimates also converge with time to the true distributions, for each venue.

Theorem 2 :

(I will not explicit the proof, which calls lots of unpleasant calculus, since it doesn’t help the comprehension. Feel free to read it in the original paper.)

Hurray ! Now that we have a convergence bound of the MLE towards the true distribution, we can try to modify this estimator a bit. Indeed, it seems obvious that, the more data we have, the more accurate our estimator will be, but we would like to guarantee that it will indeed get sufficiently accurate if we provide for the necessary amount of data. I will explain the details hereunder.

In Algorithm 3 , the value c can intuitively be viewed as a “cut-off point” up to which we are guaranteed to have sufficient data to accurately estimate the tail probabilities using the MLE.

Thus for every quantity s lower than this cut-off point, we simply calculate the tail probability estimates through MLE. It remains to fix the estimates for quantities above that threshold. We decide to set the value of the estimated tail probability one unit above the cut-off point to the MLE of the tail probability at the cut-off point.

This optimistic modification ensures that the greedy algorithm explores (i.e., has a chance of making progress towards increasing at least one cut-off value) on every time step for which the produced allocation is not already optimal.

In particular, suppose that the current greedy solution allocated no more than c_i units to any venue i and exactly c_j units to some venue j. Using the standard maximum likelihood tail probability estimates, this allocation could be suboptimal (there is no way to know if it would have been better to include the next unit from venue j in place of a unit from another venue since we do not have an accurate estimate of the tail probability for this unit), and yet no exploration is taking place.

By optimistically modifying the tail probability estimates of the current amounts increased by 1 for each venue, we ensure that no venue remains unexplored simply because the algorithm unluckily observes a low demand a small number of times.

This c_i is a “cut-off point” up to which, at each step, the MLE are accurate.

Hereunder is a Python implementation of the procedure.

Steps 2 & 3 : Exploitation and Exploration Lemmas

In this part, I will not aim at proving every lemma and theorem mathematically or go deeply into technical details, but rather at giving a general intuition of their meanings.

Lemma 2 :

This is a trivial consequence of theorem 2 combined with the formula of cut-off points from algorithm 3.

Lemma 3 :

Lemma 3 shows that it is also possible to achieve additive bounds on the error of tail probability estimates for quantities s much bigger than the cut-off point as long as the tail probability at that cut-off point is sufficiently small.

We are now ready to state our main Exploitation Lemma (Step 2), which formalizes the idea that once a sufficient amount of exploration has occurred, the allocation output by the greedy algorithm will be epsilon-optimal.

Lemma 4 (Exploitation Lemma) :

Note that these bounds are tight in the sense that it is possible to construct examples where the equality is obtained.

Finally, we present the main Exploration Lemma (Step 3), which states that on any time step at which the allocation is not epsilon-optimal, the probability of a “useful” observation is lower bounded.

Lemma 5 (Exploration Lemma) :

Same assumptions. If the allocation is notϵ-optimal, then for some venue i, with probability at leastϵ/(8V ),

Step 4:

Putting it all together, we finally get to the main theorem hereunder :

Theorem 3 :

For any ϵ > 0 and δ > 0, with probability 1 − δ (over the randomness of draws from Q and {P_i}), after running for a time polynomial in K, V, 1/ϵ, and ln(1/δ), Algorithm 2 makes anϵ-optimal allocation on each subsequent time step with probability at least 1 − ϵ.

I will conclude this theory part with a few brief remarks.

Our optimistic tail modifications of the maximum likelihood estimators are relatively mild. In many circumstances we actually expect that the estimate-allocate loop with unmodified MLE would work well.
It also seems quite likely that variants of the algorithm and analysis could be developed for alternative objective functions, such as minimizing the number of steps of reallocation required to execute all shares, or maximizing the probability of executing all V shares on a single step.

I hope you are not overwhelmed by the flood of mathematics. Don’t worry, we will now drift towards the dark pools.

Part Five : Application to the Dark Pool Problem

As mentioned in the Introduction, dark pools are a particular and relatively recent type of exchange for listed stocks. Indeed, the main challenge in executing large-volume trades in traditional exchanges is that it is difficult to “conceal” such trades, and their revelation generally results in adverse impact on price (e.g. the presence of a large-volume buyer causes the price to rise against that buyer). That’s basically demand-and-supply law.

If the volume is sufficiently large, this difficulty of concealment remains even if one attempts to break the trade up slowly over time. Dark pools arose to address the problems faced by large traders, and tend to emphasize trade and order concealment over price optimization.

In a typical dark pool, buyers and sellers submit orders that simply specify the total volume of shares they wish to buy or sell, with the price of the transaction determined exogenously by “the market”.

Upon submitting an order to buy v shares, a trader is put in a queue of buyers awaiting transaction, and there is a similar queue of sell orders. Matching between buyers and sellers occurs in sequential arrival of orders, similar to a classical exchange. However, in this case no information is provided to traders about how many shares might be available in the pool at any given moment. Thus, at a given time step, a submission of v shares results only in a (possibly censored) report of how many shares (up to v) were executed.

Dark pools have become very popular exchanges, responsible for executing 10–20% of the overall US equity volume. In fact, they have been so successful that there are now approximately 40+ dark pools for the U.S. Equity market alone, leading traders and brokerages to face exactly the censored multi-venue exploration problem we have been studying:

How should one optimally distribute a large trade over the many independent dark pools ?

The authors didn’t source any data or code that would make their results reproducible. Dark pool data can be obtained through the OTC Transparency Data project on the Financial Industry Regulatory Authority website. For full details on the origin of data, refer to this notice. However, this data is aggregated by week, month, trade venue or company, therefore it is not possible for us to reproduce the results presented in the article. I dearly encourage you to go and read their presented implementation and results, which are quite interesting !

Diving Head-First Into The Dark Pool Problem

Reinforcement Learning for Orders Execution in Modern Quantitative Finance

Part One : Overview, Vocabulary and Definitions

Part Two : Presentation of the Problem

Generalities

Related Work

Variables and Hypotheses

Part Three : Greedy Allocation is Optimal

Presentation and Explanation of the Algorithm

Python Implementation on a Toy Example

Part Four : The Censored Exploration Algorithm

Presentation and Explanation of the Algorithm

Step 1 : Review & Adaptation of the Maximum Likelihood Estimator

Steps 2 & 3 : Exploitation and Exploration Lemmas

Step 4:

Part Five : Application to the Dark Pool Problem

Thanks for reading !

Written by Louise Ferbach