### Abstract

Constraint Satisfaction Problems (CSPs) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSPs and the computation of the number of distinct elements in a data stream, also known as the zeroth frequency moment (*F*_{0}) of a data stream.

Our investigations lead us to observe striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and distinct elements computation. We design a recipe for the translation of algorithms developed for distinct elements estimation to that of model counting, resulting in new algorithms for model counting. We then observe that algorithms in the context of distributed streaming can be transformed into distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing distinct elements estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works.

### 1. Introduction

*Constraint Satisfaction Problems* (CSP’s) and a *data stream model* are two core themes in computer science with a diverse set of applications in topics including probabilistic reasoning, networks, databases, and verification. *Model counting* and computation of *zeroth frequency moment* (*F*_{0}) are fundamental problems for CSPs and a data stream model respectively. This paper is motivated by our observation that despite the usage of similar algorithmic techniques for the two problems, the developments in the two communities have, surprisingly, evolved separately, and rarely has a paper from one community been cited by the other.

Given a set of constraints ϕ over a set of variables in a finite domain *D*, the problem of model counting is to estimate the number of solutions of ϕ. We are often interested when ϕ is restricted to a special class of representations such as Conjunctive Normal Form (CNF) and Disjunctive Normal Form (DNF). A data stream over a domain [*N*] is represented by **a** = 〈*a*_{1}, *a*_{2}, ···, *a _{m}*〉 where each item

*a*is a subset of [

_{i}*N*]. The

*zeroth frequency moment*, denoted as

*F*

_{0}, of

**a**is the number of distinct domain elements appearing in

**a**, that is, |∪

_{i}

*a*| (traditionally,

_{i}*a*s are singletons; we will also be interested in the case when

_{i}*a*s are sets). The fundamental nature of model counting and

_{i}*F*

_{0}computation over data streams has led to intense interest from theoreticians and practitioners alike in the respective communities for the past few decades.

The starting point of this work is the confluence of two viewpoints. The first viewpoint contends that some of the algorithms for model counting can conceptually be thought of as operating on the stream of the solutions of the constraints. The second viewpoint contends that a stream can be viewed as a DNF formula, and the problem of *F*_{0} estimation is similar to model counting. These viewpoints make it natural to believe that algorithms developed in the streaming setting can be directly applied to model counting, and vice versa. We explore this connection and indeed design new algorithms for model counting inspired by algorithms for estimating *F*_{0} in data streams. By exploring this connection further, we design new algorithms to estimate *F*_{0} for streaming sets that are succinctly represented by constraints. It is worth noting that the two communities focus on seemingly different efficiency objectives: in streaming algorithms, space complexity is of major concern while in the context of model counting, time (especially NP query complexity in the context of CNF formulas) is of primary concern. Therefore, it is striking to observe that our transformation recipe leads to the design of *efficient* algorithms for *F*_{0} estimation as well as model counting wherein *efficient* is measured by the concern of the corresponding community. We further investigate this observation and demonstrate that the space complexity of streaming algorithms provides an upper bound on the query complexity of model counting algorithms.

To put our contributions in context, we briefly survey the historical development of algorithmic frameworks in both model counting and *F*_{0} estimation and point out the similarities.

The complexity-theoretic study of model counting was initiated by Valiant who showed that this problem, in general, is #P-complete.^{32} This motivated researchers to investigate approximate model counting and in particular to design (ε, δ)-approximation schemes. The complexity of approximate model counting depends on its representation. When the model ϕ is represented as a CNF formula ϕ, designing an efficient (ε, δ)-approximation is NP-hard.^{30} In contrast, when it is represented as a DNF formula, model counting admits an fully polynomial-time approximation scheme (FPRAS).^{21, 22} We will use #CNF to refer to the case when ϕ is a CNF formula and #DNF to refer to the case when ϕ is a DNF formula.

For #CNF, Stockmeyer^{30} provided a hashing-based randomized procedure that can compute an (ε, δ)-approximation with running time poly(|ϕ|, 1/ε, 1/δ), given access to an NP oracle. Building on Stockmeyer’s approach and motivated by the unprecedented breakthroughs in the design of SAT solvers, researchers have proposed a series of algorithmic improvements that have allowed the hashing-based techniques for approximate model counting to scale to formulas involving hundreds of thousands of variables. The practical implementations substitute the NP oracle with an SAT solver. In the context of model counting, we are primarily interested in time complexity and therefore, the number of NP queries is of key importance. The emphasis on the number of NP calls also stems from practice as the practical implementation of model counting algorithms has shown to spend over 99% of their time in the underlying SAT calls.^{29}

Karp and Luby^{21} proposed the first FPRAS scheme for #DNF, which was improved in subsequent works.^{9, 22} Chakraborty et al.^{5} demonstrated that the hashing-based framework can be extended to #DNF, thereby providing a unified framework for both #CNF and #DNF. Meel et al.^{24, 25} subsequently improved the complexity of the hashing-based approach for #DNF and observed that hashing-based techniques achieve better scalability than Monte Carlo techniques.

**1.2. Zeroth frequency moment estimation**

Estimating (ε, δ)-approximation of the *k*^{th} frequency moments (*F _{k}*) of a stream is a central problem in a data streaming model.

^{1}In particular, considerable work has been done in designing algorithms for estimating the 0

^{th}frequency moment (

*F*

_{0}), the number of distinct elements in the stream. For streaming algorithms, the primary resource concerns are space complexity and processing time per element. In general, for a streaming algorithm to be considered efficient, these should be poly(log

*N*, 1/ε) where

*N*is the size of the universe (we assume δ to be a small constant and ignore factors in this discussion).

The first algorithm for computing *F*_{0} with a constant factor approximation was proposed by Flajolet and Martin, who assumed the existence of hash functions with ideal properties resulting in an algorithm with undesirable space complexity.^{15} In their seminal work, Alon, Matias, and Szegedy designed an *O*(log *N*) space algorithm for *F*_{0} with a constant approximation ratio that employs 2-universal hash functions.^{1} Subsequent investigations into hashing-based schemes by Gibbons and Tirthapura^{16} and Bar-Yossef et al.^{3} provided (ε, δ)-approximation algorithms with space and time complexity
. Later, Bar-Yossef et al. proposed *three algorithms* with improved space and time complexity.^{2} While the three algorithms employ hash functions, they differ conceptually in the usage of relevant random variables for the estimation of *F*_{0}. This line of work resulted in the development of an algorithm with optimal space complexity
and *O* (log *N*) update time to estimate the *F*_{0} of a stream.^{20}

The aforementioned works are in the setting where each data item *a _{i}* is an element of the universe. Subsequently, there has been a series of results of estimating

*F*

_{0}in rich scenarios with a particular focus to handle the cases

*a*⊆ {1, 2, ···

_{1}*N*} such as a list or a multidimensional range.

^{3,26,31}

**1.3. The road to a unifying framework**

As mentioned above, the algorithmic developments for model counting and *F*_{0} estimation have largely relied on the usage of hashing-based techniques and yet these developments have, surprisingly, been separate, and rarely has a work from one community been cited by the other. In this context, we wonder whether it is possible to bridge this gap and if such an exercise would contribute to new algorithms for model counting as well as for *F*_{0} estimation? The main conceptual contribution of this work is an affirmative answer to the above question. First, we point out that the two well-known algorithms; Stockmeyer’s #CNF algorithm^{30} which is further refined by Chakraborty et al.^{5} and Gibbons and Tirthapura’s *F*_{0} estimation algorithm,^{16} are essentially the same.

The core idea of the hashing-based technique of Stockmeyer’s and Chakraborty et al’s scheme is to use pair-wise independent hash functions to partition the solution space (satisfying assignments of a CNF formula) into *roughly equal and small* cells, wherein a cell is *small* if the number of solutions is less than a pre-computed threshold, denoted by Thresh. Then a good estimate for the number of solutions is the *number of solutions in an arbitrary cell* x *number of cells.* To determine the appropriate number of cells, the solution space is iteratively partitioned as follows. At the *m ^{th}* iteration, a hash function with range {0, 1}

^{m}is considered resulting in cells

*h*

^{−1}(

*y*) for each

*y*∈ {0, 1}

^{m}. An NP oracle can be employed to check whether a particular cell (for example

*h*

^{−1}(0

^{m})) is small by enumerating solutions one by one until we have either obtained Thresh+1 number of solutions or we have exhaustively enumerated all the solutions. If the cell

*h*

^{−1}(0

^{m}) is small, then the algorithm outputs

*t*×2

^{m}as an estimate where

*t*is the number of solutions in the cell

*h*

^{−1}(0

^{m}). If the cell

*h*

^{−1}(0

^{m}) is not small, then the algorithm moves on to the next iteration where a hash function with range {0, 1}

^{m+1}is considered.

We now describe Gibbons and Tirthapura’s algorithm for *F*_{0} estimation which we call the Bucketing algorithm. Without loss of generality, we assume that *N* is a power of two and thus identify [*N*] with {0, 1}^{n}. The algorithm maintains a bucket of size Thresh and starts by picking a hash function *h* : {0, 1}^{n} → {0, 1}^{n}. It iterates over sampling levels. At level *m*, when a data item *x* comes, if *h*(*x*) starts with 0^{m}, then *x* is added to the bucket. If the bucket overflows, then the sampling level is increased to *m* + 1 and all elements *x* in the bucket other than the ones with *h*(*x*) = 0^{m+1} are deleted. At the end of the stream, the value *t* × 2^{m} is output as the estimate where *t* is the number of elements in the bucket and *m* is the sampling level.

These two algorithms are conceptually the same. In the Bucketing algorithm, at the sampling level *m*, it looks at only the first *m* bits of the hashed value; this is equivalent to considering a hash function with range {0, 1}^{m}. Thus the bucket is nothing but all the elements in the stream that belong to the cell *h*^{−1}(0^{m}). The final estimate is the number of elements in the bucket times the number of cells, identical to Chakraborty et al.’s algorithm. In both algorithms, to obtain an (ε, δ) approximation, the Thresh value is chosen as
. To reduce the error probability to 1/δ, the median of
independent estimations is output.

Motivated by the conceptual identity between the two algorithms, we further explore the connections between algorithms for model counting and *F*_{0} estimation.

First, we formalize a recipe to transform streaming algorithms for *F*_{0} estimation to those for model counting. Such a transformation yields new (ε, δ)-approximate algorithms for model counting, which are different from currently known algorithms. Our transformation recipe from *F*_{0} estimation to model counting allows us to view the problem of the design of distributed #DNF algorithms through the lens of *distributed functional monitoring* which is well-studied in a data streaming literature.

Building on the connection between model counting and *F*_{0} estimation algorithms, we design new algorithms to estimate *F*_{0} over *structured set streams* where each element of the stream is a (succinct representation of a) subset of the universe. Thus, the stream is *S*_{1}, *S*_{2}, ··· where each *S _{i}* ⊆ [

*N*] and the goal is to estimate the

*F*

_{0}of the stream, that is, the size of ∪

_{i}

*S*. In this scenario, the goal is to design algorithms whose per-item time (time to process each

_{i}*S*) is poly-logarithmic in the size of the universe. Structured set streams that are considered in the literature include 1-dimensional (1D) and multidimensional ranges.

_{i}^{26,31}Several interesting problems, including the max-dominance norm

^{6}and counting triangles in graphs,

^{3}can be reduced to computing

*F*

_{0}over such ranges.

We observe that several structured sets can be represented as small DNF formulae and thus *F*_{0} counting over these structured set data streams can be viewed as a special case of #DNF. Using the hashing-based techniques for #DNF, we obtain a general recipe for a rich class of structured sets that include DNF sets, affine spaces, and multidimensional ranges. Prior work on structured sets had to rely on involved analysis for each of the specific instances, while our work provides a general recipe for both analysis and implementation.

A natural question that arises from the transformation recipe is the relationship between the space complexity of the streaming algorithms and the query complexity of the obtained model counting algorithms. We establish a relationship between these two quantities by showing that the space complexity is an upper bound on the query complexity.

### 2. Notation

We will assume the universe [*N*] = {0, 1}^{n}.

*F*_{0} *Estimation:* A data stream **a** over domain [*N*] can be represented as **a** = *a*_{1}, *a*_{2}, ···, *a _{m}* wherein each item

*a*∈ [

_{i}*N*]. Let

**a**

_{u}= ∪

_{i}{

*a*

_{i}}·

*F*

_{0}of the stream

**a**is |

**a**

_{u}|. We are interested in a

*probably approximately correct*scheme that returns an (ε, δ)-

*estimate c*of

*F*

_{0}, that is, .

*Model Counting:* Let *X* = (*X*_{1}, *X*_{2}, …, *X _{n}*} be a set of Boolean variables. For a Boolean formula ϕ over variables

*X*, let Sol(ϕ) denote the set of all satisfying assignments of ϕ. The

*propositional model counting problem*is to compute |Sol(ϕ)| for a given formula ϕ. As in the case of

*F*

_{0}, we are interested in a

*probably approximately correct*algorithm that takes as inputs a formula ϕ, a tolerance ε > 0, and a confidence δ ∈ (0, 1], and returns a (ε, δ)-estimate

*c*of |Sol(ϕ)| that is, .

*k-wise Independent hash functions:* Let *n, m* ∈ ℕ and
be a family of hash functions mapping {0, 1}^{n} to {0, 1}^{m}.

DEFINITION 1. *A family of hash functions H*(*n, m*) *is k-wise independent, denoted* *H*_{k-wise} (*n, m*), *if* ∀α_{1,} α_{2,} …, α_{k} ∈{0, 1}^{m}, *for all distinct* *x*_{1}, *x*_{2,} … *x*_{k} ∈{0, 1}^{n},

*Explicit families.* An explicit hash family that we use is *H*_{Toeplitz} (*n, m*), which is known to be 2-wise independent.^{4} The family is defined as follows:
is the family of functions of the form *h* (*x*) = *Ax* + *b* with
and
, and F_{2} is the finite field of size 2. For a hash function *h*: {0,1}^{n}→ {0, 1}^{m}, *h*_{ℓ}: {0, 1}^{n} →{0, 1}_{ℓ}, ℓ ∈ {1, …,*m*} is the function where *h _{ℓ}* (

*y*) is the first

*ℓ*bits of

*h*(

*y*).

### 3. From Streaming to Counting

As a first step, we present a unified view of the three hashing-based algorithms proposed in Bar-Yossef et al.^{2} Their first algorithm, the Bucketing algorithm discussed above, is a modification of an *F*_{0} estimation algorithm due to Gibbons and Tirthapura.^{16} The second algorithm, which we call Minimum, is based on the idea that if we hash all the items of the stream, then *O*(1/ε^{2})-th minimum of the hash values can be used to compute a good estimate of *F*_{0}. The third algorithm, which we call Estimation, chooses a set of *k* functions, {*h*_{1}, *h*_{2}, …, *h _{k}*}, such that each

*h*is picked randomly from an

_{j}*O*(log(1/ε))-independent hash family. For each hash function

*h*, we say that

_{j}*h*is not

_{j}*lonely*if there exists

*a*∈

_{i}**a**such that

*h*(

_{j}*a*) = 0. One can then estimate

_{i}*F*

_{0}of

**a**by estimating the number of hash functions that are not lonely.

Algorithm 1, called ComputeF0, presents the overarching architecture of the three proposed algorithms. The architecture of ComputeF0 is fairly simple: it chooses a collection of hash functions using ChooseHashFunctions, calls the subroutine ProcessUpdate for every incoming element of the stream and invokes ComputeEst at the end of the stream to return the *F*_{0} approximation.

ChooseHashFunctions. As shown in Algorithm 2, the hash functions depend on the strategy being implemented. The subroutine PickHashFunctions(*H*, *t*) returns a collection of *t* independently chosen hash functions from the family *H.* We use *H* to denote the collection of hash functions returned, this collection is viewed as either a 1D array or as a 2-dimensional (2D) array. When *H* is a 1D array, *H* [*i*] denotes the *i*th hash function of the collection and when *H* is a 2D array *H* [*i*] [*j*] is the [*i, j*]th hash function.

ProcessUpdate. For a new item *x*, the update of *S*, as shown in Algorithm 3 is as follows:

- Bucketing: For a new item
*x*, if*H*[*i*](_{m}_{i}*x*) = 0, then we add it to^{mi}*S*[*i*] if*x*is not already present in*S*[*i*]. If the size of*S*[*i*] is greater than Thresh (which is set to be*O*(1/ε^{2})), then we increment the*m*as in line 8._{i} - Minimum: For a new item
*x*, if*H*[*i*](*x*) is smaller than max*S*[*i*], then we replace max*S*[*i*] with*H*[*i*](*x*). - Estimation: For a new item
*x*, compute*z*= TrailZero(*H*[*i, j*](*x*)), that is, the number of trailing zeros in*H*[*i, j*](*x*), and replace*S*[*i, j*] with*z*if*z*is larger than*S*[*i, j*].

ComputeEst. Finally, for each of the algorithms, we estimate *F*_{0} based on the sketch *S* as described in the sub-routine ComputeEst presented as Algorithm 4. It is crucial to note that the estimation of *F*_{0} is performed solely using the sketch *S* for the Bucketing and Minimum algorithms. The Estimation algorithm requires an additional parameter *r* that depends on a loose estimate of *F*_{0}.

**Algorithm 1** ComputeF0(*n*, ε, δ)

- Thresh ← 96/ε
^{2} *t*← 35 log(1/δ)*H*← ChooseHashFunctions(*n*, Thresh,*t*)*S*← {}**while**true**do****if**EndStream**then**exit;*x*←*input*()- ProcessUpdate(
*S*,*H*,*x*, Thresh) *Est*← ComputeEst(*S*, Thresh)**return***Est*

*Sketch Properties.* For each of the three algorithms, their corresponding sketches can be viewed as arrays of size 35 log(1/δ). The parameter Thresh is set to 96/ε^{2}.

- Bucketing: The element
*S*[*i*] is a tuple 〈*ℓ*,_{i}*m*〉 where_{i}*ℓ*is a list of size at most Thresh, where_{i}*ℓ*_{i}= {*x*∈**a**\*H*[*i*]_{mi}(*x*) = 0^{mi}}. We use*S*[*i*](0) to denote*ℓ*and_{i}*S*[*i*](1) to denote*m*_{i}. - Minimum: Each
*S*[*i*] holds the lexicographically Thresh many smallest elements of {*H*[*i*](*x*)|*x*∈**a**}. - Estimation: Each
*S*[*i*] holds a tuple of size Thresh. The*j*‘th entry of this tuple is the largest number of trailing zeros in any element of*H*[*i,j*](**a**).

**Algorithm 2** ChooseHashFunctions(*n*, Thresh, *t*)

**switch**AlgorithmType**do****case**AlgorithmType==Bucketing*H*← PickHashFunctions(*H*_{Toeplitz}(*n, n*),*t*)**case**AlgorithmType==Minimum*H*← PickHashFunctions(*H*_{Toeplitz}(*n*, 3*n*),*t*)**case**AlgorithmType==Estimation*s*← 10 log(1/ε)*H*← PickHashFunctions(*H*_{s-wise}(*n, n*),*t*× Thresh)**return***H*

**Algorithm 3** ProcessUpdate(*S, H, x*, Thresh)

**for***i*∈ [1, |*H*|]**do****switch**AlgorithmType**do****case**Bucketing*m*=_{i}*S*[*i*](0)**if***H*[*i*]_{mi}(*x*) == 0^{mi}**then***S*[*i*](0) ←*S*[*i*](0) ∪ {*x*}**if**size(*S*[*i*](0)) > Thresh**then***S*[*i*](1) ←*S*[*i*](1) + 1**for***y*∈*S***do****if***H*[*i*]_{mi+1(y)}≠ 0^{mi+1}**then**- Remove(
*S*[*i*](0),*y*) **case**Minimum**if**size(*S*[*i*]) < Thresh**then***S*[*i*]. Append(*H*[*i*](*x*))**else***j*← arg max(*S*[*i*])**if***S*[*i*](*j*)>*H*[*i*](*x*)**then***S*[*i*](*j*) ←*H*[*i*](*x*)**case**Estimation**for***j*∈ [1, Thresh]**do***S*[*i, j*] ← max(*S*[*i, j*], TrailZero(*H*[*i, j*](*x*)))**return***S*

**3.1. A recipe for transformation**

Observe that for each of the algorithms, the final computation of *F*_{0} estimation depends on the sketch *S.* Therefore, as long as for two streams **a** and **â**, if their corresponding sketches (and the hash functions chosen) match, then the three schemes presented above would return the same estimates. The recipe for a transformation of streaming algorithms to model counting algorithms is based on the following insight:

- Capture the relationship
*P*(*S*,*H*,**a**_{u}) between the sketch*S*, set of hash functions*H*, and set**a**_{u}at the end of stream. - View the formula ϕ as a symbolic representation of the unique set
**a**_{u}represented by the stream**a**such that Sol(ϕ) =**a**_{u}. - Given a formula ϕ and set of hash functions
*H*, design an algorithm to construct sketch*S*such that the property*P*(*S*,*H*, Sol(ϕ)) holds. Using the sketch*S*, |Sol(ϕ)| can be estimated.

**Algorithm 4** ComputeEst(*S*, Thresh)

**switch**AlgorithmType**do****case**Bucketing**return**Median**case**Minimum**return**Median**case**Estimation(*r*)**return**Median

By applying the above recipe to the three *F*_{0} estimation algorithms, we can derive corresponding model counting algorithms. In particular, applying the above recipe to the Bucketing algorithm leads us to the state-of-the-art hashing-based model counting algorithm, ApproxMC, proposed by Chakraborty et al.^{5} Applying the above recipe to Minimum and Estimation allows us to obtain different model counting schemes. In this extended abstract, we illustrate this transformation for the Minimum-based algorithm.

**3.2. Example application of recipe: Minimum-based algorithm**

We showcase the application of the recipe in the context of a minimum-based algorithm. For a given multiset **a** (e.g., a data stream or solutions to a model), we now specify the property *P*(*S*, *H*, **a**_{u}) as follows: The sketch *S* is an array of sets indexed by members of *H* that holds lexicographically *p* minimum elements of *H* [*i*](**a**_{u}) where *p* is
is the property that specifies this relationship.

The following lemma due by Bar-Yossef et al.^{2} establishes the relationship between the property *P* and the number of distinct elements of a multiset. Let max(*S _{i}*) denote the largest element of the set

*S*.

_{i}LEMMA 1. Let **a** ⊆ {0, 1}^{n} *be a multiset. Let H* ⊆ *H*_{Toeplitz} (*n*, 3*n*) *where each H [i] is independently drawn from* *H*_{Toeplitz} (*n*, 3*n*), *and* |*H*| = *O* (log 1/*δ*). *Let S be such that* *P*(*S*, *H*, *a _{u}*)

*holds. Let*.

_{i}

*Then*

Therefore, we can transform the minimum algorithm for *F*_{0} estimation to that of model counting given access to a sub-routine that can compute *S* such that *P*(*S*, *H*, Sol(*ϕ*)) holds. The following proposition establishes the existence and complexity of such a subroutine, called FindMin.

PROPOSITION 1. *There is an algorithm* FindMin *that, given ϕ over n variables, h* ∈ *H*_{Toeplitz} (*n, m*), *and p as input, returns a set*, *B* ⊆ *h* (Sol(*ϕ*)) *so that if* | *h* (Sol(*ϕ*))| ≤ *p, then B = h* (Sol(ϕ)), *otherwise B is the p lexicographically minimum elements of h* (Sol(ϕ)). *Moreover, if ϕ is a CNF formula, then* FindMin *makes* *O*(*p* · *m*) *calls to an NP oracle, and if ϕ is a DNF formula with k terms, then* FindMin *takes* *O*(*m*^{3} · *n* · *k* · *p*) *time.*

Equipped with Proposition 1, we are now ready to present the algorithm, called ApproxModelCountMin, for model counting. Since the complexity of FindMin is PTIME when *ϕ* is in DNF, we have ApproxModelCountMin as a FPRAS for DNF formulas.

**Algorithm 5** ApproxModelCountMin(*ϕ, ε, δ*)

*t*← 35 log(1/*δ*)*H*← PickHashFunctions(*H*_{Toeplitz}(*n*, 3*n*),*t*)*S*← {}- Thresh
**for***i*∈ [1,*t*]**do***S*[*i*] ← FindMin(*ϕ*,*H*[*i*], Thresh)*Est*←*Median***return***Est*

THEOREM 1. *Given ϕ, ε, δ*, ApproxModelCountMin *returns Est such that*

*If ϕ is a CNF formula, then* ApproxModelCountMin *is a polynomial-time algorithm that makes*
*calls to an NP oracle. If ϕ is a DNF formula, then* ApproxModelCountMin *is an FPRAS.*

We now give proof of Proposition 1 by describing the subroutine FindMin.

PROOF. We first present the algorithm when the formula *ϕ* is a DNF formula. Adapting the algorithm for the case of CNF can be done by using similar ideas. Let ∅ = *T*_{1} ∨ *T*_{2} ∨ ··· ∨ *T _{k}* be a DNF formula over

*n*variables where

*T*is a term. Let

_{i}*h*: {0, 1}

^{n}→ {0, 1}

^{m}be a linear hash function in

*H*

_{Toeplitz}(

*n, m*) defined by a

*m*×

*n*binary matrix

*A.*Let

*C*be the set of hashed values of the satisfying assignments for

*ϕ*:

*C*= {

*h*(

*x*) |

*x*⊨

*ϕ*} ⊆ {0, 1}

^{m}. Let

*C*be the first

_{p}*p*elements of

*C*in the lexicographic order. Our goal is to compute

*C*.

_{p}We illustrate an algorithm with running time *O*(*m*^{3}*np*) to compute *C _{p}* when the formula is just a term

*T.*This algorithm can easily be generalized to formulas with

*k*-terms. Let

*T*be a term with width

*w*(number of literals) and

*C*= {

*Ax*|

*x*⊨

*T*}. By fixing the variables in

*T*we get a vector

*b*and an

_{T}*N*× (

*n*–

*w*) matrix

*A*so that

_{T}*C*= {

*A*

_{T}*x*+

*b*|

_{T}*x*∈ {0, 1}

^{(n−w)}}. Both

*A*and

_{T}*b*can be computed from

_{T}*A*and

*T*in linear time. Let

*h*(

_{T}*x*) be the transformation

*A*

_{T}*x*+

*b*.

_{T}We will compute *C _{p}* iteratively as follows: assuming we have computed the (

*q*−1)

^{th}minimum of

*C*, we will compute the

*q*minimum using a prefix-search strategy. We will use a sub-routine to solve the following basic prefix-search primitive: Given any

^{th}*l*bit string

*y*1 ···

*y*, is there an

_{l}*x*∈ {0, 1}

^{n-w}so that

*y*

_{1}···

*y*is a prefix for some string in {

_{l}*h*(

_{T}*x*)}? This task can be performed using Gaussian elimination over an (

*l*+ 1) x (

*n*–

*w*) binary matrix and can be implemented in time

*O*(

*l*

^{2}(

*n*–

*w*)).

Let *y* = *y*_{1} ···· *y _{m}* be the (

*q*–1)

^{th}minimum in

*C.*Let

*r*

_{1}be the rightmost 0 of

*y.*Then using the above-mentioned procedure we can find the lexicographically smallest string in the range of

*h*that extends

_{T}*y*

_{1}···

*y*

_{(r−1)}1 if it exists. If no such string exists in

*C*, find the index of the next 0 in

*y*and repeat the procedure. In this manner the

*q*minimum can be computed using

^{th}*O*(

*m*) calls to the prefix-searching primitive resulting in an

*O*(

*m*

^{3}

*n*) time algorithm. Invoking the above procedure

*p*times results in an algorithm to compute

*C*in

_{p}*O*(

*m*

^{3}

*np*) time.

We now discuss distributed DNF counting problem.

Consider the problem of *distributed DNF counting.* In this setting, there are *k* sites that can each communicate with a central coordinator. The input DNF formula ϕ is partitioned into *k* DNF subformulas *ϕ*_{1}, ···, *ϕ*_{k}, where each *ϕ*_{i} is a subset of the terms of the original *ϕ*, with the *j*‘th site receiving only *ϕ*_{j}. The goal is for the coordinator to obtain an (*ε*, *δ*)-approximation of the number of solutions to *ϕ*, while minimizing the total number of bits communicated between the sites and the coordinator. Distributed algorithms for sampling and counting solutions to CSP’s have been studied recently in other models of distributed computation.^{11, 12, 13, 14} From a practical perspective, given the centrality of #DNF in the context of probabilistic databases,^{27, 28} a distributed DNF counting algorithm would entail applications in distributed probabilistic databases.

From our perspective, distributed DNF counting falls within the *distributed functional monitoring* framework formalized by Cormode et al.^{7} Here, the input is a stream **a** which is partitioned arbitrarily into sub-streams **a**_{1}, …, **a**_{k} that arrive at each of *k* sites. Each site can communicate with the central coordinator, and the goal is for the coordinator to compute a function of the joint stream **a** while minimizing the total communication. This general framework has several direct applications and has been studied extensively (see Cormode et al.,^{8} Huang et al.,^{18} Woodruff and Zhang^{33} and the references therein). In distributed DNF counting problem, each sub-stream **a**_{i} corresponds to the set of satisfying assignments to each subformula ϕ_{i}, while the function to be computed is *F*_{0}.

The algorithms discussed in Section 3 can be extended to the distributed setting. We briefly describe the distributed implementation of the minimum-based algorithm. *Distributed implementation of the minimum-based algorithm.* The coordinator chooses hash functions *H*[1], …, *H*[*t*] from *H*_{Toeplitz} (*n*, 3*n*) and sends them to the *k* sites. Each site runs the FindMin algorithm for each hash function and sends the outputs to the coordinator. So, the coordinator receives sets *S*[*i, j*], consisting of the Thresh lexicographically smallest hash values of the solutions to ϕ_{j}. The coordinator then extracts *S*[*i*], the Thresh lexicographically smallest elements of *S*[*i*, 1] ∪ ··· *S*[*i*, *k*], and proceeds with the rest of the algorithm ApproxModelCountMin. The communication cost is *O*(*kn*/ε^{2} · log(1/*δ*)) to account for the *k* sites sending the outputs of their FindMin invocations. The time complexity for each site is polynomial in *n*, ε^{−1}, and log(*δ*^{−1}).

A straightforward implementation of the Bucketing algorithm leads to a distributed DNF counting algorithm whose communication cost is *Õ*(*k*(*n* + 1/ε^{2}) · log(1/δ)) and time complexity per site is polynomial in *n*, ε^{−1}, and log(δ^{−1}). Similarly, the estimation-based algorithm leads to a distributed algorithm with *Õ*(*k*(*n* + 1/ε^{2}) log(1/δ)) communication cost. However, we do not know a polynomial time algorithm to implement the last algorithm on DNF terms.

The communication cost for the Bucketing and Estimation-based algorithms is nearly optimal in their dependence on *k* and ε. Woodruff and Zhang^{33} showed that the randomized communication complexity of estimating *F*_{0} up to a 1 + ε factor in the distributed functional monitoring setting is Ω(*k*/ε^{2}). We can reduce the *F*_{0} estimation problem to distributed DNF counting. Namely, if for the *F*_{0} estimation problem, the *j*’th site receives items *a*_{1}, ···, *a _{m}* ∈ [

*N*], then for the distributed DNF counting problem, ϕ

_{j}is a DNF formula on ⌈log

_{2}

*N*⌉ variables whose solutions are exactly

*a*

_{1}, ···

*a*in their binary encoding. Thus, we immediately get an Ω(

_{m}*k*/ε

^{2}) lower bound for the distributed DNF counting problem. Finding the optimal dependence on

*N*for

*k*> 1 remains an interesting open question.

### 4. From Counting to Streaming

In this section, we consider the *structured set streaming model* where each item *S _{i}* of the stream is a succinct representation of a set over the universe

*U*= {0, 1}

^{n}. Our goal is to design efficient algorithms (both in terms of memory and processing time per item) for computing |∪

_{i}

*S*|—the number of distinct elements in the union of all the sets in the stream. We call this problem

_{i}*F*

_{0}computation over structured set streams. We discuss two types of structured sets

*DNF Sets*and

*Affine Spaces.*As we mentioned in the introduction, other structured sets studied in the literature are single and multi-dimensional ranges. Our techniques also give algorithms for estimating

*F*

_{0}of such structured set streams, which we omit in this extended abstract.

A particular representation we are interested in is where each set is presented as the set of satisfying assignments to a DNF formula. Let ϕ be a DNF formula over *n* variables. Then the DNF set corresponding to ϕ be the set of satisfying assignments of ϕ. The *size* of this representation is the number of terms in the formula ϕ. A stream over DNF sets is a stream of DNF formulas ϕ_{1}, ϕ_{2}, …. Given such a DNF stream, the goal is to estimate |∪_{i} *S _{i}*| where

*S*the DNF set represented by ϕ

_{i}_{i}. This quantity is the same as the number of satisfying assignments of the formula ∨

_{i}ϕ

_{i}. We show that the algorithms described in the previous section carry over to obtain (ε, δ) estimation algorithms for this problem with space and per-item time poly(1/ε,

*n*,

*k*, log(1/δ)) where

*k*is the size of the formula.

THEOREM 2. *There is a streaming algorithm to compute an* (ε, δ) *approximation of F*_{0} *over DNF sets. This algorithm takes space*
*and processing time*
*per item where k is the size (number of terms) of the corresponding DNF formula.*

PROOF. We show how to adapt the Minimum-value based algorithm from Section 3.2 to this setting. The algorithm picks a hash function *h* ∈ *H*_{Toeplitz} (*n*, 3*n*) and maintains the set *B* consisting of *t* lexicographically minimum elements of the set {*h* (Sol(*ϕ*_{1} ∨ ··· ∨ *ϕ*_{i−1}))} after processing *i*–1 items. When ϕ_{i} arrives, it computes the set *B*‘ consisting of the *t* lexicographically minimum values of the set {*h* (Sol(*ϕ _{i}*))} and subsequently updates

*B*by computing the

*t*lexicographically smallest elements from

*B*∪

*B*‘. By Proposition 1, the computation of

*B*‘ can be done in time

*O*(

*n*

^{4}·

*k*·

*t*) where

*k*is the number of terms in ϕ

_{i}. Updating

*B*can be done in

*O*(

*t*·

*n*) time. Thus, the update time for item ϕ

_{i}is

*O*(

*n*

^{4}·

*k*·

*t*). For obtaining an (ε, δ)-approximation, we set and repeat the procedure times and take the median value. Thus the update time for the item ϕ is . For analyzing space, each hash function uses

*O*(

*n*) bits and the algorithm stores minimums, resulting in overall space usage of . The proof of correctness follows from Lemma 1.

Instead of the Minimum-value based algorithm, we could also adapt the Bucketing-based algorithm to obtain an algorithm with similar space and time complexities.

Another example of a structured stream is where each item of the stream is an affine space represented by *Ax* = *B* where *A* is a Boolean matrix and *B* is a zero-one vector. Without loss of generality, we may assume that *A* is a *n* × *n* matrix. Thus an affine stream consists of 〈*A*_{1}, *B*_{1}〉, 〈*A*_{2}, *B*_{2}〉, ···, where each 〈*A _{i}*,

*B*〉 succinctly represents a set {

_{i}*x*∈ {0, 1}

^{n}|

*A*=

_{i}x*B*}. Here operations are over the finite field of size 2. For an

_{i}*n*×

*n*Boolean matrix

*A*and a zero-one vector

*B*, let Sol〈

*A*,

*B*〉) denote the set of all

*x*that satisfy

*A*x =

*B.*

PROPOSITION 2. *Given* (*A, B*), *h* ∈ *H*_{Toeplitz} (*n*, 3*n*), *and t as input, there is an algorithm*, AffineFindMin, *that returns a set*, *B* ⊆ *h*(Sol(〈*A, B*〉)) *so that if* |*h*(Sol(〈*A, B*〉))| ≤ *t*, *then* *B* = *h* (Sol(〈*A, B*〉)), *otherwise B is the t lexicographically minimum elements of h*(Sol(〈*A*, *B*〉)). *The time taken by this algorithm is O*(*n*^{4}*t*) *and the space taken by the algorithm is O*(*tn*).

The above proposition together with the minimum-based algorithm gives the following theorem.

THEOREM 3. *There is a streaming algorithm that computes a* (ε δ)-*approximation of F*_{0} *over affine spaces. This algorithm takes space*
*and processing time of O*
*per item.*

### 5. Relating Sketch Space Complexity and NP Query Complexity

Our investigations reveal surprising connections between algorithms for *F*_{0} estimation and model counting that are of interest to two different communities. It is noteworthy that the two communities often have different concerns: in the context of model counting, one is focused on the NP-query complexity while in the context of streaming, the focus is on the space complexity. This begs the question of whether the connections are a matter of happenstance or there is an inherent relationship between the space complexity in the context of streaming and the query complexity for model counting. We detail our investigations on the existence of such a relationship.

In the following, we will fold the hash function *h* also in the sketch *S.* With this simplification, instead of writing *P*(*S, h,* Sol(*ϕ*)) we write *P*(*S*, Sol(*ϕ*)).

We first introduce some complexity-theoretic notation. For a complexity class *C*, a language *L* belongs to the complexity class ∃·*C* if there is a polynomial *q*(·) and a language *L’* ∈ *C* such that for every *x, x* ∈ *L* ⇔ ∃*y*, |*y*| ≤ *q*(|*x*|), 〈*x, y*〉 ∈ *L*‘..

Consider a streaming algorithm for *F*_{0} that constructs a sketch such that *P* (*S*, *a _{u}*) holds for some property

*P*using which we can estimate |

*a*|, where the size of

_{u}*S*is polylogarithmic in the size of the universe and polynomial in 1/ε. Now consider the following

*Sketch-Language*

THEOREM 4. *If L _{sketh} belongs to the complexity class C, then there exists a* FP

^{∃}·

^{C}

*model counting algorithm that estimates the number of satisfying assignments of a given formula*ϕ.

*The number of queries made by the algorithm is bounded by the sketch size.*

Let us apply the above theorem to the minimum-based algorithm. The sketch language consists of tuples of the form 〈*ϕ*, 〈*h, v*_{1}, ···, *v*_{t}〉〉 where {*v*_{1}, ··· *v _{t}*} is the set of

*t*lexicographically smallest elements of the set

*h*(Sol(ϕ)). It can be seen that this language is in coNP. Since ∃ · coNP is the same as the class , we obtain a algorithm. Since

*t*=

*O*(1/ε

^{2}) and

*h*maps from

*n*-bit strings to 3

*n*-bit strings, it follows that the size of the sketch is

*O*(

*n*/ε

^{2}). Thus the number of queries made by the algorithm is

*O*(

*n*/ε

^{2}).

Interestingly, all the model counting algorithms that were obtained following our recipe are probabilistic polynomialtime algorithms that make queries to languages in NP. The above generic transformation gives a deterministic polynomialtime algorithm that makes queries to a language. Precisely characterizing the properties of the sketch that lead to probabilistic algorithms making only NP queries is an interesting direction to explore.

### 6. Conclusion and Future Outlook

Our investigation led to a diverse set of results that unify over two decades of work in model counting and *F*_{0} estimation. The viewpoint presented in this work has the potential to spur several new interesting research directions.

**Higher Moments.** There has been a long line of work on the estimation of higher moments, that is, *F _{k}* over data streams. A natural direction of future research is to adapt the notion of

*F*in the context of the model of counting and explore its applications. We expect extensions of the framework and recipe presented in this work to derive algorithms for higher frequency moments in the context of model counting.

_{k}**Sparse XORs.** In the context of model counting, the performance of underlying SAT solvers strongly depends on the size of XORs. The standard constructions lead to XORs of size Θ(*n*) and an interesting line of research has focused on the design of sparse XOR-based hash functions^{10,17,19} culminating in showing that one can use hash functions of the form *h* (*x*) = *Ax* + *b* wherein each entry of the *m*-th row of *A* is 1 with probability
.^{23} Such XORs were shown to improve runtime efficiency. In this context, a natural direction would be to explore the usage of sparse XORs in the context of *F*_{0} estimation.

### Acknowledgments

We thank the anonymous reviewers of PODS 21 for their valuable comments. We are grateful to Phokion Kolaitis for suggesting exploration beyond the transformation recipe that led to results in Section 5. We thank Wim Martens for providing valuable suggestions on an earlier version of the manuscript. Bhattacharyya was supported in part by the NRF Fellowship Programme [NRF-NRFFAI1-2019-0002] and an Amazon Research Award. Meel was supported in part by the NRF Fellowship Programme[NRF-NRFFAI1-2019-0004] and the AI Singapore Programme [AISG-RP-2018-005], and NUS ODPRT Grant [R-252-000-685-13]. Vinod was supported in part by NSF CCF-2130608, NSF CCF-184908, and NSF HDR:TRIPODS-1934884 awards. Pavan was supported in part by NSF CCF-2130536, NSF CCF-1849053, and NSF HDR:TRIPODS-1934884 awards.

## Join the Discussion (0)

## Become a Member or Sign In to Post a Comment