Systems and Networking Research Highlights

Model Counting Meets Distinct Elements

By A. Pavan, N. V. Vinodchandran, Arnab Bhattacharyya, and Kuldeep S. Meel

Posted Sep 1 2023

Abstract
1. Introduction
2. Notation
3. From Streaming to Counting
4. From Counting to Streaming
5. Relating Sketch Space Complexity and NP Query Complexity
6. Conclusion and Future Outlook
Acknowledgments
References
Authors
Footnotes

Abstract

Constraint Satisfaction Problems (CSPs) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSPs and the computation of the number of distinct elements in a data stream, also known as the zeroth frequency moment (F₀) of a data stream.

Our investigations lead us to observe striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and distinct elements computation. We design a recipe for the translation of algorithms developed for distinct elements estimation to that of model counting, resulting in new algorithms for model counting. We then observe that algorithms in the context of distributed streaming can be transformed into distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing distinct elements estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works.

1. Introduction

Constraint Satisfaction Problems (CSP’s) and a data stream model are two core themes in computer science with a diverse set of applications in topics including probabilistic reasoning, networks, databases, and verification. Model counting and computation of zeroth frequency moment (F₀) are fundamental problems for CSPs and a data stream model respectively. This paper is motivated by our observation that despite the usage of similar algorithmic techniques for the two problems, the developments in the two communities have, surprisingly, evolved separately, and rarely has a paper from one community been cited by the other.

Given a set of constraints ϕ over a set of variables in a finite domain D, the problem of model counting is to estimate the number of solutions of ϕ. We are often interested when ϕ is restricted to a special class of representations such as Conjunctive Normal Form (CNF) and Disjunctive Normal Form (DNF). A data stream over a domain [N] is represented by a = 〈a₁, a₂, ···, a_m〉 where each item a_i is a subset of [N]. The zeroth frequency moment, denoted as F₀, of a is the number of distinct domain elements appearing in a, that is, |∪_i a_i| (traditionally, a_is are singletons; we will also be interested in the case when a_is are sets). The fundamental nature of model counting and F₀ computation over data streams has led to intense interest from theoreticians and practitioners alike in the respective communities for the past few decades.

The starting point of this work is the confluence of two viewpoints. The first viewpoint contends that some of the algorithms for model counting can conceptually be thought of as operating on the stream of the solutions of the constraints. The second viewpoint contends that a stream can be viewed as a DNF formula, and the problem of F₀ estimation is similar to model counting. These viewpoints make it natural to believe that algorithms developed in the streaming setting can be directly applied to model counting, and vice versa. We explore this connection and indeed design new algorithms for model counting inspired by algorithms for estimating F₀ in data streams. By exploring this connection further, we design new algorithms to estimate F₀ for streaming sets that are succinctly represented by constraints. It is worth noting that the two communities focus on seemingly different efficiency objectives: in streaming algorithms, space complexity is of major concern while in the context of model counting, time (especially NP query complexity in the context of CNF formulas) is of primary concern. Therefore, it is striking to observe that our transformation recipe leads to the design of efficient algorithms for F₀ estimation as well as model counting wherein efficient is measured by the concern of the corresponding community. We further investigate this observation and demonstrate that the space complexity of streaming algorithms provides an upper bound on the query complexity of model counting algorithms.

To put our contributions in context, we briefly survey the historical development of algorithmic frameworks in both model counting and F₀ estimation and point out the similarities.

1.1. Model counting

The complexity-theoretic study of model counting was initiated by Valiant who showed that this problem, in general, is #P-complete.³² This motivated researchers to investigate approximate model counting and in particular to design (ε, δ)-approximation schemes. The complexity of approximate model counting depends on its representation. When the model ϕ is represented as a CNF formula ϕ, designing an efficient (ε, δ)-approximation is NP-hard.³⁰ In contrast, when it is represented as a DNF formula, model counting admits an fully polynomial-time approximation scheme (FPRAS).^{21, 22} We will use #CNF to refer to the case when ϕ is a CNF formula and #DNF to refer to the case when ϕ is a DNF formula.

For #CNF, Stockmeyer³⁰ provided a hashing-based randomized procedure that can compute an (ε, δ)-approximation with running time poly(|ϕ|, 1/ε, 1/δ), given access to an NP oracle. Building on Stockmeyer’s approach and motivated by the unprecedented breakthroughs in the design of SAT solvers, researchers have proposed a series of algorithmic improvements that have allowed the hashing-based techniques for approximate model counting to scale to formulas involving hundreds of thousands of variables. The practical implementations substitute the NP oracle with an SAT solver. In the context of model counting, we are primarily interested in time complexity and therefore, the number of NP queries is of key importance. The emphasis on the number of NP calls also stems from practice as the practical implementation of model counting algorithms has shown to spend over 99% of their time in the underlying SAT calls.²⁹

Karp and Luby²¹ proposed the first FPRAS scheme for #DNF, which was improved in subsequent works.^{9, 22} Chakraborty et al.⁵ demonstrated that the hashing-based framework can be extended to #DNF, thereby providing a unified framework for both #CNF and #DNF. Meel et al.^{24, 25} subsequently improved the complexity of the hashing-based approach for #DNF and observed that hashing-based techniques achieve better scalability than Monte Carlo techniques.

1.2. Zeroth frequency moment estimation

Estimating (ε, δ)-approximation of the k^th frequency moments (F_k) of a stream is a central problem in a data streaming model.¹ In particular, considerable work has been done in designing algorithms for estimating the 0^th frequency moment (F₀), the number of distinct elements in the stream. For streaming algorithms, the primary resource concerns are space complexity and processing time per element. In general, for a streaming algorithm to be considered efficient, these should be poly(log N, 1/ε) where N is the size of the universe (we assume δ to be a small constant and ignore factors in this discussion).

The first algorithm for computing F₀ with a constant factor approximation was proposed by Flajolet and Martin, who assumed the existence of hash functions with ideal properties resulting in an algorithm with undesirable space complexity.¹⁵ In their seminal work, Alon, Matias, and Szegedy designed an O(log N) space algorithm for F₀ with a constant approximation ratio that employs 2-universal hash functions.¹ Subsequent investigations into hashing-based schemes by Gibbons and Tirthapura¹⁶ and Bar-Yossef et al.³ provided (ε, δ)-approximation algorithms with space and time complexity . Later, Bar-Yossef et al. proposed three algorithms with improved space and time complexity.² While the three algorithms employ hash functions, they differ conceptually in the usage of relevant random variables for the estimation of F₀. This line of work resulted in the development of an algorithm with optimal space complexity and O (log N) update time to estimate the F₀ of a stream.²⁰

The aforementioned works are in the setting where each data item a_i is an element of the universe. Subsequently, there has been a series of results of estimating F₀ in rich scenarios with a particular focus to handle the cases a₁ ⊆ {1, 2, ··· N} such as a list or a multidimensional range.^3,26,31

1.3. The road to a unifying framework

As mentioned above, the algorithmic developments for model counting and F₀ estimation have largely relied on the usage of hashing-based techniques and yet these developments have, surprisingly, been separate, and rarely has a work from one community been cited by the other. In this context, we wonder whether it is possible to bridge this gap and if such an exercise would contribute to new algorithms for model counting as well as for F₀ estimation? The main conceptual contribution of this work is an affirmative answer to the above question. First, we point out that the two well-known algorithms; Stockmeyer’s #CNF algorithm³⁰ which is further refined by Chakraborty et al.⁵ and Gibbons and Tirthapura’s F₀ estimation algorithm,¹⁶ are essentially the same.

The core idea of the hashing-based technique of Stockmeyer’s and Chakraborty et al’s scheme is to use pair-wise independent hash functions to partition the solution space (satisfying assignments of a CNF formula) into roughly equal and small cells, wherein a cell is small if the number of solutions is less than a pre-computed threshold, denoted by Thresh. Then a good estimate for the number of solutions is the number of solutions in an arbitrary cell x number of cells. To determine the appropriate number of cells, the solution space is iteratively partitioned as follows. At the m^th iteration, a hash function with range {0, 1}^m is considered resulting in cells h⁻¹ (y) for each y ∈ {0, 1}^m. An NP oracle can be employed to check whether a particular cell (for example h⁻¹ (0^m)) is small by enumerating solutions one by one until we have either obtained Thresh+1 number of solutions or we have exhaustively enumerated all the solutions. If the cell h⁻¹ (0^m) is small, then the algorithm outputs t×2^m as an estimate where t is the number of solutions in the cell h⁻¹ (0^m). If the cell h⁻¹ (0^m) is not small, then the algorithm moves on to the next iteration where a hash function with range {0, 1}^m+1 is considered.

We now describe Gibbons and Tirthapura’s algorithm for F₀ estimation which we call the Bucketing algorithm. Without loss of generality, we assume that N is a power of two and thus identify [N] with {0, 1}ⁿ. The algorithm maintains a bucket of size Thresh and starts by picking a hash function h : {0, 1}ⁿ → {0, 1}ⁿ. It iterates over sampling levels. At level m, when a data item x comes, if h(x) starts with 0^m, then x is added to the bucket. If the bucket overflows, then the sampling level is increased to m + 1 and all elements x in the bucket other than the ones with h(x) = 0^m+1 are deleted. At the end of the stream, the value t × 2^m is output as the estimate where t is the number of elements in the bucket and m is the sampling level.

These two algorithms are conceptually the same. In the Bucketing algorithm, at the sampling level m, it looks at only the first m bits of the hashed value; this is equivalent to considering a hash function with range {0, 1}^m. Thus the bucket is nothing but all the elements in the stream that belong to the cell h⁻¹(0^m). The final estimate is the number of elements in the bucket times the number of cells, identical to Chakraborty et al.’s algorithm. In both algorithms, to obtain an (ε, δ) approximation, the Thresh value is chosen as . To reduce the error probability to 1/δ, the median of independent estimations is output.

1.4. Our contributions

Motivated by the conceptual identity between the two algorithms, we further explore the connections between algorithms for model counting and F₀ estimation.

First, we formalize a recipe to transform streaming algorithms for F₀ estimation to those for model counting. Such a transformation yields new (ε, δ)-approximate algorithms for model counting, which are different from currently known algorithms. Our transformation recipe from F₀ estimation to model counting allows us to view the problem of the design of distributed #DNF algorithms through the lens of distributed functional monitoring which is well-studied in a data streaming literature.

Building on the connection between model counting and F₀ estimation algorithms, we design new algorithms to estimate F₀ over structured set streams where each element of the stream is a (succinct representation of a) subset of the universe. Thus, the stream is S₁, S₂, ··· where each S_i ⊆ [N] and the goal is to estimate the F₀ of the stream, that is, the size of ∪_iS_i. In this scenario, the goal is to design algorithms whose per-item time (time to process each S_i) is poly-logarithmic in the size of the universe. Structured set streams that are considered in the literature include 1-dimensional (1D) and multidimensional ranges.^26,31 Several interesting problems, including the max-dominance norm⁶ and counting triangles in graphs,³ can be reduced to computing F₀ over such ranges.

We observe that several structured sets can be represented as small DNF formulae and thus F₀ counting over these structured set data streams can be viewed as a special case of #DNF. Using the hashing-based techniques for #DNF, we obtain a general recipe for a rich class of structured sets that include DNF sets, affine spaces, and multidimensional ranges. Prior work on structured sets had to rely on involved analysis for each of the specific instances, while our work provides a general recipe for both analysis and implementation.

A natural question that arises from the transformation recipe is the relationship between the space complexity of the streaming algorithms and the query complexity of the obtained model counting algorithms. We establish a relationship between these two quantities by showing that the space complexity is an upper bound on the query complexity.

2. Notation

We will assume the universe [N] = {0, 1}ⁿ.

F₀ Estimation: A data stream a over domain [N] can be represented as a = a₁, a₂, ···, a_m wherein each item a_i ∈ [N]. Let a_u = ∪_i{a_i}· F₀ of the stream a is |a_u|. We are interested in a probably approximately correct scheme that returns an (ε, δ)-estimate c of F₀, that is, .

Model Counting: Let X = (X₁, X₂, …, X_n} be a set of Boolean variables. For a Boolean formula ϕ over variables X, let Sol(ϕ) denote the set of all satisfying assignments of ϕ. The propositional model counting problem is to compute |Sol(ϕ)| for a given formula ϕ. As in the case of F₀, we are interested in a probably approximately correct algorithm that takes as inputs a formula ϕ, a tolerance ε > 0, and a confidence δ ∈ (0, 1], and returns a (ε, δ)-estimate c of |Sol(ϕ)| that is, .

k-wise Independent hash functions: Let n, m ∈ ℕ and be a family of hash functions mapping {0, 1}ⁿ to {0, 1}^m.

DEFINITION 1. A family of hash functions H(n, m) is k-wise independent, denoted H_k-wise (n, m), if ∀α_1, α_2, …, α_k ∈{0, 1}^m, for all distinct x₁, x_2, … x_k ∈{0, 1}ⁿ,

Explicit families. An explicit hash family that we use is H_Toeplitz (n, m), which is known to be 2-wise independent.⁴ The family is defined as follows: is the family of functions of the form h (x) = Ax + b with and , and F₂ is the finite field of size 2. For a hash function h: {0,1}ⁿ→ {0, 1}^m, h_ℓ: {0, 1}ⁿ →{0, 1}_ℓ, ℓ ∈ {1, …,m} is the function where h_ℓ (y) is the first ℓ bits of h (y).

3. From Streaming to Counting

As a first step, we present a unified view of the three hashing-based algorithms proposed in Bar-Yossef et al.² Their first algorithm, the Bucketing algorithm discussed above, is a modification of an F₀ estimation algorithm due to Gibbons and Tirthapura.¹⁶ The second algorithm, which we call Minimum, is based on the idea that if we hash all the items of the stream, then O(1/ε²)-th minimum of the hash values can be used to compute a good estimate of F₀. The third algorithm, which we call Estimation, chooses a set of k functions, {h₁, h₂, …, h_k}, such that each h_j is picked randomly from an O(log(1/ε))-independent hash family. For each hash function h_j, we say that h_j is not lonely if there exists a_i ∈ a such that h_j (a_i) = 0. One can then estimate F₀ of a by estimating the number of hash functions that are not lonely.

Algorithm 1, called ComputeF0, presents the overarching architecture of the three proposed algorithms. The architecture of ComputeF0 is fairly simple: it chooses a collection of hash functions using ChooseHashFunctions, calls the subroutine ProcessUpdate for every incoming element of the stream and invokes ComputeEst at the end of the stream to return the F₀ approximation.

ChooseHashFunctions. As shown in Algorithm 2, the hash functions depend on the strategy being implemented. The subroutine PickHashFunctions(H, t) returns a collection of t independently chosen hash functions from the family H. We use H to denote the collection of hash functions returned, this collection is viewed as either a 1D array or as a 2-dimensional (2D) array. When H is a 1D array, H [i] denotes the ith hash function of the collection and when H is a 2D array H [i] [j] is the [i, j]th hash function.

ProcessUpdate. For a new item x, the update of S, as shown in Algorithm 3 is as follows:

Bucketing: For a new item x, if H[i]_m_i (x) = 0^m_i, then we add it to S[i] if x is not already present in S[i]. If the size of S[i] is greater than Thresh (which is set to be O(1/ε²)), then we increment the m_i as in line 8.
Minimum: For a new item x, if H [i](x) is smaller than max S[i], then we replace max S[i] with H [i](x).
Estimation: For a new item x, compute z = TrailZero(H [i, j](x)), that is, the number of trailing zeros in H [i, j](x), and replace S[i, j] with z if z is larger than S[i, j].

ComputeEst. Finally, for each of the algorithms, we estimate F₀ based on the sketch S as described in the sub-routine ComputeEst presented as Algorithm 4. It is crucial to note that the estimation of F₀ is performed solely using the sketch S for the Bucketing and Minimum algorithms. The Estimation algorithm requires an additional parameter r that depends on a loose estimate of F₀.

Algorithm 1 ComputeF0(n, ε, δ)

Thresh ← 96/ε²
t ← 35 log(1/δ)
H ← ChooseHashFunctions(n, Thresh, t)
S ← {}
while true do
if EndStream then exit;
x ← input ()
ProcessUpdate(S, H, x, Thresh)
Est ← ComputeEst(S, Thresh)
return Est

Sketch Properties. For each of the three algorithms, their corresponding sketches can be viewed as arrays of size 35 log(1/δ). The parameter Thresh is set to 96/ε².

Bucketing: The element S[i] is a tuple 〈ℓ_i, m_i〉 where ℓ_i is a list of size at most Thresh, where ℓ_i = {x ∈a\H[i]_mi (x) = 0^mi}. We use S[i](0) to denote ℓ_i and S[i](1) to denote m_i.
Minimum: Each S[i] holds the lexicographically Thresh many smallest elements of {H[i](x)|x∈a}.
Estimation: Each S[i] holds a tuple of size Thresh. The j‘th entry of this tuple is the largest number of trailing zeros in any element of H [i,j](a).

Algorithm 2 ChooseHashFunctions(n, Thresh, t)

switch AlgorithmType do
case AlgorithmType==Bucketing
H ← PickHashFunctions(H_Toeplitz (n, n), t)
case AlgorithmType==Minimum
H ← PickHashFunctions(H_Toeplitz (n, 3n), t)
case AlgorithmType==Estimation
s ← 10 log(1/ε)
H ← PickHashFunctions(H_s-wise (n, n), t × Thresh)
return H

Algorithm 3 ProcessUpdate(S, H, x, Thresh)

for i ∈ [1, |H|] do
switch AlgorithmType do
case Bucketing
m_i = S[i](0)
if H[i]_mi (x) == 0^mi then
S[i](0) ← S[i](0) ∪ {x}
if size(S[i](0)) > Thresh then
S[i](1) ← S[i](1) + 1
for y ∈ S do
if H [i]_{m_i+1(y)} ≠ 0^m_i+1 then
Remove(S[i](0), y)
case Minimum
if size(S[i]) < Thresh then
S[i]. Append(H [i](x))
else
j ← arg max(S[i])
if S[i](j)> H[i](x) then
S[i](j) ← H[i](x)
case Estimation
for j ∈ [1, Thresh] do
S [i, j] ← max(S [i, j], TrailZero(H [i, j](x)))
return S

3.1. A recipe for transformation

Observe that for each of the algorithms, the final computation of F₀ estimation depends on the sketch S. Therefore, as long as for two streams a and â, if their corresponding sketches (and the hash functions chosen) match, then the three schemes presented above would return the same estimates. The recipe for a transformation of streaming algorithms to model counting algorithms is based on the following insight:

Capture the relationship P(S, H, a_u) between the sketch S, set of hash functions H, and set a_u at the end of stream.
View the formula ϕ as a symbolic representation of the unique set a_u represented by the stream a such that Sol(ϕ) = a_u.
Given a formula ϕ and set of hash functions H, design an algorithm to construct sketch S such that the property P(S, H, Sol(ϕ)) holds. Using the sketch S, |Sol(ϕ)| can be estimated.

Algorithm 4 ComputeEst(S, Thresh)

switch AlgorithmType do
case Bucketing
return Median
case Minimum
return Median
case Estimation(r)
return Median

By applying the above recipe to the three F₀ estimation algorithms, we can derive corresponding model counting algorithms. In particular, applying the above recipe to the Bucketing algorithm leads us to the state-of-the-art hashing-based model counting algorithm, ApproxMC, proposed by Chakraborty et al.⁵ Applying the above recipe to Minimum and Estimation allows us to obtain different model counting schemes. In this extended abstract, we illustrate this transformation for the Minimum-based algorithm.

3.2. Example application of recipe: Minimum-based algorithm

We showcase the application of the recipe in the context of a minimum-based algorithm. For a given multiset a (e.g., a data stream or solutions to a model), we now specify the property P(S, H, a_u) as follows: The sketch S is an array of sets indexed by members of H that holds lexicographically p minimum elements of H [i](a_u) where p is is the property that specifies this relationship.

The following lemma due by Bar-Yossef et al.² establishes the relationship between the property P and the number of distinct elements of a multiset. Let max(S_i) denote the largest element of the set S_i.

LEMMA 1. Let a ⊆ {0, 1}ⁿ be a multiset. Let H ⊆ H_Toeplitz (n, 3n) where each H [i] is independently drawn from H_Toeplitz (n, 3n), and |H| = O (log 1/δ). Let S be such that P(S, H, a_u) holds. Let ._i Then

Therefore, we can transform the minimum algorithm for F₀ estimation to that of model counting given access to a sub-routine that can compute S such that P(S, H, Sol(ϕ)) holds. The following proposition establishes the existence and complexity of such a subroutine, called FindMin.

PROPOSITION 1. There is an algorithm FindMin that, given ϕ over n variables, h ∈ H_Toeplitz (n, m), and p as input, returns a set, B ⊆ h (Sol(ϕ)) so that if | h (Sol(ϕ))| ≤ p, then B = h (Sol(ϕ)), otherwise B is the p lexicographically minimum elements of h (Sol(ϕ)). Moreover, if ϕ is a CNF formula, then FindMin makes O(p · m) calls to an NP oracle, and if ϕ is a DNF formula with k terms, then FindMin takes O(m³ · n · k · p) time.

Equipped with Proposition 1, we are now ready to present the algorithm, called ApproxModelCountMin, for model counting. Since the complexity of FindMin is PTIME when ϕ is in DNF, we have ApproxModelCountMin as a FPRAS for DNF formulas.

Algorithm 5 ApproxModelCountMin(ϕ, ε, δ)

t ← 35 log(1/δ)
H ← PickHashFunctions(H_Toeplitz (n, 3n), t)
S ← {}
Thresh
for i ∈ [1, t] do
S [i] ← FindMin(ϕ, H [i], Thresh)
Est ← Median
return Est

THEOREM 1. Given ϕ, ε, δ, ApproxModelCountMin returns Est such that

If ϕ is a CNF formula, then ApproxModelCountMin is a polynomial-time algorithm that makes calls to an NP oracle. If ϕ is a DNF formula, then ApproxModelCountMin is an FPRAS.

We now give proof of Proposition 1 by describing the subroutine FindMin.

PROOF. We first present the algorithm when the formula ϕ is a DNF formula. Adapting the algorithm for the case of CNF can be done by using similar ideas. Let ∅ = T₁ ∨ T₂ ∨ ··· ∨ T_k be a DNF formula over n variables where T_i is a term. Let h : {0, 1}ⁿ → {0, 1}^m be a linear hash function in H_Toeplitz (n, m) defined by a m × n binary matrix A. Let C be the set of hashed values of the satisfying assignments for ϕ: C = {h (x) | x ⊨ ϕ} ⊆ {0, 1}^m. Let C_p be the first p elements of C in the lexicographic order. Our goal is to compute C_p.

We illustrate an algorithm with running time O(m³np) to compute C_p when the formula is just a term T. This algorithm can easily be generalized to formulas with k-terms. Let T be a term with width w (number of literals) and C = {Ax |x ⊨ T}. By fixing the variables in T we get a vector b_T and an N × (n – w) matrix A_T so that C = {A_T x + b_T | x ∈ {0, 1}^(n−w)}. Both A_T and b_T can be computed from A and T in linear time. Let h_T (x) be the transformation A_T x + b_T.

We will compute C_p iteratively as follows: assuming we have computed the (q−1)^th minimum of C, we will compute the q^th minimum using a prefix-search strategy. We will use a sub-routine to solve the following basic prefix-search primitive: Given any l bit string y1 ··· y_l, is there an x ∈ {0, 1}^n-w so that y₁ ··· y_l is a prefix for some string in {h_T(x)}? This task can be performed using Gaussian elimination over an (l + 1) x (n – w) binary matrix and can be implemented in time O(l² (n – w)).

Let y = y₁ ···· y_m be the (q–1)^th minimum in C. Let r₁ be the rightmost 0 of y. Then using the above-mentioned procedure we can find the lexicographically smallest string in the range of h_T that extends y₁ ··· y_(r−1) 1 if it exists. If no such string exists in C, find the index of the next 0 in y and repeat the procedure. In this manner the q^th minimum can be computed using O(m) calls to the prefix-searching primitive resulting in an O(m³n) time algorithm. Invoking the above procedure p times results in an algorithm to compute C_p in O(m³np) time.

We now discuss distributed DNF counting problem.

3.3. Distributed DNF counting

Consider the problem of distributed DNF counting. In this setting, there are k sites that can each communicate with a central coordinator. The input DNF formula ϕ is partitioned into k DNF subformulas ϕ₁, ···, ϕ_k, where each ϕ_i is a subset of the terms of the original ϕ, with the j‘th site receiving only ϕ_j. The goal is for the coordinator to obtain an (ε, δ)-approximation of the number of solutions to ϕ, while minimizing the total number of bits communicated between the sites and the coordinator. Distributed algorithms for sampling and counting solutions to CSP’s have been studied recently in other models of distributed computation.^{11, 12, 13, 14} From a practical perspective, given the centrality of #DNF in the context of probabilistic databases,^{27, 28} a distributed DNF counting algorithm would entail applications in distributed probabilistic databases.

From our perspective, distributed DNF counting falls within the distributed functional monitoring framework formalized by Cormode et al.⁷ Here, the input is a stream a which is partitioned arbitrarily into sub-streams a₁, …, a_k that arrive at each of k sites. Each site can communicate with the central coordinator, and the goal is for the coordinator to compute a function of the joint stream a while minimizing the total communication. This general framework has several direct applications and has been studied extensively (see Cormode et al.,⁸ Huang et al.,¹⁸ Woodruff and Zhang³³ and the references therein). In distributed DNF counting problem, each sub-stream a_i corresponds to the set of satisfying assignments to each subformula ϕ_i, while the function to be computed is F₀.

The algorithms discussed in Section 3 can be extended to the distributed setting. We briefly describe the distributed implementation of the minimum-based algorithm. Distributed implementation of the minimum-based algorithm. The coordinator chooses hash functions H[1], …, H[t] from H_Toeplitz (n, 3n) and sends them to the k sites. Each site runs the FindMin algorithm for each hash function and sends the outputs to the coordinator. So, the coordinator receives sets S[i, j], consisting of the Thresh lexicographically smallest hash values of the solutions to ϕ_j. The coordinator then extracts S[i], the Thresh lexicographically smallest elements of S[i, 1] ∪ ··· S[i, k], and proceeds with the rest of the algorithm ApproxModelCountMin. The communication cost is O(kn/ε² · log(1/δ)) to account for the k sites sending the outputs of their FindMin invocations. The time complexity for each site is polynomial in n, ε⁻¹, and log(δ⁻¹).

A straightforward implementation of the Bucketing algorithm leads to a distributed DNF counting algorithm whose communication cost is Õ(k(n + 1/ε²) · log(1/δ)) and time complexity per site is polynomial in n, ε⁻¹, and log(δ⁻¹). Similarly, the estimation-based algorithm leads to a distributed algorithm with Õ(k(n + 1/ε²) log(1/δ)) communication cost. However, we do not know a polynomial time algorithm to implement the last algorithm on DNF terms.

3.4. Lower bound

The communication cost for the Bucketing and Estimation-based algorithms is nearly optimal in their dependence on k and ε. Woodruff and Zhang³³ showed that the randomized communication complexity of estimating F₀ up to a 1 + ε factor in the distributed functional monitoring setting is Ω(k/ε²). We can reduce the F₀ estimation problem to distributed DNF counting. Namely, if for the F₀ estimation problem, the j’th site receives items a₁, ···, a_m ∈ [N], then for the distributed DNF counting problem, ϕ_j is a DNF formula on ⌈log₂ N⌉ variables whose solutions are exactly a₁, ··· a_m in their binary encoding. Thus, we immediately get an Ω(k/ε²) lower bound for the distributed DNF counting problem. Finding the optimal dependence on N for k > 1 remains an interesting open question.

4. From Counting to Streaming

In this section, we consider the structured set streaming model where each item S_i of the stream is a succinct representation of a set over the universe U = {0, 1}ⁿ. Our goal is to design efficient algorithms (both in terms of memory and processing time per item) for computing |∪_i S_i|—the number of distinct elements in the union of all the sets in the stream. We call this problem F₀ computation over structured set streams. We discuss two types of structured sets DNF Sets and Affine Spaces. As we mentioned in the introduction, other structured sets studied in the literature are single and multi-dimensional ranges. Our techniques also give algorithms for estimating F₀ of such structured set streams, which we omit in this extended abstract.

4.1. DNF sets

A particular representation we are interested in is where each set is presented as the set of satisfying assignments to a DNF formula. Let ϕ be a DNF formula over n variables. Then the DNF set corresponding to ϕ be the set of satisfying assignments of ϕ. The size of this representation is the number of terms in the formula ϕ. A stream over DNF sets is a stream of DNF formulas ϕ₁, ϕ₂, …. Given such a DNF stream, the goal is to estimate |∪_i S_i| where S_i the DNF set represented by ϕ_i. This quantity is the same as the number of satisfying assignments of the formula ∨_iϕ_i. We show that the algorithms described in the previous section carry over to obtain (ε, δ) estimation algorithms for this problem with space and per-item time poly(1/ε, n, k, log(1/δ)) where k is the size of the formula.

THEOREM 2. There is a streaming algorithm to compute an (ε, δ) approximation of F₀ over DNF sets. This algorithm takes space and processing time per item where k is the size (number of terms) of the corresponding DNF formula.

PROOF. We show how to adapt the Minimum-value based algorithm from Section 3.2 to this setting. The algorithm picks a hash function h ∈ H_Toeplitz (n, 3n) and maintains the set B consisting of t lexicographically minimum elements of the set {h (Sol(ϕ₁ ∨ ··· ∨ ϕ_i−1))} after processing i–1 items. When ϕ_i arrives, it computes the set B‘ consisting of the t lexicographically minimum values of the set {h (Sol(ϕ_i))} and subsequently updates B by computing the t lexicographically smallest elements from B ∪ B‘. By Proposition 1, the computation of B‘ can be done in time O (n⁴ · k · t) where k is the number of terms in ϕ_i. Updating B can be done in O (t · n) time. Thus, the update time for item ϕ_i is O (n⁴ · k · t). For obtaining an (ε, δ)-approximation, we set and repeat the procedure times and take the median value. Thus the update time for the item ϕ is . For analyzing space, each hash function uses O(n) bits and the algorithm stores minimums, resulting in overall space usage of . The proof of correctness follows from Lemma 1.

Instead of the Minimum-value based algorithm, we could also adapt the Bucketing-based algorithm to obtain an algorithm with similar space and time complexities.

4.2. Affine spaces

Another example of a structured stream is where each item of the stream is an affine space represented by Ax = B where A is a Boolean matrix and B is a zero-one vector. Without loss of generality, we may assume that A is a n × n matrix. Thus an affine stream consists of 〈A₁, B₁〉, 〈A₂, B₂〉, ···, where each 〈A_i, B_i〉 succinctly represents a set {x ∈ {0, 1}ⁿ | A_ix = B_i}. Here operations are over the finite field of size 2. For an n × n Boolean matrix A and a zero-one vector B, let Sol〈A, B〉) denote the set of all x that satisfy Ax = B.

PROPOSITION 2. Given (A, B), h ∈ H_Toeplitz (n, 3n), and t as input, there is an algorithm, AffineFindMin, that returns a set, B ⊆ h(Sol(〈A, B〉)) so that if |h(Sol(〈A, B〉))| ≤ t, then B = h (Sol(〈A, B〉)), otherwise B is the t lexicographically minimum elements of h(Sol(〈A, B〉)). The time taken by this algorithm is O(n⁴t) and the space taken by the algorithm is O(tn).

The above proposition together with the minimum-based algorithm gives the following theorem.

THEOREM 3. There is a streaming algorithm that computes a (ε δ)-approximation of F₀ over affine spaces. This algorithm takes space and processing time of O per item.

5. Relating Sketch Space Complexity and NP Query Complexity

Our investigations reveal surprising connections between algorithms for F₀ estimation and model counting that are of interest to two different communities. It is noteworthy that the two communities often have different concerns: in the context of model counting, one is focused on the NP-query complexity while in the context of streaming, the focus is on the space complexity. This begs the question of whether the connections are a matter of happenstance or there is an inherent relationship between the space complexity in the context of streaming and the query complexity for model counting. We detail our investigations on the existence of such a relationship.

In the following, we will fold the hash function h also in the sketch S. With this simplification, instead of writing P(S, h, Sol(ϕ)) we write P(S, Sol(ϕ)).

We first introduce some complexity-theoretic notation. For a complexity class C, a language L belongs to the complexity class ∃·C if there is a polynomial q(·) and a language L’ ∈ C such that for every x, x ∈ L ⇔ ∃y, |y| ≤ q(|x|), 〈x, y〉 ∈ L‘..

Consider a streaming algorithm for F₀ that constructs a sketch such that P (S, a_u) holds for some property P using which we can estimate |a_u|, where the size of S is polylogarithmic in the size of the universe and polynomial in 1/ε. Now consider the following Sketch-Language

THEOREM 4. If L_sketh belongs to the complexity class C, then there exists a FP^∃·^C model counting algorithm that estimates the number of satisfying assignments of a given formula ϕ. The number of queries made by the algorithm is bounded by the sketch size.

Let us apply the above theorem to the minimum-based algorithm. The sketch language consists of tuples of the form 〈ϕ, 〈h, v₁, ···, v_t〉〉 where {v₁, ··· v_t} is the set of t lexicographically smallest elements of the set h(Sol(ϕ)). It can be seen that this language is in coNP. Since ∃ · coNP is the same as the class , we obtain a algorithm. Since t = O (1/ε²) and h maps from n-bit strings to 3n-bit strings, it follows that the size of the sketch is O(n/ε²). Thus the number of queries made by the algorithm is O(n/ε²).

Interestingly, all the model counting algorithms that were obtained following our recipe are probabilistic polynomialtime algorithms that make queries to languages in NP. The above generic transformation gives a deterministic polynomialtime algorithm that makes queries to a language. Precisely characterizing the properties of the sketch that lead to probabilistic algorithms making only NP queries is an interesting direction to explore.

6. Conclusion and Future Outlook

Our investigation led to a diverse set of results that unify over two decades of work in model counting and F₀ estimation. The viewpoint presented in this work has the potential to spur several new interesting research directions.

Higher Moments. There has been a long line of work on the estimation of higher moments, that is, F_k over data streams. A natural direction of future research is to adapt the notion of F_k in the context of the model of counting and explore its applications. We expect extensions of the framework and recipe presented in this work to derive algorithms for higher frequency moments in the context of model counting.

Sparse XORs. In the context of model counting, the performance of underlying SAT solvers strongly depends on the size of XORs. The standard constructions lead to XORs of size Θ(n) and an interesting line of research has focused on the design of sparse XOR-based hash functions^10,17,19 culminating in showing that one can use hash functions of the form h (x) = Ax + b wherein each entry of the m-th row of A is 1 with probability .²³ Such XORs were shown to improve runtime efficiency. In this context, a natural direction would be to explore the usage of sparse XORs in the context of F₀ estimation.

Acknowledgments

We thank the anonymous reviewers of PODS 21 for their valuable comments. We are grateful to Phokion Kolaitis for suggesting exploration beyond the transformation recipe that led to results in Section 5. We thank Wim Martens for providing valuable suggestions on an earlier version of the manuscript. Bhattacharyya was supported in part by the NRF Fellowship Programme [NRF-NRFFAI1-2019-0002] and an Amazon Research Award. Meel was supported in part by the NRF Fellowship Programme[NRF-NRFFAI1-2019-0004] and the AI Singapore Programme [AISG-RP-2018-005], and NUS ODPRT Grant [R-252-000-685-13]. Vinod was supported in part by NSF CCF-2130608, NSF CCF-184908, and NSF HDR:TRIPODS-1934884 awards. Pavan was supported in part by NSF CCF-2130536, NSF CCF-1849053, and NSF HDR:TRIPODS-1934884 awards.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Model Counting Meets Distinct Elements

View in the ACM Digital Library

This work is licensed under a Creative Commons Attribution International 4.0 License. https://creativecommons.org/licenses/by/4.0/

DOI

10.1145/3607824

September 2023 Issue

Published: September 1, 2023

Vol. 66 No. 9

Pages: 95-102

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Apr 29 2024

A Brief History of Embodied Artificial Intelligence, and Its Future Outlook

Shaoshan Liu and Shuang Wu

Architecture and Hardware

Credit: Getty Images Figure floating amidst vertical beams of light, illustration

BLOG@CACM Apr 26 2024

Optimizing Energy Efficiency in Datacenters with Advanced Cooling Technologies

Alex Williams

Architecture and Hardware

Credit: Getty Images Servers in snowy setting.

News Apr 23 2024

Maximizing Power Grid Security

R. Colin Johnson

Security and Privacy

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Abstract

1. Introduction

2. Notation

3. From Streaming to Counting

4. From Counting to Streaming

5. Relating Sketch Space Complexity and NP Query Complexity

6. Conclusion and Future Outlook

Acknowledgments

Model Counting Meets Distinct Elements

DOI

September 2023 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.