In this article, we consider the following question: Can we optimize objective functions from the training data we use to learn them? We formalize this question through a novel framework we call optimization from samples (OPS). In OPS, we are given sampled values of a function drawn from some distribution and the objective is to optimize the function under some constraint.
While there are interesting classes of functions that can be optimized from samples, our main result is an impossibility. We show that there are classes of functions that are statistically learnable and optimizable, but for which no reasonable approximation for optimization from samples is achievable. In particular, our main result shows that there is no constant factor approximation for maximizing coverage functions under a cardinality constraint using polynomially-many samples drawn from any distribution.
We also show tight approximation guarantees for maximization under a cardinality constraint of several interesting classes of functions including unit-demand, additive, and general monotone submodular functions, as well as a constant factor approximation for monotone submodular functions with bounded curvature.
1 Introduction
The traditional approach in optimization typically assumes there is an underlying model known to the algorithm designer, and the goal is to optimize an objective function defined through the model. In a routing problem, for example, the model can be a weighted graph that encodes roads and their congestion, and the objective is to select a route that minimizes expected travel time from source to destination. In influence maximization, we are given a weighted graph that models the likelihood of individuals forwarding information, and the objective is to select a subset of nodes to spread information and maximize the expected number of nodes that receive information [54].
In many applications such as influence maximization or routing, we do not actually know the objective functions we wish to optimize, since they depend on the behavior of the world generating the model. In such cases, we gather information about the objective function from past observations and use that knowledge to optimize it. A reasonable approach is to learn a surrogate function that approximates the function generating the data (e.g., References [24, 27, 28, 30, 46, 62]) and optimize the surrogate. In routing, we may observe traffic, fit weights to a graph that represents congestion times, and optimize for the shortest path on the weighted graph learned from data. In influence maximization, we can observe information spreading in a social network, fit weights to a graph that encodes the influence model, and optimize for the k most influential nodes. But what are the guarantees we have?
One problem with optimizing a surrogate learned from data is that it may be inapproximable. For a problem like influence maximization, for example, even if a surrogate \(\widetilde{f}:2^N \rightarrow \mathbb {R}\) approximates a submodular influence function \(f:2^N\rightarrow \mathbb {R}\) within a factor of \((1\pm \epsilon)\) for sub-constant \(\epsilon \gt 0\), in general there is no polynomial-time algorithm that can obtain a reasonable approximation to \(\max _{S:|S|\le k}\widetilde{f}(S)\) or \(\max _{S:|S|\le k}{f}(S)\) [50]. A different concern is that the function learned from data may be approximable (e.g., if the surrogate remains submodular), but its optima are very far from the optima of the function generating the data. In influence maximization, even if the weights of the graph are learned within a factor of \((1\pm \epsilon)\) for sub-constant \(\epsilon \gt 0\) the optima of the surrogate may be a poor approximation to the true optimum [52, 62]. The sensitivity of optimization to the nuances of the learning method therefore raises the following question:
Can we actually optimize objective functions from the training data we use to learn them?
Optimization from samples. In this article, we consider the following question: Given an unknown objective function \(f:2^N\rightarrow \mathbb {R}\) and samples \(\lbrace S_i,f(S_{i})\rbrace _{i=1}^m\) where \(S_i\) is drawn from some distribution \(\mathcal {D}\) and \(m \in \operatorname{poly}(|N|)\), is it possible to solve \(\max _{S:|S|\le k}f(S)\)? More formally:
A class of functions \(\mathcal {F}:2^N\rightarrow \mathbb {R}\) is \(\alpha\)-optimizable in \(\mathcal {M}\) from samples over distribution \(\mathcal {D}\) if there exists a (not necessarily polynomial time) algorithm whose input is a set of samples \(\lbrace S_{i},f(S_i)\rbrace _{i=1}^{m}\), where \(f \in \mathcal {F}\) and \(S_i\) is drawn i.i.d. from \(\mathcal {D}\), and returns \(S \in \mathcal {M}\) s.t.:
where the expectation is over the decisions of the algorithm, \(m \in \operatorname{poly}(|N|)\), \(\delta \in [0 , 1)\) is a constant.
An algorithm with the above guarantees is an \(\alpha\)-OPS algorithm. In this article, we focus on the simplest constraint, where \(\mathcal {M} = \lbrace S\subseteq N : |S|\le k\rbrace\) is a cardinality constraint. For a class of functions \(\mathcal {F}\), we say that optimization from samples is possible when there exists some constant \(\alpha \in (0,1]\) and any distribution \(\mathcal {D}\) s.t. \(\mathcal {F}\) is \(\alpha\)-optimizable from samples over \(\mathcal {D}\) in \(\mathcal {M} = \lbrace S:|S|\le k\rbrace\).
Before discussing what is achievable in this framework, the following points are worth noting:
•
Optimization from samples is defined per distribution. Note that if we demand optimization from samples to hold on all distributions, then trivially no function would be optimizable from samples (e.g., for the distribution that always returns the empty set);
•
Optimization from samples seeks to approximate the global optimum. In learning, we evaluate a hypothesis on the same distribution we use to train it, since it enables making a prediction about events that are similar to those observed. For optimization it is trivial to be competitive against a sample by simply selecting the feasible solution with maximal value from the set of samples observed. Since an optimization algorithm has the power to select any solution, the hope is that polynomially many samples contain enough information for optimizing the function. In influence maximization, for example, we are interested in selecting a set of influencers, even if we did not observe a set of highly influential individuals that initiate a cascade together.
•
As previously discussed, a natural optimization from samples algorithm is to first learn a surrogate function \(\tilde{f}\) from the samples and to then optimize this surrogate function \(\tilde{f}\). We note that the optimization from samples framework is more general than the learn-then-optimize approach and that optimization from samples algorithms are not required to employ this two-stage approach.
As we later show, there are interesting classes of functions and distributions that indeed allow us to approximate the global optimum well, in polynomial-time using polynomially many samples. The question is therefore not whether optimization from samples is possible, but rather if particular function classes of interest are optimizable from samples.
1.1 Optimizability and Learnability
Optimization from samples is particularly interesting when functions are learnable and optimizable.
•
Optimizability. We are interested in functions \(f:2^N \rightarrow \mathbb {R}\) and constraints \(\mathcal {M}\) such that given access to a value oracle (given S the oracle returns \(f(S)\)), there exists a constant factor approximation algorithm for \(\max _{S\in \mathcal {M}}f(S)\). For this purpose, monotone submodular functions are a convenient class to work with, where the canonical problem is \(\max _{|S|\le k}f(S)\). It is well known that there is a \(1-1/e\) approximation algorithm for this problem [63] and that this is tight using polynomially many value queries [40]. Influence maximization is an example of maximizing a monotone submodular function under a cardinality constraint [54].
•
PMAC-learnability. The standard framework in the literature for learning set functions is Probably Mostly Approximately Correct (\(\alpha\)-PMAC) learnability due to Balcan and Harvey [5]. This framework nicely generalizes Valiant’s notion of Probably Approximately Correct (PAC) learnability [71]. Informally, PMAC-learnability guarantees that after observing polynomially many samples of sets and their function values, one can construct a surrogate function that is likely to, \(\alpha\)-approximately, mimic the behavior of the function observed from the samples (see Appendix D for formal definitions). Since the seminal paper of Balcan and Harvey, there has been a great deal of work on learnability of submodular functions [2, 3, 4, 41, 43, 44].
Are functions that are learnable and optimizable also optimizable from samples?
1.2 Main Result
Our main result is an impossibility. We show that there is an interesting class of functions that is PMAC-learnable and optimizable but not optimizable from samples. This class is coverage functions.
A function is called coverage if there exists a family of sets \(T_{1},\ldots , T_{n}\) that covers subsets of a universe U with weights \(w(a_j)\) for \(a_j \in U\) such that for all S, \(f(S) = \sum _{a_j \in \cup _{i \in S}T_i}w(a_j)\). A coverage function is polynomial-sized if the universe is of polynomial size in n.
Coverage functions are a canonical example of monotone submodular functions and are hence optimizable. In terms of learnability, for any constant \(\epsilon \gt 0\), coverage functions are \((1-\epsilon)\)-PMAC learnable over any distribution [2], unlike monotone submodular functions that are generally not PMAC learnable [5]. Somewhat surprisingly, coverage functions are not optimizable from samples.
No algorithm can obtain an approximation better than \(2^{-\Omega (\sqrt {\log n})}\) for maximizing a polynomial-sized coverage function under a cardinality constraint, using polynomially many samples drawn from any distribution.
Coverage functions are heavily used in machine learning [1, 48, 56, 58, 68, 69, 73], data-mining [22, 25, 29, 47, 65, 67], mechanism design [15, 26, 31, 33, 34, 57], privacy [41, 49], as well as influence maximization [14, 54, 66] (influence maximization is in fact a generalization of maximizing coverage functions under a cardinality constraint). In many of these applications, the functions are learned from data and the goal is to optimize the function under a cardinality constraint. In addition to learnability and optimizability, coverage functions have many other desirable properties (see Section E). One important fact is that they are parametric: If the sets \(T_1,\ldots ,T_n\) are known, then the coverage function is completely defined by the weights \(\lbrace w(a) \ : \ a \in U \rbrace\). Our impossibility result holds even in the case where the sets \(T_1,\ldots ,T_n\) are known.
Technical overview. In the value query model, information theoretic impossibility results use functions defined over a partition of the ground set [37, 59, 72]. The hardness then arises from hiding all the information about the partition from the algorithm. Although the constructions in the OPS model also rely on a partition, the techniques are different, since the impossibility is quasi-polynomial and not constant.
Our constructions partition the ground set of elements in “good,” “bad,” and “masking” elements. The optimal solution is the set of good elements and almost all elements are masking elements. The hardness arises from hiding which parts of the partition are good or bad. We note that even though the algorithm cannot distinguish a bad part of the partition from the good part of the partition, it can learn which elements are masking elements. If an algorithm knows which elements are masking elements, then it can estimate the value of almost all the sets, since almost all elements are masking elements, which explains why these functions are learnable. However, the value of the sets that consist of mostly good and bad elements cannot be estimated from samples, which is the reason why these functions are not optimizable from samples.
We begin by describing a framework that reduces the problem of showing hardness results to constructing good and bad functions that satisfy certain properties. The desired good and bad functions must have equal value on small sets of equal sizes and a large gap in value on large sets. Interestingly, a main technical difficulty is to simultaneously satisfy these two simple properties, which we do with novel techniques for constructing coverage functions. Another technical part is the use of tools from pseudorandomness to obtain coverage functions of polynomial size.
1.3 Algorithms for OPS
There are classes of functions and distributions for which optimization from samples is possible. Most of the algorithms use a simple technique that consists of estimating the expected marginal contribution of an element to a random sample. For general submodular functions, we show an essentially tight bound using a non-trivial analysis of an algorithm that uses such estimates.
There exists an \(\tilde{\Omega }(n^{-1/4})\)-OPS algorithm over a distribution \(\mathcal {D}\) for monotone submodular functions. Furthermore, this approximation ratio is essentially tight.
For unit-demand and additive functions, we give near-optimal optimization from samples results. The result for unit-demand is particularly interesting, as it shows one can easily optimize a function from samples even when recovering it is impossible (see Section 4). For monotone submodular functions with curvature c, we obtain a \(((1-c)^2-o(1))\)-OPS algorithm.
1.4 Recent Subsequent Work
Recent follow-up work has studied the optimization from samples model introduced in this article in different settings. An optimal \((1-c)/(1+c-c^2)\) approximation was obtained for optimizing a monotone submodular function with curvature c from samples [7]. Optimization from samples has also been studied in the context of convex optimization [11], submodular minimization [10], and influence maximization [6]. Finally, very recent work circumvents this article’s impossibility result for coverage functions by considering a stronger model with structured samples where the samples contain structural information about the function [20]. In that model, and under three natural assumptions, a \((1-1/e)/2\) approximation is obtained.
Another recent line of work on the adaptive complexity of submodular maximization originated from this article, see, e.g., References [8, 9, 12, 18, 19, 35, 36, 38, 39] and references therein. As discussed in Reference [12], the inaproximability for optimization from samples is a consequence of non-adaptivity: An optimization from samples algorithm cannot make adaptive queries after observing samples. The hardness result for OPS and non-adaptive algorithms raised the following question: How many rounds of adaptive queries are needed for optimizing coverage and submodular functions? Reference [12] shows that \(\tilde{\Theta }(\log n)\) rounds of adaptivity are both necessary and sufficient to obtain a constant factor approximation guarantee for maximizing monotone submodular functions under a cardinality constraint.
1.5 Article Organization
We begin with the hardness result in Section 2. The OPS algorithms are presented in Section 3. We discuss the notion of recoverability in Section 4 and additional related work in Section 5. The proofs are deferred to the appendix.
2 Impossibility of Optimization from Samples
We show that optimization from samples is in general impossible, over any distribution \(\mathcal {D}\), even when the function is learnable and optimizable. Specifically, we show that there exists no constant \(\alpha\) and distribution \(\mathcal {D}\) such that coverage functions are \(\alpha\)-optimizable from samples, even though they are \((1 - \epsilon)\)-PMAC learnable over any distribution \(\mathcal {D}\) and can be maximized under a cardinality constraint within a factor of \(1 - 1/e\). In Section 2.1, we construct a framework that reduces the problem of proving information theoretic lower bounds to constructing functions that satisfy certain properties. We then construct coverage functions that satisfy these properties in Section 2.2.
2.1 A Framework for OPS Hardness
The framework we introduce partitions the ground set of elements into good, bad, and masking elements. We derive two conditions on the values of these elements so samples do not contain enough information to distinguish good and bad elements with high probability. We then give two additional conditions so if an algorithm cannot distinguish good and bad elements, then the solution returned by this algorithm has low value compared to the optimal set consisting of the good elements. We begin by defining the partition, which depends on a variable r that is later defined.
The collection of partitions \(\mathcal {P}\) contains all partitions P of the ground set N in r parts \(T_1, \ldots , T_r\) of k elements and a part M of remaining \(n - rk\) elements, where \(n = |N|\).
The elements in \(T_i\) are called the good elements for some \(i \in [r]\). The bad and masking elements are the elements in \(T_{-i} := \cup _{j = 1, j \ne i}^{r}T_j\) and M, respectively. Next, we define a class of functions \(\mathcal {F}(g,b,m,m^+)\) such that \(f \in \mathcal {F}(g,b,m,m^+)\) is defined in terms of good, bad, and masking functions g, b, and \(m^+\), and a masking fraction m such that \(m(S) \in [0,1]\) for all \(S \subseteq N\).1
Given functions \(g,b,m,m^{+}\), the class of functions \(\mathcal {F}(g,b,m,m^+)\) contains functions \(f^{P,i}\), where \(P \in \mathcal {P}\) and \(i \in [r]\), defined as
We use probabilistic arguments over the partition \(P \in \mathcal {P}\) and the integer \(i \in [r]\) chosen uniformly at random to show that for any distribution \(\mathcal {D}\) and any algorithm, there exists a function in \(\mathcal {F}(g,b,m,m^+)\) that the algorithm optimizes poorly given samples from \(\mathcal {D}\). The functions \(g,b,m,m^{+}\) have desired properties that are parametrized below. At a high level, the identical-on-small-samples and masking-on-large-samples properties imply that the samples do not contain enough information to learn i, i.e., distinguish good and bad elements, even though the partition P can be learned. The gap and curvature property imply that if an algorithm cannot distinguish good and bad elements, then the algorithm performs poorly.
The class of functions \(\mathcal {F}(g,b,m,m^{+})\) has an \((\alpha , \beta)\)-gap if the following conditions are satisfied for some t, where \(\mathcal {U}(\mathcal {P})\) is the uniform distribution over \(\mathcal {P}\).
(1)
Identical-on-small-samples. For a fixed \(S:|S| \le t\), with probability \(1 - n^{-\omega (1)}\) over partition \(P \sim \mathcal {U}(\mathcal {P})\), \(g(S \cap T_{i}) + b(S \cap T_{- i})\) is independent of i;
(2)
Masking-on-large-samples. For a fixed \(S :|S| \ge t\), with probability \(1 - n^{-\omega (1)}\) over partition \(P \sim \mathcal {U}(\mathcal {P})\), the masking fraction is \(m(S \cap M) = 1\);
(3)
\(\alpha\)-Gap. Let \(S: |S| =k\), then \(g(S) \ge \max \lbrace \alpha \cdot b(S),\alpha \cdot m^+(S)\rbrace\);
(4)
\(\beta\)-Curvature. Let \(S_1:|S_1| = k\) and \(S_2: |S_2| = k / r\), then \(g(S_1) \ge (1 - \beta) \cdot r \cdot g(S_2)\).
The following lemma reduces the problem of showing an impossibility result to constructing \(g, b,m,\) and \(m^+\), which satisfy the above properties.
Consider a distribution \(\mathcal {D}\). The proof of this result consists of three parts.
(1)
Fix a set S. With probability \(1 - n^{-\omega (1)}\) over \(P \sim \mathcal {U}(\mathcal {P})\), \(f^{P,i}(S)\) is independent of i, by the identical-on-small-samples and masking-on-large-samples properties.
(2)
There exists a partition \(P \in \mathcal {\mathcal {P}}\) such that with probability \(1- n^{-\omega (1)}\) over polynomially many samples \(\mathcal {S}\) drawn from \(\mathcal {D}\), \(f^{P,i}(S)\) is independent of i for all \(S \in \mathcal {S}\). Thus, given samples \(\lbrace (S_j, f^{P,i}(S_j))\rbrace _j\) for such a partition P, the decisions of the algorithm are independent of i.
(3)
There exists \(f^{P,i}\) such that the algorithm does not obtain a \(2\max (1 /(r (1- \beta)), 2/\alpha)\) approximation for \(f^{P,i}\) with samples from \(\mathcal {D}\). This holds by a second probabilistic argument, this time over \(i \in \mathcal {U}([r])\), and by the gap and curvature properties. Intuitively, the gap property ensures that good elements have higher value than bad and masking elements. The curvature property ensures that the set of k good elements has higher value than a small set of good elements. Combined, these two properties imply that a solution that achieves a good approximation must contain a large number of good elements.
2.2 OPS Hardness of Coverage Functions
We use this framework to show that there exists no constant \(\alpha\) and distribution \(\mathcal {D}\) such that coverage functions are \(\alpha\)-optimizable from samples over \(\mathcal {D}\). We first state a definition of coverage functions that is equivalent to the traditional definition and that is used through this section.
The construction of good and bad coverage functions g and b that combine the identical-on-small-samples property and a large \(\alpha\)-gap on large sets as needed by the framework is a main technical challenge. The bad function b needs to increase slowly (or not at all) for large sets to obtain a large \(\alpha\)-gap, which requires a non-trivial overlap in the children covered by bad elements (this is related to coverage functions being second-order supermodular [55]). The overlap in children covered by good elements then must be similar (identical-on-small-samples), while the good function still needs to grow quickly for large sets (large gap), as illustrated in Figure 1. We consider the cardinality constraint \(k = n^{2/5 - \epsilon }\) and a number of parts \(r = n^{1/5 - \epsilon }\). At a high level, the proof follows three main steps.
Fig. 1.
(1)
Constructing the good and bad functions. In Section 2.2.1, we construct the good and bad functions whose values are identical-on-small-samples for \(t = n^{3/5 + \epsilon }\), have gap \(\alpha = n^{1/5-\epsilon }\), and curvature \(\beta = o(1)\). These good and bad functions are affine combinations of primitives \(\lbrace C_{p}\rbrace _{p \in \mathbb {N}}\), which are coverage functions with desirable properties;
(2)
Constructing the masking function. In Section 2.2.2, we construct m and \(m^+\) that are masking-on-large-samples for \(t = n^{3/5 + \epsilon }\) and that have a gap \(\alpha = n^{1/5}\). In this construction, masking elements cover the children from functions g and b such that t masking elements cover all the children, but k masking elements only cover an \(n^{-1/5}\) fraction of them.
(3)
From exponential to polynomial-sized coverage functions. The construction from Section 2.2.2 requires exponentially many children and thus only applies to exponential-sized coverage functions. In Section 2.2.3, we prove the hardness result for polynomial-sized coverage functions. This construction relies on constructions of \(\ell\)-wise independent variables to reduce the number of children.
2.2.1 Constructing the Good and the Bad Coverage Functions.
In this section, we describe the construction of good and bad functions that are identical-on-small-samples for \(t = n^{3/5 + \epsilon }\), with a gap \(\alpha = n^{1/5-\epsilon }\) and curvature \(\beta = o(1)\). To do so, we introduce a class of primitive functions \(C_{p}\), through which we express the good and bad functions. For symmetric functions h (i.e., whose value only depends on the size of the set), we abuse notation and simply write \(h(y)\) instead of \(h(S)\) for a set S of size y.
The construction. We begin by describing the primitives we use for the good and bad functions. These primitives are the family \(\lbrace C_{p}\rbrace _{p\in \mathbb {N}}\), which are symmetric, and defined as:
\[C_{p}(y) = p \cdot \left(1 - (1 - 1/p)^y \right).\]
These are coverage functions defined over an exponential number of children.
For a given \(\ell \in [n]\), we construct g and b as affine combinations of \(\ell\) coverage functions \(C_{p_j}(y)\) weighted by variables \(x_j\) for \(j \in [\ell ]\):
Overview of the analysis of the good and bad functions. Observe that if \(g(y) = b^{\prime }(y)\) for all \(y \le \ell\) for some sufficiently large \(\ell\), then we obtain the identical-on-small-samples property. The main idea is to express these \(\ell\) constraints as a system of linear equations \(A \mathbf {x}= \mathbf {y}\) where \(A_{ij} := C_{p_j}(i)\) and \(y_j := j\), with \(i,j \in [\ell ]\). We prove that this matrix has two crucial properties:
(1)
A is invertible. In Lemma A.2, we show that there exists \(\lbrace p_j\rbrace _{j=1}^{\ell }\) such that the matrix A is invertible by interpreting its entries defined by \(C_{p_j}\) as non-zero polynomials of degree \(\ell\). This implies that the system of linear equations \(A\cdot \mathbf {x}= \mathbf {y}\) can be solved and that there exists a coefficient \(\mathbf {x}^{\star }\) needed for our construction of the good and the bad functions;
(2)
\(||\mathbf {x}^{\star }||_{\infty }\) is bounded. In Lemma A.5, we use Cramer’s rule and Hadamard’s inequality to prove that the entries of \(\mathbf {x}^{\star }\) are bounded. This implies that the linear term y in \(g(y)\) dominates \(x^{\star }_j \cdot C_{p_j}(y)\) for large y and all j. This then allows us to prove the curvature and gap properties.
These properties of A imply the desired properties of g and b for \(\ell = \log \log n\).
Lemma A.4 shows the identical-on-small-samples property. It uses Lemma A.3, which shows that if \(|S| \le n^{3/5 + \epsilon }\), then with probability \(1 - n^{-\omega (1)}\), \(|S \cap T_j| \le \log \log n\) for all j. The property then follows from the system of linear equations. The gap and curvature properties are proven in Lemmas A.6 and A.7 using the fact that the term y in g dominates the other terms in g and b.
2.2.2 Constructing the Masking Function.
Masking elements allow the indistinguishability of good and bad elements from large samples.
The masking elements. The construction of the coverage functions g and b defined in the previous section is generalized so we can add masking elements M with desirable properties. For each child \(a_\ell\) in the coverage function defined by \(g + b\), we divide \(a_\ell\) into \(n^{3/5}\) children \(a_{\ell ,1}, \ldots , a_{\ell ,n^{3/5}}\) with equal weights \(w(a_{\ell ,j}) = \frac{w(a_\ell)}{n^{3/5}}\) for all j. Each element that covered \(a_\ell\) now covers children \(a_{\ell ,1}, \ldots , a_{\ell ,n^{3/5}}\). Note that the value of \(g(S)\) and \(b(S)\) remains unchanged with this new construction and thus, the previous analysis still holds.
Next, for each \(P \in \mathcal {P}\) and \(i \in [r]\), we define a bipartite graph \(G^{P,i}\) (between elements \(T_i \cup T_{-i} \cup M\) and children) and argue that the coverage functions \(f^{P,i}\) induced by bipartite graphs \(G^{P,i}\) are such that \(f^{P,i}(S) = (1 - m(S \cap M)) (g(S \cap T_i) + b(S \cap T_{-i})) + m^+(S \cap M)\) for some functions \(m^+\) and m. In particular, the children covered by good elements \(T_i\) and bad elements \(T_{-i}\) in \(G^{P,i}\) are the children \(a_{\ell , j}\) covered by \(T_i\) and \(T_{-i}\) in the coverage functions g and b, respectively. For each masking element \(e \in M\), we draw \(j_e \sim \mathcal {U}{[n^{3/5}]}\). The children covered by masking element e in the graph \(G^{P,i}\) are \(a_{\ell , j_e}\), for all \(\ell\).
We observe that the coverage function \(f^{P,i}(S)\) induced by the bipartite graph \(G^{P,i}\) is such that \(f^{P,i}(S) = (1 - m(S \cap M)) (g(S \cap T_i) + b(S \cap T_{-i})) + m^+(S \cap M)\) where the masking function \(m^+(S \cap M)\) is the total weight covered by masking elements \(S \cap M\) and the masking fraction \(m(S \cap M)\) is the fraction of indices \(j \in [n^{3/5}]\) such that j is drawn for at least one element in \(S \cap M\).
Masking properties. Masking elements cover children that are already covered by good or bad elements. A large number of masking elements mask the good and bad elements, which implies that good and bad elements are indistinguishable.
•
In Lemma A.8, we prove that the masking property holds for \(t = n^{3/5 + \epsilon }\).
•
We show a gap \(\alpha = n^{1/5}\) in Lemma A.9. For any \(S:|S|\le k\), we have \(g(S)\ge n^{1/5}\cdot m^+(S)\).
An impossibility result for exponential size coverage functions. We have the four properties for a \((n^{1/5 - \epsilon }, o(1))\)-gap.
2.2.3 From Exponential to Polynomial Size Coverage Functions.
The construction above relies on the primitives \(C_{p}\), which are defined with exponentially many children. In this section, we modify the construction to use primitives \(c_{p}\), which are coverage with polynomially many children. The function class \(\mathcal {F}(g,b,m,m^+)\) obtained are then coverage functions with polynomially many children. The functions \(c_{p}\) we construct satisfy \(c_{p}(y) = C_{p}(y)\) for all \(y \le \ell\), and thus the matrix A for polynomial size coverage functions is identical to the general case. We lower the cardinality constraint to \(k = 2^{\sqrt {\log n}} = |T_j|\) so the functions \(c_{p}(S \cap T_j)\) need to be defined over only \(2^{\sqrt {\log n}}\) elements. We also lower the number of parts to \(r = 2^{\sqrt {\log n}/2}\).
Maintaining symmetry via \(\ell\)-wise independence. The technical challenge in defining a coverage function with polynomially many children is in maintaining the symmetry of non-trivial size sets. To do so, we construct coverage functions \(\lbrace \zeta ^z\rbrace _{z\in [k]}\) for which the elements that cover a random child are approximately \(\ell\)-wise independent. The next lemma reduces the problem to that of constructing coverage functions \(\zeta ^z\) that satisfy certain properties.
The proof is constructive. We obtain \(c_{p}\) by replacing, for all \(z \in [k]\), all children in \(C_{p}\) that are covered by z elements with children from \(\zeta ^z\) with weights normalized that sum up \(C_{p}(k)\). Next, we construct such \(\zeta ^z\). Assume without loss that k is prime (otherwise pick some prime close to k). Given \(\mathbf {a}\in [k]^{\ell }\), and \(x \in [z]\), let
The children in \(\zeta ^z\) are \(U = \lbrace \mathbf {a}\in [k]^{\ell } \, : \,h_{\mathbf {a}}(x_1) \ne h_{\mathbf {a}}(x_2) \text{ for all distinct} x_1, x_2 \in [z]\rbrace\). The k elements are \(\lbrace j \, : \,0 \le j \lt k\rbrace\). Child \(\mathbf {a}\) is covered by elements \(\lbrace h_{\mathbf {a}}(x) \, : \,x \in [z]\rbrace\). Note that \(|U| \le k^{\ell } = 2^{\ell \sqrt {\log n}}\) and we pick \(\ell = \log \log n\) as previously. The next lemma shows that we obtain the desired properties for \(\zeta ^z\).
At a high level, the proof uses Lemma A.12, which shows that the parents of a random child \(\mathbf {a}\) are approximately \(\ell\)-wise independent. This follows from \(h_{\mathbf {a}}(x)\) being a polynomial of degree \(\ell - 1\), a standard construction for \(\ell\)-wise independent random variables. Then, using inclusion-exclusion over subsets T of a set S of size at most \(\ell\), the probability that T is the parents of a child \(\mathbf {a}\) only depends on \(|T|\) by Lemma A.12. Thus, \(\zeta ^z(S)\) only depends on \(|S|\). We are now ready to show the properties for \(g,b,m,m^+\) with polynomially many children,
We construct \(g,b,m,m^+\) as in the general case but in terms of primitives \(c_{p}\) instead of \(C_{p}\). By Lemmas 2.2 and 2.3, we obtain the same matrix A and coefficients \(\mathbf {x}^{\star }\) as in the general case, so the identical-on-small-samples property holds. The masking-on-large-samples and curvature property hold almost identically as previously. Finally, since k is reduced, the gap \(\alpha\) is reduced to \(2^{\Omega (\sqrt {\log n})}\). OPS Hardness for Coverage Functions. We get our main result by combining Theorem 2.1 with this \((\alpha = 2^{\Omega (\sqrt {\log n})}, \beta = o(1))\)-gap.
3 Algorithms for OPS
In this section, we describe OPS-algorithms. Our algorithmic approach is not to learn a surrogate function and to then optimize this surrogate function. Instead, the algorithms estimate the expected marginal contribution of elements to a random sample directly from the samples (Section 3.1) and solve the optimization problem using these estimates. The marginal contribution of an element e to a set S if \(f_S(e) := f(S \cup \lbrace e\rbrace) - f(S)\). If these marginal contributions are decreasing, i.e., \(f_S(e) \ge f_T(e)\) for all \(e \in N\) and \(S \subseteq T \subseteq N\), then f is submodular. If they are non-negative, i.e., \(f_S(e) \ge 0\) for all \(e \in N\) and \(S \subseteq N\), then f is monotone.
This simple idea turns out to be quite powerful; we use these estimates to develop an \(\tilde{\Omega }(n^{-1/4})\)OPS-algorithm for monotone submodular functions in Section 3.2. This approximation is essentially tight with a hardness result for general submodular functions shown in Appendix F that uses the framework from the previous section. In Section 3.3, we show that when samples are generated from a product distribution, there are interesting classes of functions that are amenable to optimization from samples.
3.1 OPS via Estimates of Expected Marginal Contributions
A simple case in which the expected marginal contribution \({\bf {E}}_{S \sim \mathcal {D}|e_i \not\in S}[{f_S(e_i)}]\) of an element \(e_i\) to a random set \(S \sim \mathcal {D}\) can be estimated arbitrarily well is that of product distributions. We now show a simple algorithm we call EEMC, which estimates the expected marginal contribution of an element when the distribution \(\mathcal {D}\) is a product distribution. This estimate is simply the difference between the average value of a sample containing \(e_i\) and the average value of a sample not containing \(e_i\).
The proof consists of the following two steps. First note that
\begin{align*} {\bf {E}}_{S \sim \mathcal {D}|e_i \not\in S }[{f_S(e_i)]} =& \,{\bf {E}}_{S \sim \mathcal {D}|e_i \not\in S }[{f(S \cup e_i)]} - {\bf {E}}_{S \sim \mathcal {D}|e_i \not\in S }[{f(S)]} \\ =&\, {\bf {E}}_{S \sim \mathcal {D}|e_i \in S }[{f(S)]} - {\bf {E}}_{S \sim \mathcal {D}|e_i \not\in S }[{f(S)]} \end{align*}
where the second equality is, since \(\mathcal {D}\) is a product distribution. Then, from standard concentration bounds, the average value \((\sum _{S \in \mathcal {S}_i} f(S)) / |\mathcal {S}_i|\) of a set containing \(e_i\) estimates \({\bf {E}}_{S \sim \mathcal {D}|e_i \in S }[{f(S)]}\) well. Similarly, \((\sum _{S \in \mathcal {S}_{-i}} f(S)) / |\mathcal {S}_{-i}|\) estimates \({\bf {E}}_{S \sim \mathcal {D}|e_i \not\in S }[{f(S)]}\).
3.2 A Tight Approximation for Submodular Functions
We develop an \(\tilde{\Omega }(n^{-1/4})\)-OPS algorithm over \(\mathcal {D}\) for monotone submodular functions, for some distribution \(\mathcal {D}\). This bound is essentially tight, since submodular functions are not \(n^{-1/4 + \epsilon }\)-optimizable from samples over any distribution (Appendix F). We first describe the distribution for which the approximation holds. Then, we describe the algorithm, which builds upon estimates of expected marginal contributions.
The distribution. Let \(\mathcal {D}_i\) be the uniform distribution over all sets of size i. Define the distribution \(\mathcal {D}^{sub}\) to be the distribution that draws from \(\mathcal {D}_k\), \(\mathcal {D}_{\sqrt {n}}\), and \(\mathcal {D}_{\sqrt {n}+1}\) at random. In Lemma B.2, we generalize Lemma 3.1 to estimate \(\hat{v}_i \approx {\bf {E}}_{S \sim \mathcal {D}_{\sqrt {n} }|e_i \not\in S }[{f_S(e_i)]}\) with samples from \(\mathcal {D}_{\sqrt {n} }\) and \(\mathcal {D}_{\sqrt {n}+1}\).
The algorithm. We begin by computing the expected marginal contributions of all elements. We then place the elements in \(3 \log n\) bins according to their estimated expected marginal contribution \(\hat{v}_i\). The algorithm then simply returns either the best sample of size k or a random subset of size k of a random bin. Up to logarithmic factors, we can restrict our attention to just one bin. We give a formal description below.
Analysis of the algorithm. The main crux of this result is in the analysis of the algorithm.
The analysis is divided in two cases, depending if a random set \(S \sim \mathcal {D}_{\sqrt {n} }\) of size \(\sqrt {n}\) has low value or not. Let \(S^{\star }\) be the optimal solution.
•
Assume that \({\bf {E}}_{S \sim \mathcal {D}_{\sqrt {n}} }[{f(S)} \le f(S^{\star ]}) / 4\). Thus, optimal elements have large estimated expected marginal contribution \(\hat{v}_i\) by submodularity. Let \(B^{\star }\) be the bin B with the largest value \(f(B)\) among the bins with contributions \(\hat{v} \ge f(S^{\star }) / (4k)\). We argue that a random subset of \(B^{\star }\) of size k performs well. Lemma B.4 shows that a random subset of \(B^{\star }\) is a \(|B^{\star }| / (4k\sqrt {n})\)-approximation. At a high level, a random subset S of size \(\sqrt {n}\) contains \(|B^{\star }| /\sqrt {n}\) elements from bin \(B^{\star }\) in expectation, and these \(|B^{\star }| /\sqrt {n}\) elements \(S_{B^{\star }}\) have contributions at least \(f(S^{\star }) / (4k)\) to \(S_{B^{\star }}\). Lemma B.5 shows that a random subset of \(B^{\star }\) is an \(\tilde{\Omega }(k/|B^{\star }|)\)-approximation to \(f(S^{\star })\). The proof first shows that \(f(B^{\star })\) has high value by the assumption that a random set \(S \sim \mathcal {D}_{\sqrt {n}}\) has low value, and then uses the fact that a subset of \(B^{\star }\) of size k is a \(k/|B^{\star }|\) approximation to \(B^{\star }\). Note that either \(|B^{\star }| / (4k\sqrt {n})\) or \(\tilde{\Omega }(k/|B^{\star }|)\) is at least \(\tilde{\Omega }(n^{-1/4})\).
•
Assume that \({\bf {E}}_{S \sim \mathcal {D}_{\sqrt {n}} }[{f(S)} \ge f(S^{\star ]}) / 4\). We argue that the best sample of size k performs well. Lemma B.6 shows that, by submodularity, a random set of size k is a \(k / (4\sqrt {n})\) approximation, since a random set of size k is a fraction \(k / (\sqrt {n})\) smaller than a random set from \(\mathcal {D}_{\sqrt {n}}\) in expectation. Lemma B.7 shows that the best sample of size k is a \(1/k\)-approximation, since it contains the elements with the highest value with high probability. Note that either \(k / (4\sqrt {n})\) or \(1/k\) is at least \(n^{-1/4}\).
3.3 Bounded Curvature and Additive Functions
A simple \(((1-c)^2 - o(1))\)-OPS algorithm for monotone submodular functions with curvature c over product distributions follows immediately from estimating expected marginal contributions. This result was recently improved to \((1-c)/(1+c-c^2)\), which was shown to be tight [7]. An immediate corollary is that additive (linear) functions, which have curvature 0, are \((1 - o(1))\)-OPS over product distributions. The curvature c of a function measures how far this function is to being additive.
The curvaturec of a submodular function f is \(c := 1 - \min _{e \in N, S \subseteq N} f_{S \setminus e}(e) / f(e).\)
This definition implies that \(f_S(e) \ge (1 - c) f(e) \ge (1-c)f_T(e)\) for all \(S,T\) and all \(e \not\in S \cup T\), since \(f(e) \ge f(T \cup e) - f(T) = f_T(e)\) where the first inequality is by submodularity. The algorithm simply returns the k elements with the highest expected marginal contributions.
The proof follows almost immediately from the definition of curvature. Let S be the set returned by the algorithm and \(S^{\star }\) be the optimal solution, then \(f(S)\) and \(f(S^{\star })\) are sums of marginal contributions of elements in S and \(S^{\star }\), which are each at most a factor \(1-c\) away from their estimated expected marginal contribution by curvature. A \(1 - o(1)\) approximation follows immediately for additive functions, since they have curvature \(c = 0\). A function f is additive if \(f(S) = \sum _{e_i \in S} f(\lbrace e_i\rbrace)\).
4 Recoverability
The largely negative results from the above sections lead to the question of how well must a function be learned for it to be optimizable from samples? One extreme is a notion we refer to as recoverability (REC). A function is recoverable if it can be learned everywhere within an approximation of \(1 \pm 1/n^2\) from samples. Does a function need to be learnable everywhere for it to be optimizable from samples?
A function f is recoverable for distribution \(\mathcal {D}\) if there exists an algorithm that, given a polynomial number of samples drawn from \(\mathcal {D}\), outputs a function \(\tilde{f}\) such that for all sets S,
with probability at least \(1 - \delta\) over the samples and the randomness of the algorithm, where \(\delta \in [0,1)\) is a constant.
This notion of recoverability is similar to the problem of approximating a function everywhere from Goemans et al. [45]. The differences are that recoverability is from samples, whereas their setting allows value queries and that recoverability requires being within an approximation of \(1 \pm 1/n^2\). It is important for us to be within such bounds and not within some arbitrarily small constant, because such perturbations can still lead to an \(O(n^{-1/2 + \delta })\) impossibility result for optimization [50]. We show that if a monotone submodular function f is recoverable, then it is optimizable from samples by using the greedy algorithm on the recovered function \(\tilde{f}\). The proof is similar to the classical analysis of the greedy algorithm.
We show that additive functions are in REC under some mild condition. Combined with the previous result, we get an alternate proof from the previous section for additive functions being \(1 - o(1)\)-optimizable from samples over product distributions.
We also note that submodular functions that are a c-junta for some constant c are recoverable. A function f is a c-junta [43, 61, 70] if it depends only on a set of elements T of size c. If c is constant, then with enough samples, T can be learned, since each element not in T is in at least one sample that does not contain any element in T. For each subset of T, there is also at least one sample that intersects with T in exactly that subset, so f is exactly recoverable.
The previous results lead us to the following question: Does a function need to be recoverable to be optimizable from samples? We show that it is not the case, since unit demand functions are optimizable from samples and not recoverable. A function f is a unit demand function if \(f(S) = \max _{e_i \in S} f(\lbrace e_i\rbrace)\).
We conclude that functions do not need to be learnable everywhere from samples to be optimizable from samples.
5 Additional Related Work
Revenue maximization from samples. The discrepancy between the model on which algorithms optimize and the true state of nature has recently been studied in algorithmic mechanism design. Most closely related to our work are several recent papers (e.g., References [17, 21, 23, 32, 53, 60]) that also consider models that bypass the learning algorithm and let the mechanism designer access samples from a distribution rather than an explicit Bayesian prior. In contrast to our negative conclusions, these papers achieve mostly positive results. In particular, Huang et al. [53] show that the obtainable revenue is much closer to the optimum than the information-theoretic bound on learning the valuation distribution.
Comparison to online learning and reinforcement learning. Another line of work that combines decision-making and learning is online learning (see Reference [51]). In online learning, a player iteratively makes decisions. For each decision, the player incurs a cost and the cost function for the current iteration is immediately revealed. The objective is to minimize regret, which is the difference between the sum of the costs of the decisions of the player and the sum of the costs of the best fixed decision. The fundamental differences with our framework are that decisions are made online after each observation, instead of offline given a collection of observations. The benchmarks, regret in one case and the optimal solution in the other, are not comparable.
A similar comparison can be made with the problem of reinforcement learning, where at each iteration the player typically interacts with a Markov decision process (MDP) [64]. At each iteration, an action is chosen in an online manner and the player receives a reward based on the action and the state in the MDP she is in. Again, this differs from our setting where there is one offline decision to be made given a collection observations.
Additional learning results for submodular functions. In addition to the PMAC learning results mentioned in the introduction for coverage functions, there are multiple learning results for submodular functions. Monotone submodular functions are \(\alpha\)-PMAC learnable over product distributions for some constant \(\alpha\) under some assumptions [5]. Impossibility results arise for general distributions, in which case submodular functions are not \(\tilde{\Omega }(n^{-1/3})\)-PMAC learnable [5]. Finally, submodular functions can be \((1-\epsilon)\)-PMAC learned for the uniform distribution over all sets with a running time and sample complexity exponential in \(\epsilon\) and polynomial in n [43]. This exponential dependency is necessary, since \(2^{\Omega (\epsilon ^{-2/3})}\) samples are needed to learn submodular functions with \(\ell _1\)-error of \(\epsilon\) over this distribution [42].
6 Conclusion
In many optimization problems, we often do not know the objective function we wish to optimize and instead learn it from data. In this article, we ask whether we can actually obtain good guarantees when optimizing objective functions from the training data that is used to learn them. To answer this question, we introduce the optimization from samples model. In this model, we are given samples \(\lbrace S_i, f(S_i)\rbrace _i\) of an unknown objective function \(f : 2^N \rightarrow \mathbb {R}\) and the goal is to optimize the function f. Our main result is that the class of coverage functions, which is both PMAC-learnable and optimizable, is not optimizable from samples. In particular, this result implies that there is no guarantee achievable when optimizing a function that has been learned from data, even if this function is both learnable and optimizable.
On the positive side, we give an algorithm for maximizing monotone submodular functions under a cardinality constraint from samples that achieves an \(\tilde{\Omega }(n^{-1/4})\) approximation, which we show is tight up to lower order terms. We also give constant approximation algorithms for optimizing from samples unit-demand functions, additive functions, and monotone submodular functions with curvature.
A first natural direction for future work is to study whether optimization from samples is possible for other optimization problems. In this direction, recent subsequent work has obtained both positive and negative results [6, 7, 10, 11, 20]. More broadly, the impossibility result for optimizing coverage functions from samples implies that PMAC-learning is not a sufficiently strong concept to obtain guarantees when optimizing a function that is learned from data. Thus, a more conceptual and open-ended direction is to develop new learning concepts that provide guarantees when optimizing learned functions. Such a learning concept would need to depart from measuring average case performance.
Footnotes
1
The notation \(m^+\) refers to the role of this function, which is to maintain monotonicity of masking elements. These four functions are assumed to be normalized such that \(g(\emptyset) = b(\emptyset) = m(\emptyset) = m^+(\emptyset) = 0\).
2
The marginals are bounded if for all e, \(e \in S\sim \mathcal {D}\) and \(e \not\in S\sim \mathcal {D}\) w.p. at least \(1/ \operatorname{poly}(n)\) and at most \(1 - 1/\operatorname{poly}(n)\).
3
Formally, with exponentially high probability means with probability at least \(1 - e^{-\Omega (n^{\epsilon })}\) for some constant \(\epsilon \gt 0\).
4
The marginals are bounded if for all e, \(e \in S\sim \mathcal {D}\) and \(e \not\in S\sim \mathcal {D}\) w.p. at least \(1/ \operatorname{poly}(n)\) and at most \(1 - 1/\operatorname{poly}(n)\).
5
For simplicity, this proof uses estimations that we know how to compute. However, The values \(f(\lbrace e_i\rbrace)\) can be recovered exactly by solving a system of linear equations where each row corresponds to a sample, provided that the matrix for this system is invertible, which is the case with a sufficiently large number of samples by using results from random matrix theory such as in the survey by Blake and Studholme [13].
Appendices A Impossibility of OPS
A Framework for OPS Hardness
We reduce the problem of showing hardness results to the problem of constructing \(g,b,m,m^+\) with an \((\alpha , \beta)\)-gap. Recall that a partition P has r parts \(T_1, \ldots , T_r\) of k elements and a part M of remaining \(n - rk\) elements. The functions \(f^{P,i}(S) \in \mathcal {F}(g,b,m,m^+)\) are defined as \(f^{P,i}(S) := (1 - m(S \cap M)) \left(g(S \cap T_i) + b(S \cap T_{-i})\right) + m^+(S \cap M)\) with \(i \in [r]\).
Theorem 2.1.Assume the functions\(g,b,m,m^{+}\)have an\((\alpha , \beta)\)-gap, then\(\mathcal {F}(g,b,m,m^{+})\)is not\(2\max (1 /(r (1- \beta)), 2/\alpha)\)-optimizable from samples over any distribution\(\mathcal {D}\).
OPS Hardness of Coverage Functions
We consider the cardinality constraint \(k = n^{2/5 - \epsilon }\) and the number of parts \(r = n^{1/5 - \epsilon }\).
Construction the Good and the Bad Coverage Functions.
For symmetric functions h (i.e., whose value only depends on the size of the set), we abuse notation and simply write \(h(y)\) instead of \(h(S)\) for a set S of size y. We begin by showing that the primitives \(C_{p}(y) = p \cdot \left(1 - (1 - 1/p)^y \right)\) (illustrated in Figure 2) are coverage functions. It then follows that the functions g and b are coverage.
In the remaining of this section, we prove Lemma 2.1. Recall that for some parameter t, the identical-on-small-samples property must hold for any fixed set S such that \(|S| \le t\).
The good and bad functions are defined as \(g(y) = y + \sum _{j \, : \,x_j \lt 0} (-x_j) C_{p_j}(y)\) and \(b(S) = \sum _{j=1, j \ne i}^r b^{\prime }(S \cap T_j)\) with \(b^{\prime }(y) := \sum _{j \, : \,x_j \gt 0} x_j C_{p_j}(y)\). We obtain the coefficients \(\mathbf {x}\) by solving the system of linear equations \(A \mathbf {x}= \mathbf {y}\) where \(A_{ij} := C_{p_j}(i)\) and \(y_j := j\), as illustrated in Figure 3, with \(i,j \in [\ell ]\).
Fig. 3.
To prove Lemma 2.1, we begin by showing that A is invertible in Lemma A.2, so the coefficients \(\mathbf {x}\) satisfying the system of linear equations exist. We then show the three desired properties. Lemma A.3 shows that a set S of size at most \(n^{3/5 + \epsilon }\) contains at most \(\ell\) elements from any part \(T_j\) w.p. \(1 - n^{-\omega (1)}\), thus the identical-on-small-samples property holds by the system of linear equations (Lemma A.4). Lemma A.5 bounds the coefficients \(\mathbf {x}\), thus the y term in the good function dominates and we obtain the gap (Lemma A.6) and curvature (Lemma A.7) properties.
We need the following lemma to show the identical-on-small-samples property:
For coverage functions, we let \(\ell = \log \log n\).
The gap and curvature properties require bounding the coefficients \(\mathbf {x}\) (Lemma A.5). We recall two basic results from linear algebra (Theorems A.1 and A.2) that are used to bound the coefficients.
Using the bounds previously shown for \(\mathbf {x}^{\star }\), the two following lemmas establish the gap \(\alpha\) and curvature \(\beta\) of the good and bad functions \(g(\cdot)\) and \(b(\cdot)\).
Proof. We show the gap between the good and the bad function on a set S of size k. Recall that \(b(S) \le r \cdot { b^{\prime }}(k) = r\cdot \sum _{j \, : \,x_j^{\star } \gt 0, j \le \ell } x_j^{\star } C_{p_j}(k)\). We can bound each summand as:
and therefore \({ b^{\prime }}(k) \le \ell ^{O(\ell ^4)}\). However, the good function is bounded from below by the cardinality: \(g(k) \ge k\). Plugging in \(k = n^{2/5-\epsilon }\), \(r = n^{1/5 - \epsilon }\) and \(\ell = \log \log n\), we get the following gap \(\alpha\),
Finally, combining Lemmas A.4, A.6, and A.7, we get Lemma 2.1.
Constructing the Masking Function.
To obtain the desired properties of the masking functions \(m^+\) and masking fraction m, each child \(a_i\) in the universe of \(g+b\) is divided into \(n^{3/5}\) children \(a_{i,1}, \ldots a_{i,n^{3/5}}\) of equal weights \(w(a_i) / n^{3/5}\). For each masking element, draw \(j \in \mathcal {U}([n^{3/5}])\), then this masking element covers \(a_{i,j}\) for all i. The function \(m^+(S)\) is then the total weight covered by masking elements S and the masking fraction \(m(S)\) is the fraction of \(j \in [n^{3/5}]\) such that j is drawn for at least one element in S. Lemmas A.8 and A.9 show the masking property on large samples and the \(\alpha\)-gap for masking elements. We begin by stating the Chernoff bound, used in Lemma A.8.
Combining Lemmas 2.1, A.8, and A.9, we obtain an \((n^{1/5 - \epsilon }, o(1))\)-gap. The main result for exponential size coverage functions then follows from Theorem 2.1.
From Exponential to Polynomial Size Coverage Functions.
We modify \(C_{p}\) to use primitives \(c_{p}\) that are coverage with polynomially many children. The function class \(\mathcal {F}(g,b,m,m^+)\) obtained are then coverage functions over a polynomial-size universe. The matrix A for polynomial size coverage functions is identical as in the general case. We lower the cardinality constraint to \(k = 2^{\sqrt {\log n}} = |T_j|\) so the functions \(c_{p}(S \cap T_j)\) need to be defined over only \(2^{\sqrt {\log n}}\) elements. We also lower the number of parts to \(r = 2^{\sqrt {\log n}/2}\).
The main technical challenge is to obtain symmetric coverage functions for sets of size at most \(\ell\) with polynomially many children. We start by reducing the problem to constructing such functions with certain properties in Lemma 2.2. We then construct such functions and prove they satisfy these properties in Lemma 2.3. Combining these Lemmas, we obtain a \((2^{\Omega (\sqrt {\log n})}, o(1))\)-gap (Lemma 2.4).
We now construct such \(\zeta ^z\). Assume without loss that k is prime (o.w. pick some prime close to k). Given \(\mathbf {a}\in [k]^{\ell }\), and \(x \in [z]\), let \(h_{\mathbf {a}}(x) := \sum _{i \in [\ell ]} a_i x^i \mod {k}\). The children in \(\zeta ^z\) are \(U = \lbrace \mathbf {a}\in [k]^{\ell } \, : \,h_{\mathbf {a}}(x_1) \ne h_{\mathbf {a}}(x_2) \text{ for all distinct } x_1, x_2 \in [z]\rbrace\). The k elements are \(\lbrace j \, : \,0 \le j \lt k\rbrace\). Child \(\mathbf {a}\) is covered by elements \(\lbrace h_{\mathbf {a}}(x) \, : \,x \in [z]\rbrace\). Note that \(|U| \le k^{\ell } = 2^{\ell \sqrt {\log n}}\) and we pick \(\ell = \log \log n\) as previously. The following lemma is useful to show the symmetricity of \(\zeta ^z\):
We are now ready to show the main lemma for the coverage functions \(\zeta ^z(\cdot)\).
We obtain an \((\alpha = 2^{\Omega (\sqrt {\log n})}, \beta = o(1))\)-gap for polynomial sized coverage functions by using the primitives \(c_p\).
We conclude with the main result for coverage functions by combining Claim 2, Lemma 2.4, and Theorem 2.1.
B Algorithms for OPS
OPS via Estimates of Expected Marginal Contributions
We denote by \(\mathcal {S}_i\) and \(\mathcal {S}_{-i}\) the collections of all samples containing and not containing element \(e_i\), respectively. The estimate \(\hat{v}_i\) is then the difference in the average value of a sample in \(\mathcal {S}_i\) and the average value of a sample in \(\mathcal {S}_{-i}\). By standard concentration bounds (Hoeffding’s inequality, Lemma B.1), these are good estimates of \({\bf {E}}_{S \sim \mathcal {D}|e_i \not\in S }[{f_S(e_i)]}\) for product distributions \(\mathcal {D}\) (Lemma 3.1).
A Tight Approximation for Submodular Functions
Let \(\mathcal {D}_i\) be the uniform distribution over all sets of size i. Define the distribution \(\mathcal {D}^{sub}\) to be the distribution that draws from \(\mathcal {D}_k\), \(\mathcal {D}_{\sqrt {n}}\), and \(\mathcal {D}_{\sqrt {n}+1}\) at random. This section is devoted to show that Algorithm 2 is an \(\tilde{\Omega }(n^{-1/4})\)-OPS algorithm over \(\mathcal {D}^{sub}\) (Theorem 3.1). Define \(\mathcal {S}_{i,j}\) and \(\mathcal {S}_{-i,j}\) to be the collections of samples of size j containing and not containing \(e_i\), respectively. For Algorithm 2, we use a slight variation of EEMC(\(\mathcal {S}\)) where the estimates are
These are good estimates of \({\bf {E}}_{S \sim \mathcal {D}_{\sqrt {n}}|e_i\not\in S }[{f_S(e_i)]}\), as shown by the following lemma. The proof follows almost identically as the proof for Lemma 3.1.
Next, we show a simple lemma that is useful when returning random sets (Lemma B.3). The analysis is then divided into two cases, depending if a random set \(S \sim \mathcal {D}_{\sqrt {n} }\) has low value or not. If a set has low value, then we obtain an \(\tilde{\Omega }(n^{-1/4})\)-approximation by Corollary 2. Corollary 2 combines Lemmas B.4 and B.5, which, respectively, obtain \(t / (4k\sqrt {n})\) and \(k/t\) approximations. If a random set has high value, then we obtain an \(n^{-1/4}\)-approximation by Corollary 3. Corollary 3 combines Lemmas B.6 and B.7, which, respectively, obtain \(k / (4\sqrt {n})\) and \(1/k\) approximations.
In the first case of the analysis, we assume that \({\bf {E}}_{S \sim \mathcal {D}_{\sqrt {n} } }[{f(S)} \le f(S^{\star ]}) / 4\). Let \(j^{\prime }\) be the largest j such that bin \(B_{j}\) contains at least one element \(e_i\) such that \(\hat{v}_i \ge f(S^{\star }) / (2k)\). So any element \(e_i \in B_j\), \(j \le j^{\prime }\) is such that \(\hat{v}_i \ge f(S^{\star }) / (4k)\). Define \(B^{\star } = \operatorname{argmax}_{B_j: j \le j^{\prime }} f(S^{\star } \cap B_j)\) to be the bin B with high marginal contributions that has the highest value from the optimal solution. Let t be the size of \(B^{\star }\).
In the second case of the proof, we assume that \({\bf {E}}_{S \sim \mathcal {D}_{\sqrt {n} } }[{f(S)} \ge f(S^{\star ]}) / 4.\)
Proof. If \(k \ge \sqrt {n}\), then a uniformly random set of size k is a \(1/4\)-approximation to \(f(S^{\star })\) by monotonicity. Otherwise, a uniformly random subset of size k of N is a uniformly random subset of size k of a uniformly random subset of size \(\sqrt {n}\) of N. So by Lemma B.3,
By combining Corollaries 2 and 3, we obtain the main result for this section.
Bounded Curvature and Additive Functions
The algorithm MaxMargCont simply returns the k elements with the largest estimate \(\hat{v}_i\).
Proof. Let \(S^{\star } = \lbrace e^{\star }_1, \ldots , e^{\star }_k\rbrace\) be the optimal solution and \(S = \lbrace e_1, \ldots , e_k\rbrace\) be the set returned by Algorithm 3. Let \(S_i^{\star } := \lbrace e^{\star }_1, \ldots , e^{\star }_i\rbrace\) and \(S_{i} := \lbrace e_{1}, \ldots , e_{i}\rbrace\). If \(e_j \not\in S\), then
A function f is recoverable for distribution \(\mathcal {D}\) if given samples drawn from \(\mathcal {D}\), it is possible to output a function \(\tilde{f}(\cdot)\) such that for all sets S, \((1 - 1/n^2) f(S) \le \tilde{f}(S) \le (1 + 1/ n^2) f(S)\) with high probability over the samples and the randomness of the algorithm.
Proof. We have already shown that the expected marginal contribution of an element to a random set of size \(k-1\) can be estimated from samples for submodular functions.5 In the case of additive functions, this marginal contribution of an element is its value \(f(\lbrace e_i\rbrace)\).
We apply Lemma 3.1 with \(\epsilon = f(\lbrace e_i\rbrace) / n^2\) to compute \(\hat{v}_i\) such that \(|\hat{v}_i - f(\lbrace e_i\rbrace)| \le f(\lbrace e_i\rbrace) / n^2\). Note that \(\epsilon = f(\lbrace e_i\rbrace) / n^2\) satisfies \(\epsilon \ge f(S^{\star }) / \operatorname{poly}(n)\), since \(v_{min} \ge v_{max} / \operatorname{poly}(n)\). Let \(\tilde{f}(S) = \sum _{e_i \in S} \hat{v}_i\), then
As a model for statistical learnability, we use the notion of PAC learnability due to Valiant [71] and its generalization to real-valued functions PMAC learnability, due to Balcan and Harvey [5]. Let \(\mathcal {F}\) be a hypothesis class of functions \(\lbrace f_{1},f_{2},\ldots \rbrace\) where \(f_i : 2^N \rightarrow \mathbb {R}\). Given precision parameters \(\epsilon \gt 0\) and \(\delta \gt 0\), the input to a learning algorithm is samples \(\lbrace S_{i},f(S_i)\rbrace _{i=1}^{m}\) where the \(S_i\)’s are drawn i.i.d. from from some distribution \(\mathcal {D}\), and the number of samples m is polynomial in \(1/\epsilon ,1/\delta\) and n. The learning algorithm outputs a function \(\widetilde{f}: 2^N \rightarrow \mathbb {R}\) that should approximate f in the following sense:
•
\(\mathcal {F}\) is PAC-learnable on distribution \(\mathcal {D}\) if there exists a (not necessarily polynomial time) learning algorithm such that for every \(\epsilon , \delta \gt 0\):
\(\mathcal {F}\) is \(\alpha\)-PMAC-learnable on distribution \(\mathcal {D}\) if there exists a (not necessarily polynomial time) learning algorithm such that for every \(\epsilon , \delta \gt 0\):
A class \(\mathcal {F}\) is PAC (or \(\alpha\)-PMAC) learnable if it is PAC- (\(\alpha\)-PMAC)-learnable on every distribution \(\mathcal {D}\).
E Discussion
Beyond set functions. Thinking about models as set functions is a useful abstraction, but optimization from samples can be considered for general optimization problems. Instead of the max-k-cover problem, one may ask whether samples of spanning trees can be used for finding an approximately minimum spanning tree. Similarly, one may ask whether shortest paths, matching, maximal likelihood in phylogenetic trees, or any other problem where crucial aspects of the objective functions are learned from data, is optimizable from samples.
Coverage functions. In addition to their stronger learning guarantees, coverage functions have additional guarantees that distinguish them from general monotone submodular functions.
•
Any polynomial-sized coverage function can be exactly recovered, i.e., learned exactly for all sets, using polynomially many (adaptive) queries to a value oracle [16]. In contrast, there are monotone submodular functions for which no algorithm can recover the function using fewer than exponentially many value queries [16]. It is thus interesting that despite being a distinguished class within submodular functions with enough structure to be exactly recovered via adaptive queries, polynomial-sized coverage functions are inapproximable from samples.
•
In mechanism design, one seeks to design polynomial-time mechanisms that have desirable properties in equilibrium (e.g., truthful-in-expectation). Although there is an impossibility result for general submodular functions [34], one can show that for coverage functions there is a mechanism that is truthful-in-expectation [31, 33].
F Hardness of Submodular Functions
Using the hardness framework from Section 2.1, it is relatively easy to show that submodular functions are not \(n^{-1/4 + \epsilon }\)-OPS over any distribution \(\mathcal {D}\). The good, bad, and masking functions \(g,b,m,m^+\) we use are:
It is easy to show that \(\mathcal {F}(g,b,m,m^+)\) is a class of monotone submodular functions (Lemma F.2). To derive the optimal \(n^{-1/4+\epsilon }\) impossibility, we consider the cardinality constraint \(k = n^{1/4 - \epsilon /2}\) and the size of the partition to be \(r = n^{1/4}\). We show that \(\mathcal {F}(g,b,m,m^+)\) has an \((n^{1/4 - \epsilon }, 0)\)-gap.
Proof. We show that these functions satisfy the properties to have an \((n^{1/4 - \epsilon }, 0)\)-gap.
•
Identical-on-small-samples. Assume \(|S| \le n^{1/2 + \epsilon /4}\). Then \(|T_{-i}|\cdot |S| / n \le n^{1/2- \epsilon /2} \cdot n^{1/2 + \epsilon /4} / n \le n^{-\epsilon /4}\), so by Lemma A.3, \(|S \cap T_{-i}| \le \log n\) w.p. \(1- \omega (1)\) over \(P \sim \mathcal {U}(\mathcal {P})\). Thus,
Masking-on-large-samples. Assume \(|S| \ge n^{1/2 +\epsilon /4}\). Then \(|S \cap M| \ge n^{1/2}\) with exponentially high probability over \(P \sim \mathcal {U}(\mathcal {P})\) by Chernoff bound (Lemma A.7), and \(m(S \cap M) = 1\) w.p. at least \(1- \omega (1)\).
•
Gap \(n^{1/4 - \epsilon }\). Note that \(g(S) = k = n^{1/4 - \epsilon /2}\), \(b(S) = \log n\), \(m^+(S) = n^{-\epsilon /2}\) for \(|S| = k\), so \(g(S) \ge n^{1/4 - \epsilon } b(S)\) for n large enough and \(g(S) = n^{1/4} m^+(S)\).
•
Curvature \(\beta = 0\). The curvature \(\beta = 0\) follows from g being linear.\(\Box\)
We show that that we obtain monotone submodular functions.
Together with Theorem 2.1, these two lemmas imply the hardness result.
References
[1]
Ioannis Antonellis, Anish Das Sarma, and Shaddin Dughmi. 2012. Dynamic covering for recommendation systems. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM, 26–34.
Ashwinkumar Badanidiyuru, Shahar Dobzinski, Hu Fu, Robert Kleinberg, Noam Nisan, and Tim Roughgarden. Sketching valuation functions. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms.
Maria-Florina Balcan. 2015. Learning submodular functions with applications to multi-agent systems. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems.
Maria-Florina Balcan, Florin Constantin, Satoru Iwata, and Lei Wang. 2012. Learning valuation functions. In Proceedings of the 25th Annual Conference on Learning Theory.
Eric Balkanski, Nicole Immorlica, and Yaron Singer. 2017. The importance of communities for learning to influence. In Proceedings of the Conference on Neural Information Processing Systems.
Eric Balkanski, Aviad Rubinstein, and Yaron Singer. 2016. The power of optimization from samples. In Proceedings of the Conference on Neural Information Processing Systems.
Eric Balkanski, Aviad Rubinstein, and Yaron Singer. 2019. An exponential speedup in parallel running time for submodular maximization without loss in approximation. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms.
Eric Balkanski, Aviad Rubinstein, and Yaron Singer. 2019. An optimal approximation for submodular maximization under a matroid constraint in the adaptive complexity model. In Proceedings of the ACM Symposium on Theory of Computing.
Eric Balkanski and Yaron Singer. 2017. Minimizing a submodular function from samples. In Proceedings of the Conference on Neural Information Processing Systems.
Eric Balkanski and Yaron Singer. 2018. The adaptive complexity of maximizing a submodular function. In Proceedings of the ACM Symposium on Theory of Computing.
Ian F. Blake and Chris Studholme. 2006. Properties of random matrices and applications. Unpublished report. Retrieved from http://www.cs.toronto.edu/ cvs/coding.
Christian Borgs, Michael Brautbar, Jennifer Chayes, and Brendan Lucier. 2014. Maximizing social influence in nearly optimal time. In Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 946–957.
Dave Buchfuhrer, Michael Schapira, and Yaron Singer. 2010. Computation and incentives in combinatorial public projects. In Proceedings of the 11th ACM Conference on Electronic Commerce. ACM, 33–42.
Deeparnab Chakrabarty and Zhiyi Huang. 2012. Testing coverage functions. In Proceedings of the International Colloquium on Automata, Languages, and Programming. 170–181.
Shuchi Chawla, Jason D. Hartline, and Denis Nekipelov. 2014. Mechanism design for data science. In Proceedings of the ACM Conference on Economics and Computation. 711–712.
Chandra Chekuri and Kent Quanrud. 2019. Parallelizing greedy for submodular set function maximization in matroids and beyond. In Proceedings of the ACM Symposium on Theory of Computing.
Chandra Chekuri and Kent Quanrud. 2019. Submodular function maximization in parallel via the multilinear relaxation. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms.
Wei Chen, Xiaoming Sun, Jialin Zhang, and Zhijie Zhang. 2020. Optimization from structured samples for coverage functions. In Proceedings of the International Conference on Machine Learning. PMLR, 1715–1724.
Yu Cheng, Ho Yee Cheung, Shaddin Dughmi, Ehsan Emamjomeh-Zadeh, Li Han, and Shang-Hua Teng. 2015. Mixture selection, mechanism design, and signaling. In Proceedings of the IEEE Symposium on Foundations of Computer Science.
Flavio Chierichetti, Ravi Kumar, and Andrew Tomkins. 2010. Max-cover in map-reduce. In Proceedings of the 19th International Conference on World Wide Web. ACM, 231–240.
Richard Cole and Tim Roughgarden. 2014. The sample complexity of revenue maximization. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing. 243–252.
Hadi Daneshmand, Manuel Gomez-Rodriguez, Le Song, and Bernhard Schölkopf. 2014. Estimating diffusion network structures: Recovery conditions, sample complexity & soft-thresholding algorithm. In Proceedings of the 31st International Conference on Machine Learning. 793–801.
Anirban Dasgupta, Arpita Ghosh, Ravi Kumar, Christopher Olston, Sandeep Pandey, and Andrew Tomkins. 2007. The discoverability of the web. In Proceedings of the 16th International Conference on World Wide Web. ACM, 421–430.
Shahar Dobzinski and Michael Schapira. 2006. An improved approximation algorithm for combinatorial auctions with submodular bidders. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithm. Society for Industrial and Applied Mathematics, 1064–1073.
Nan Du, Yingyu Liang, Maria-Florina Balcan, and Le Song. 2014. Influence function learning in information diffusion networks. In Proceedings of the 31st International Conference on Machine Learning. 2016–2024.
Nan Du, Yingyu Liang, Maria-Florina Balcan, and Le Song. 2014. Learning time-varying coverage functions. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 3374–3382.
Nan Du, Yingyu Liang, Maria-Florina F. Balcan, and Le Song. 2014. Learning time-varying coverage functions. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 3374–3382.
Nan Du, Le Song, Manuel Gomez-Rodriguez, and Hongyuan Zha. 2013. Scalable influence estimation in continuous-time diffusion networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems.3147–3155.
Shaddin Dughmi. 2011. A truthful randomized mechanism for combinatorial public projects via convex optimization. In Proceedings of the 12th ACM Conference on Electronic Commerce. ACM, 263–272.
Shaddin Dughmi, Li Han, and Noam Nisan. 2014. Sampling and representation complexity of revenue maximization. In Proceedings of the 10th International Conference on Web and Internet Economics. 277–291.
Shaddin Dughmi, Tim Roughgarden, and Qiqi Yan. 2011. From convex optimization to randomized mechanisms: Toward optimal combinatorial auctions. In Proceedings of the 43rd Annual ACM Symposium on Theory of Computing. ACM, 149–158.
Alina Ene and Huy L. Nguyen. 2019. Submodular maximization with nearly-optimal approximation and adaptivity in nearly-linear time. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms.
Alina Ene, Huy L. Nguyen, and Adrian Vladu. 2019. Submodular maximization with matroid and packing constraints in parallel. In Proceedings of the ACM Symposium on Theory of Computing.
Alina Ene, Jan Vondrák, and Yi Wu. 2013. Local distribution and the symmetry gap: Approximability of multiway partitioning problems. In Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms. 306–325.
Matthew Fahrbach, Vahab Mirrokni, and Morteza Zadimoghaddam. 2019. Non-monotone submodular maximization with nearly optimal adaptivity complexity. In International Conference on Machine Learning. PMLR.
Matthew Fahrbach, Vahab Mirrokni, and Morteza Zadimoghaddam. 2019. Submodular maximization with optimal approximation, adaptivity and query complexity. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms.
Vitaly Feldman and Pravesh Kothari. 2014. Learning coverage functions and private release of marginals. In Proceedings of the 27th Conference on Learning Theory.
Vitaly Feldman, Pravesh Kothari, and Jan Vondrák. 2013. Representation, approximation and learning of submodular functions using low-rank decision trees. In Proceedings of the 26th Annual Conference on Learning Theory.
Vitaly Feldman and Jan Vondrák. 2013. Optimal bounds on approximation of submodular and XOS functions by juntas. In Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer Science.
Vitaly Feldman and Jan Vondrák. 2015. Tight bounds on low-degree spectral concentration of submodular and XOS functions. In Proceedings of the Foundations of Computer Science. IEEE, 923–942.
Michel X. Goemans, Nicholas J. A. Harvey, Satoru Iwata, and Vahab Mirrokni. 2009. Approximating submodular functions everywhere. In Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms.
Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Krause. 2010. Inferring networks of diffusion and influence. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1019–1028.
Manuel Gomez Rodriguez, Jure Leskovec, and Andreas Krause. 2010. Inferring networks of diffusion and influence. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1019–1028.
Carlos Guestrin, Andreas Krause, and Ajit Paul Singh. 2005. Near-optimal sensor placements in Gaussian processes. In Proceedings of the 22nd International Conference on Machine Learning. 265–272.
Anupam Gupta, Moritz Hardt, Aaron Roth, and Jonathan Ullman. 2013. Privately releasing conjunctions and the statistical query barrier. SIAM J. Comput. 42, 4 (2013), 1494–1520.
Xinran He and David Kempe. 2016. Robust influence maximization. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 885–894.
Zhiyi Huang, Yishay Mansour, and Tim Roughgarden. 2015. Making the most of your samples. In Proceedings of the 16th ACM Conference on Economics and Computation. 45–60.
David Kempe, Jon Kleinberg, and Éva Tardos. 2003. Maximizing the spread of influence through a social network. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 137–146.
Nitish Korula, Vahab S. Mirrokni, and Morteza Zadimoghaddam. 2015. Online submodular welfare maximization: Greedy beats 1/2 in random order. In Proceedings of the 47th Annual ACM on Symposium on Theory of Computing.
Andreas Krause and Carlos Guestrin. 2007. Near-optimal observation selection using submodular functions. In Proceedings of the AAAI Conference on Artificial Intelligence. 1650–1654.
Benny Lehmann, Daniel Lehmann, and Noam Nisan. 2001. Combinatorial auctions with decreasing marginal utilities. In Proceedings of the 3rd ACM Conference on Electronic Commerce. ACM, 18–28.
Hui Lin and Jeff Bilmes. 2011. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 510–520.
Vahab Mirrokni, Michael Schapira, and Jan Vondrák. 2008. Tight information-theoretic lower bounds for welfare maximization in combinatorial auctions. In Proceedings of the 9th ACM Conference on Electronic Commerce.
Jamie Morgenstern and Tim Roughgarden. 2015. The pseudo-dimension of nearly-optimal auctions. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 28.
Elchanan Mossel, Ryan O’Donnell, and Rocco P. Servedio. 2003. Learning juntas. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing. ACM, 206–212.
Harikrishna Narasimhan, David C. Parkes, and Yaron Singer. 2015. Learnability of influence in networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 3186–3194.
G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. 1978. An analysis of approximations for maximizing submodular set functions II. Math. Program. Stud. 8 (1978).
Barna Saha and Lise Getoor. 2009. On maximum coverage in the streaming model & application to multi-topic blog-watch. In Proceedings of the SIAM International Conference on Data Mining. SIAM, 697–708.
Lior Seeman and Yaron Singer. 2013. Adaptive seeding in social networks. In Proceedings of the IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE, 459–468.
Yaron Singer. 2012. How to win friends and influence people, truthfully: Influence maximization mechanisms for social networks. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining. ACM, 733–742.
Ashwin Swaminathan, Cherian V. Mathew, and Darko Kirovski. 2009. Essential pages. In Proceedings of the IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology. 173–182.
Hiroya Takamura and Manabu Okumura. 2009. Text summarization model based on maximum coverage problem and its variant. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 781–789.
Gregory Valiant. 2012. Finding correlations in subquadratic time, with applications to learning parities and juntas. In Proceedings of the IEEE 53rd Annual Symposium on Foundations of Computer Science. IEEE, 11–20.
Yisong Yue and Thorsten Joachims. 2008. Predicting diverse subsets using structural SVMs. In Proceedings of the 25th International Conference on Machine Learning. 1224–1231.
STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
In this paper we consider the following question: can we optimize objective functions from the training data we use to learn them? We formalize this question through a novel framework we call optimization from samples (OPS). In OPS, we are given ...
We study the maximum coverage problem with group budget constraints (MCG). The input consists of a ground set X, a collection $$\psi $$ź of subsets of X each of which is associated with a combinatorial structure such that for every set $$S_j\in \psi $$...
We design new approximation algorithms for the problems of optimizing submodular and supermodular functions subject to a single matroid constraint. Specifically, we consider the case in which we wish to maximize a monotone increasing submodular function ...
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].