What We Do In The Shadows (Part 1/3)

A gentle intro to quantum and the start of a 3-part blog about shadow tomography.

This post presents content from John Wright's notes and Bădescu O'Donnell (2020); any errors are mine.

Here's a 3-part blog about shadow tomography, a problem in quantum learning theory with some neat connections to multi-distribution learning, adaptive data analysis, and no-regret dynamics. \(\def\textsc#1{\dosc#1\csod} \def\dosc#1#2\csod{{\rm #1{\small #2}}}\def\ket#1{\lvert #1\rangle}\def\bra#1{\langle #1\rvert}\def\braket#1#2{\langle #1 | #2 \rangle}\def\ketbra#1#2{\lvert #1\rangle\langle #2\rvert}\def\defeq{\mathrel{:=}}\) I mainly wrote these notes for myself, but thought I'd share them anyways since I haven't posted in a while. This first post just provides a gentle introduction to quantum and the Bădescu-O'Donnell algorithm.

The Shadow Tomography Problem

An observable \(A_i \in \mathbb{C}^{d \times d}\) is just a Hermitian matrix. Its expectation on a state \(\rho\) is given by \(\operatorname*{E}_\rho[A_i] = \operatorname{Tr}(\rho A_i)\). For a pure state, this is \(\operatorname*{E}_\rho[A_i] = \langle \psi | A_i | \psi \rangle\).

Now let's say you have a \(d\)-dimensional quantum state \(\rho \in \mathbb{C}^{d \times d}\) and you want to estimate the expectations of some observables \(A_1, \ldots, A_m\) using as few copies of \(\rho\) as possible. This is the shadow tomography problem of Aaronson 2017.

Basics of Quantum Learning Theory

We'll start from the top and review the basics of quantum learning theory (and page back in some linear algebra from undergrad).

First, recall that a normal matrix commutes with its conjugate transpose, i.e., \(M M^\dagger = M^\dagger M\), or equivalently has an orthonormal basis of eigenvectors. Quantum states and measurements are just normal matrices with real eigenvalues.

Definition 1.1 (Hermitian Matrix). A square matrix \(M \in \mathbb{C}^{d \times d}\) is Hermitian (or self-adjoint) if it is equal to its conjugate transpose, i.e., \(M = M^\dagger\). Hermitian matrices have real eigenvalues. Density matrices and measurement operators (projectors, POVM elements) are required to be Hermitian.

Definition 1.2 (Projection Matrix). A square matrix \(P \in \mathbb{C}^{d \times d}\) is a projection matrix (or projector) if it is Hermitian (\(P = P^\dagger\)) and idempotent (\(P^2 = P\)). A projector projects vectors onto a subspace.

Evolutions of quantum systems are normal matrices with eigenvalues on the unit circle.

Definition 1.3 (Unitary Matrix). A square matrix \(U \in \mathbb{C}^{d \times d}\) is unitary if its conjugate transpose \(U^\dagger\) is also its inverse, i.e., \(U^\dagger U = U U^\dagger = I\). Unitary transformations preserve the inner product between vectors, and thus preserve lengths and angles.

Quantum States

A \(d\)-dimensional pure quantum state is represented by a unit vector \(|\psi\rangle \in \mathbb{C}^d\). More generally, a mixed state is described by a density matrix \(\rho \in \mathbb{C}^{d \times d}\) that is positive semi-definite matrix (\(\rho \ge 0\)) and trace one (\(\operatorname{Tr}(\rho) = 1\), which means \(\rho\) is also Hermitian). A pure state \(\psi\) can be written as \(\rho = |\psi\rangle\langle\psi|\), and are a special case where \(\operatorname{rank}(\rho)=1\).

Remark 1.4 (Classical special case). A classical distribution is a density matrix that is diagonal in the standard basis \(\{|1\rangle, \dots, |d\rangle\}\), i.e. \(\rho = \sum_{x=1}^d p_x |x\rangle\langle x|\) corresponds to the distribution \(p = (p_1, \dots, p_d)\) over the outcomes \(\{1, \dots, d\}\).

Since \(\rho\) is Hermitian, it's unitarily diagonalizable, which means there's always some basis (specifically \(\rho\)'s eigenbasis) in which \(\rho\) looks like a classical distribution.

Definition 1.5 (Purification of a Density Matrix). Let \(\rho\) be a state in a \(d_A\)-dimensional system (System A). A purification of \(\rho\) is a pure state \(|\psi\rangle_{AB}\) in a larger system (System AB), such that the original state \(\rho\) is recovered by a partial trace over the auxiliary system (System B). \[ \rho = \operatorname{Tr}_B(|\psi\rangle\langle\psi|_{AB}). \] A purification can always be constructed if the dimension \(d_B\) of the auxiliary system B is at least the rank of \(\rho\). Let the spectral decomposition of \(\rho\) be \(\rho = \sum_{i=1}^{r} \lambda_i |v_i\rangle\langle v_i|\), where \(r = \operatorname{rank}(\rho)\), \(\lambda_i > 0\) are the non-zero eigenvalues, and \(\{|v_i\rangle\}\) are the corresponding orthonormal eigenvectors in \(\mathbb{C}^{d_A}\). Choose an auxiliary system B of dimension \(d_B \ge r\) with an orthonormal basis \(\{|u_i\rangle_B\}_{i=1}^{d_B}\). A standard way to construct a purification is: \[ |\psi\rangle_{AB} = \sum_{i=1}^{r} \sqrt{\lambda_i} |v_i\rangle_A \otimes |u_i\rangle_B. \] If one were to perform measurements only on System A, the statistics obtained from the pure state \(|\psi\rangle_{AB}\) would be indistinguishable from those obtained from the mixed state \(\rho\). Purifications are not unique; choosing a different basis \(\{|u_i\rangle_B\}\) or applying a unitary transformation only to System B results in a different valid purification.

Quantum Measurements

The simplest operations on a quantum computer are: (i) applying a unitary \(U\) to your state, (ii) appending ancilla registers (say, qubits initialized to \(|0\rangle\)), and (iii) measuring in the standard/canonical/computational basis \(\{|1\rangle, \dots, |d\rangle\}\). All of the measurements we'll talk about can be performed using a combination of these operations. We'll defer for now how to simulate post-measurement states and focus on simulating outcomes.

Definition 1.6 (Basis measurement). Let \(|v_1\rangle, \dots, |v_d\rangle\) be an orthonormal basis of \(\mathbb{C}^d\). Given a mixed state \(\rho \in \mathbb{C}^{d \times d}\), if we perform this basis measurement, then the probability we observe outcome “\(i\)" is \[ \mathbb{P}[\text{Outcome “}i\text{"}] = \langle v_i| \rho |v_i\rangle = \operatorname{Tr}(|v_i\rangle\langle v_i| \cdot \rho). \]

To simulate measuring in an arbitrary basis \(\{|v_1\rangle, \dots, |v_d\rangle\}\), first apply the unitary \(U\) which rotates this basis into the standard basis, i.e., \(U |v_i\rangle = |i\rangle\). Then measure in the standard basis. Given a state \(\rho\), the evolved state is \(U \rho U^\dagger\). The probability of observing outcome “\(i\)" in the standard basis is \[ \mathbb{P}[\text{Outcome “}i\text{"}] = \langle i| (U \rho U^\dagger) |i\rangle = \operatorname{Tr}(|i\rangle\langle i| U \rho U^\dagger) = \operatorname{Tr}(U^\dagger |i\rangle\langle i| U \rho). \] Since \(U^\dagger |i\rangle = |v_i\rangle\), we have that \(U^\dagger |i\rangle\langle i| U = |v_i\rangle\langle v_i|\), so \[ \mathbb{P}[\text{Outcome “}i\text{"}] = \operatorname{Tr}(|v_i\rangle\langle v_i| \rho) = \langle v_i| \rho |v_i\rangle. \]

Definition 1.7 (Projective measurement). A projective measurement is described by a set of projection matrices \(\Pi = \{\Pi_1, \dots, \Pi_m\}\) that sum to the identity, \(\Pi_1 + \dots + \Pi_m = I\). The projectors satisfy \(\Pi_i \Pi_j = \delta_{ij} \Pi_i\), meaning they project onto mutually orthogonal subspaces. If this measurement is performed on a state \(\rho \in \mathbb{C}^{d \times d}\), then \[ \mathbb{P}[\text{Outcome “}i\text{"}] = \operatorname{Tr}(\Pi_i \cdot \rho). \]

Given a projective measurement \(\Pi = \{\Pi_1, \dots, \Pi_m\}\), we can decompose each projector \(\Pi_i\) into a sum over an orthonormal basis for its image subspace \( \Pi_i = \sum_{j=1}^{r_i} |v_{i,j}\rangle\langle v_{i,j}|, \) where \(r_i = \operatorname{rank}(\Pi_i)\) and \(\{|v_{i,j}\rangle\}_{j=1}^{r_i}\) is an orthonormal basis for the subspace projected onto by \(\Pi_i\). Because the projectors are orthogonal, the set of all vectors \(\{v_{i,j}\}_{i \in [m], j \in [r_i]}\) forms an orthonormal basis for the entire space \(\mathbb{C}^d\) (or the subspace spanned by the non-zero projectors if \(\sum \Pi_i \neq I\), but we assume \(\sum \Pi_i = I\)). To simulate measuring \(\Pi\) on a state \(\rho\), we'll first perform a basis measurement in the basis \(\{v_{i,j}\}\) and then, if the outcome corresponds to vector \(|v_{i,j}\rangle\), output \(i\). The total probability of outputting \(i\) is \[ \mathbb{P}[\text{Output “}i\text{"}] = \sum_{j=1}^{r_i} \mathbb{P}[\text{Outcome “}v_{i,j}\text{"}] = \sum_{j=1}^{r_i} \operatorname{Tr}(|v_{i,j}\rangle\langle v_{i,j}| \cdot \rho) = \operatorname{Tr} \left( \left( \sum_{j=1}^{r_i} |v_{i,j}\rangle\langle v_{i,j}| \right) \cdot \rho \right) = \operatorname{Tr}(\Pi_i \cdot \rho). \]

Definition 1.8 (POVMs). A positive operator-valued measure (POVM) is described by a set of Hermitian matrices \(E = \{E_1, \dots, E_m\}\) such that \(E_i \ge 0\) for all \(1 \le i \le m\), and \(E_1 + \dots + E_m = I\). These \(E_i\) are called POVM elements. If this measurement is performed on a state \(\rho \in \mathbb{C}^{d \times d}\), then \[ \mathbb{P}[\text{Outcome “}i\text{"}] = \operatorname{Tr}(E_i \cdot \rho). \]

POVMs generalize projective measurements by dropping the requirement that \(E_i^2 = E_i\); i.e. just allowing \(E_i\) to have non-negative eigenvalues versus strictly those in \(\{0, 1\}\). This allows for more general measurement strategies, including those with more outcomes than the dimension \(d\), or measurements that cannot perfectly distinguish non-orthogonal states.

Simulating POVMs (Neumark's Extension Theorem). Any POVM \(E = \{E_1, \dots, E_m\}\) on a system \(\mathcal{H}_S\) can be physically realized using a unitary operation \(U\) on an extended system \(\mathcal{H}_S \otimes \mathcal{H}_A\) followed by a projective measurement on the auxiliary (ancilla) system \(\mathcal{H}_A\). The procedure is:

Introduce an ancilla system \(\mathcal{H}_A\) of dimension at least \(m\). Initialize it in a fixed state, e.g., \(|0\rangle_A\). The combined system starts in state \(\rho \otimes |0\rangle\langle 0|_A\).
Define an isometry \(V: \mathcal{H}_S \to \mathcal{H}_S \otimes \mathcal{H}_A\) by \(V |\psi\rangle_S = \sum_{i=1}^m (\sqrt{E_i} |\psi\rangle) \otimes |i\rangle_A\). Here, \(\{|i\rangle_A\}_{i=1}^m\) is an orthonormal set of states in \(\mathcal{H}_A\). This is indeed an isometry because \[\langle\psi| V^\dagger V |\psi\rangle = \sum_{i,j} \langle\psi| \sqrt{E_j}^\dagger \sqrt{E_i} |\psi\rangle \langle j|i\rangle_A = \sum_i \langle\psi| E_i |\psi\rangle = \langle\psi| (\sum_i E_i) |\psi\rangle = \langle\psi|\psi\rangle.\]
Extend the isometry \(V\) to a unitary operator \(U\) acting on the full space \(\mathcal{H}_S \otimes \mathcal{H}_A\) (we'll just take this for granted). Apply this unitary \(U\) to the initial state \(\rho \otimes |0\rangle\langle 0|_A\).
Perform a projective measurement on the ancilla system \(\mathcal{H}_A\) described by the projectors \(\{\Pi_i = I_S \otimes |i\rangle\langle i|_A\}_{i=1}^m\), plus possibly a projector onto the remaining subspace if \(\dim(\mathcal{H}_A) > m\). Here \(I_S\) is the identity operator on the system Hilbert space \(\mathcal{H}_S\).

The probability of obtaining outcome \(i\) from the projective measurement on the ancilla is: \begin{align*} \mathbb{P}[\text{Ancilla outcome } i] &= \operatorname{Tr}_{S+A}\left( \Pi_i U (\rho \otimes |0\rangle\langle 0|_A) U^\dagger \right) \\ &= \operatorname{Tr}_{S+A}\left( \Pi_i V \rho V^\dagger \right) \quad \text{(since } U \text{ acts as } V \text{ on } \mathcal{H}_S \otimes |0\rangle_A\text{)} \\ &= \operatorname{Tr}_S \langle i|_A V \rho V^\dagger |i\rangle_A \\ &= \operatorname{Tr}_S \left( \sqrt{E_i} \rho (\sqrt{E_i})^\dagger \right) = \operatorname{Tr}( E_i \rho ). \quad (\text{since } E_i \text{ is Hermitian}) \end{align*} This probability exactly matches the POVM rule \(\operatorname{Tr}(E_i \rho)\). Thus, any abstract POVM can be implemented using standard quantum operations.

Example (Trivial POVM). Consider the POVM \(E_1 = I/2, E_2 = I/2\). Let's simulate it:

Add an ancilla qubit initialized to \(|0\rangle\). State is \(\rho \otimes |0\rangle\langle 0|\).
Apply a Hadamard gate \(H\) to the ancilla: \((I \otimes H) (\rho \otimes |0\rangle\langle 0|) (I \otimes H^\dagger) = \rho \otimes |+\rangle\langle +|\).
Measure the ancilla in the standard basis \(\{|0\rangle, |1\rangle\}\). The projectors are \(\Pi_0 = I \otimes |0\rangle\langle 0|\) and \(\Pi_1 = I \otimes |1\rangle\langle 1|\).

Mapping outcome \(0 \to 1\) and outcome \(1 \to 2\) simulates the POVM \(\{I/2, I/2\}\): \[\mathbb{P}[0] = \operatorname{Tr}(\Pi_0 (\rho \otimes |+\rangle\langle +|)) = \operatorname{Tr}(\rho) \operatorname{Tr}(|0\rangle\langle 0| |+\rangle\langle +|) = 1 \cdot |\langle 0|+\rangle|^2 = 1/2\] \[\mathbb{P}[1] = \operatorname{Tr}(\Pi_1 (\rho \otimes |+\rangle\langle +|)) = \operatorname{Tr}(\rho) \operatorname{Tr}(|1\rangle\langle 1| |+\rangle\langle +|) = 1 \cdot |\langle 1|+\rangle|^2 = 1/2\]

Example (Optimal Unambiguous Discrimination POVM). Consider pure states \(|\psi_1\rangle\) and \(|\psi_2\rangle\) with \(\langle\psi_1|\psi_2\rangle = \cos\theta\). We want to discriminate between them, or declare "don't know". We'll do this with the POVM \(E = \{E_1, E_2, E_{dk}\}\) with \(E_1 = \frac{1}{\lambda} \Pi_1\), \(E_2 = \frac{1}{\lambda} \Pi_2\), and \(E_{dk} = I - E_1 - E_2\), where \(\Pi_1 = |\psi_2^\perp\rangle\langle\psi_2^\perp|\), \(\Pi_2 = |\psi_1^\perp\rangle\langle\psi_1^\perp|\), \(\lambda = 1+\cos\theta\). Following the Neumark recipe, the implementation (1) adds a 3-level ancilla (qutrit) in state \(|0\rangle_A\), (2) defines the isometry \[V |\psi\rangle_S = (\sqrt{E_1} |\psi\rangle) \otimes |1\rangle_A + (\sqrt{E_2} |\psi\rangle) \otimes |2\rangle_A + (\sqrt{E_{dk}} |\psi\rangle) \otimes |3\rangle_A,\] (3) extends \(V\) to a unitary \(U\) and applies it to \(\rho \otimes |0\rangle\langle 0|_A\), and (4) measures the ancilla projectively with \(\{\Pi_1 = I \otimes |1\rangle\langle 1|_A, \Pi_2 = I \otimes |2\rangle\langle 2|_A, \Pi_3 = I \otimes |3\rangle\langle 3|_A\}\).

Probability Things We'll Probably Need

In addition to total variation distance \(d_{\mathrm{TV}}(p,q) = \frac{1}{2} \sum_{x=1}^d |p_x - q_x|\), we'll need to know the Bhattacharyya coefficient and Hellinger distance.

Definition 1.9 (Bhattacharyya coefficient). The Bhattacharyya coefficient between \(p\) and \(q\) is \( BC(p,q) = \sum_{x=1}^d \sqrt{p_x q_x} \). It measures overlap with \(1\) for identical, \(0\) for disjoint support. It's bounded (\(0 \le BC(p,q) \le 1\)), symmetric (\(BC(p,q) = BC(q,p)\)), and multiplicative (\(BC(p^{\otimes n}, q^{\otimes n}) = BC(p, q)^n\)).

Definition 1.10 (Hellinger distance). The Hellinger distance between distributions \(p\) and \(q\) is \( d_H(p,q) = \sqrt{1 - BC(p,q)} \). Equivalently, representing \(p\) as \(|\sqrt{p}\rangle = \sum_x \sqrt{p_x} |x\rangle\), \( d_H(p,q) = \frac{1}{\sqrt{2}} \left\lVert |\sqrt{p}\rangle - |\sqrt{q}\rangle \right\rVert_2 \). It's a metric (on the space of probability distributions) and bounded \(0 \le d_H(p,q) \le 1\).

We'll now define the quantum analogs of these distances/coefficients.

Definition 1.11 (Trace Distance). The trace distance between density matrices \(\rho\) and \(\sigma\) is \[ d_{\mathrm{Tr}}(\rho, \sigma) = \frac{1}{2} \left\lVert\rho - \sigma\right\rVert_1 = \frac{1}{2} \operatorname{Tr}|\rho - \sigma|, \] where \(|\rho-\sigma| = \sqrt{(\rho-\sigma)^\dagger(\rho-\sigma)} = \sqrt{(\rho-\sigma)^2}\) since \(\rho-\sigma\) is Hermitian. If \(\lambda_i\) are the eigenvalues of \(\rho - \sigma\), then \(d_{\mathrm{Tr}}(\rho, \sigma) = \frac{1}{2} \sum_{i=1}^d |\lambda_i|\). An equivalent definition is \[ d_{\mathrm{Tr}}(\rho, \sigma) = \max_{0 \le \Pi \le I} \operatorname{Tr}(\Pi(\rho - \sigma)), \] where the maximum is over all Hermitian operators \(\Pi\) such that \(0 \le \Pi \le I\). The maximum is achieved when \(\Pi\) is the projector onto the non-negative eigenspace of \(\rho-\sigma\).

Trace distance generalizes TV distance, is a metric and bounded just like TV distance, and satisfies \(0 \le d_{\mathrm{Tr}}(\rho, \sigma) \le 1\) with \(d_{\mathrm{Tr}}(\rho, \sigma) = 1\) iff \(\rho, \sigma\) have orthogonal support. It also has an operational meaning (Helstrom bound): the maximum success probability for distinguishing \(\rho\) and \(\sigma\) with equal priors is \(\frac{1}{2} + \frac{1}{2} d_{\mathrm{Tr}}(\rho, \sigma)\).

Definition 1.12 (Fidelity). The fidelity between two density matrices \(\rho\) and \(\sigma\) is \[ F(\rho, \sigma) = \left\lVert\sqrt{\rho}\sqrt{\sigma}\right\rVert_1 = \operatorname{Tr}\sqrt{\sqrt{\rho}\sigma\sqrt{\rho}}. \] Here \(\sqrt{\rho}\) is the unique PSD square root of \(\rho\). Fidelity generalizes Bhattacharyya coefficient; it's similarly bounded (\(0 \le F(\rho, \sigma) \le 1\)), symmetric (\(F(\rho, \sigma) = F(\sigma, \rho)\)), and multiplicative (\(F(\rho^{\otimes n}, \sigma^{\otimes n}) = F(\rho, \sigma)^n\)).

For pure states, \(F(|\psi\rangle\langle\psi|, |\phi\rangle\langle\phi|) = |\langle\psi|\phi\rangle|\) and \(F(|\psi\rangle\langle\psi|, \sigma) = \sqrt{\langle\psi|\sigma|\psi\rangle}\). Uhlmann's theorem says \(F(\rho, \sigma) = \max_{|\psi\rangle, |\phi\rangle} \{ |\langle\psi|\phi\rangle| \}\), where the maximum is over all purifications \(|\psi\rangle\) of \(\rho\) and \(|\phi\rangle\) of \(\sigma\) in an extended Hilbert space.

Remark 1.13. The definition \(F(\rho, \sigma) = \operatorname{Tr}\sqrt{\sqrt{\rho}\sigma\sqrt{\rho}}\) is sometimes called the "square root fidelity". Some literature defines fidelity as \(F_{alt}(\rho, \sigma) = F(\rho, \sigma)^2 = (\operatorname{Tr}\sqrt{\sqrt{\rho}\sigma\sqrt{\rho}})^2\).

Definition 1.14 (Bures distance). The Bures distance between density matrices \(\rho\) and \(\sigma\) is defined in terms of fidelity: \(d_{\mathrm{Bures}}(\rho, \sigma) = \sqrt{2(1 - F(\rho, \sigma))}\). It generalizes Hellinger distance, is a metric, and bounded (\(0 \le d_{\mathrm{Bures}}(\rho, \sigma) \le 2\)).

Theorem 1.15 (Fuchs-van de Graaf Inequalities). The trace distance \(d_{\mathrm{Tr}}(\rho, \sigma)\) and fidelity \(F(\rho, \sigma)\) are related by \( 1 - F(\rho, \sigma) \le d_{\mathrm{Tr}}(\rho, \sigma) \le \sqrt{1 - F(\rho, \sigma)^2} \). In terms of the Bures distance \(d_{\mathrm{Bures}}(\rho, \sigma) = \sqrt{1-F(\rho, \sigma)}\), this is: \[ d_{\mathrm{Bures}}(\rho, \sigma)^2 \le 2 d_{\mathrm{Tr}}(\rho, \sigma) \le \sqrt{4 - (2 - d_{\mathrm{Bures}}(\rho, \sigma)^2)^2} = \sqrt{4 d_{\mathrm{Bures}}(\rho, \sigma)^2 - d_{\mathrm{Bures}}(\rho, \sigma)^4}. \] A simpler upper bound derived from this is \(d_{\mathrm{Tr}}(\rho, \sigma) \le d_{\mathrm{Bures}}(\rho, \sigma)\).

Pure State Discrimination

We now reach our first toy problem: distinguish between two known, non-orthogonal pure quantum states, \(|\psi_1\rangle\) and \(|\psi_2\rangle\), given a single copy of an unknown state \(|\psi\rangle\) promised to be one of the two. Let's say the states span a 2-dimensional subspace \(S = \operatorname{span}\{|\psi_1\rangle, |\psi_2\rangle\}\). We'll assume \(|\psi_1\rangle, |\psi_2\rangle\) are normalized and let \(\langle\psi_1|\psi_2\rangle = \cos\theta e^{i\phi}\). By applying a global phase (e.g. apply \(e^{-i \phi}\) to \(\psi_2\)), we can assume \(\langle\psi_1|\psi_2\rangle = \cos\theta\) is real and non-negative, so \(\theta \in (0, \pi/2]\) is the angle between the states.

Optimal Worst-Case Error Discrimination

We want a POVM \(E=\{E_1, E_2\}\) that minimizes \[\max\{\mathbb{P}[\text{Guess 2}|\psi_1], \mathbb{P}[\text{Guess 1}|\psi_2]\} = \max\{\langle\psi_1|E_2|\psi_1\rangle, \langle\psi_2|E_1|\psi_2\rangle\}.\] Let \(|u_1\rangle = |\psi_1\rangle\) and \(|u_2\rangle = (|\psi_2\rangle - \cos\theta |\psi_1\rangle) / \sin\theta\) form an orthonormal basis for \(S\). We'll define the orthogonal measurement basis vectors within \(S\) as: \begin{align*} |v_1\rangle &= \cos\left(\frac{\pi}{4} - \frac{\theta}{2}\right) |u_1\rangle - \sin\left(\frac{\pi}{4} - \frac{\theta}{2}\right) |u_2\rangle \\ |v_2\rangle &= \sin\left(\frac{\pi}{4} - \frac{\theta}{2}\right) |u_1\rangle + \cos\left(\frac{\pi}{4} - \frac{\theta}{2}\right) |u_2\rangle \end{align*} One way to view this is that our basis rotates \(\psi_1, \psi_2\) so that \(\psi_1\) is at \((1, 0)\), then sets \(v_1\) and \(v_2\) so that they're at right angles and their bisector is \(\frac{\theta}{2}\). If outcome \(E_1 = |v_1\rangle\langle v_1|\) occurs, guess \(|\psi_1\rangle\); if outcome \(E_2 = |v_2\rangle\langle v_2|\) occurs, guess \(|\psi_2\rangle\).

Theorem 2.1 (Optimal worst-case error). The worst-case error is \(\frac{1}{2} - \frac{1}{2}\sin(\theta)\).

Proof. By construction, the setup is symmetric with respect to \(|\psi_1\rangle\) and \(|\psi_2\rangle\). The angle between \(|v_1\rangle\) and \(|\psi_1\rangle = |u_1\rangle\) is \(\alpha = \frac{\pi}{4} - \frac{\theta}{2}\). The angle between \(|v_2\rangle\) and \(|\psi_2\rangle\) is also \(\alpha\).

The success probability given \(|\psi_1\rangle\) is \[\mathbb{P}[\text{Guess 1}|\psi_1] = \langle\psi_1| E_1 |\psi_1\rangle = |\langle v_1|\psi_1\rangle|^2 = \cos^2(\alpha) = \cos^2(\frac{\pi}{4} - \frac{\theta}{2}).\] Using the half-angle formula \(\cos^2(x) = (1+\cos(2x))/2\): \[ \cos^2\left(\frac{\pi}{4} - \frac{\theta}{2}\right) = \frac{1 + \cos\left(\frac{\pi}{2} - \theta\right)}{2} = \frac{1 + \sin(\theta)}{2}. \] The other case follows symmetrically. \(\square\)

Optimal Unambiguous Discrimination

Here, we seek a POVM \(E = \{E_1, E_2, E_{dk}\}\) such that incorrect guesses are forbidden: \(\operatorname{Tr}(E_1 |\psi_2\rangle\langle\psi_2|) = 0\) and \(\operatorname{Tr}(E_2 |\psi_1\rangle\langle\psi_1|) = 0\). The goal is to minimize the worst-case probability of guessing “don't know", \(\max\{\operatorname{Tr}(E_{dk}|\psi_1\rangle\langle\psi_1|), \operatorname{Tr}(E_{dk}|\psi_2\rangle\langle\psi_2|)\}\).

Let \(|\psi_1^\perp\rangle\) and \(|\psi_2^\perp\rangle\) be normalized states in \(S = \operatorname{span}\{|\psi_1\rangle, |\psi_2\rangle\}\) such that \(\langle\psi_1|\psi_1^\perp\rangle=0\) and \(\langle\psi_2|\psi_2^\perp\rangle=0\). Again using \(|u_1\rangle=|\psi_1\rangle\) and \(|u_2\rangle=(|\psi_2\rangle - \cos\theta |\psi_1\rangle)/\sin\theta\) for \(S\), let's define the projectors \(\Pi_1 = |\psi_2^\perp\rangle\langle\psi_2^\perp|\) and \(\Pi_2 = |\psi_1^\perp\rangle\langle\psi_1^\perp|\). Let: \[\lambda = 1 + |\langle\psi_1^\perp|\psi_2^\perp\rangle| = 1 + |\langle u_2|(\sin\theta |u_1\rangle - \cos\theta |u_2\rangle)| = 1 + |-\cos\theta| = 1 + \cos\theta.\] Our POVM is then \[ E_1 = \frac{1}{\lambda} \Pi_1, \quad E_2 = \frac{1}{\lambda} \Pi_2, \quad E_{dk} = I - E_1 - E_2. \]

Let's quickly verify that \(E_1, E_2, E_{dk}\) form a valid POVM. The main thing to verify is that \(E_{dk} \geq 0\), which we can do by observing that \( \Pi_1 + \Pi_2 \leq \lambda I \).

Theorem 2.2 (Optimal unambiguous worst-case don't know). The worst-case probability of guessing “don't know" is \(\cos(\theta)\).

Proof. By definition of \(|\psi_1^\perp\rangle\), we have \[\operatorname{Tr}(E_2 |\psi_1\rangle\langle\psi_1|) = \langle\psi_1| E_2 |\psi_1\rangle = \frac{1}{\lambda} \langle\psi_1| \Pi_2 |\psi_1\rangle = \frac{1}{\lambda} |\langle\psi_1|\psi_1^\perp\rangle|^2 = 0.\] Similarly, \(\operatorname{Tr}(E_1 |\psi_2\rangle\langle\psi_2|) = \frac{1}{\lambda} |\langle\psi_2|\psi_2^\perp\rangle|^2 = 0\). Thus, the measurement is unambiguous.

Now consider where the state is \(|\psi_1\rangle\). The probability of guessing “\(|\psi_1\rangle\)" (outcome \(E_1\)) is: \begin{align*} \mathbb{P}[\text{Output “}|\psi_1\rangle\text{"}|\psi_1] &= \operatorname{Tr}(E_1 |\psi_1\rangle\langle\psi_1|) = \langle\psi_1| E_1 |\psi_1\rangle \\ &= \frac{1}{\lambda} \langle\psi_1| \Pi_1 |\psi_1\rangle = \frac{1}{\lambda} |\langle\psi_1|\psi_2^\perp\rangle|^2 \end{align*} Recalling that \(|\psi_1\rangle = |u_1\rangle\) and \(|\psi_2^\perp\rangle = \sin\theta |u_1\rangle - \cos\theta |u_2\rangle\), we get \[\langle\psi_1|\psi_2^\perp\rangle = \langle u_1 | (\sin\theta |u_1\rangle - \cos\theta |u_2\rangle) = \sin\theta \langle u_1|u_1\rangle - \cos\theta \langle u_1|u_2\rangle.\] Since \(\{|u_1\rangle, |u_2\rangle\}\) is an orthonormal basis, \(\langle u_1|u_1\rangle = 1\) and \(\langle u_1|u_2\rangle = 0\), simplifying the inner product to \(\sin\theta\). Since \(\theta \in (0, \pi/2]\), \(\sin\theta > 0\). So, \(|\langle\psi_1|\psi_2^\perp\rangle|^2 = \sin^2\theta\). Plugging this back in, we have \[ \mathbb{P}[\text{Output “}|\psi_1\rangle\text{"}|\psi_1] = \frac{1}{1+\cos(\theta)} \sin^2(\theta) = \frac{1 - \cos^2(\theta)}{1+\cos(\theta)} = \frac{(1-\cos(\theta))(1+\cos(\theta))}{1+\cos(\theta)} = 1 - \cos(\theta). \] The probability of guessing "don't know" given \(|\psi_1\rangle\) is: \begin{align*} \mathbb{P}[\text{Output “don't know"}|\psi_1] &= \operatorname{Tr}(E_{dk} |\psi_1\rangle\langle\psi_1|) \\ &= 1 - \mathbb{P}[\text{Output “}|\psi_1\rangle\text{"}|\psi_1] - \mathbb{P}[\text{Output “}|\psi_2\rangle\text{"}|\psi_1] \\ &= 1 - (1 - \cos(\theta)) - 0 \\ &= \cos(\theta). \end{align*} By symmetry, \(\mathbb{P}[\text{Output “don't know"}|\psi_2]\) is also \(\cos(\theta)\). \(\square\)

The Pretty Good Measurement (PGM)

Problem Setup: Distinguishing Multiple Quantum States

Now suppose state \(\rho\) guaranteed to be one of \(K\) possible \(d\)-dimensional mixed states \(\rho_1, \dots, \rho_K\). We are also given prior probabilities \(\alpha_1, \dots, \alpha_K\) (with \(\sum_{i=1}^K \alpha_i = 1\)) such that the probability of being given state \(\rho_i\) is \(\alpha_i\). The average success probability for a given POVM \(M = \{M_1, \dots, M_K\}\) is \(P_{\text{success}}(M) = \sum_{i=1}^K \alpha_i \operatorname{Tr}(M_i \rho_i)\). While the optimal POVM can be found using semidefinite programming, there is generally no simple closed-form expression.

Definition of the Pretty Good Measurement (PGM)

The Pretty Good Measurement provides a specific, explicit construction for a POVM that often performs well. First, define the average state \(S\): \[ S = \sum_{j=1}^K \alpha_j \rho_j. \] \(S\) is a density operator (trace 1, positive semidefinite). We need the concept of the pseudoinverse for operators that might not be invertible. For a positive semidefinite operator \(S = \sum_k \lambda_k |u_k\rangle\langle u_k|\), its pseudoinverse square root is \(S^{-1/2} = \sum_{k: \lambda_k \ne 0} \frac{1}{\sqrt{\lambda_k}} |u_k\rangle\langle u_k|\).

Definition 2.3 (Pretty Good Measurement (PGM)). The Pretty Good Measurement is the POVM \(E = \{E_1, \dots, E_K\}\) defined by \[ E_i = S^{-1/2} (\alpha_i \rho_i) S^{-1/2} \quad \text{for } i = 1, \dots, K. \] It can be verified that \(E_i \ge 0\) and \(\sum_{i=1}^K E_i = P_{\text{supp}(S)}\), the projector onto the support of \(S\). Assuming we work within the support of \(S\) (because \(S\) may not be invertible but we can safely assume \(\rho_i\) is in support), \(\sum_{i=1}^K E_i = I\).

The success probability achieved by the Pretty Good Measurement is denoted by \(P_{\text{PGM}}\): \[ P_{\text{PGM}} = \sum_{i=1}^K \alpha_i \operatorname{Tr}(E_i \rho_i) = \sum_{i=1}^K \alpha_i \operatorname{Tr}( S^{-1/2} (\alpha_i \rho_i) S^{-1/2} \rho_i ). \]

Theorem 2.4 (Barnum-Knill Bound on PGM Success). Let \(P_{\text{opt}}\) be the optimal average success probability for distinguishing states \(\{\rho_i\}_{i=1}^K\) with priors \(\{\alpha_i\}_{i=1}^K\). Let \(P_{\text{PGM}}\) be the average success probability achieved by the Pretty Good Measurement. Then \[ P_{\text{opt}}^2 \le P_{\text{PGM}} \le P_{\text{opt}}. \]

Proof. The inequality \(P_{\text{PGM}} \le P_{\text{opt}}\) holds by definition. We now focus on proving \(P_{\text{opt}}^2 \le P_{\text{PGM}}\). Let \(M = \{M_1, \dots, M_K\}\) be an arbitrary POVM. Its success probability is \[ P_{\text{success}}(M) = \sum_{i=1}^K \alpha_i \operatorname{Tr}[M_i \rho_i]. \] Let \(S = \sum_{j=1}^K \alpha_j \rho_j\) be the average state. We rewrite the success probability using the Hilbert-Schmidt inner product \(\langle X, Y \rangle = \operatorname{Tr}[X^\dagger Y]\). Let \(A_i = S^{1/4} M_i S^{1/4}\) and \(B_i = \alpha_i S^{-1/4} \rho_i S^{-1/4}\). Since \(M_i\) and \(\rho_i\) are PSD, and \(S^{\pm 1/4}\) are PSD (on the support of S), \(A_i\) and \(B_i\) are also PSD and thus Hermitian (\(A_i^\dagger=A_i\), \(B_i^\dagger=B_i\)). \begin{align*} P_{\text{success}}(M) &= \sum_{i=1}^K \operatorname{Tr}[M_i (\alpha_i \rho_i)] \\ &= \sum_{i=1}^K \operatorname{Tr}[ S^{1/4} M_i S^{1/4} \cdot S^{-1/4} (\alpha_i \rho_i) S^{-1/4} ] \quad (\text{inserting } S^{1/4}S^{-1/4} = I ) \\ &= \sum_{i=1}^K \operatorname{Tr}[ A_i B_i ] = \sum_{i=1}^K \operatorname{Tr}[ A_i^\dagger B_i ] = \sum_{i=1}^K \langle A_i, B_i \rangle \end{align*} We can view this as the Hilbert-Schmidt inner product \(\langle A, B \rangle\) where \(A = \bigoplus_i A_i\) and \(B = \bigoplus_i B_i\) are block diagonal matrices. Applying the Cauchy-Schwarz inequality \(|\langle A, B \rangle| \le \|A\| \|B\|\): \begin{align*} P_{\text{success}}(M) &\le \|A\| \|B\| \\ &= \sqrt{\operatorname{Tr}[A^\dagger A]} \sqrt{\operatorname{Tr}[B^\dagger B]} \\ &= \sqrt{\sum_{i=1}^K \operatorname{Tr}[A_i^\dagger A_i]} \sqrt{\sum_{i=1}^K \operatorname{Tr}[B_i^\dagger B_i]} \\ &= \sqrt{\sum_{i=1}^K \operatorname{Tr}[A_i^2]} \sqrt{\sum_{i=1}^K \operatorname{Tr}[B_i^2]} \quad (\text{since } A_i, B_i \text{ are Hermitian/PSD}) \end{align*} Let's evaluate the term involving \(B_i\): \begin{align*} \sum_{i=1}^K \operatorname{Tr}[B_i^2] &= \sum_{i=1}^K \operatorname{Tr}[ (\alpha_i S^{-1/4} \rho_i S^{-1/4})^2 ] \\ &= \sum_{i=1}^K \alpha_i^2 \operatorname{Tr}[ S^{-1/4} \rho_i S^{-1/2} \rho_i S^{-1/4} ] \\ &= \sum_{i=1}^K \alpha_i^2 \operatorname{Tr}[ S^{-1/2} \rho_i S^{-1/2} \rho_i ] \\ &= \sum_{i=1}^K \alpha_i \operatorname{Tr}[ (S^{-1/2} \alpha_i \rho_i S^{-1/2}) \rho_i ] \\ &= \sum_{i=1}^K \alpha_i \operatorname{Tr}[ E_i \rho_i ] = P_{\text{PGM}} \quad (\text{by definition of } E_i \text{ and } P_{\text{PGM}}) \end{align*} Now let's bound the term involving \(A_i\): \begin{align*} \sum_{i=1}^K \operatorname{Tr}[A_i^2] &= \sum_{i=1}^K \operatorname{Tr}[ (S^{1/4} M_i S^{1/4})^2 ] \\ &= \sum_{i=1}^K \operatorname{Tr}[ S^{1/4} M_i S^{1/2} M_i S^{1/4} ] \\ &= \sum_{i=1}^K \operatorname{Tr}[ S^{1/2} M_i S^{1/2} M_i ] \end{align*} Let \(Y_i = S^{1/2} M_i S^{1/2}\). Since \(M_i \ge 0\), \(Y_i \ge 0\). Since \(\sum_{j=1}^K M_j = I\), we have \(0 \le M_i \le I\). For any \(Y \ge 0\) and \(0 \le M \le I\), we have \(\operatorname{Tr}[YM] \le \operatorname{Tr}[Y]\). Applying this with \(Y = Y_i\) and \(M = M_i\): \begin{align*} \sum_{i=1}^K \operatorname{Tr}[Y_i M_i] &\le \sum_{i=1}^K \operatorname{Tr}[Y_i] \\ &= \sum_{i=1}^K \operatorname{Tr}[ S^{1/2} M_i S^{1/2} ] \\ &= \operatorname{Tr}[ S^{1/2} \left(\sum_{i=1}^K M_i\right) S^{1/2} ] \\ &= \operatorname{Tr}[ S^{1/2} I S^{1/2} ] \\ &= \operatorname{Tr}[S] = \operatorname{Tr}\left[\sum_{j=1}^K \alpha_j \rho_j\right] = \sum_{j=1}^K \alpha_j \operatorname{Tr}[\rho_j] = \sum_{j=1}^K \alpha_j = 1. \end{align*} So, \(\sum_{i=1}^K \operatorname{Tr}[A_i^2] \le 1\). Plugging the bounds \(\sum_{i=1}^K \operatorname{Tr}[A_i^2] \le 1\) and \(\sum_{i=1}^K \operatorname{Tr}[B_i^2] = P_{\text{PGM}}\) back into the Cauchy-Schwarz inequality: \[ P_{\text{success}}(M) \le \sqrt{\sum_{i=1}^K \operatorname{Tr}[A_i^2]} \sqrt{\sum_{i=1}^K \operatorname{Tr}[B_i^2]} \le \sqrt{1} \cdot \sqrt{P_{\text{PGM}}} = \sqrt{P_{\text{PGM}}}. \] Since this holds for any POVM \(M\), it must hold for the optimal POVM that achieves \(P_{\text{opt}}\). Therefore, \( P_{\text{opt}} \le \sqrt{P_{\text{PGM}}}\). \(\square\)

Remark 2.5. Observe that \(1 - P_{\text{PGM}} \le 1 - P_{\text{opt}}^2\). If \(P_{\text{opt}} = 1 - \epsilon_{\text{opt}}\) with small \(\epsilon_{\text{opt}}\), then \(P_{\text{opt}}^2 \approx 1 - 2\epsilon_{\text{opt}}\), so \(1 - P_{\text{PGM}} \le 2 \epsilon_{\text{opt}} = 2(1-P_{\text{opt}})\), i.e. the error of PGM is at most about twice the optimal error.

Dumb Approaches to Shadow Tomography

Let's now start looking at some dumb approaches to the shadow tomography problem.

Dumb Approach #1: Estimate Each Observable Independently

Theorem 3.1. Suppose we estimate the true expectation values \(\mu_i := \operatorname{Tr}(\rho A_i)\) independently for each \(i \in \{1, \dots, m\}\) on \(k\) copies of \(\rho\), yielding outcomes \(X_{i,1}, \dots, X_{i,k}\), and forming the empirical mean \( \hat{\mu}_i^{(k)} := \frac{1}{k} \sum_{j=1}^k X_{i,j} \). If \(k \geq \frac{1}{2\varepsilon^2} \ln\left( \frac{2m}{\delta} \right)\) for an error tolerance \(\varepsilon \in (0,1)\) and failure probability \(\delta \in (0,1)\), then the estimates \(\hat{\mu}_i^{(k)}\) satisfy \[ \mathbb{P}\left[ \max_{1 \leq i \leq m} |\hat{\mu}_i^{(k)} - \mu_i| < \varepsilon \right] \geq 1 - \delta. \]

Proof. To estimate \(\mu_i = \operatorname{Tr}(\rho A_i)\), we perform the projective measurement defined by the observable \(A_i\). Since \(A_i\) is Hermitian, it has a spectral decomposition \(A_i = \sum_{j} \lambda_{ij} \Pi_{ij}\), where \(\lambda_{ij}\) are the distinct eigenvalues of \(A_i\). We define our projective measurement to be \(\{\Pi_{ij}\}_{j=1}^r\), where \(r\) is the number of distinct eigenvalues of \(A_i\). Recall that the rank of \(\Pi_{ij}\) is the multiplicity of \(\lambda_{ij}\) and can write \(\Pi_{ij} = \sum_{k} |\psi_{ijk}\rangle \langle \psi_{ijk}|\) where \(\{\psi_{ijk}\}_k\) are the eigenvectors associated with \(\lambda_{ij}\). This measurement yields \(\lambda_{ij}\) with probability \(p_{ij} = \operatorname{Tr}(\rho \Pi_{ij})\). The expected value of the outcome is \(\sum_{j} \lambda_{ij} p_{ij} = \operatorname{Tr}(\rho A_i)\).

Let \(X_{i,l}\) be the outcome of the \(l\)-th measurement (\(l=1, \dots, k\)). These \(X_{i,l}\) are i.i.d. random variables with \(X_{i,l} \in [0, 1]\) and \(\operatorname*{E}[X_{i,l}] = \mu_i\). The estimator for \(\mu_i\) is the sample mean \( \hat{\mu}_i^{(k)} = \frac{1}{k} \sum_{l=1}^k X_{i,l} \), which is an unbiased estimator. Hoeffding's inequality then gives \[ \mathbb{P}[|\hat{\mu}_i^{(k)} - \mu_i| \geq \varepsilon] \leq 2 \exp\left(-\frac{2 k \varepsilon^2}{(1-0)^2}\right) = 2 e^{-2k\varepsilon^2}. \] A union bound thus gives that if \(k \geq \frac{1}{2\varepsilon^2} \ln\left(\frac{2m}{\delta}\right)\), the probability that at least one estimator fails is \(\mathbb{P}[\mathcal{F}] \le \delta\). Therefore, the probability that all \(m\) estimators succeed (\(|\hat{\mu}_i^{(k)} - \mu_i| < \varepsilon\) for all \(i\)) is \(\mathbb{P}[\mathcal{F}^c]=1 - \mathbb{P}[\mathcal{F}] \geq 1 - \delta\).

The total number of copies of \(\rho\) consumed is \( O\left(\frac{m}{\varepsilon^2} \log\left(\frac{m}{\delta}\right)\right)\). \(\square\)

Dumb Approach #2: State Tomography

For this approach, we'll output a classical description \(\hat{\rho}\) of the state \(\rho\) so that \(d_{\mathrm{Tr}}(\rho, \hat{\rho}) \leq \epsilon\) and then manually compute the observables on our classical guess of \(\hat{\rho}\).

We start with some elementary facts about Pauli matrices \(I, X, Y\), and \(Z\): \[ I = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}, \quad X = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}, \quad Y = \begin{pmatrix} 0 & -i \\ i & 0 \end{pmatrix}, \quad Z = \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix}. \] We'll write n-qubit Pauli matrices as \(\operatorname{Pauli}_n\), where \[ \operatorname{Pauli}_n = \{ P = P_1 \otimes \dots \otimes P_n \mid P_i \in \{I, X, Y, Z\} \}. \]

Proposition 3.2 (Properties of Pauli matrices). The n-qubit Pauli matrices satisfy the following properties:

They are Hermitian with eigenvalues \(\pm 1\).
They are idempotent, i.e., the square of any n-qubit Pauli matrix is the n-qubit identity matrix (\(P^2 = I^{\otimes n}\)).
Each non-identity Pauli matrix is traceless (\(\operatorname{Tr}(P)=0\) if \(P \ne I^{\otimes n}\)).
If \(P\) and \(Q\) are n-qubit Pauli matrices, then \[ \operatorname{Tr}(PQ) = \begin{cases} 2^n & \text{if } P = Q, \\ 0 & \text{otherwise.} \end{cases} \]

Lemma 3.3 (Vector spaces spanned by Pauli matrices). The n-qubit Pauli matrices form bases for the following vector spaces:

the \(4^n\)-dimensional complex vector space \(\mathbb{C}^{2^n \times 2^n}\) of \(2^n \times 2^n\) complex matrices,
the \(4^n\)-dimensional real vector space of \(2^n \times 2^n\) Hermitian matrices.

Let's now solve the n-qubit tomography problem. First, we note that any n-qubit density matrix can be written as \[ \rho = \frac{1}{2^n} \sum_{P \in \operatorname{Pauli}_n} \alpha_P P, \] where each \(\alpha_P = \operatorname{Tr}(P\rho)\) is real (since \(\rho, P\) are Hermitian), \(\alpha_I = \operatorname{Tr}(I^{\otimes n} \rho) = \operatorname{Tr}(\rho) = 1\), and the normalization factor comes from the fact that we require \(\rho\) to have unit trace (\(\operatorname{Tr}(\rho) = \frac{1}{2^n} \sum_P \alpha_P \operatorname{Tr}(P) = \frac{1}{2^n} \alpha_I \operatorname{Tr}(I^{\otimes n}) = \frac{1}{2^n} \alpha_I 2^n = \alpha_I\)).

Algorithm 3.4: The “textbook" tomography algorithm

Fix \(P \in \operatorname{Pauli}_n\) such that \(P \ne I^{\otimes n}\). Measure \(P\) observable \(m\) times. Receive \(x_1, \dots, x_m \in \{\pm 1\}\). Set \(\hat{\alpha}_P = \frac{1}{m}(x_1 + \dots + x_m)\) to be the sample average of this set of observations.
Repeat for all other non-identity \(P\)'s. (There are \(4^n-1\) such matrices).
Output \(\hat{\rho} = \frac{1}{2^n} \sum_{P \in \operatorname{Pauli}_n} \hat{\alpha}_P P\), where \(\hat{\alpha}_{I^{\otimes n}} = 1\).

Theorem 3.5. Set \(m = O(4^n/\epsilon^2)\), so we use \((4^n-1) \cdot m = O(16^n/\epsilon^2)\) copies. Then, \(d_{\mathrm{Tr}}(\rho, \hat{\rho}) \le \epsilon\) with probability at least 99%.

Proof. First, we bound the 2-norm in expectation. Observe that \[ \rho - \hat{\rho} = \frac{1}{2^n} \sum_{P \in \operatorname{Pauli}_n, P \ne I^{\otimes n}} (\alpha_P - \hat{\alpha}_P) P. \] Then, \[ \begin{aligned} \operatorname*{E} [\|\rho - \hat{\rho}\|_2^2] &= \operatorname*{E} [\operatorname{Tr}((\rho - \hat{\rho})^2)] \\ &= \operatorname*{E} \left[ \operatorname{Tr} \left( \left( \frac{1}{2^n} \sum_{P \ne I^{\otimes n}} (\alpha_P - \hat{\alpha}_P) P \right) \left( \frac{1}{2^n} \sum_{Q \ne I^{\otimes n}} (\alpha_Q - \hat{\alpha}_Q) Q \right) \right) \right] \\ &= \frac{1}{4^n} \operatorname*{E} \left[ \sum_{P \ne I^{\otimes n}} \sum_{Q \ne I^{\otimes n}} (\alpha_P - \hat{\alpha}_P)(\alpha_Q - \hat{\alpha}_Q) \operatorname{Tr}(PQ) \right] \\ &= \frac{1}{4^n} \sum_{P \ne I^{\otimes n}} \sum_{Q \ne I^{\otimes n}} \operatorname*{E}[(\alpha_P - \hat{\alpha}_P)(\alpha_Q - \hat{\alpha}_Q)] \operatorname{Tr}(PQ) \end{aligned} \] Since \(\hat{\alpha}_P\) and \(\hat{\alpha}_Q\) are estimated using independent sets of samples for \(P \ne Q\), they are independent random variables. Also \(\operatorname*{E}[\hat{\alpha}_P] = \alpha_P\). Thus \(\operatorname*{E}[\alpha_P - \hat{\alpha}_P] = 0\). For \(P \ne Q\), \(\operatorname*{E}[(\alpha_P - \hat{\alpha}_P)(\alpha_Q - \hat{\alpha}_Q)] = \operatorname*{E}[\alpha_P - \hat{\alpha}_P]\operatorname*{E}[\alpha_Q - \hat{\alpha}_Q] = 0\). Using \(\operatorname{Tr}(PQ) = \delta_{PQ} 2^n\), we get \[ \begin{aligned} \operatorname*{E} [\|\rho - \hat{\rho}\|_2^2] &= \frac{1}{4^n} \sum_{P \ne I^{\otimes n}} \operatorname*{E}[(\alpha_P - \hat{\alpha}_P)^2] \operatorname{Tr}(P^2) \\ &= \frac{1}{4^n} \sum_{P \ne I^{\otimes n}} \operatorname{Var}[\hat{\alpha}_P] (2^n) \\ &= \frac{1}{2^n} \sum_{P \ne I^{\otimes n}} \operatorname{Var}[\hat{\alpha}_P] \end{aligned} \] For each \(P \ne I^{\otimes n}\), the measurement outcomes \(x_i\) have mean \(\alpha_P = \operatorname{Tr}(P\rho)\) and variance \(1 - \alpha_P^2 \le 1\). Thus \(\operatorname{Var}[\hat{\alpha}_P] = \frac{\operatorname{Var}[x_i]}{m} \le \frac{1}{m}\). \[ \begin{aligned} \operatorname*{E} [\|\rho - \hat{\rho}\|_2^2] &\le \frac{1}{2^n} \sum_{P \ne I^{\otimes n}} \frac{1}{m} = \frac{1}{2^n} \frac{4^n - 1}{m} < \frac{4^n}{2^n m}=\frac{2^n}{m}. \end{aligned} \] To convert this 2-norm bound into a 1-norm bound, we proceed as follows. First, observe that \(d=2^n\). Using the result \(\|M\|_1 \le \sqrt{d} \|M\|_2\), \[ \operatorname*{E}[d_{\mathrm{Tr}}(\rho, \hat{\rho})]=\frac{1}{2} \operatorname*{E}[\|\rho - \hat{\rho}\|_1] \le \frac{\sqrt{2^n}}{2} \operatorname*{E}[\|\rho - \hat{\rho}\|_2]. \] Next, by Jensen's inequality, \(\operatorname*{E}[\|A\|_2] \le \sqrt{\operatorname*{E}[\|A\|_2^2]}\), so \[ \operatorname*{E}[d_{\mathrm{Tr}}(\rho, \hat{\rho})] \le \frac{\sqrt{2^n}}{2} \sqrt{\operatorname*{E}[\|\rho - \hat{\rho}\|_2^2]} \le \frac{\sqrt{2^n}}{2} \sqrt{\frac{2^n}{m}}=\frac{2^n}{2\sqrt{m}}=\frac{2^{n-1}}{\sqrt{m}}. \] If we set \(m=C \cdot 4^n/\epsilon^2\) for a sufficiently large constant C, then \(\operatorname*{E}[d_{\mathrm{Tr}}(\rho, \hat{\rho})] \le O(\epsilon)\). With \(m=O(4^n/\epsilon^2)\), Hoefdding's bounds the probability that \(d_{\mathrm{Tr}}(\rho, \hat{\rho})> \epsilon\) by a small constant, e.g., 1%. □

In principle, this algorithm is very simple to carry out since it is equivalent to measuring each qubit individually in all possible combinations of non-identity Pauli eigenbases (this follows by generalizing our first analysis of single-qubit tomography), which is a relatively simple task. Also, since \(d = 2^n\), we have that the total number of copies is \(O(16^n/\epsilon^2) = O((2^n)^4/\epsilon^2) = O(d^4/\epsilon^2)\), so the algorithm is polynomial in \(d\).

Shadow Tomography via Private Multiplicative Weights

Now let's look at an algorithm based on private multiplicative weights.

Theorem 1.1 (Bădescu & O'Donnell 2020). There is an online algorithm that uses \[ n = O\left(\frac{(\log^{2}m+\log(\frac{\log d}{\delta\epsilon}))(\log d)}{\epsilon^{4}} \cdot \log(\tfrac{\log d}{\delta\epsilon})\right) \] copies of \(\rho\) to output estimates \(\widehat{\mu}_{1}, \widehat{\mu}_{2}, \dots, \widehat{\mu}_{m}\) of observables \(A_{1},A_{2},\dots,A_{m}\in\mathbb{C}^{d\times d}\) such that with probability at least \(1-\delta\), all \(m\) estimates satisfy \(|\widehat{\mu}_{t}-\mu_t|\leq\epsilon\).

This approach reframes shadow tomography as an online learning problem reminiscent of private multiplicative weights. Imagine a student trying to predict the expectation values \(\mu_t = \operatorname*{E}_\rho[A_t]\), while a teacher who has access to the true state \(\rho\) provides feedback when the student makes a significant mistake:

The student uses a mistake-bounded online learning algorithm to guarantee that if the teacher behaves correctly, the student won't make too many "significant" mistakes.

Lemma 1.2 (Matrix Multiplicative Weights). Consider an interaction where a student receives quantum events \(A_t\) (\(0 \le A_t \le I\)) from \(\mathbb{C}^{d \times d}\) and predicts \(\widehat{\mu}_{t}\) for \(\mu_{t}=\operatorname*{E}_{\rho}[A_{t}]\). A teacher observes \(\hat{\mu}_t\) and either "passes" or declares a "mistake" providing a correction \(\mu'_t\). The teacher satisfies: (1) always declares "Mistake" when \(|\widehat{\mu}_{t}-\mu_{t}|>\epsilon\), (2) always passes when \(|\widehat{\mu}_{t}-\mu_{t}|\leq\frac{3}{4}\epsilon\), and (3) provides corrections \(\mu^{\prime}_{t}\) with \(|\mu^{\prime}_{t}-\mu_{t}|\leq\frac{1}{4}\epsilon\). Then there exists a matrix multiplicative weights algorithm that guarantees \(O(\log d/\epsilon^{2})\) mistakes.

The teacher uses an threshold search algorithm to check if the student's prediction \(\hat{\mu}_t\) significantly differs from \(\mu_t\). If so, the teacher provides a high-accuracy estimate \(\mu'_t\).

Lemma 1.3 (Quantum Threshold Search). There is an algorithm that given each input \((A_t, \theta_t)\) either (1) declares that \(|\operatorname*{E}_{\rho}[A_{t}]-\theta_{t}|>\frac{3}{4}\epsilon\) and provides a value \(\mu^{\prime}_{t}\) with \(|\operatorname*{E}_{\rho}[A_{t}]-\mu^{\prime}_{t}|\leq\frac{1}{4}\epsilon\) or (2) moves to the next input; an input will only be passed if \(|\operatorname*{E}_{\rho}[A_{t}]-\theta_{t}|\leq\frac{1}{4}\epsilon\). This algorithm succeeds using \[ n'_{\mathrm{TS}}(m, \epsilon, \delta) = O\left(\frac{\log^{2}m+\log(1/\delta)}{\epsilon^{2}}\cdot \log(1/\delta)\right) \] copies of \(\rho\), with probability at least \(1-\delta\).

Full algorithm. The overall algorithm uses \(n\) copies of \(\rho\), divided into \(R\) batches. Each batch is reserved for the teacher to use during one "round" of interaction. Formally:

Prepare \(n = R n_0\) copies of \(\rho\), partitioned into batches \(\mathcal{B}_1, \dots, \mathcal{B}_R\). Let \(r = 1\) and initialize the Student and the Teacher.
For \(t = 1, 2, \dots, m\):
1. Receive observable \(A_t\) and run the Student to get its prediction \(\widehat{\mu}_{t}\).
2. Run the Teacher with input \((A_t, \theta_t = \widehat{\mu}_t)\) using \(\mathcal{B}_r\).
3. If Teacher declares \(|\operatorname*{E}_{\rho}[A_t]-\theta_{t}|\leq\epsilon\), we output \(\widehat{\mu}_{t}\) as the estimate for \(\mu_t = \operatorname*{E}_\rho[A_t]\) and inform the student they passed.
4. If Teacher declares \(|\operatorname*{E}_{\rho}[A_{t}]-\widehat{\mu}_{t}|>\frac{3}{4}\epsilon\), Teacher provides us a \(|\operatorname*{E}_{\rho}[A_{t}]-\mu^{\prime}_{t}|\leq\frac{1}{4}\epsilon\). We output \(\mu^{\prime}_{t}\) and inform the Student algorithm that a mistake occurred and provide \(\mu^{\prime}_{t}\). We then discard batch \(\mathcal{B}_r\) and increment \(r \leftarrow r + 1\).

Quantum Threshold Search

Let's first reduce the Quantum Threshold Search problem to a simpler variant.

Lemma 1.4 (Simpler Version of Lemma 1.3). Given \(A_1, \dots, A_m\) online where \(A_i\) are projectors, your algorithm must halt on any \(A_t\) where \(\operatorname*{E}_{\rho}[A_{t}] > 3/4\). If all \(\operatorname*{E}_{\rho}[A_{t}] \le 1/4\), your algorithm must pass all inputs. You can do this with \(1-\delta\) success probability using \(n_{\mathrm{TS}}(m, \delta) = O(\log^{2}m)\) copies of \(\rho\).

We can go from Lemma 1.4 to Lemma 1.3 with just a few reductions.

Focus on Thresholds. We can always use a holdout set of \(N_{holdout} = O(\log(1/\delta)/\epsilon^{2})\) copies of \(\rho\) to output a \(\mu^{\prime}_{t}\), so we'll instead focus on knowing when to output \(\mu^{\prime}_{t}\).
Reduce to Projectors: Using Neumark's Theorem, any quantum event \(A_i\) on \(\mathbb{C}^d\) can be associated with a projector \(\Pi_i\) on \(\mathbb{C}^d \otimes \mathbb{C}^2\) such that \(\operatorname*{E}_{\rho \otimes \ketbra{0}{0}}[\Pi_i] = \operatorname*{E}_\rho[A_i]\). We can work with the state \(\rho' = \rho \otimes \ketbra{0}{0}\) and projectors \(\Pi_i\) instead of \(A_i\). This only increases the dimension by a factor of 2, which doesn't affect the complexity bounds that are logarithmic in \(d\). So, we assume \(A_i\) are projectors WLOG.
Reduce to Thresholds: We can reduce from distinguishing whether \(|\operatorname*{E}_{\rho}[A_{t}]-\theta_{t}|>\frac{3}{4}\epsilon\) or \(|\operatorname*{E}_{\rho}[A_{t}]-\theta_{t}|\leq\frac{1}{4}\epsilon\) to whether \(\operatorname*{E}_{\rho}[A_{t}'] \geq \theta_t\) or \(\operatorname*{E}_{\rho}[A_{t}'] < \theta_t - \tfrac14 \epsilon\) where \[(A_1', \theta_1'), \dots, (A_{2m}', \theta_{2m}')=(A_1, \theta_1), (I - A_1, 1 - \theta_1), \dots, (I - A_{2m}, 1- \theta_{2m}).\] We can further reduce to fixed thresholds where \(\epsilon < \tfrac 12\) and \(\theta_t=1/4\) via the Quantum Chernoff Bound. This Bound constructs new projectors \(B_i\) acting on \(n_0=O(1/\epsilon^2)\) copies of \(\rho\) such that:
- If \(\operatorname*{E}_\rho[A_i] > \theta_i\), then \(\operatorname*{E}_{\rho^{\otimes n_0}}[B_i] > 3/4\).
- If \(\operatorname*{E}_\rho[A_i] \le \theta_i - \epsilon\), then \(\operatorname*{E}_{\rho^{\otimes n_0}}[B_i] \le 1/4\).
Reduce to Constant Probability: It suffices to have threshold detection that succeeds with small constant probability (e.g., 0.01). We can run the constant-probability algorithm \(L = O(\log(1/\delta))\) times in parallel. Whenever an algorithm claims a threshold violation on \(A_t\), we can just use a failsafe check to verify the claim to high precision \(< \epsilon\).

Stable Threshold Reporting

Before we proceed to a proof of Lemma 1.4, we start with a classical result: "adding exponential noise provides a (composably) \(\chi^{2}\)-stable mechanism for reporting if a distribution’s mean is above a given threshold".

Lemma 2.1 (\(\chi^{2}\)-stable Threshold Reporting). Let \(S\sim\operatorname{Binomial}(n,p)\). Assume that \(X\) is an Exponential RV with mean \(\operatorname*{E}[X] \ge \operatorname{stddev}[S] \ge 1\). Let \(B\) be the event that \(S+X > \theta n\), and assume that \(\mathbb{P}[B] < \frac{1}{4}\). Then \[ d_{\mathrm{\chi^{2}}}((S\mid\overline{B}),S) \lesssim\left(\mathbb{P}[B]\cdot\frac{\operatorname{stddev}[S]}{\operatorname*{E}[X]}\right)^{2}\leq\mathbb{P}[B]^{2}\cdot(n/\operatorname*{E}[X]^{2}). \]

Proof. Let \(\lambda = 1/\operatorname*{E}[X]\) and \(\sigma = \operatorname{stddev}[S]\). We'll also let \[f(s) = \mathbb{P}[B \mid S=s] = \mathbb{P}[X > \theta n - s] = \min(1, e^{-\lambda (\theta n - s)}) .\]

The \(\chi^2\) distance is related to the variance of \(f(S)\) by \(d_{\mathrm{\chi^{2}}}((S\mid\overline{B}),S) = \operatorname*{Var}[f(S)] / \mathbb{P}[\overline{B}]^2\). Since \(\mathbb{P}[B] < 1/4\), we have \(\mathbb{P}[\overline{B}]^2 \gtrsim 1\) and it suffices to show \(\operatorname*{Var}[f(S)] \lesssim (\mathbb{P}[B] \sigma \lambda)^2\).

Next, we bound the variance using an auxiliary function. Let \(g(s) = e^{-\lambda\theta n} e^{\lambda s}\). Because \(f(s) = \min(1, g(s))\) and the function \(y \mapsto \min(1, y)\) is 1-Lipschitz, the variance does not increase: \(\operatorname*{Var}[f(S)] \le \operatorname*{Var}[g(S)]\).

Next, we'll write the variance in terms of the MGF of \(S\), \(M(t) = \operatorname*{E}[e^{tS}] = (q+pe^t)^n\): \[ \operatorname*{Var}[g(S)] = \operatorname*{E}[g(S)^2] - (\operatorname*{E}[g(S)] )^2 = (\operatorname*{E}[g(S)] )^2 \left( \frac{M(2\lambda)}{M(\lambda)^2} - 1 \right). \]

The MGF ratio can be bounded using our assumption that \(\sigma\lambda \le 1\): \[ \frac{M(2\lambda)}{M(\lambda)^2} - 1 = \left(\frac{q+pe^{2\lambda}}{(q+pe^{\lambda})^2}\right)^n - 1 \lesssim (\sigma\lambda)^2. \] Substituting this into the variance expression gives \( \operatorname*{Var}[g(S)] \lesssim (\operatorname*{E}[g(S)] )^2 (\sigma\lambda)^2 \). The crucial step remaining is to show \(\operatorname*{E}[g(S)] \lesssim \mathbb{P}[B]\).

Case 1: \(p \ge 1/n\) (Concentrated Regime). Here \(S\) concentrates near its mean \(np \ge 1\), with \(\mathbb{P}[S \approx np] \gtrsim 1\). We lower bound \(\mathbb{P}[B] = \operatorname*{E}[f(S)]\) using these typical values of \(S\), yielding \(\mathbb{P}[B] \gtrsim e^{-\lambda(\theta-p)n}\) (since \(\mathbb{P}[B]<1 /4\) implies \(\theta n> np\)). To show \(\operatorname*{E}[g(S)] = e^{-\lambda\theta n} M(\lambda) \lesssim \mathbb{P}[B]\), we use a standard MGF bound \(M(\lambda) \lesssim e^{\lambda pn}\) that reflects the concentration near \(np\).
Case 2: \(p < 1/n\) (Sparse Regime). Here \(S\) is typically 0, with \(\mathbb{P}[S=0] = (1-p)^n \gtrsim 1\). We lower bound \(\mathbb{P}[B]\) using the event \(\{S=0\}\), giving \[\mathbb{P}[B] \gtrsim \mathbb{P}[S=0]\mathbb{P}[X > \theta n] \gtrsim e^{-\lambda\theta n}.\] To show \(\operatorname*{E}[g(S)] = e^{-\lambda\theta n} M(\lambda) \lesssim \mathbb{P}[B]\), we now need \(M(\lambda) \lesssim 1\). This also holds via standard bounds, consistent with \(S\) being near 0.

In both regimes, \(\operatorname*{E}[g(S)] \lesssim \mathbb{P}[B]\). Finally, we combine the bounds: \[ d_{\mathrm{\chi^{2}}}((S\mid\overline{B}),S) \lesssim \operatorname*{Var}[f(S)] \le \operatorname*{Var}[g(S)] \lesssim (\operatorname*{E}[g(S)] )^2 (\sigma\lambda)^2 \lesssim (\mathbb{P}[B])^2 (\sigma\lambda)^2. \] Recalling \(\sigma = \operatorname{stddev}[S]\) and \(\lambda = 1/\operatorname*{E}[X]\), this yields the first inequality: \[ d_{\mathrm{\chi^{2}}}((S\mid\overline{B}),S) \lesssim \left(\mathbb{P}[B]\cdot\frac{\operatorname{stddev}[S]}{\operatorname*{E}[X]}\right)^{2}. \] The second inequality follows directly from \((\sigma\lambda)^2 = npq\lambda^2 \le n\lambda^2\). \(\square\)

We now prove an extension of this result to quantum events.

Lemma 2.2 (Quantum Threshold Reporting). Fix \(\rho\in\mathbb{C}^{d\times d}\) and projector \(A\in\mathbb{C}^{d\times d}\), and let \(S \sim \mathrm{Binomial}(n, \operatorname*{E}_{\rho}[A])\) and \(X \sim \mathrm{Exponential}(\lambda)\). Suppose \(S\) and \(X\) satisfy the assumptions of Lemma 2.1. Then there exists a quantum event \(B\in(\mathbb{C}^{d\times d})^{\otimes n}\) such that \(\operatorname*{E}_{\rho^{\otimes n}}[B]=\mathbb{P}[{ S}+{X} > \theta n]\) and \[ d_{\mathrm{Bures}}\left(\rho^{\otimes n},\rho^{\otimes n}|_{\sqrt{I-B}}\right) \lesssim\operatorname*{E}_{\rho^{\otimes n}}[B]\cdot \frac{\operatorname*{stddev}[{S}]}{\operatorname*{ E}[{X}]}. \] Moreover, \( \operatorname*{E}_{\rho^{\otimes n}}[B] \leq\exp(-n\lambda(\theta-(e-1)\operatorname*{E}_{\rho}[ A])) \).

Proof. We will take for granted that there exists an event \(B\) that exhibits the same statistics as the clasical event \({ S}+{X} > \theta n\) from Lemma 2.1. Specifically, \(\operatorname*{E}_{\rho^\otimes}[B]=\mathbb{P}[{ S}+{X} > \theta n]\) and \(\operatorname{F}\left(\rho^\otimes,\rho^\otimes|_{\sqrt {I-B}}\right)=\operatorname{BC}((S\mid{ S}+{X}\leq\theta n),S)\). Note that \(1-\operatorname{BC}(\mu,\nu)\leq d_{\mathrm{\chi^{2}}}(\mu,\nu)\). By Lemma 2.1, \[ \sqrt{d_ {\mathrm{\chi^{2}}}(S\mid\overline{B},S)}\lesssim \mathbb{P}[B]\cdot\frac{\operatorname*{stddev}[{ S}]}{\operatorname*{E}[{X}]}\leq \mathbb{P}[B]\cdot\frac{\sqrt{n}}{\operatorname*{E}[{ X}]}. \] Thus, \[ d_{\mathrm{Bures}}\left(\rho^{\otimes n},\rho^{\otimes n}|_{\sqrt{I-B}}\right) =\sqrt{2\left( 1-\operatorname{F}\left(\rho^{\otimes n},\rho^{\otimes n}|_{\sqrt{I-B}}\right)\right)} \] \[ =\sqrt{2(1-\operatorname{BC}((S\mid S + X\leq\theta n),S))} =d_{\mathrm{H}}((S\mid S+ X\leq\theta n),S) \] \[ \leq\sqrt{d_{\mathrm{\chi^{2}}}((S\mid S+ X\leq\theta n),S)} \lesssim\operatorname*{E}_{\rho^{\otimes n}}[B]\cdot \frac{\operatorname*{stddev}[{S}]}{\operatorname*{ E}[{X}]} \] Since \(\operatorname*{E}_{\rho^{\otimes n}}[B]=\mathbb{P}[{ S}+{X} > \theta n]\), \[ \operatorname*{E}_{\rho^{\otimes n}}[B] \leq\operatorname*{E}[\exp(-\lambda(\theta n-{ S}))] =\exp(-n\lambda(\theta-(e-1)p)). \]

Proof of Lemma 1.4

Lemma 1.4. Given \(A_1, \dots, A_m\) online where \(A_i\) are projectors, your algorithm must halt on any \(A_t\) where \(\operatorname*{E}_{\rho}[A_{t}] > 3/4\). If all \(\operatorname*{E}_{\rho}[A_{t}] \le 1/4\), your algorithm must pass all inputs. You can do this with \(1-\delta\) success probability using \(n_{\mathrm{TS}}(m, \delta) = O(\log^{2}m)\) copies of \(\rho\).

The core idea is to sequentially test each incoming projector \(A_t\) using a specially constructed "noisy" measurement derived from Lemma 2.2. This measurement involves adding classical exponential noise to the outcome statistics. The noise ensures that if we pass the test (outcome \(\overline{B}_t\)), the quantum state is only slightly disturbed, allowing for reuse. If we fail the test (outcome \(B_t\)), it indicates \(A_t\) likely has a high expectation, and we halt.

Proof. We choose the number of copies \(n=O(\log^2 m)\) to be sufficiently large. The exponential noise \(X\) used in constructing the test events \(B_t\) has a mean \(\operatorname*{E}[{X}] \approx \sqrt{n}\). We'll set the threshold for the noisy test to \(\theta = 2/3\) and a parameter \(\eta = 0.01\).

Algorithm. Initialize the state to \(\varrho_{0}=\rho^{\otimes n}\). For each \(A_t\), Lemma 2.2 provides a quantum event \(B_t\) acting on \(\rho^{\otimes n}\), mimicking \(S_t + X > \theta n\) where \(S_t \sim \mathrm{Binomial}(n, \operatorname*{E}_\rho[A_t])\). We'll accordingly, for each \(t=1, \dots, m\), measure \((B_t, \overline{B}_t)\) on \(\varrho_{t-1}\). If outcome \(\overline{B}_t\) occurs, update the state to \(\varrho_t = \varrho_{t-1}|_{\overline{B}_t}\) and continue. If outcome \(B_t\) occurs, halt and output \(t\).

We aim to show the probability of halting at a "good" index (\(\operatorname*{E}_{\rho}[A_{t}]\geq 1/3\)) is at least \(0.01\). First, we note that if a high-expectation projector exists, the cumulative probability of passing must eventually drop significantly.

Claim 3.1. Suppose that \(\operatorname*{E}_{\rho}[A_k] \ge 3/4\) for some \(k\). Then there exists a step \(t \in [m]\) such that the probability of passing the first \(t\) tests (let this probability be \(q_t\)) is at most 4/5. Furthermore, if \(t\) is the first such index and \(t>1\), then \(q_{t-1} > 3/4\), and the sum of the initial probabilities \(\sum_{i=1}^{t-1} \operatorname*{E}_{\rho^{\otimes n}}[B_{i}]\) is at most 1/4.

Second, we note that the noisy test \(B_i\) is designed to have a very low chance of succeeding if the underlying expectation \(\operatorname*{E}_\rho[A_i]\) is significantly below the threshold \(\theta\).

Claim 3.2. If \(\operatorname*{E}_{\rho}[A_{i}] < 1/3\) (a "bad" projector), then the initial probability of the corresponding test \(B_i\) succeeding is very small: \(\operatorname*{E}_{\rho^{\otimes n}}[B_{i}] \le (\eta/m)^2\).

Now, let's use these claims. Let \(t\) be the minimal index from Claim 3.1 such that \(q_t \le 4/5\). The algorithm halts by step \(t\) with probability \(1 - q_t \ge 1/5\). We'll take for granted the following Quantum union-bound: \[ 1 \leq\sqrt{q_{t}}\operatorname{F}(\rho^{\otimes n},\varrho_{t})+ \sum_{i=1}^{t}\sqrt{s_{i}}\sqrt{p_{i}}. \] Here, \(\varrho_t\) is the state after \(t\) passes, \(s_i\) is the probability of halting exactly at step \(i\), and \(p_i = \operatorname*{E}_{\rho^{\otimes n}}[B_{i}]\) is the initial success probability of test \(B_i\). The first term is easiest to handle: since \(\operatorname{F}(\cdot) \le 1\) and \(q_t \le 4/5\), the first term is \(\le \sqrt{4/5}\).

For the second term, let's isolate the "bad" indices \(\mathcal{B} = \{i \in [t] \mid \operatorname*{E}_{\rho}[A_{i}] < 1/3\}\). We can use Claim 3.2 for the bad indices as \(p_i \le (\eta/m)^2\) for \(i \in \mathcal{B}\) and thus \(\sum_{i\in\mathcal{B}}\sqrt{s_{i}} (\eta/m) \le \eta\). For the sum over good indices (\(i \notin \mathcal{B}\)), let's write \(P_{good} :=\sum_{i\not\in\mathcal{B}} s_i\) and apply Cauchy-Schwarz: \( \sum_{i\not\in\mathcal{B}}\sqrt{s_{i}}\sqrt{p_{i}} \le \sqrt{P_{good}} \sqrt{\sum_{i\not\in\mathcal{B}} p_i} \). Since Claim 3.1 ensures that \(\sum_{i\not\in\mathcal{B}, i \le t} p_i \le 1/4\), we have: \[ \sum_{i\not\in\mathcal{B}}\sqrt{s_{i}}\sqrt{p_{i}} \le \sqrt{P_{good}} \sqrt{1/4}=\frac{1}{2}\sqrt{P_{good}}. \]

Since \(1 \le \sqrt{q_t}\operatorname{F}(\cdot) + \sum_{i \in \mathcal{B}} \sqrt{s_i p_i} + \sum_{i \notin \mathcal{B}} \sqrt{s_i p_i}\), we have \(1 \le \sqrt{4/5} + \eta + \frac{1}{2}\sqrt{P_{good}}\). A careful calculation then shows that \(P_{good} \ge 0.0365\). \(\square\)

The only thing left is proving Claim 3.1. This is where we apply our guarantees for quantum threshold reporting.

Proof of Claim 3.1. Recall from Lemma 2.2 that \(p_i = \operatorname*{E}_{\rho^{\otimes n}}[B_{i}] = \mathbb{P}[S_i + X > \theta n]\), where \(S_i \sim \mathrm{Binomial}(n, \operatorname*{E}_{\rho}[A_i])\) and \(\theta=2/3\). We are given that there exists some \(k \in [m]\) such that \(\operatorname*{E}_{\rho}[A_k] \ge 3/4\). Since \(3/4 > \theta\), the mean of \(S_k\) is significantly larger than \(\theta n\). For sufficiently large \(n\), Hoeffding's inequality gives \(\mathbb{P}[S_k > \theta n] \ge 1 - \exp(-1/4)\). Since adding the non-negative noise \(X\) only increases the value, we have \[ p_k = \mathbb{P}[S_k + X > \theta n] \ge \mathbb{P}[S_k > \theta n] \ge 1 - \exp(-1/4). \]

Now, consider the sequence of products \( \prod_{i=1}^j (1-p_i) \) for \(j=1, \dots, m\). Since \(p_k\) is close to 1, this product must eventually become small. Let \(t \in [m]\) be the minimal index such that \[ (1-p_1)(1-p_2)\dotsm(1-p_t) \le \exp(-1/4). \] Such a \(t\) must exist because \((1-p_k) \le \exp(-1/4)\).

We now need to relate \(q_t\), the actual probability of passing the first \(t\) tests when run sequentially, to this product of individual probabilities \(\prod_{i=1}^t (1-p_i)\). Let \(\varrho_i\) be the state after passing the first \(i\) tests. Let's again take for granted the quantum union bound, which gives: \[ \left| q_t - \prod_{i=1}^t (1-p_i) \right| \le 2 \sum_{i=1}^{t-1} \left( \prod_{j=1}^i (1-p_j) \right) d_{\mathrm{Tr}}(\rho^{\otimes n}, \varrho_i / q_i). \] (Note: \(\varrho_i / q_i\) is the normalized state after passing \(i\) tests). From Lemma 2.2 and the fact \(d_{\mathrm{Tr}} \le d_{\mathrm{Bures}}\), we have \[ d_{\mathrm{Tr}}(\rho^{\otimes n}, \varrho_i / q_i) \le d_{\mathrm{Bures}}(\rho^{\otimes n}, \varrho_i / q_i) \lesssim p_i \cdot \frac{\operatorname*{stddev}[S_i]}{\operatorname*{E}[X]} \le O\left( p_i \frac{\sqrt{n}}{\operatorname*{E}[X]} \right). \] Since \(\prod_{j=1}^i (1-p_j) \le 1\), the bound becomes \[ \left| q_t - \prod_{i=1}^t (1-p_i) \right| \le \sum_{i=1}^{t-1} O\left( p_i \frac{\sqrt{n}}{\operatorname*{E}[X]} \right) = O\left( \frac{\sqrt{n}}{\operatorname*{E}[X]} \right) \sum_{i=1}^{t-1} p_i. \]

Consider two cases for the minimal \(t\):

Case \(t=1\): The condition is \((1-p_1) \le \exp(-1/4)\). The bound above gives \(|q_1 - (1-p_1)| = 0\). Thus \(q_1 = 1-p_1 \le \exp(-1/4) < 4/5\), so our claim is vacuously true.
Case \(t > 1\): By minimality of \(t\), we have \((1-p_1)\dotsm(1-p_{t-1}) > \exp(-1/4)\). Using the inequality \(1-x \le e^{-x}\), \[ \exp\left(-\sum_{i=1}^{t-1} p_i\right) \ge \prod_{i=1}^{t-1} (1-p_i) > \exp(-1/4). \] This implies \(\sum_{i=1}^{t-1} p_i < 1/4\). Now substitute this into our error bound: \[ \left| q_t - \prod_{i=1}^t (1-p_i) \right| \le O\left( \frac{\sqrt{n}}{\operatorname*{E}[X]} \right) \cdot \frac{1}{4}. \] By assumption, we chose \(\operatorname*{E}[X]=\Omega(\sqrt{n})\). By choosing the constant in the \(\Omega\) sufficiently large, we can ensure this error term is small, say \( \le 4/5 - \exp(-1/4) \approx 0.021\). Since \(\prod_{i=1}^t (1-p_i) \le \exp(-1/4)\), we get \[ q_t \le \prod_{i=1}^t (1-p_i) + (\text{error}) \le \exp(-1/4) + (4/5 - \exp(-1/4))=4/5. \] We can similarly bound \(q_{t-1} \geq 3/4\).

This means either \(q_t \le 4/5\) or \(t>1\) and both \(q_{t-1} \ge 3/4\) and \(\sum_{i=1}^{t-1} p_i \le 1/4\) hold. \(\square\)

References

Wright, J. (2024). Quantum Learning Theory Lecture Notes 1-6. Link
Aaronson, S. (2017). Shadow Tomography of Quantum States. Link
Bădescu, C., & O’Donnell, R. (2020). Improved quantum data analysis. TheoretiCS, Volume 3, Article 7. Link

Thanks for reading! Anonymous feedback can be left here. Feel free to reach out if you think there's something I should add or clarify.