240
81
2
0
Mathematics
Statistics and Probability
It is argued from several points of view that quantum probabilities might play a role in statistical settings. New approaches toward quantum foundations have postulates that appear to be equally valid in macroscopic settings. One such approach is described here in detail, while two others are briefly sketched. In particular, arguments behind the Born rule, which gives the basis for quantum probabilities, are given. A list of ideas for possible statistical applications of quantum probabilities is provided and discussed. A particular area is machine learning, where there exists substantial literature on links to quantum probability. Here, an idea about model reduction is sketched and is motivated from a quantum probability model. Quantum models can play a role in model reduction, where the partial least squares regression model is a special case. It is shown that for certain experiments, a Bayesian prior given by a quantum probability can be motivated. Quantum decision theory is an emerging discipline that can be motivated by this author’s theory of quantum foundations.
Corresponding author: Inge S. Helland, ingeh@math.uio.no
The basis for nearly all articles in theoretical and applied statistics is Kolmogorovian probability[1]. Quantum probability is mostly looked upon by statisticians as an exotic tool with no relevance for statistical science and statistical inference. This implies that, in the statistical literature, there is very little discussion of possible links between statistical theory and quantum theory. (A good exception is the article by Barndorff-Nielsen et al.[2], where quantum versions of exponential models, sufficiency, and Fisher information are discussed, among other related topics.)
In my opinion, this has led to a limitation of statistical science, both from a theoretical point of view and from an applied point of view. I relate this statement to the idea that traditional statistical theory only - at least in most cases - looks upon the parameter space as a set with no structure. Introducing more structure into the parameter space will give a richer theory. And I will show in the next Section that some such structures indeed lead to quantum probabilities. Later in this article, I will discuss a specific structure, symmetry introduced by letting a concrete group act on the parameter space, and the implications for model reduction.
It seems as if a discussion of links between quantum theory and statistical theory is more than due now. This can be underpinned by looking at the current revolution in artificial intelligence, in particular, machine learning, an area that engages more and more statisticians. It is probably not so well known among statisticians that discussions of connections between machine learning and quantum mechanics have appeared in the literature now; see, for instance, the review article by Dunjko & Briegel[3]. If we really mean that machine learning should be based upon statistics, connections between statistical theory and quantum theory should also be of high interest.
One can ask then: Is not quantum probabilities just of relevance to the microcosmos? My answer to this is no. To argue for that answer, one can look at recent derivations of quantum mechanics from various postulates. In two books[4][5] and in a series of articles[6][7][8][9][10][11][12][13][14][15], this author has proposed a completely new foundation of quantum theory, and also, in this connection, discussed the interpretation of the theory. This foundation is summarized in Subsections 2.2 and 2.3 below, where the essence of quantum theory is deduced from 7 postulates. Looking at these postulates, most of them may be seen to be equally valid in a macroscopic setting.
Another basis for quantum theory is given in a series of articles by De Raedt and his collaborators. One of these is De Raedt et al.[16]. Here, essential elements of quantum mechanics are derived from what is called logical inference to experiments, an assumption that there are uncertainties about individual events and that the frequencies are robust with respect to small changes in the parameters. A basic tool in the derivations is that of Fisher information. Fisher information as a general basis for nearly all aspects of science has earlier been advocated by Frieden[17].
Finally, it is of interest in this connection to look at the quantum-like models introduced by Khrennikov and collaborators; see Khrennikov[18][19] and Haven & Khrennikov[20][21]. These models are based upon quantum probabilities and are applied to several sciences, including cognitive psychology, sociology, finance, and biology. An important element here is a theory of quantum decisions; see also Helland[10][14].
While there are few examples of statistical papers related to quantum probability, there are several articles in the quantum theory literature that use basic statistical concepts and ideas. Very much discussion among theoretical physicists is concerned with the interpretation of quantum theory. Some still want to look upon it as an ontological theory, but more and more physicists now conclude with the view that quantum mechanics should be interpreted as an epistemic or epistemological theory: It is a theory of our knowledge of the world, in the same way that statistical theory is basically about knowledge from experiments and observations. A strong school in the quantum community is QBism, originally an abbreviation for Quantum Bayesianism, but now by its founders claimed to have a somewhat wider basis, see Caves et al.[22] and Fuchs[23][24]. A very recent article on Bayes’ rule and related inference in quantum mechanics is Liu[25].
The plan of this article is as follows: In Section 2, my own basis for quantum theory is discussed in some detail, together with other bases. One aim is to motivate the use of quantum probabilities in macroscopic settings. In Section 3, several potential ideas for the use of quantum probabilities in statistical settings are discussed. This includes model reductions, quantum probabilities as priors for certain experiments, decision theory, and machine learning. The final section gives some general discussion points.
The traditional approach towards quantum theory, as found in numerous textbooks, is rather formal. To understand the main message of this article, it is not necessary to read this subsection in complete detail, but some of the elements discussed here will be used later.
As a start, quantum mechanics is a nearly 100-year-old theory, whose formal foundation is given by von Neumann[26]. That book summarizes the efforts of many prominent physicists at the beginning of the last century, physicists whose names need not be mentioned here. The book by von Neumann has, in turn, inspired a large literature, consisting of research articles, monographs, textbooks, and more popular books.
A thorough and more modern formal treatment is given by Hall[27]. The basic mathematical concept underlying all the formal literature on quantum mechanics is that of a Hilbert space, a complex linear space of vectors admitting a scalar product. The theory is much simpler when the dimension of the Hilbert space is finite, when it can be taken to be equal to Cr for some r, where C is the space of complex numbers. Physical states are represented by normalized vectors in the Hilbert space, and physical variables are represented by operators, in the finite-dimensional case, matrices.
Hall[27] gives axioms behind quantum mechanics that are mathematically precise, also for the case where the operators involved are unbounded, which implies for an operator A that it is only defined on a subset D=DA of the Hilbert space. For this case, the adjoint A† of the operator A is defined: A vector |ϕ⟩ belongs to the domain D† of A† if the linear functional ⟨ϕ,A⋅⟩ defined on D is bounded, and for |ϕ⟩∈D†, |ξ⟩=A†|ϕ⟩ is defined as the unique vector such that ⟨ξ,ψ⟩=⟨ϕ,Aψ⟩ for all |ψ⟩∈D, where ⟨,⟩ is the scalar product of the Hilbert space. From this, the concept of a self-adjoint operator is defined by requiring that A†=A. A self-adjoint operator is characterized by having a real-valued spectrum, say, real-valued eigenvalues, so, by connecting such an operator to a physical variable, the eigenvalues can be interpreted as the possible values of this variable.
A simpler concept is that of a symmetric operator: An operator A is symmetric if ⟨ϕ,Aψ⟩=⟨Aϕ,ψ⟩ for all |ϕ⟩,ψ⟩∈D. Conditions under which a symmetric operator is self-adjoint are discussed in Chapter 9 of Hall[27]. It is noted that also symmetric operators have real-valued eigenvalues. All bounded operators that are symmetric are self-adjoint.
Of some interest to statisticians are also the Lecture Notes by Meyer[28] and Holevo[29]. What is common to those monographs is that they assume the superposition principle: If |ϕ⟩ and |ψ⟩ are possible state vectors and a and b are complex numbers, then a normalized version of a|ϕ⟩+bψ⟩ is also a possible state vector. In the next subsection, I will limit the concept of state vector to vectors in the Hilbert space that are eigenvectors of some physically or statistically meaningful operator, and then the general superposition principle is not necessarily true. As shown in Helland[8][12], this view can be used to resolve various quantum paradoxes.
In newer versions of this traditional approach to quantum probabilities, a basic notion is that of a POVM (positive operator-valued measurement): Let H be a Hilbert space, and (Ω,F) a measurable space with F being the Borel σ-algebra on Ω. A POVM is a function F defined on F whose values are positive bounded self-adjoint operators on H such that for every |ψ⟩∈H the mapping from E to ⟨F(E)ψ,ψ⟩ is a non-negative countably additive measure on F, and F(Ω)=I is the identity operator.
In the finite-dimensional case, a POVM can be constructed from a set of semi-definite self-adjoint matrices {Mi} such that ∑Mi=I by letting Mi=F(i) for i=1,…,r. Then for a set of non-negative constants pi adding to 1, and for a fixed orthonormal basis {|ψi⟩} of the Hilbert space, define ⟨Miψi,ψi⟩=pi. This gives for any subset E of {1,…,r}, and for any |ψ⟩ in the Hilbert space, ⟨F(E)ψ,ψ⟩=∑i∈E|ai|2pi when |ψ⟩=∑iai|ψi⟩.
A POVM is called a projection valued measure (PVM) if it is self-adjoint and F(E) is a projection operator for every E. Naimark’s dilation theorem shows how any POVM F can be obtained from a PVM P acting on a larger Hilbert space K: There is a unitary transformation U (a transformation such that UU†=I) from K to H such that F(E)=UP(E)U† for every E.
The key property of a POVM is that ⟨F(E)ψ,ψ⟩ can be interpreted as a probability of outcome E when measured in the state |ψ⟩ when this is a normalized vector in the Hilbert space. This is a version of Born’s formula.
A completely new approach towards quantum foundations is proposed in Helland[12][13][14][15], where the formal properties of quantum mechanics are derived from a rather simple set of postulates. These postulates will be repeated below.
As a possible general interpretation, the basis can be taken to be relative to an actor who is in some fixed (physical or statistical) context. In this context, there are theoretical variables, and some of these variables, say θ,λ,η,… may be related to the actor C. Some of these variables are accessible to him, which means roughly that it is, in some future, given some estimation principle, in principle possible to obtain as accurate estimates as he wishes on the relevant variable. Other variables are inaccessible. In Helland[12][13][14][15], physical examples are given. In the present article, applications to statistics will be the theme, and then in most cases, the theoretical variables will be parameters of some statistical model. However, I will also allow other interpretations of the variables: In special applications, they may be latent variables, future data, or combinations of parameters and data.
The above characterization of accessible and inaccessible variables will, in this article, mainly be related to a statistical implication of the theory. But the theory itself is purely mathematical and can be made precise in different directions. In particular, the terms ‘accessible’ and ‘inaccessible’ can just be seen as primitive notions of the theory. In addition to the statistical implication, two other ways that the theory can be made precise are 1) ordinary quantum mechanics, where the theoretical variables are physical variables, in my interpretation connected to a fixed context and also to the mind of some actor; 2) quantum decision theory, where the variables are decision variables. From a mathematical point of view, it is only assumed that if λ is a theoretical variable and θ=f(λ) for some function f, then θ is a theoretical variable. And if λ is accessible, then θ is accessible.
As said, in physical applications, the variables may also be connected to the mind of some actor. Note, however, that actors may communicate. The mathematical model developed in the articles mentioned above is equally valid relative to a group of people that can communicate about the various theoretical variables. This gives a new version of the theory, a version where all theoretical variables are defined jointly for such a group of actors. In physical applications, the actor or the communicating group of actors is important. In many statistical applications, we may take the group to be the set of all possible actors; see Section 3 below.
From a mathematical point of view, an accessible variable θ is called maximal if there is no other accessible variable λ such that θ=f(λ) for some non-invertible function f. In other words, the term ‘maximal’ will then be seen to be maximal with respect to the partial ordering of variables given by α≤β iff α=f(β) for some function f.
To be precise, every accessible variable is assumed to vary over some topological space; in most cases, they are real-valued or vector-valued, and all functions discussed are assumed to be Borel-measurable.
A basic assumption in my theory is that there exists an inaccessible variable ϕ such that all the accessible variables can be seen as functions of ϕ. In simple physical applications, such a ϕ may easily be defined explicitly. In statistical applications, ϕ may be some total, inaccessible parameter, say, the set of all parameters that in some way may be included in a certain statistical model.
Two different accessible variables θ and η are defined to be related if there is a transformation (group action) k in ϕ-space and a function f such that θ=f(ϕ) and η=f(kϕ).
As a summary of the above discussion, here are the first 3 postulates of the theory:
Postulate 1. If η is a theoretical variable and γ=f(η) for some function f, then γ is also a theoretical variable.
Postulate 2. If θ is accessible to C and λ=f(θ) for some function f , then λ is also accessible to C.
Postulate 3. In the given context, there exists an inaccessible variable ϕ such that all the accessible ones can be seen as functions of ϕ. There is a transitive group K acting upon ϕ.
A definition is now needed for the fourth postulate:
Definition 1. The accessible variable θ is called maximal if θ is maximal as an accessible variable under the partial ordering defined by α≤β iff α=f(β) for some function f.
Note that this partial ordering is consistent with accessibility: If β is accessible and α=f(β), then α is accessible. Also, ϕ from Postulate 3 is an upper bound under this partial ordering.
Postulate 4. There exist maximal accessible variables relative to this partial ordering. For every accessible variable θ there exists a maximal accessible variable λ such that θ is a function of λ.
Then, in my opinion, two different maximal accessible variables come very close to what Bohr called complementary variables; see Plotnitsky[30] for a thorough discussion. The term complementary originated in connection to the variables position and momentum, but has now reached a number of applications; see Steiner & Rendell[31] and Maccone[32] for example.
It is crucial what is meant by ‘different’ here. If θ=f(η), where f is a bijective function, (i.e., there is a one-to-one correspondence between θ and η), then θ and η contain the same information, and they must be considered ‘equal’ in this sense. θ and η are said to be ‘different’ if they are not ‘equal’ in this meaning. This is consistent with the partial ordering in Definition 1. The word ‘different’ is used in the same meaning in the Theorem below.
Postulate 4 can be motivated by using Zorn’s lemma - if this lemma, which is equivalent to the axiom of choice, is assumed to hold - and Postulate 3, but such a motivation is not necessary if Postulate 4 is accepted. Physical examples of maximal accessible variables are the position or the momentum of some particle, or the spin component in some direction. In a more general situation, the maximal accessible variable may be a vector, whose components are simultaneously measurable variables.
Assuming these postulates, the main result of Helland[6][12][15] is as follows:
Theorem 1. Consider a context where there are two different maximal accessible variables θ and η. Assume that both θ and η are real-valued or real vectors, taking at least two values. Make the following additional assumptions:
Then there exists a Hilbert space H connected to the situation, and to every (real-valued or vector-valued) accessible variable there can be associated a symmetric operator on H.
The main result is that each accessible variable ξ is associated with an operator Aξ. The proof goes by first constructing Aθ and Aη, then operators associated with other accessible variables are found by using the spectral theorem. For this, we need weak conditions[27] ensuring that the symmetric operators are self-adjoint.
In order to formulate in general the spectral theorem, first, the spectrum σA of an operator A is defined as the set of constants λ such that A−λI does not have a bounded inverse. For self-adjoint operators, the spectrum is contained in the real line and contains all eigenvalues of A. For bounded operators, the spectrum is equal to the set of eigenvalues.
Then, in general, we have for any self-adjoint operator A that there exists a projection-valued measure EA such that
A=∫σAλdEA(λ).From this spectral theorem (see Hall[27] for a proof), if A=Aη is the operator which according to Theorem 1 is associated with the maximal accessible variable η, we can define the operator associated with ξ=f(η) by
Aξ=f(A)=∫σAf(λ)dEA(λ).In particular, we have
∫σAdEA(λ)=I.Note that here η may be any maximal accessible variable associated with some θ which satisfies (i) and (ii) above. It is shown in Helland (2024d) that, as a consequence of the assumption (ii), the two variables θ and η will be related.
By Axiom 4, for any accessible variable ξ, there exists a maximal variable η and a function f such that ξ=f(η). In this way, operators associated with any accessible variable may be defined.
Groups as acting on a space are important in my approach. A group action G acting on a space Ω is called transitive if for every ω∈Ω, the range of gω as g runs through G is the full space Ω. If this holds for one ω, it holds for every ω∈Ω. The isotropy group of G at ω is the set of g∈G such that gω=ω. In the transitive case, for different ω, the isotropy groups are isomorphic. In particular, if the isotropy group is trivial for one ω, it is trivial for every ω∈Ω.
When there is a transitive group G with a trivial isotropy group acting on Ω, there will be a one-to-one correspondence between the points ω∈Ω and the group elements g∈G. This is important for the proof of Theorem 1.
It is shown in Helland[15] that under the assumptions of Theorem 1, there exists a variable ξ that is a bijective function of η and a transformation k on the ϕ-space by which the variables θ and ξ are related: ξ(ϕ)=θ(kϕ).
An important special case of Theorem 1 is when the accessible variables take a finite number of values, say u1,u2,…,ur. For this case, it is proved in Helland[12][15] that a group G and a transformation k with the above properties can always be constructed. The following Corollary then follows:
Corollary 1. Assume that there exist two different maximal accessible variables θ and η, each taking r values, and not in one-to-one correspondence. Then, there exists an r-dimensional Hilbert space H describing the situation, and every accessible variable in this situation will have an associated self-adjoint operator in H.
In the finite case, the equations (1-3) take a simpler form. The operator Aη will have eigenvalues {uj} and corresponding eigenvectors {uj}. The spectral theorem then reads
Aη=∑jujuju†j,where u†j is the complex conjugate row vector corresponding to uj. In the quantum mechanical literature, these vectors are often written as ket and bra vectors: uj=|j⟩ and u†j=⟨j|. The eigenvectors can be chosen as orthonormal: u†iuj=⟨i|j⟩=δij. In the following, both notations will be used.
We then further have
Af(η)=∑jf(uj)uju†j,in particular
∑juju†j=I,the corresponding resolution of the identity.
Theorem 1 and its Corollary constitute the first steps in a new proposed foundation of quantum theory. In statistical applications, the variables (parameters) involved will in most cases have a continuous variation. However, continuous parameters can be approximated by a sequence of parameters taking a finite number of values; see subsection 5.3 in Helland[4]. In this way, we may avoid both the symmetry assumptions of Theorem 1 and the technical issues relating symmetric and self-adjoint operators. Examples are given in Section 3 below.
The second step now is to prove the following: If k is the transformation connecting two related maximal accessible variables θ and η, and Aθ and Aη are the associated operators, then there is a unitary operator W(k) such that Aη=W(k)−1AθW(k). This, and a more general related result, is proved as Theorem 5 in Helland[12].
Given these results, a rich theory follows. The set of eigenvalues of the operator Aθ is proved to be identical to the set of possible values of θ. The variable θ is shown to be maximal if and only if all eigenvalues of the corresponding operator are simple. In general, the eigenspaces of Aθ are in one-to-one correspondence with questions ‘What is θ’/ ‘What will θ be if we measure it?’ together with sharp answers θ=u for some eigenvalue u of Aθ. If θ is a maximal accessible variable, then all eigenvectors of the operator Aθ have a similar interpretation. In my theory, I wish to limit the concept of a state vector to vectors in the Hilbert space that are eigenvectors of some meaningful operator. Then this gives a simple interpretation of all possible state vectors.
What is lacking in the above theory is a foundation of Born’s formula, necessary for the computation of quantum probabilities. Several versions of Born’s formula are proved from two new postulates in Helland[4][15]. The first postulate is as follows:
Postulate 5. The likelihood principle holds.
As is well known, the likelihood principle is a principle that many statisticians base their inference on. In its strict form, it is controversial; see, for instance, the discussion in Schweder and Hjort[33]. Elsewhere[4], I have advocated the view that the principle should be restricted to a specific context, and then it is less controversial. For a basic historical discussion of the principle, see Berger and Wolpert[34].
Recall that the likelihood principle runs as follows: Relative to any experiment, the experimental evidence is always a function of the likelihood. Here, the term ‘experimental evidence’ is left undefined and can be made precise in several directions. But as everybody would agree, an experiment is always done in a context, and such a context should include a well-defined experimental question.
In a quantum mechanical setting, a potential or actual experiment is seen in relation to an actor C or to a communicating group of actors. Concentrate here on the first scenario. In the simplest case, assuming a discrete-valued variable, we assume that C knows the state |a;i⟩ of a physical system and that this state can be interpreted as the knowledge that θa=ui for some maximal accessible variable θa. Then assume that C has focused upon a new maximal accessible variable θb, and we are interested in the probability distribution of this variable.
The last postulate is connected to the scientific ideals of C, ideals that either are given by certain conscious or unconscious principles, or are connected to some concrete persons. These ideals are then modeled by some ‘higher being’ D that C considers to be perfectly rational with respect to any aspect of the relevant theoretical variables.
Postulate 6. Consider in the context τ an experiment where the likelihood principle is assumed to be satisfied, and the whole situation is observed by an experimentalist C whose decisions can be shaped or influenced by a ‘superior being’ D. Assume D’s probability for some given outcome E is q, that D is seen by C to be perfectly rational in agreement with the Dutch Book Principle, and that q is assumed to be the real probability for E.
The Dutch Book Principle says as follows: No choice of payoffs in a series of bets shall lead to a sure loss for the bettor.
A situation where Postulate 6 holds will be called a rational epistemic setting. It will be seen in the next subsection to imply essential aspects of quantum probability. As shown in Helland[13], it also gives a foundation for probabilities in quantum decision theory.
In Helland[4][14], a generalized likelihood principle is proved from the ordinary likelihood principle: Given some experiment, assumed here to have a discrete, maximal accessible parameter θa, and assume a context τ connected to the experiment, any experimental evidence will under the above assumptions be a function of the so-called likelihood effect F=Fa, defined by
Fa(u;z,τ)=∑ip(z|τ,θa=ui)|a;i⟩⟨a;i|In particular, the probability q of Postulate 6 must be a function of F: q(F|τ).
In many textbooks, quantum mechanics is restricted to discrete-valued variables as above. For a continuous variable θa, the likelihood effect can be defined by appealing to the spectral theorem for the operator A=Aθa and using the probability density p(z|τ,θa=u) for the data:
Fa(u;z,τ)=∫σAp(z|τ,θa=u)dEA(u).Using these postulates and a version of Gleason’s Theorem due to Busch[35], the following variant of Born’s formula is proved in Helland[4][14]:
Theorem 2 [Born’s formula, simple version] Assume a rational epistemic setting and assume two different discrete maximal accessible variables θa and θb. In the above situation, we have:
P(θb=vj|θa=ui)=|⟨a;i|b;j⟩|2.Here, |a;i⟩ is the state given by θa=ui and |b;j⟩ is the state given by θb=vj. In this version of the Born formula, I have assumed perfect measurements: there is no experimental noise, so that the experiment gives a direct value of the relevant theoretical variable. Another assumption is that the events θa=ui and θb=vj are contained in the experimental questions related to the respective experiments.
A last postulate is needed to compute probabilities of independent events. A version of such a postulate is
Postulate 7. If the probability of an event E1 is computed by a probability amplitude z1 from the Born rule in the Hilbert space H1, the probability of an event E2 is computed by a probability amplitude z2 from the Born rule in the Hilbert space H2, and these two events are independent, then the probability of the event E1∩E2 can be computed from the probability amplitude z1z2, associated with the tensor product of the Hilbert spaces H1 and H2.
This postulate can be motivated by its relation to classical probability theory: If P(E1)=|z1|2 and P(E2)=|z2|2, then P(E1∩E2)=P(E1)P(E2)=|z1|2|z2|2=|z1z2]2
The simple Born formula can now be generalized to the case where the variables are continuous, and where the accessible variables θa and θb are not necessarily maximal. There is also a variant for a mixed state involving θa.
Define first the mixed state operator associated with any accessible variable θ by using the spectral theorem for A=Aθ:
ρθ=∫σApθ(u)dEA(u).In the continuous case, pθ(u) is the probability density for θ. In the discrete case, pθ(u) is the point probability of θ, and (10) reads
ρθ=∑jpθ(uj)uju†j,where uj is the eigenvector of Aθ corresponding to the eigenvalue uj.
The probability distribution for θ assumed in (10) can be of many kinds. For a Bayesian, it can be a prior or posterior distribution. For a frequentist, it can be a confidence distribution of the kind discussed by Schweder and Hjort[33]. Also, statisticians that follow some fiducial school operate with probability distributions of parameters.
Assume now first that θb in Theorem 2 is discrete, but not necessarily maximal. Then θb is a function f of a maximal accessible variable η, and it follows by summation over j, assuming that η=vj belongs to some set B1 defined by f(vj)∈B for some given set B, that
P(θb∈B|θa=i)=⟨a;i|ΠB|a:i⟩,where ΠB is the projection upon the space spanned by the eigenvectors |b;j⟩ of Aη for which the eigenvalues vj are in the set of indices j such that f(vj)∈B.
Now, by approximating a continuous θb by discrete variables θbn such that θbn→θb as n→∞, it is easy to show that (12) holds in general, where now ΠB is interpreted as the projection upon the eigenspace of the indicator for the set θb∈B corresponding to the value 1 for this indicator. More precisely, we should use ΠB=∫B∩σbdEb(u), where σb is the spectrum of the operator Aθb, and {Eb} is found from the spectral theorem of the same operator.
In the same way, a continuous θa may be approximated by discrete θar, assumed to be functions fr of some maximal accessible variables ξr, replacing the variable θa in Theorem 2. Then, using the definition (10) (which can be generalized to the case of non-maximal accessible variables), we can prove
Theorem 3 [Born’s formula, general version]. Assume Postulate 5 and Postulate 6, and assume that we have two different accessible variables θa and θb in the relevant context. Assume that the knowledge of θa is given by the density matrix ρa. Then for any Borel set B in Ωθb we have
P(θb∈B|ρa)=trace(ρaΠB).This result is not necessarily associated with a microscopic situation, a fact that I will come back to in examples in Section 3.
As a corollary, we have
E(θb|ρa)=trace(ρaAθb).Here Aθb is the operator corresponding to θb.
Finally, one can generalize to the case where the final measurement is not necessarily perfect. Let us assume future data zb instead of a perfect theoretical variable θb, which is now taken to be the parameter of the experiment. Note that we only need the likelihood principle (together with postulate 6) for perfect experiments in order to prove that (13) is valid. Then we can define an operator corresponding to zb by
Azb=∫σAzbp(zb|θb=u)dEb(u),where {Eb} is found from the spectral theorem used on the operator A=Aθb. Then from the version (13) of the Born formula, we obtain
E(zb|ρa)=trace(ρaAzb).and
P(zb∈F)=trace(ρa∫z∈(F∩σA1)dEzb(z)),where {Ezb} is found from the spectral theorem used on the operator A1=Azb.
Again, elementary quantum mechanics uses discrete data and discrete parameters, a setting unfamiliar to statisticians, but useful as an approximation. Then p(zb|θb=ui) is the point probability of the data, we define
Azb=∑ip(zb|θb=ui)uiu†i,and Born’s formula gives E(zb|ρa)=trace(ρaAzb) and
P(zb∈F)=trace(ρa∑zj∈Fvjv†j),where vj is the eigenvector of Azb corresponding to the eigenvalue zj.
All this not only points to a new foundation of quantum theory, but it also suggests a general epistemic interpretation of the theory: Quantum Theory is not directly a theory about the world, but a theory about an actor’s knowledge of the world. In particular, the probabilities in the Born formula can be interpreted as probabilities attached to a single observer, or to a communicating group of observers. It is crucial that the probabilities at the outset according to Postulate 6 are seen as probabilities as evaluated by the ‘superior actor’ D.
In De Raedt et al.[16][36] another approach to the foundation of quantum theory is discussed. This approach will be described very briefly here.
First, the authors define inference-probability as any conditional probability satisfying the three rules
These rules are the same as the rules for the concept of plausibility, derived from reasonable assumptions and discussed in detail by Jaynes[37]. In op. cit., A, B and Z are propositions, and to be precise, AB denotes that both propositions A and B are true, and A+B denotes that at least one of the propositions is true.
Next, De Raedt et al.[16][36] assume the following conditions, which are made precise in these articles:
Conditions 1. There may be uncertainty about each event. The condition under which the experiment is carried out may be uncertain. The frequencies with which events are observed are reproducible and robust against small changes in the conditions. Individual events are independent.
Using these assumptions, they first derive quantum probabilities for the Einstein-Podolsky-Rosen-Bohm thought experiment. This experiment consists of a source S, a router R1 to the left of the source oriented by a chosen unit vector a1, a router R2 to the right of the source oriented by a chosen unit vector a2, and two detectors connected to each router.
The experiment is the same as what is called the Bell experiment, an experiment that has now been shown experimentally to give outcomes as predicted by quantum theory, but not by ‘common sense’ use of classical arguments. The Bell experiment has been discussed under various assumptions by many authors, including Helland[7][11]
De Raedt aims at making a minimal set of assumptions about the experiment; for details, see De Raedt et al.[16][36] and references therein:
The whole experiment is then run a large number N of times, giving frequencies for the various outcomes. Probabilities are obtained by letting N→∞. It turns out that these probabilities are in agreement with quantum theory, which deviates from the common sense use of classical arguments.
This is a special experiment, but it is an important experiment, distinguishing between quantum predictions and classical conditions. A simpler experiment is the Stern-Gerlach experiment. Here the source S, activated at times n=1,2,…,N, sends a particle carrying a magnetic moment s to a magnet M with its magnetization a. Depending on a and s, the particle is detected with 100 % by one of two assumed detectors. It is crucial that this depends only on the scalar product a⋅s. Again, by a long argument, the predictions of quantum theory are derived under very weak assumptions.
In addition to this, there is a discussion of the Schrödinger equation in De Raedt et al.[16], a theme that I will not discuss in detail in the present article.
It is crucial that fundamental quantum predictions are derived in op. cit., using plausible assumptions together with the basic Conditions 1. There is nothing microscopic connected to these assumptions, supporting my main thesis that quantum theory also may be relevant in a macroscopic context.
Finally, these derivations are also consistent with my epistemic interpretation of quantum theory, as shown by the following citation: ‘… current scientific knowledge derives, through cognitive processes in the human brain, from the discrete events which are observed in laboratory experiments and from the relationships between those experiments that we, humans, discover.’
As said in the Introduction above, it is also highly relevant that the concept of Fisher information is used in their detailed arguments.
In a recent, long, and detailed article[38], the Bell experiment and the violation of the so-called CHSH inequality is discussed from the general point of view of mathematical models and discrete data. It is concluded that discrete data recorded by experiments and mathematical models used to describe relevant features belong to different, separate universes and should be treated accordingly. This is a conclusion that seemingly has large consequences for ordinary statistical modelling.
In my opinion, this conclusion indeed seems to be supported in experiments like the Bell experiment. It is connected to the fact that any human being, including a theoretician that makes mathematical models, meets limitations that apply to these models: It may in certain cases be impossible to include more than two maximal accessible parameters in the models. More precisely: If the model contains two, really different, related maximal accessible parameters θ and η, it cannot at the same time contain a parameter ζ which is related to θ, but not to η. The notion of being related can be given a precise definition connected to the mind of a person or to the joint minds of a group of communicating persons (see Subsection 2.2). All this follows from the discussion in Helland[7][11], a discussion which is also briefly given in Helland (2024c).
In a number of books and articles, see Khrennikov[18][19] and Haven & Khrennikov[20][21] for some of them, Andrei Khrennikov has advocated what he calls quantum-like models. I will start with a particular argument given in Khrennikov[18].
He there takes the point of departure that the law of total probability
P(B|C)=∑aP(Aa|C)P(B|Aa∩C),where {Aa} is a disjoint partition of the probability space, does not hold under all circumstances. He assumes in particular that a term of the form
2λ(B|a,C)√Πa(P(Aa|C)P(B|Aa∩C)),where Πa denotes the product, may have to be added to the right-hand side of (20).
He then shows that quantum probabilities may be derived under the specific condition that |λ(B|a,C)|≤1 for all B,a,C. One then can write λ=cos(ϕ) for an angle ϕ. Specialize to the case where the partition has two elements A and ¯A, and use the elementary formula x+y+2√xycos(ϕ)=|√x+eiϕ√y|2 to arrive at
P(B|C)=|ψ(B)|2,where
ψ(B)=√P(A|C)P(B|A∩C)+eiϕ√P(¯A|C)P(B|¯A∩C).which can be seen as a special case of Born’s formula.
This is a minor technical argument, but it again shows that quantum probabilities may be derived under conditions that are not necessarily microscopic. In fact, the many examples and the various other arguments in Khrennikov[18][19] underline this point. Let me end this subsection with a citation from the preface of Khrennikov[19].
‘Quantum-like modeling is built on the methodology and the mathematical apparatus of quantum theory and it is directed to applications outside of physics, namely to biology, cognition, psychology, decision making, economics, finances, social and political sciences, and artificial intelligence.’
Note that statistics as a science aims at a similar list of applications.
I have already mentioned that the law of total probability does not hold for quantum probabilities. This implies also that Savage’s sure thing principle[39] does not hold:
If an event B is true (has probability 1) under condition A, and it also is true under condition ¯A, then it is always true.
This may seem counterintuitive, but it can be understood by the fact that quantum mechanics allows states that can be seen as superpositions of the states specified by the condition A and by the condition ¯A, states where other conditions are focused upon. (I do not need the general superposition principle here; the ‘other conditions’ can be related to a complementary theoretical variable.)
Another property of quantum probabilities is that the probability of successive events depends on the order of the events. An example mentioned in the quantum decision literature is an opinion poll, where American citizens were asked about their opinions on Al Gore and their opinion on Bill Clinton. Empirically, it was shown that answers were dependent upon the order in which the questions were asked.
To prove this property, we also have to assume the well-known collapse rule of quantum mechanics: Start with a pure state |ψ⟩, so that, according to the Born rule, the probability of an event B is ‖ΠB|ψ⟩‖2, where ΠB is the projection on the indicator of B. Then the collapse rule says that after the measurement, the state changes to
|ψB⟩=ΠB|ψ⟩‖ΠB|ψ⟩‖.For a derivation of this rule from a knowledge-based perspective, see Shrapnel et al.[40].
This gives for successive events B and C:
P(BandthenC)=P(B|ψ⟩)P(C|B)=‖ΠB|ψ⟩‖2‖ΠCΠB|ψ⟩‖ΠB|ψ⟩‖‖2=‖ΠCΠB|ψ⟩‖2,which in general is different from P(CandthenB).
Finally, it follows from (25) and a simple geometric argument that one in certain cases may have
P(BandthenC)>P(C),which to some may seem counterintuitive, but can be illustrated by the so-called Linda paradox, discussed by several authors, for instance Busemeyer and Bruza[41].
In the statistical applications below, I will only in some special cases go into details concerning the related quantum-mechanical calculations, which may be complicated. The main purpose of this Section is to point at some ideas under which such calculations may possibly enlighten or complement a statistical analysis.
In a medical experiment, let μa,μb,μc and μd be continuous inaccessible parameters, the hypothetical effects of treatment a,b,c and d, respectively. Assume that the focus of the experiment is to compare treatment b with the mean effect of the other treatments, which is supposed to give the parameter 13(μa+μc+μd). One wants to do a pairwise experiment, but it turns out that the maximal parameter which can be estimated is
θb=sign(μb−13(μa+μc+μd)).(Imagine, for example, that one has four different ointments against rash. A patient is treated with ointment b on one side of his back; a mixture of the other ointments on the other side of his back. It is only possible to observe which side improves best, but this observation is assumed to be very accurate. One can in principle do the experiment on several patients and select out the patients where the difference is clear.)
Described in this way, it may be natural, after the data are collected, to do a Bayesian analysis with a prior given by P(θb=−1)=P(θb=+1)=1/2. But assume now that we have the following modification of the experiment:
The experiment is done on a selected set of experimental units, on whom it is known from earlier accurate experiments that the corresponding parameter
θa=sign(μa−13(μb+μc+μd))takes the value +1. In other words, for a Bayesian analysis, one is interested in the priors
π=P(θb=+1|θa=+1).Consider first a full Bayesian approach, also toward these priors. Natural priors for μa,…,μd are independent N(ν,σ2) with the same ν and σ. By location and scale invariance, there is no loss in generality by assuming ν=0 and σ=1. Then the joint prior of ζa=μa−13(μb+μc+μd) and ζb=μb−13(μa+μc+μd) is multinormal with mean 0 and covariance matrix
(43−49−4943).A numerical calculation from this gives
π=P(ζb>0|ζa>0)≈0.43.This result can also be assumed to be valid when σ→∞, a case which can be considered as independent objective priors for μa,…,μd, more precisely, a joint non-informative prior for the parameters under the translation group; see Helland[42].
Now consider quantum probabilities for the same priors. Since again scale is irrelevant, a natural group on μa,…,μd is a 4-dimensional rotation group around a point (ν,…,ν) together with a translation of ν. Furthermore, ζa and ζb are contrasts, that is, linear combinations with coefficients adding to 0. The space of such contrasts is a 3-dimensional subspace of the original 4-dimensional space, and by a single orthogonal transformation, the relevant subset of the 4-dimensional rotations can be transformed into the group G of 3-dimensional rotations on this latter space, and the translation in ν is irrelevant. One such orthogonal transformation is given by
ψ0=12(μa+μb+μc+μd),Let G be the group of rotations orthogonal to ψ0. We find
ζa=−23(ψ1+ψ2+ψ3),The rotation group element transforming ζa into ζb under G is strongly related to the group element gab transforming a=−1√3(1,1,1) into b=−1√3(1,−1,−1) under a group of rotations of unit vectors.
Furthermore, let Ga be the maximal subgroup of G under which ζa is permissible. The following definition was given in Helland[42][43] and is further discussed in these two articles:
Definition 2. Let G be a group acting upon a parameter η, and let ζ be a function of η. We say that ζ is permissible if ζ(η1)=ζ(η2) implies ζ(gη1)=ζ(gη2) for all g∈G.
In general, if ζ is a permissible parameter in this way, one can define group actions h∈H on ζ by h(ζ(η))=ζ(gη) for g∈G.
The subgroup Ga is here isomorphic with the unit vector transformation group of rotations around a together with a reflection in the plane perpendicular to a. The action by the group Ha induced on ζa by Ga is just a reflection together with the unit element.
Again, all these groups have their analogues in relation to the rotation group of unit vectors.
In conclusion, the whole situation is completely equivalent to the spin-example discussed in many books, for instance, Helland[4], and may be assumed to satisfy the postulates of Subsection 2.2 above. This implies, by an application of the Born rule (see Proposition 7 in Helland[4]):
π=P(sign(ζb)=+1|sign(ζa)=+1)=12(1+a⋅b)=13.So the two analyses give different results for the desired prior. Which solution should one recommend? Here is my opinion: Both solutions are based upon symmetries implied by group actions. The full Bayesian solution is based upon a prior distribution on the inaccessible parameters μa,μb,μc and μd, which could be related to group actions upon these parameters. In the quantum solution, one ends up with a group acting upon the accessible parameters θa=sign(ζa) and θb=sign(ζb). In general, in applied statistics, it is crucial that the parameter space is not too large. From an applied point of view, symmetries based upon accessible parameters should be preferred when compared to rather abstract symmetries based upon inaccessible parameters. Therefore, even though the arguments are more complicated, I will here prefer the quantum probability solution. Related arguments are discussed more generally in the next subsections.
In applied statistics, it is crucial that the parameter space is not too large. For instance, in regression problems, when the number of regression variables p is larger than the number of units n, ordinary least squares regression runs into problems. Let, in general, ϕ be the set of all thinkable parameters that a statistician C wants to include in his model, and, for the sake of the argument, let us assume that there exists a group M acting on the space Ωϕ.
The total parameter ϕ may, in many cases, be so large that it cannot be estimated from the available data, using some estimation principle like unbiased estimation or equivariant estimation. Here, I will concentrate on the last estimation principle.
In general, let a group G act upon the space Ωθ over which a parameter θ varies. In many cases, this group may be induced by a group ˆG acting upon the sample space Ω, based upon some statistical model Pθ(x):
Pgθ(ˆgx)=Pθ(x)forallx∈Ω.This introduces a homomorphism from ˆG to G: If ˆg1 is mapped to g1 and ˆg2 is mapped to g2, then ˆg1ˆg2 is mapped to g1g2.
An estimator ˆθ(x) of the parameter θ is said to be equivariant if ˆθ(ˆgx) is the estimator of gθ whenever ˆg is mapped to g. There can be many arguments given to concentrate upon equivariant estimators.
Under very weak conditions, there exists a right-invariant measure μ on Ωθ under the group G: First, a right-invariant Haar measure ν is defined on the group G itself by ν(D⋅g)=ν(D) for all D⊆G and g∈G. Then μ is said to be right-invariant if μ(A)=ν(GA) with GA={g:gθ0∈A} for some θ0. Left-invariant measures have a similar definition. In many cases, the left-invariant measure is equal to the right-invariant measure,
This introduces an invariant measure on every orbit of the group G: An orbit is the set {gθ0} for some θ0∈Ωθ. The space Ωθ is always divided into a disjoint set of orbits. If θ1 and θ2 belong to the same orbit, this orbit can equivalently be characterized by either {gθ1} or {gθ2}. If there is only one orbit of G in Ωθ, the group is said to be transitive.
An objective Bayes estimator with respect to G is an estimator that uses the right-invariant measure as a prior. In Helland[42], 12 different reasons for using such a prior are given; among other things, it can be proved that credibility sets with some credibility probability are equal to frequentist confidence sets with the same confidence probability.
Turning to quantum theory, it is important for the foundation that there exists a transitive group on the variable space (see point (i) in Theorem 1 of Subsection 2.2). If G should not be transitive, we can introduce the following model reduction principle:
Principle 1. Reduce Ωθ to an orbit of the group G. Choose the orbit such that a subparameter ζ of interest is permissible.
In Helland[44] and Helland[4], this principle is used on the electron spin, a qubit. It is shown that a classical model of spin can be reduced to a quantum model using this principle. In Helland et al.[45], the same principle was used to motivate the model reduction in multiple regression, leading to the partial least squares regression model.
Go back to the example in Subsection 3.1. By a change of notation, let G be the group given by reflections of three-dimensional unit vectors a together with rotations around a. This group is intransitive, and its orbits are found by fixing some a. This corresponds to what was called Ga there, and a group reduction gives the quantum-mechanical interpretation of the example.
More statistical theory related to transitive and intransitive groups defined on the parameter space and the sample space is given in Helland[42]. It is of independent interest that the statistical model corresponding to partial least squares regression can be motivated by model reduction to orbits of a certain group defined on the parameter space[46][47].
Partial least squares regression is an algorithmic method for estimating the regression parameter β, intended for the case of collinearity. It was connected to a statistical model in Helland[48]. Briefly, this model can be formulated as follows: Let Σx be the covariance matrix of the p explanatory variables xi;i=1,…,p, assumed to be random, and let {di} be the eigenvectors of Σx. Decompose the regression vector β of the predictor variable y upon x=(x1,…,xp) as
β=p∑i=1γidi,and then introduce the following hypothesis for some m<p:
Hm:There are exactly m nonzero terms in (27).There are two mechanisms by which the number of terms can be reduced: 1) Some terms are really zero; 2) There are coinciding eigenvalues of Σx, and then the eigenvectors may be rotated in such a way that there is only one in the relevant eigenspace that is along β.
Then the following is proved in Helland[48]: The parametric version of the partial least squares regression algorithm stops after m steps under the hypothesis Hm. Using the resulting partial least squares regression for prediction seems to give a good solution to the collinearity problem.
In Helland[47], this is studied further. Among other results, one can prove the following: (Theorem 5 in op. cit.) Using a least squares criterion, the partial least squares model under certain technical conditions gives the best possible model reduction for linear prediction.
In discussing this and related results, it turns out to be of some relevance to use results from the foundation of quantum theory, in particular Theorem 1 from Subsection 2.2. Specifically, let ϕ=(β,Σx,σ2) be the full parameter of the model, where σ2=Var(y|xi;i=1,…,p), let θ=θ(ϕ) be the model reduced β under the hypothesis Hm, and let η=η(ϕ) be any other m-parametric model reduction of β. Then it is shown in Helland[47] that the assumptions of Theorem 1 are satisfied.
Using this, it is shown: The technical condition ensuring that the PLS regression model (θ) is better than the arbitrary reduction η holds if a statistician B has a non-informative prior on η.
Furthermore, assume Postulate 5 and Postulate 6. Then the general version of the Born formula holds. In Postulate 6, we may specify the superior actor D to represent general scientific ideals connected to any statistician making the statistical analysis. Note that the probabilities here must be interpreted as probabilities as calculated by D, that is, probabilities assuming general scientific ideals.
The discussion in this subsection must be considered tentative. There are mathematical issues that should be resolved.
In the traditional approach to quantum mechanics, the Hilbert space is directly determined by the variable considered. If this is the position of a particle, say, the Hilbert space is L2(R,dμ), where μ is Lebesgue measure, and the operator corresponding to position x is just a multiplication by x.
Similarly, for a continuous statistical parameter θ that varies over the whole space, and where the relevant group is the translation group, we can take the Hilbert space to be L2(R,dμ), and take the operator Aθ corresponding to θ to be multiplication of f∈L2(R,dμ) by θ.
This also determines the operator for any ξ=ξ(θ): By the spectral theorem, we have Aθ=∫σAθλdEAθ, which gives Aξ=∫σAθξ(λ)dEAθ. This reduces to multiplication with ξ(θ) in this case.
Complementarity is a notion due to Niels Bohr, who called the position and momentum of a particle complementary variables. In my theory, in a statistical context, two parameters are called complementary if they are really different and both are maximal accessible variables. By Theorem 1, the existence of two such complementary parameters is the essential basis for the development of quantum phenomena in a statistical setting.
Subsection 3.1 gave a setting where two discrete complementary parameters implied quantum probability as a possible prior. For continuous parameters, the theory is more complicated. The point is that the Hilbert space L2(R,dμ) is not separable and does not have a countable basis. On many occasions, also in this article, it may nevertheless be useful to think in terms of a finite set of basis vectors. This corresponds to parameters θ and η taking a finite number of values. Continuous parameters may be approximated by such finite-valued parameters. For mathematicians, a direct strictly precise theory is given in Hall[27].
Example
Consider a modified version of the example from Subsection 3.1. Assume that in the rash-medicine illustration, we really are able to measure the difference in improvement on the sides of each patient’s back, and thus in one set of measurements get an estimate of ζa=μa−13(μb+μc+μd) and in another set of measurements get an estimate of ζb=μb−13(μa+μc+μd). These are contrasts, but not orthogonal contrasts.
We are interested in the contrast ζb, but relative to a particular population for which we have information about the contrast ζa. This information is obtained from a previous experiment that has been analyzed by either frequentist or Bayesian methods. In the first case, we have obtained a confidence distribution of the contrast ζa, and in the last case, a posterior probability distribution. In either case, we possess now a probability density pa(ζ) for ζa. We will use this to find a prior for the experiment on ζb, relevant for the resulting population.
Now we use Theorem 1. The assumptions may be shown to be satisfied, see below, with θ=ζb, η=ζa and G equal to the translation group on θ, and M equal to the translation group on ϕ=(μa,μb,μc,μd). The result is two symmetric operators, Aa corresponding to ζa and Ab corresponding to ζb. We can take Aa to be of the simple multiplication form. The operators will be self-adjoint. Using the spectral theorem for Aa (Aa=∫ζdEa(ζ)) and the probability density pa(ζ), we find a density operator ρa=∫p(ζ)dEa(ζ), and the spectral theorem for Ab (Ab=∫ηdEb(η)) gives a projection operator ΠB=∫BdEb(η) for each Borel set B. Then, from Born’s formula, a prior for ζb is given by
P(ζb∈B|ρa)=trace(ρaΠB).It is crucial for this example that both the parameters ζa and ζb can be seen to be maximal as accessible parameters; see Definition 1 of Subsection 2.2. They belong to different experiments and must be maximal in these experiments: For instance, any parameter λ for the first experiment such that ζa is a function of λ which is not bijective must be inaccessible, not possible to estimate from the available data. This is a rather strong requirement.
One such potential λ will be the vector (μa,13(μb+μc+μd)), and another will be the vector (μa,μb,μc,μd). These must be inaccessible; it is only possible to measure contrasts/ differences between sides of the back. (This can be seen as a somewhat strange requirement, but it is here crucial for my arguments.) Furthermore, one must argue that priors should only depend upon accessible parameters. Then, a quantum prior determined by (28) can be motivated for the second experiment.
Note that in this example, if we take the basic inaccessible variable as ϕ=(μa,μb,μc,μd), and M as the translation group acting on ϕ, then the contrast function θ=ζb=μb−13(μa+μc+μd) is not permissible with respect to M. (See Definition 2 of Subsection 3.1.) The group G acting on θ is not induced by the full group M.
In the discussion above, I had assumed that we really had knowledge about ζa for each unit, so that this was available for the selection of units. For the more realistic case, see below.
The essence of this example can be generalized. Assume a statistical model P(z|ϕ) with density p(z|ϕ) with some large parameter space Ωϕ. Let M be a group acting upon Ωϕ, and let θ=θ(ϕ) be some focus parameter. Assume that θ is maximally accessible in the given situation, that is, 1) It can be estimated using some estimation principle; 2) If θ=f(ξ) for a non-invertible function f, then the parameter ξ cannot be estimated. Assume that there is a transitive group G with a trivial isotropy group acting on Ωθ. Consider some experimental units, and make an experiment in accordance with the model on these units. This gives a Bayesian posterior or a confidence distribution (more generally, a fiducial distribution[49]) with density p(θ). In the Bayesian case, it is natural, if possible, to use as a prior the invariant measure associated with the group G.
Then do a new experiment on a selected set of units, selected according to the probability distribution p(θ). Let η=η(ϕ) be another maximal accessible parameter, a focus parameter on the second experiment, and essentially different from θ. Then, according to Theorem 1, there exists a Hilbert space H, and two symmetric operators Aθ and Aη in H, one associated with θ and one with η. Let Aθ=∫θdEθ(θ) be the spectral decomposition of Aθ, and define ρθ=∫p(θ)dEθ(θ). Then it can be argued that a prior π for the second experiment should be chosen such that π(B)=trace(ρθΠB), where ΠB is the projection operator defined by ΠB=∫BdEη(η), with {Eη} chosen such that Aη=∫ηdEη(η).
All this assumes that units can be chosen by values of θ that are really known. If not, we define u=ˆθ, where ˆθ is the chosen estimator of θ, and we must replace ρθ by ρu=∫r(u)dEu(u), where r(u)=∫q(u|θ)p(θ)dθ with q being the density in the distribution of the estimator, assumed to only depend upon θ, and where {Eu} is found from the spectral distribution of Au=∫uq(u|θ)dEθ(θ). Units are then chosen from data z of the first experiment according to the density r(u(z)).
Similar discussions can, in principle, be made in very complicated statistical models. For many such models, the groups G and M of Theorem 1 can be defined. An example from the design of experiments where such groups can be defined is the model for randomized experiments discussed by Bailey[50]. But to discuss examples where links to quantum theory can be found, one has to have a basic inaccessible parameter ϕ, and two really different maximal accessible (complementary) parameters.
An even more general class of statistical models is discussed by McCullagh[51] using category theory. Group theory can be seen as a special case of category theory, and using this, examples with groups G and M can be found. But again, it is a challenge to find applied examples with two different maximal accessible parameters.
It is obvious that more research in this area is required.
For the purpose of this subsection, it is crucial that my foundation of quantum theory is also valid when relevant accessible variables are macroscopic. In the language of Khennikov[18][19], I also want to include quantum-like models, which are applicable in biology, economics, psychology, and many other disciplines. Quantum structure may be ubiquitous. A special case is quantum cognition theory[41], which includes quantum decision theory.
The traditional tool for decision-making in statistics is to minimize the expected value of a loss function. However, there are many decisions that are made in a statistical analysis, which cannot be seen in this way: The choice of model, the choice of method in the analysis, the choice of variables or set of variables to include in a multiple regression setting, or the choice to report or not report a p-value. These are examples of decisions made by a statistician or by a communicating group of statisticians, decisions that sometimes can be modeled by quantum decision theory[52].
Decisions can be made on the basis of knowledge, on the basis of beliefs, or both. They are always made in a concrete context. Single persons can make decisions, and joint decisions can be made by a group of communicating persons.
Consider a person C or a group of persons in some decision situation. Say that he or she or they has/have the choice between a finite set of actions a1,…,ar. Relative to this situation, we can define a finite-valued decision variable θ, taking the different values 1,…,r, such that θ=i corresponds to the action ai (i=1,…,r). If C (or the group) really is (are) able to make a decision here and carry out the actions, we say that θ is an accessible variable. As discussed for the general situation in Subsection 2.2, the variable θ is in relation to a person C or to the group of persons; in fact, here θ belongs to the mind of C or to the joint minds of the group.
This must be made precise. A decision problem is said to be maximal if C (or the group) is (are) just able to make his (their) mind(s) with respect to this decision; if the problem is made slightly more complicated, he (they) is (are) not able to make a decision. Let two (completely) different maximal decision variables be θ and η, where θ=i corresponds to the action ai (i=1,…,r), and η=j corresponds to the action bj (j=1,…,r). Then, by the theory of Subsection 2.2, we can model the situation by using quantum theory.
Note that both θ and η may be vector variables. Say that θ=(θ1,…,θn), where each θj is a simpler decision variable. Then this corresponds to a situation where the actor(s), in addition to the difficult decision given by η, is (are) faced with n more simple decisions. Such a situation is not uncommon. In each situation where we shall make a difficult decision, we will be in a context where also a number of trivial decisions may have to be made just in order to survive and to function well in the given context. For many people, these trivial decisions occupy a large portion of their mind, such a large part that the vector decision variable θ also must be considered to be maximal. The assumption that both θ and η take the same number of values r, can be satisfied by artificially adding some actions to one of the decision problems.
It is crucial that both θ and η can be seen as functions of some large inaccessible variable ϕ. The solution here depends on which philosophy one has. One psychological theory might be that our more automatic decisions depend upon our culture and upbringing, which, modeled in some way, can be seen as a part of ϕ. In addition, ϕ must contain something that may be called our free will.
The simple model above does not cover all situations. Sometimes we have a choice between an infinite number of possibilities, and sometimes the outer context changes during the decision process. Nevertheless, the simple model is a good starting point.
It is well known that our minds may be limited, for instance, when faced with difficult decisions. I will first mention a side result in this direction from the present development.
In Helland[7], Theorem 2 says essentially: Imagine a person C which, in some context, has two related maximal accessible related variables θ and η in his mind. Impose a specific symmetry assumption. Then C cannot simultaneously have in mind any other maximal accessible variable which is related to θ, but not related to η. It was claimed in Helland[7] that the violation of a famous inequality by practical Bell experiments can be understood on the basis of this theorem. See also Helland (2023e), where a corresponding theorem is formulated without any symmetry assumption.
Note that this result has the qualification ‘at the same time’ and indicates a specific restriction to two maximal variables. But the human mind is very flexible. Taking time into account, we can think of very many variables, even ones that are not related.
For the present article, however, the direct results from Subsection 2.2 are equally important. Consider again a decision situation, and assume the simple model of the present subsection. In particular, let C at the same time be confronted with at least two different maximal related decision processes. Then the following hold:
- Each decision variable η is associated with a self-adjoint operator Aη, whose eigenvalues are the possible values of η.
This can be taken as a starting point of quantum decision theory, but to develop this theory further, we need to be able to calculate probabilities for the various decisions. For this, I refer to the discussion of the Born rule in Subsection 2.3. In particular, note that the probabilities are assumed to be calculated in a way that can be associated with some (abstract or concrete) superior being, assumed to be perfectly rational. The interpretation of this point also depends on our philosophy. My own view is discussed in Helland[53].
The literature on Artificial Intelligence, in particular Machine Learning, has exploded in recent years. For the purpose of this article, I will focus on a simple neural network with one hidden layer and a single output, as described from a statistical point of view in Efron and Hastie[54]. Here, assume p predictors (features) x=(x1,x2,…,xp) which for simplicity are centered on zero expectation, k hidden units al=g(∑pj=1wjlxj) (l=1,…,k) and output z=h(∑kl=1vlal), where g and h are non-linear, monotonically increasing functions satisfying g(0)=0 and h(0)=0, and {wjl} and {vl} are the parameters of the model, the weights. The {xj} and y are observed on n units, and our task - the learning of the network - is to estimate the weights. In many applications, n is very large, and procedures such as backpropagation are used. I will here also consider the case where n is moderate, perhaps smaller than p. Then a model reduction may be called for.
Before discussing this, I take a brief look at some of the recent literature concerning links between machine learning and quantum theory.
In their abstract, Dunjko and Briegel[3] mention 3 points: 1) Quantum computing is finding vital applications in providing speed-ups for machine learning problems. 2) Machine learning may become instrumental in advanced quantum technologies. 3) One can consider quantum generalizations of learning and artificial intelligence concepts.
Op. cit. is a review article with many references to recent papers. Quantum models of relevance to machine learning are discussed in detail. Historically, the first such model was the quantum Turing machine[55], but there are many more modern models. In general, machine learning can be divided into supervised and unsupervised learning. In supervised learning, we start with a training set (yi,xi),i=1,…,n, where xi is a vector, and the task is to predict y from x on a new unit, so multiple regression can be seen as a special case of machine learning. Quantum information is a more general concept, where data are replaced by quantum states. This is a large area with independent literature. Again, specializing to the multiple regression case, one can mention works by Wiebe et al.[56], Wang[57], and Schuld et al.[58].
In two recent articles, Wu et al.[59] and Zhu et al[60] develop network models that can simultaneously predict multiple quantum properties and the behavior of an unknown quantum process.
Quantum foundations seek to understand and develop the mathematical and conceptual basis for quantum theory. Bharti et al.[61] survey representative works at the interface of machine learning and quantum foundations. Special topics considered are entanglement, Bell-type inequalities, and contextuality. It is proposed that neural networks can be seen as ‘hidden’ variable models for quantum systems.
Go back to the simple model introduced in the first paragraph of this Section. We are interested in model reduction. The whole neural net is usually learned by gradient descent[54][62]. For the purpose of model reduction, I will concentrate on a single perceptron
a=g(p∑j=1wjxj).I now, as a model, assume that the features x=(x1,…,xp) have a random distribution with expectation 0 and covariance matrix Σ, assumed to be positive definite. Expand w=(w1,…,wp) in terms of eigenvectors di of Σ
w=p∑i=1γidi.Then, completely in analogy with partial least squares regression (Subsection 3.3), introduce the model reduction (θm)
Hm:There are exactly m nonzero terms in (30).The theory of Helland[47] carries over. We can represent Hm with the parameter θm=(γ1d1,…,γmdm), which is a function of ϕ=(w,Σ). Assume any other model reduction ηm to m terms. Assume a non-informative probability distribution of ηm. Then, it can be proved that by a least squares criterion, θm gives a better model reduction than ηm.
In proving this in op. cit., essential use was made of a joint quantum model for θm and ηm. There, and here, the assumptions of Theorem 1 (Subsection 2.2) can be shown to be satisfied. The group G acting on θ is given by orthogonal transformations of the dj’s and γj↦αjγj (j=1,2,…,m). It is convenient to let the group M on ϕ be defined by orthogonal transformations of all the di’s (i=1,…,p) and by γi↦g(αiγi). Then the orbits of the group M are given by m and the hypothesis Hm. (Theorem 2 in Helland[47]). This is the reason why I have chosen the constraint g(0)=0.
To carry out the model reduction for the perceptron in theory, and also in practice, the whole literature on partial least squares can be taken over if we base ourselves on x and y=g−1(a)[=∑pj=1wjxj]. The theoretical population algorithm is given in Appendix 1 of Helland[47]. In practice, we have data on n copies of (x,y), and in the algorithm, theoretical variances and covariances must be replaced by estimated variances and covariances. The size m can be determined by cross-validation.
Then all this should be incorporated into the algorithm for the whole neural network. Look again at the simple model defined at the beginning of this Subsection. We are given data (x,z) on n units, where all these variables are centered by their means. We use a feedforward procedure, which means that we start the estimation by first looking at the transition from the x-data to the variables al, then from al to z. The now well-known procedures here can be found in the machine learning literature. Each of these steps must now be replaced by a series of steps of the type described in the previous paragraph. I omit the details here, since the purpose of the present article is to introduce ideas.
A model reduction of this kind can be expected to give an advantage when there is not too much data (n) compared to the number p of variables, or perhaps more accurately, as shown in the partial least case by recent asymptotics by Cook and Forzani[63], in the abundant case where many predictors x contribute information about the response z, often correlated information.
The social scientist Ralph D. Stacey once said: ‘Culture is a set of attitudes, opinions and convictions that a group of humans share, about how one shall behave against each other, how things shall be evaluated and done, which questions that are important and which answers that are accepted. The most important elements of culture are unconscious, and cannot be forced upon us from the outside.’
From this perspective, statistical theory and quantum theory, as they have functioned up to now, may be seen as connected to separate cultures. It is hoped that this article may help to bridge the gap between these two cultures.
The investigations here started with Helland[64], where mathematical models in various sciences were discussed from several points of view. With the present article, this discussion can be said to lead to concrete results.
Of course, there are differences between models in quantum theory on the one side and statistically related models on the other side. It is important that quantum models are always seen in a context, and that they often are related to an observer or a group of communicating observers. Contextual quantum measurements have been discussed from several points of view by Khrennikov[65]. By contrast, statistical models are more universal. But this does not contradict the fact that quantum-like models are ubiquitous, cf. Khrennikov[18][19].
I really want to thank Andrei Khrennikov for his yearly, very enlightening conferences on quantum foundations and related topics. I have learned a lot by attending a few of these conferences. I also want to thank Christopher Fuchs and Richard Gill for their patience in trying to understand my basic message. Finally, I am grateful to Wolfgang Tiefenbrunner for making me aware of the works by Hans De Raedt et al. I am grateful to Solve Sæbø and Trygve Almøy for discussions, and I am grateful to Gudmund Hermansen for doing numerical calculations in connection to the experiment in Subsection 3.1.