Modelling Football Data

Modelling Football Data

By
Renzo Galea

A Dissertation Submitted in Partial Fulfilment of the Requirements
For the Degree of Bachelor of Science (Honours)
Statistics and Operations Research as main area

DEPARTMENT OF STATISTICS AND OPERATIONS RESEARCH
FACULTY OF SCIENCE
UNIVERSITY OF MALTA

MAY 2011

Declaration of Authorship

I, Renzo Galea 25889G, declare that this dissertation entitled:
“Modelling Football Data”, and the work presented in it is my own.

I confirm that:
(1) This work is carried out under the auspices of the Department of Statistics and Operations
Research as part fulfillment of the requirements of the Bachelor of Science (Hons.) course.
(2) Where any part of this dissertation has previously been submitted for a degree or any other qualification at this university or any other institution, this has been clearly stated.
(3) Where I have used or consulted the published work of others, this is always clearly attributed.
(4) Where I have quoted from the works of others, the source is always given. With the exception of such quotations, this dissertation is entirely my own work.
(5) I have acknowledged all sources used for the purpose of this work.

Signature:

_______________________

Date:

_______________________

Abstract
Renzo Galea, B.Sc. (Hons.)
Department of Statistics & Operations Research
May 2011
University of Malta

The main goal of this dissertation is to investigate the Bayesian modelling performance for football data. An extensive study of Markov processes and the Bayesian statistical approach is carried out. In particular, special reference is made to the radical Markov Chain Monte
Carlo sampling technique. Using real data from the Italian serie A championship, a
Bayesian modelling application (as according to the rationale of professors Gianluca Baio and Martha A. Blangiardo) is considered and confronted with the performance of a comparable generalised linear model.

i

Acknowledgements
It is my pleasure to thank the several people who have made this dissertation possible with their precious support.
First and foremost I am heartily thankful to my supervisor, Professor Lino Sant (Head of
Statistics & Operations Research Department, University of Malta), whose continuous encouragement and professional guidance motivated me to develop a thorough understanding of the subject.
I wish to express my deep gratitude towards two more individuals for their time and disponibility. The first is Vincent Marmara (from Betfair Group, Malta), who helped me develop a better picture of the sports betting industry. And the other is professor Gianluca
Baio (from the Department of Statistical Science, University College London), with whom
I have frequently corresponded via email regarding any queries I had about the implementation of his Bayesian mixture model.
Lastly I would also like to extend my sincerest appreciation to all staff and friends within the Statistics & Operations Research Department who have provided the best possible environment in which to learn and grow throughout the whole course.

ii

Contents

Abstract

i

Acknowledgments

ii

List of Figures

vi

List of Tables

vii

1

Introduction

1

1.1

3

2

Structure of Dissertation

4

2.1

Markov Processes

4

2.2

Markov Chain Monte Carlo (MCMC)

5

2.3

Bayesian Inference

6

2.4

3

Literature Review

Football Analysis

8

Markov Chains & MCMC Algorithms

11

3.1

Introduction

11

3.2

Markov Chains

11

3.3

Transition Probabilities

12

3.4

Properties

13 iii 3.5

15

3.6

Convergence

18

3.7

MCMC

20

3.8

4

General state-space Markov Chains

The Metropolis-Hastings Algorithm

21

The Bayesian Approach

23

4.1

General Framework

23

4.2

Bayes’ Rule

24

4.2.1

Likelihood Functions

25

4.2.2

Posterior Distribution

26

4.2.3

Sufficient Statistics

27

4.3

29

4.3.1

Point Estimation

29

4.3.2

Interval Estimation

30

4.3.2.1

Credible Interval

30

4.3.2.2
4.4

Bayesian Estmation

Highest Posterior Density Interval

31
33

4.4.1

Simple Null & Alternative Hypotheses

33

4.4.2

5

Hypothesis Testing

Simple Null & Composite Alternative Hypotheses

34

Model Estimation using Football Data

36

5.1

Introduction

36

5.2

Data Set

36

5.2.1

37

Descriptive Statistics

5.3

Defining the Problem

38

5.4

Bayesian Modelling Procedure

39

5.4.1

40

5.5

Results

6

45

5.5.1
5.6

Poisson Regression Formulation

46

Results

Bayesian vs GLM

51

Conclusion

53 iv A

56

A.1

B

Matlab m-files

56

Calculating team strengths (using previous 6 match days)

58

B.1

Bayesian mixture model

58

B.2

C

WinBUGS files (Baio & Blangiardo (2010))

Initial values

59

SPSS files

62

C.1

62

Data (serie A season 1994/1995)

Bibliography

67

v

List of Figures

3.1

Markov Model format .

12

4.1

The Bayesian synthesis.

24

5.1

Histogram for the number of home and away goals.

38

5.2

Posterior densities for the attack parameters.

41

5.3

Posterior densities for the defense parameters.

42

5.4

The estimated team parameters vs the final league points.

44

5.5

Residuals for the 2 generalised linear models.

50

5.6

Plotting the differences between the home attack parameters.

51

vi

List of Tables

5.1

Summary statistics of home and away goals.

37

5.2

Bayesian estimation of the main parameters.

40

5.3

Final league table and the corresponding team parameters estimates.

43

5.4

Deviance Information Criterion

44

5.5

Respective tests of model effects for the dependent home goals and dependent away

goals models.

45

5.6

Summary statistics of TS variable

46

5.7

Newly generated tests of model effects.

46

5.8

Parameter estimates for the model with home goals as dependent variable. 47

5.9

Parameter estimates for the model with away goals as dependent variable. 48

5.10

AICs and BICs for the generalised linear model.

vii

50

To Mum and Dad

viii

Chapter 1

Introduction

Association football (nowadays simply known as football) is probably one of the most popular sports that has ever been around. It manages to unite millions of fans from all over the world, irrespective of their age, ethnicity or nationality. Doubtlessly it is a sport that witnesses an interesting mix of skill, chance and intelligence put together as one. However each existing football league is definitely distinguished by its own special and exciting characteristics. The most common league format that is adopted is the so called double round robin tournament. Such a season requires that each team plays against each other twice (home and away), in order to compensate for the home advantage bias. The standard points scheme follows the 3-1-0 system, in which wins are equivalent to 3 points and ties to just 1.
However many football leagues (such as the Italian serie A) originally awarded 2 points for a win instead, but this encouraged team managers to draw away matches and focus on winning home games. Herein the change in system placed an additional value on wins
(with respect to draws), and so it stimulated more attacking play.
In recent years, considerable interest concerning football modelling and match predictions has constantly pursued this sport. And it does not take any long to identify that this has been motivated by the explosive growth in the sports betting industry. Actual figures
1

Chapter 1: Introduction

2

published in the European Commission’s green paper regarding online gambling reported that sports betting generated an annual revenue of around € 1,97 billion (accounting to 32
% of the total € 6,16 billion registered by the online gambling market) in 2008. Surely, developments in digital technology and the widespread availability of internet services have been the main catalysts in this reaction. In addition the same year has seen Malta having as many as 500 registered online gambling operators, becoming the EU member state with the highest number of licences in this category.
Such gaming companies have truly revolutionised the betting market scenario by introducing a wider range of exotic betting systems. Once, wagers were simply placed over the probable match outcomes. Nowadays, it is even possible to bet on bookings, shirt numbers or corner kicks amongst others. This is in fact part of the so called “Spread
Betting” system, where it is possible to place bets while the football event is going on.
Other common football betting systems that will only be mentioned are the “Asian
Handicap” (also known as “Hang Cheng”), “Fixed-Odds Betting”, and “Pari-mutuel
Betting”.
Security, speed and pricing are doubtlessly the most significant attributes that guarantee a profitable internet sports book. The first two characteristics in fact fall under the exclusive responsibility of technology providers. Whereas on the other hand, the methodology for compiling the pricing issues has its roots in statistics.
Determining accurate probabilities for football match outcomes is however not an easy task. It is a very complex process that involves numerous factors such as playing conditions and the teams’ abilities among others. Moreover a team’s ability is itself subject to fluctuation (for example, according to injuries, transfers, suspensions, pressure levels, motivation etc). Herein the “right” model should be capable of estimating a very wide list of paramaters.
This dissertation follows the Bayesian modelling rationale as developed by professors
Gianluca Baio and Martha A. Blangiardo. The model they proposed is presented for a particular football season, and its performance is compared with reference to a comparable generalised linear model application. Various software packages are used in doing so.
WinBUGS’s special ability of updating via MCMC sampling is exploited for the Bayesian

Chapter 1: Introduction

3

modelling purposes, whilst Matlab and SPSS are namely utilised for developing the generalised linear model fit.

1.1

Structure of Dissertation

Having reached the end of this dissertation’s introductory part, we shall now give a brief overview of the next 5 chapters that follow.
Coming up first is a chapter that covers the literature related to the theory involved and the evolution of the BUGS software project. In addition this chapter further develops into discussing several statistical football models that have been proposed throughout the course of time. Special reference is made to those models that somehow employ the Bayesian approach. After the literature review we shall then be introducing the relevant theoretical underpinnings. In chapters 3 and 4 we will be looking at the fundamental concepts of both
Markov processes and the Bayesian statistical approach. In particular we shall delve into the Markov Chain Monte Carlo method, which is a sampling technique that has been largely integrated within the Bayesian world.
Chapter 5 is the core of this dissertation. It starts off with a detailed description of the data set available, and formulates the problem that we would be shortly dealing with. The
Bayesian mixture model as according to Baio and Blangiardo (2010) is estimated for the serie A season 1994/1995 and compared to the performance of a generalised linear model application. Finally we conclude by summing up the key outcomes achieved through this dissertation, whilst sparing a last thought for the possible future of football analysis.

Chapter 2

Literature Review

2.1

Markov Processes

What are nowadays known as Markov chains first appeared in 1906. It was in fact the
Russian mathematician Andrei Andreevich Markov who had introduced the notion of
‘chains’ in the paper “Extension of the law of large numbers to dependent quantities”. As speculated in Medhi (1982), he developed the idea whilst watching Pushkin’s (and
Tchaikovsky’s) opera “Evgeni Onegin”. Indeed he eventually published a chain study for the consonant-vowel alterations of the first 20,000 words in Pushkin’s work.
However, in reality Markov’s interest within this field was probably motivated by
Nekrasov’s “abuse of mathematics” (as frequently alleged by Markov himself). The first paper to unleash controversy was that of 1898, entitled as “General properties of numerous independent events in connection with approximate calculation of functions of very large numbers”. Consequently the dispute pursued when four years later Nekrasov said that not only would “pairwise independence” yield the Weak Law of Large numbers (WLLN) as according to Chebychev, but erroneously declared that “independence is a necessary condition for the law of large numbers”.

4

Chapter 2: Literature Review

5

In response, Markov aimed to extend Chebychev’s conclusions by applying the WLLN and the Central Limit Theorem specifically to his own sequences of dependent random variables. In fact after studying variables whose dependence diminishes with increasing mutual distances, he concluded his 1906 paper by claiming that the “independence of quantities does not constitute a necessary condition for the existence of the law of large numbers”. For Seneta (1996), this statement encapsulated all the motivation that Markov had for studying new schemes of chain dependence.
Subsequently, Bernstien introduced the phrase “Markov chain” for the first time. Having clearly been influenced by Markov’s work, he (among many others such as Romanovsky and Neyman) managed to take statistics to newer levels. However, few know that Poincaré could have taken all the recognition instead. In fact Medhi (1982) recalls how Poincaré had already came across the same sequences of random variables before, but did not really delve into the subject as much as Markov did.
Among the many interesting ideas that have been closely associated with Markov chains, one finds the special cases of random walks. Bernoulli and Laplace were among the first to consider urn models. And later, Galton reused the theory to model the propagation of family names. However, Markov processes left some considerable impact in the physics field as well. With special reference to statistical mechanics, the systems encountered generally evolve independently from time whilst obeying the memoryless condition. In addition, the most eminent figures known to have worked with Markov processes within these areas are in fact Einstein and Smoluchowsky.
Another interesting Markov chain application that is relatively quite recent features internet traffic and navigation of websites. However this is not really surprising, since Markov processes have provided the foundations for Queueing Theory. And remaining within the statistical field, Markov processes were also used as a basis for a very important class of
Monte Carlo techniques (that later became known as the Markov Chain Monte Carlo).

2.2

Markov Chain Monte Carlo (MCMC)

MCMC methods have their roots in the Metropolis algorithm, derived when Metropolis et al. (1953) attempted to compute complex integrals by expressing them as expectations for a

Chapter 2: Literature Review

6

particular distribution and used samples to obtain expectation estimates. Thereafter, there had been the intervention of Hastings (1970) and Peskun (1973, 1981) to extend the original algorithm to a more general case and overcome the curse of dimensionality
(usually met by Monte Carlo methods).
Meanwhile several earlier pioneers had been establishing the seeds of another important
MCMC technique, the Gibbs sampler. In particular Hammersley and Clifford had been developing an argument that recovers a joint distribution from its conditionals. In fact as coined by Besag (1974, 1986), this result is nowadays known as the Hammersley-Clifford theorem. However the real breakthrough of the Gibbs sampler was possible via Geman and
Geman (1984), in which they carried a Bayesian study of the Gibbs random fields (and hence derived the name of Gibbs sampler).
Subsequently, one of the most influential papers regarding MCMC theory was presented by
Tierney. In fact Tierney (1994) put forward all assumptions that are required to analyse
Markov chains and developed their properties. For instance some of his most important issues treated the convergence of ergodic averages and the central limit theorems. In addition, Liu et al. (1994, 1995) continued to enrich the MCMC literature after having analysed the covariance structure of the Gibbs sampler and established the validity of RaoBlackwellisation (which had been previously used by Gelfand and Smith (1990).
Furthermore, among the many contributions in the fields of MCMC theory, other prominent names are those of Gelman, Gilks, Roberts, Rosenthal, Tweedie etc. However it is definitely more important to comprehend that all their precious works have essentially allowed the generation of more complicated but desired probability distributions. And in this respect the Bayesian inference could benefit from a wider range of posterior distributions for simulation.

2.3

Bayesian Inference

Bayesian statistics knows its origin to the English mathematician and nonconformist
Unitarian minister, Thomas Bayes. It all started after his death, when Richard Price (the executor of Bayes’ will) discovered a particular unpublished paper in which there was a detailed description of what is nowadays known as the Bayes’ Theorem. After having

Chapter 2: Literature Review

7

drawn the attention of the Royal Society of London, it was posthumously published in 1763 under the title of “An Essay Towards Solving a Problem in the Doctrine of Chances”.
Basically Bayes’ paper introduced the use of a uniform prior distribution on a binomial parameter, and dealt with the problem of predicting new observations. Consequently the first generalisation of the theorem was introduced by Laplace, who is also notable for approaching problems in celestial mechanics and medical statistics amongst others. Laplace
(1774) considered an elaborate version of the inference problem for the unknown binomial parameter. Differently from Bayes, he justified the choice of a uniform prior distribution by arguing that the parameter’s posterior distribution should be proportional to the likelihood of the data.
Moreover in spite of some rivalry and heated conflicts, the term “Bayesian” entered circulation thanks to one of the most significant contributors of classical statistical theory. It was in fact in 1950 when Fisher introduced the adjective in the volume “Contributions to
Mathematical Statistics”. Subsequently there had been some significant explorations by
Savage, Jeffreys, Cox, Jaynes and Good among others. And on a more recent note, there had also been considerable contributions by Berger and Bernardo.
Interestingly enough, Stigler (1982) questioned whether Bayes might have originally intended his results in a rather more limited way than it is actually done. Nonetheless the wide variety of research and successful applications that have integrated the use of
Bayesian statistics is nowadays enormous. It is enough to envisage the radical changes brought in the delicate fields of astronomy, cosmology, artificial intelligence, biology, marketing, etc. (just to mention some).
However a dramatic boost of this genre was only possible with the rise of computer power, and the consequent unleashing of MCMC potential. Geman and Geman (1984) and Pearl
(1987) were among the very first to induce the desire for the necessary software developments that could utterly reduce the computational problems. And among the numerous versions of available software that have been closely associated with Bayesian statistics, the so-called BUGS development project has been arguably one of the most successful. Chapter 2: Literature Review

8

Spiegelhalter et al. (1995) published a very important manual for the first version of this particular software (originally known by the name of Bayesian inference Using Gibbs
Sampling). The authors recognised their motivation in the yet unexploited power of MCMC theory, and as the name itself indicates, the software was properly intended to perform
Bayesian analysis via the Gibbs sampler. Thereafter, the project triggered a renewed interest in the MCMC theory. In particular, Mengersen and Tweedie (1996) explored the convergence speeds of MCMC algorithms to the target distribution, and Roberts et al.
(1997) set explicit targets on the acceptance rate of the Metropolis-Hastings algorithm.
Subsequently Lunn et al. (2000) introduced a newer version of the software package (which however retained most of its predecessor’s features), named WinBUGS. Herein, an interesting remark worth mentioning was spared by J. A. Royle who defined WinBUGS as a real ‘MCMC blackbox’. Moreover Lunn et al. (2009) further provided a critical evaluation of the whole project. In addition the authors spared some thoughts regarding the future of the open source version of WinBUGS as proposed by Thomas A. et al. (2006).

2.4

Football Analysis

The first statistical analysis of football results dates back to 1956 when M. J. Moroney published the work entitled “Facts from Figures”. In this ‘primitive’ study, he suggested that a modified Poisson distribution (properly the Negative Binomial distribution) should provide an adequate fit to football results. The same Negative Binomial distribution was then reused by Reep and Benjamin (1968) in search of modelling ball passing among football team mates. Having introduced the “r-pass movement” model, in which a series of r successful passes preceed a shot on goal or an intercepted pass, they came up with a very important statement affirming that “chance does dominate the game”.
Hennessy’s opinion was far more controversial. In 1969 he stated that only chance was involved. However eventually I. D. Hill published the paper “Association Football and
Statistical Inference” where he expressed his dissent from such unconvincing arguments.
With particular reference to Reep and Benjamin (1968), he argued that although good passing sequences are quite necessary, it is not right to base an entire model on that alone since that “is not the aim of the game”. In addition Hill (1974) compared Goal’s pre-season

Chapter 2: Literature Review

9

forecasts with the final league tables for the season 1971-72, and observed a significant positive correlation. With this in mind he proposed that although “there is obviously a considerable element of chance”, a significant amount of skill should dominate the final outcome, and hence implying that the situation is predictible to some extent.
In the meantime several other attempts to model the qualities of league teams included maximum likelihood estimation by Thompson (1975) and linear model methodology as in
Harville (1977). However the first real model to predict football scores was put forward by
Maher (1982). According to his model, the goals scored by two opposing teams in some particular match are drawn from independent Poisson distributions. Whilst introducing the home advantage factor, he assigned each team with a pair of fixed parameters (α and β) such that the model would simply consist in combining the respective attacking and defensive parameters of the opposing teams.
Maher’s Poisson approach in fact laid the basis for several other studies within the field.
For instance, Lee (1997) relied on this model to simulate the English Premier League season 1995/1996 for around 1000 times, and investigated whether Manchester United really deserved to emerge victorious. However, whoever used the independent Poisson model frequently observed a (relatively low) correlation between the opponent’s goals.
Followingly, Dixon and Coles (1997) extended Maher’s model by introducing an indirect kind of dependence. Having foreseen the probable need of varying the parameters αi and β i with time, they also tapered the likelihood function in order to assign greater weightings to the more recent results.
Meanwhile, Griffiths and Milne (1978) had introduced the theoretical foundations for bivariate Poisson models. Herein the idea of two goal variables that follow a bivariate
Poisson distribution seemed to be quite a good alternative to Maher’s independent Poisson model. In fact Karlis and Ntzoufras (2003) replaced the goals’ independence assumption properly by a bivariate Poisson model that included an additional covariance parameter for the respective goals. This allowed space for score correlations, which is however quite plausible in view of two competing teams. In addition they also considered a diagonal inflation factor to improve the estimated precision of draws.
Recently, Karlis and Ntzoufras reapproached the situation in a completely different manner.
In their paper “Bayesian modelling of football outcomes for the goal difference” published

Chapter 2: Literature Review

10

in 2008, they focused on modelling the goal difference instead. Relying upon real data from the English Premier League season 2006/2007, they built up their reasoning over one of their own previous papers entitled as “Bayesian analysis of the differences of count data”.
Whilst removing scoring correlations and eliminating Poisson marginals, they made use of the Bayesian’s ability to incorporate any available prior knowledge.
Similarly the work of Baio and Blangiardo (2010), which has very much inspired this dissertation, proposed a Bayesian hierarchical model that was tested over the Italian Serie A championship 1991/1992. And from such recent works, one can note that it took quite a while before Bayesian concepts had been integrated within these fields. However some of the earlier works comprised Rue and Salvessen (1997) and Knorr-Held (2000). In fact the former applied a Bayesian dynamic generalised linear model and utilised MCMC to generate dependent samples from the posterior density. Whilst the latter made use of recursive Bayesian estimation over the 1996/1997 German Bundesliga, and investigated the possible time dependency of team strengths.

Chapter 3

Markov Chains & MCMC Algorithms

3.1

Introduction

This chapter starts off with Markov chains, and develops the framework for Markov Chain
Monte Carlo sampling. Markov chains are foremostly famous for the Markov property which discards the past to condition the future over the present. On the other hand MCMC is the most sought after sampling technique in Bayesian statistics. With this in mind a theoretical overview of both discrete and continuous state-space Markov chains is formulated, such that the simulation process is then explained in detail.

3.2

Markov Chains

A sequence of discrete random variables {X 0 , X 1 , ...} is recognised as a Markov chain if it progresses in accordance with the Markov property:

ℙ[ +1 = | = , −1 = −1 , … , 1 = 1 ] = ℙ[ +1 = | = ]

for all n ∈ ℕ and j, i n , i n-1 , …, i 1 ∈ ℕ.

11

12

Chapter 3: Markov Chains & MCMC Algorithms

Figure 3.1 represents a graphical view of one particular example. It exhibits three different states that are interconnected with one another. In this way, any future state can change to either one of the two possibilities or remain in its original state.

Figure 3.1: Markov Model format
This mechanism is able to model a large variety of discrete processes, however the probability with which a state changes its situation is totally dependent upon the so called transition probabilites.

3.3

Transition Probabilities

The single step evolution of states within a discrete parameter stochastic process ( ) ℕ is

generally explained by:

= ℙ[ = | −1 = ],

such that the whole scenario (of the previous 3-state Markov chain model) is then captured

Π=�

within a transition probability matrix:

�.

As a stochastic matrix, Π should only contain non-negative probabilities � ≥ 0, ∀, �,

whilst the sum of each row should be equal to unity �∑ = 1, ∀�. Furthermore, it can remaining at a current state ( ≥ 0).

be even noted that the first observation caters for the eventual non zero probability of

13

Chapter 3: Markov Chains & MCMC Algorithms

Moreover, processes do not necessarily need to evolve in single steps. It might be the case that a process in state i finds itself in state j after n transitions. In this respect, the transition probability formula changes to:

for all i, j, m, n ≥ 0.

= ℙ[ + = | = ]

As a result, the associated transition probability matrix is now:

()

Πn = � ()

()

()

()

()

()

�.
()

()

In general, these n-step transition probabilities can be figured out by conditioning on the state at any intermediate stage as according to the Chapman-Kolmogorov equation:
+

= ℙ[ + = | 0 = ]

= ∑∞ ℙ[{ + = } ∩ { = }|{0 = }]
=0

= ∑∞ ℙ[{ + = }|{ = } ∩ {0 = }] ℙ[ = | 0 = ]
=0

= ∑∞ ℙ[{ + = }|{ = }] ℙ[0 = ] ℙ[ = | 0 = ]
=0

= ∑∞ ℙ[{ = }|{0 = }] ℙ[ = | 0 = ]
=0

= ∑∞
=0

for all n, m ≥ 0 and any states i, j.

3.4

Properties

For a state j to be considered accessible from state i (written i → j), there must exist a nonzero probability that state i reaches state j in a finite number of transitions. Given some integer n ≥ 0,

ℙ[ = | 0 = ] = > 0.

14

Chapter 3: Markov Chains & MCMC Algorithms

If the same process applies the other way round (ie. j → i), the states are said to communicate (i ↔ j). Hence, a whole set of states that communicate is said to be an irreducible Markov chain.
The period k is denoted by,

= gcd{ ∶ ℙ[ = | 0 = ] > 0},

where gcd stands for the greatest common divisor. If the returns to state i are irregular, k =
∞ the probability for a possible return is equal to zero.

1 and the state is considered to be aperiodic. Else if k > 1 state i is periodic, and when k =

Given a state i, the probability for a first return in n steps is defined by:

, = ℙ[ = , − ≠ | 0 = ],

where 0 < k < n. If , = 1, state i is said to be recurrent. Further, on addition of , = 0 for i

alternative but equivalent condition for the recurrency of state i comes from ∑∞ , = ∞.
=1

≠ j, state i is impossible to leave and is even referred to as absorbing. However, an

Furthermore, it is of great interest to model the expected return time. And since , has

been defined as the probability of first return, the mean number of steps with which such a return occurs is given by:

∞

= � , .
=0

For slow rates of return, is generally infinite, and the state is referred to as null recurrent.
If the rate is instead more consistent such that it renders a finite , it is then called positive

recurrent. On the other hand, non recurrent states are defined as transient and , < 1. Thus

it is possible that state i will not be revisited in the future (and so is clearly infinite).

The mean return time is in fact important when defining three further results. Given an aperiodic, irreducible and recurrent Markov chain:

1) lim →∞ , =

1

2) lim →∞ , = 0 for null recurrent states i

for positive recurrent states i

3) lim →∞ , = lim →∞ , for all states j.

15

Chapter 3: Markov Chains & MCMC Algorithms

Theorem: Given any irreducible, aperiodic and positive recurrent Markov chain there should be a unique stationary distribution π. As a result, for all states i and j:

lim , = = � , ,
→∞

where ∑ = 1 and p ij represent the transitional probabilities.

Stationarity of Markov chains is one fundamental property for MCMC sampling. It is in fact the basis for replicating some target distribution. However in the major problems of interest (such as in estimation of football results), the distribution π is absolutely continuous. As a result the MCMC theory is instead based upon discrete-time Markov chains with continuous state spaces, and so the properties mentioned earlier have to be revisted within this perspective.

3.5

General state-space Markov Chains

In this respect, the proper definition of a time-homogeneous Markov chain {X n ; n ≥ 0} on a continuous state-space E can be regarded as follows:

ℙ[ +1 ∈ +1 | ∈ , −1 ∈ −1 , … , 1 ∈ 1 ] = ℙ[ +1 ∈ +1 | = ]

for all n ϵ ℕ and A ⊂ E. So differently from before, the chain is not restricted anymore by a countable number of possible values. Instead, it can now take any value over some continuous interval.
Furthermore, given an initial distribution the evolution process is governed by a transition kernel P which is defined as:

(, ) = ℙ[ +1 ∈ | = ]

for all measurable sets A and x ∈ E. So if v represents the initial probability distribution on
E, and P is the transition kernel, vP renders the position distribution of the Markov chain exactly after one step:

() = � (, )().

Chapter 3: Markov Chains & MCMC Algorithms

16

On the other hand, if h is a real valued function on E, one can define two more functions:
ℎ() = ∫ ℎ()(, ) and ℎ = ∫ ℎ()().

In this context, one should also consider first returns to some set A ⊂ E. Denoted by ,

these are differently defined as:

= inf{ ≥ 1 ∶ ∈ },

and by convention, a chain will not return to if = ∞.

Another strong condition for the general state-space Markov chains is the detailed balance property. It implies that around any closed cycle of states, no net flow of probability takes place. Moreover a Markov process that satisfies the detailed balance condition is said to be reversible. Theorem: A Markov chain { } on a state-space E is reversible with respect to a probability distribution π on E, if and only if (, ) satisfies the following relation: for all x, y ∈ E.

() (, ) = () (, )

Proof: According to the Bayes rule for conditional probability, we have:
ℙ( ∈ | +1 ∈ ) =

ℙ( +1 ∈ | ∈ )ℙ( ∈ )
.
ℙ( +1 ∈ )

ℙ( +1 ∈ | ∈ ) =

ℙ( +1 ∈ | ∈ )ℙ( ∈ )
ℙ( +1 ∈ )

However, if we assume that the Markov chain is reversible, ℙ( ∈ | +1 ∈ ) =

ℙ( +1 ∈ | ∈ ) and thus,

such that,

⇒ ℙ( ∈ | +1 ∈ ) = ℙ( ∈ | +1 ∈ )

⇒ ∫ ∈ ∫ ∈

() (, ) = ∫ ∈ ∫ ∈

Chapter 3: Markov Chains & MCMC Algorithms

for all A, B ∈ E. for all x, y ∈ E.

() (, )

17

⇒ () (, ) = () (, )

□

Theorem: If a Markov chain is reversible, there exists a unique stationary probability measure π for the chain such that:

Proof:

() = ∫ ()(, ).

∫ () (, )

(3.5.1)

= ∫ () (, )
= () ∫ (, )

= ()

□

̅ sample path averages should converge to the corresponding expectations πf for any
Then assuming that a Markov chain has transition kernel P and stationary distribution π, the

initial distribution. And for this to be possible the chain must firstly be irreducible, such that all interesting sets on the state-space can be reached.

Definition: A Markov chain is π-irreducible for a probability distribution π on E if π(A) > 0 for a set A ⊂ E implies that,

{ < ∞} > 0,

where represents the probability that a Markov chain starts with X 0 = x for all x ∈ E.

Consequently it can be said that if a Markov chain is π-irreducible with respect to some distribution π, the chain is also irreducible, and π can be considered as an irreducibility distribution for it. In addition, the possibility of repetitively reaching the same sets in the long run is represented by the recurrence property.

Chapter 3: Markov Chains & MCMC Algorithms

18

Definition: A π-irreducible Markov chain is recurrent if for any set A ⊂ E with π(A) > 0 satisfies: { ∈ in�initely often} > 0 for all ,

{ ∈ in�initely often} = 1 for -almost all .

It is thus evident that this recurrence differs from that found in the discrete case. Here it is in fact regarded as a property for an entire irreducible chain, and not defined for individual states anymore. Moreover if an irreducible reccurrent chain has a stationary probability distribution, it is said to be positive recurrent. On the other hand if it happens that the second condition of the previous definition fails, such that:

{ ∈ in�initely often} = 1

for all x ∈ , we define a new concept which we call Harris recurrent.

3.6

Convergence

At this point, the ultimate goal (which plays a very important role for the next section) is to actually prove that the transition kernels truly converge to some stationary distribution. In fact, by running a sufficiently long chain, we can deduce that the total variation distance will eventually tend to 0.
Theorem: Assuming an aperiodic, π-irreducible Markov chain with transition kernel P, then, lim ‖ (,∙) − ‖ → 0
→∞

for π-almost all x. If the transtion kernel is positive Harris recurrent, this convergence can be further extended to all x.
Proof: Suppose that the total variation distance can be represented as,
‖ (,∙) − ‖ = � | (, ) − ()| .

19

Chapter 3: Markov Chains & MCMC Algorithms

Also let,

(, ) = �

n

j=0

( = ) − (, ) and

(, ) = �

n

j=0

(, )

− (, )

be the first entrance and last exit. Then if we denote,

() = ()
() = ()

() =

(, )

where is a fixed reference state, and apply the convolution notation ∗ () =
∫ ()( − ) :

(, ) = ∫j=0 ()( − ) n = ∗ ().

and

(, ) = ∫j=0 () ( − )

Thus combining these two equations would give: (, ) =

(, ) + ∗ ∗ (),

n

= ∗ ().

such that the total variation distance is then bounded by three terms:
‖

(,∙)

− ‖ ≤ �

for any , , ∈ .

(,

∞

) + �| ∗ − ()| ∗ () + � () � ()

=+1

However for the first term,
�

(, ) = ( ≥ ) → 0

for all x from Harris recurrence.
Meanwhile from result (3.5.1) which proves the existence of an invariant measure,

∞

Chapter 3: Markov Chains & MCMC Algorithms

() = () � (),

20

=1

and if we integrate both sides,
∞

� () = () � � () < ∞.
=1

Thus the third term,
∞

� () � ()

=+1

should also tend to 0 as n goes to infinity.

Furthermore, using the finiteness of () ∑∞ ∫ () and assuming that is ergodic,
=1
such that,

() = lim →∞ ∗ (),

the middle term tends to zero as well.

Hence since all the bounds converge to 0, we can confirm that,

for all x.

lim →∞ ‖ (,∙) − ‖ → 0

□

Finally, all these desired properties open doors for the leading method of MCMC sampling.
With the intention of exploring posterior distributions of interest, this can be essentially described as Monte Carlo integration by using Markov chains.

3.7

MCMC

The Markov Chain Monte Carlo technique constructs a Markov chain of the type discussed above, such that it approximates some target distribution π by its stationary distribution.

Chapter 3: Markov Chains & MCMC Algorithms

21

The general idea behind the process is to compile random samples from some target distribution, properly by using the theory of random walks.
One typical approach involves conditioning. Assuming that X has a distribution π, Y = f(X) while function f is defined on E, we consider,

(, ) = ℙ{ ∈ | = },

such that P(x, A) = Q(f(x), A) represents the transition kernel with stationary distribution π.
Moreover P is generally not irreducible, but if one constructs a series of conditioning kernels P = P 1 , P 2 , ...P m for a list of several functions, one can derive another kernel P =
P 1 P 2 ...P m with stationary distribution π that is also irreducible. For instance, the Gibbs sampling algorithm (which is highly integrated within the key computational softwares available) relies upon the functions,

() = (1 , … , ) = (1 , … , −1 , +1 , … , )

for i = 1, ..., m and = (1 , … , ) ∈ which denotes a subset of a product space.

The kernel is important to sample from a conditional distribution X | Y = f(X n ) and produce the next state X n+1 . Moreover, since the conditional distribution would be serving as a stationary distribution, the kernel would also have the same stationary distribution π. In case it is not irreducible, a series of kernels can be used again as before. This strategy is employed in the Gibbs sampling, which is however a special case of the original Metropolis algorithm as defined by Hastings.

3.8

The Metropolis-Hastings Algorithm

Given a target distribution π with density μ, the algorithm starts by defining a proper
Markov transition kernel,

(, ) = (, )().

At each state X n = x, a proposal Y is generated from Q(x | .) for the next state X n+1 . Then the relation significance between the current and the proposed state is evaluated according to the acceptance probability,

()(, )
�
()(, )

Chapter 3: Markov Chains & MCMC Algorithms

min �1,

(, ) = �
1

if ()(, ) > 0,

22

()(, ) = 0.

Unless it is not rejected, the candidate point Y will become the new state. Otherwise the chain has to remain at the same state. So,
X n+1 = �

with probability (, ), with probability 1 − (, ).

Furthermore, this algorithm can be actually proven to produce a Markov chain { } which

is reversible with respect to the stationary distribution (.). In fact according to the detailed balance (reversibility) condition, it satisfies:

Proof: Assuming that x ≠ y,

() (, ) = () (, ).

() (, ) = [()][(, )(, )]
= ()(, ) min �1,

�

()(,)
()(,)

= min�()(, ), () (, )�
= () (, ).

□

One can find a variety of different MCMC algorithms which stem out from the MetropolisHastings. However the fundamental characteristics of the orginal version are evidently retained. Differences exist in acceptance probability structures, which sometimes make the process highly comparable to the importance sampling technique. Moreover, irrespective of the algorithm employed, the ultimate goal of any software version remains to converge the transition kernels to the stationary distribution.

Chapter 4

The Bayesian Approach

4.1

General Framework

Bayesian inference is a branch of statistical inference, where the posterior probability of some statistic is given rather than a decision as to the significance or otherwise of some parameter. This is done with accordance to two distinct sources of information. And in this respect the approach uses the rules of probability to fit distributions over every parameter and unobserved quantity of interest. As a result, a model parameter is thus treated as a random variable rather than as an unknown constant.
With this in mind, a family of distributions that best models the situation under study is function (), which represents the possible values within the random distribution firstly selected. Then, the prior beliefs are expressed in terms of a probability density parameter Θ. Finally this is all modified with respect to the sample data Y at hand, which is assumed to be interchangeable.

Definition: A sequence of random variables {1 , … } is interchangeable if the joint density function remains the same under all permutations of the indices, such that:
(1 , … ) = (1 , … ),
23

whenever 1 , … represent a permutation of 1 , … .

24

Chapter 4: Bayesian Inference

However the main issue regards the way in which the parameters’ prior beliefs are related given some sample data Y = y. Known as the posterior distribution function (|), this to the observed evidence. In fact, this is achieved by deriving the conditional density of Θ

summarises the current state of knowledge about all observable or unobservable parameters

of interest. And as the name ‘Bayesian’ itself indicates, the calculations build up upon the
Bayes’ theorem. Figure 4.1 perfectly summarises the whole process.

Figure 4.1: The Bayesian synthesis

4.2

Bayes’ Rule

Assuming that there exists a space Ω where the Y’s and Θ can be defined jointly, we can express the conditional joint probability mass or density function as:
(1 , … , , ) = (1 , … , | )(), or (1 , … , , ) = ( | 1 , … , )(1 , … , ).

Then, combining these equations together will produce the posterior probability density function of the distribution paramaters given the sample data Y:
( | 1 , … , ) =

=

(1 , … , , )
(1 , … , )

(1 ,…, | )()
(1 ,…, )

,

(4.2.1)

where (1 , … , ) represents the total summation (or integration, depending on the nature
25

Chapter 4: Bayesian Inference

of the paramaters) over all values of available.

In fact (1 , … , ) is actually a marginal function of (1 , … , ) only, since is integrated out. Herein result (4.2.1) can be simplified into:

( | 1 , … , ) = (1 , … , | )() × ,

(4.2.2)

where is a constant of proportionality. However this result is also subject to change according to the introduction of likelihood functions.

4.2.1 Likelihood Functions
(1 , … , ) about the parameter values θ of our statistical model. So given that the random

The likelihood is a function that represents all information within the observed sample variables {1 , … } are independent and identically distributed,

( ; 1 , … , ) = (1 | ) … ( | )

= � ( | ).
=1

For instance if we assume that goals follow a poisson distribution, the likelihood function would be:

−
,
!

( | ) =

and so the likelihood for the whole sample becomes:

− −
( ; 1 , … , ) = � �
�=
�� ! !

=1

where ∏ =1

1

!

∝ − ∑

=1

(4.2.3)

is treated as a constant of proportionality due to being independent from .

26

Chapter 4: Bayesian Inference

Consequently the next subsection will deal with the appropriate changes to the posterior probability density function that was obtained as result (4.2.2).

4.2.2 Posterior Distribution
Substituting in the likelihood function,

( | 1 , … , ) = ( ; 1 , … , )() × .

And as a result, we may finally re-write the posterior as proportional to the product of the likelihood and prior:

( | 1 , … , ) ∝ ( ; 1 , … , )().

Herein in case we are lacking prior information about goals, we would like to model our prior beliefs by a normal distribution. Hence we would have:
( ; , 2 ) =

1

√2 2

exp �−

( − )2
�,
2 2

and on combination with result (4.2.3), the posterior density function that follows would satisfy: ( | 1 , … , ) ∝ − ∑

∝ − ∑

∝ ∑

( − )2 exp �−
�
2 2

exp �−

( 2 − 2 + 2 )
�
2 2

exp �− −

� 2 −2�
22

�

(4.2.4)

where any term that is independent of is considered as a constant of proportionality.

However in this example, the likelihood of this posterior density function depends upon the full sample data. Instead, it is sometimes possible to work with an appropriate function of the same observations in question.

27

Chapter 4: Bayesian Inference

4.2.3 Sufficient Statistics

In view of some sample data (1 , … , ), a statistic (1 , … , ) = is essentially sufficient

in the sense that it contains the same amount of relevant information about some unknown parameter of interest Θ, such that:

Definition: The conditional probability distribution of the sample data (1 , … , ) given is independent of Θ, and so:

ℙ[ = | () = ; ] = ℙ[ = | () = ].

frequently consulted because of its ability to isolate out the dependence on to one
And in this respect, a particular theorem by the name of Neyman’s factorisation is

multiplicand function.

Theorem: = (1 , … , ) is considered as a sufficient statistic for some parameter if

and only if there are two functions and ℎ such that,

(1 , … , | ) = ((1 , … , ), )ℎ(1 , … , ),

where ℎ is independent of and .

Proof: Consider a discrete random variable, and suppose that is sufficient for . Then if
(1 , … , ) = and = for all = 1, … ,
(1 , … , | ) = ℙ {1 = 1 , … , = }
=

�

∈ (1 ,… )

=ℙ�

ℙ {1 = 1 , … , = | (1 , … , ) = } ℙ {(1 , … , ) = }

�

∈ (1 ,… )

{1 = 1 , … , = ∩ (1 , … , ) = }�

= ℙ [{1 = 1 , … , = } ∩ {(1 , … , ) = (1 , … , )}]
= ℙ{1 = 1 , … , = | (1 , … , ) = (1 , … , )}
ℙ {(1 , … , ) = (1 , … , )}.

x

Hence by definition of sufficiency, ℙ{1 = 1 , … , = | (1 , … , ) = (1 , … , )} is
28

Chapter 4: Bayesian Inference

independent of and factorisation holds for:

((1 , … , ), ) = ℙ {(1 , … , ) = (1 , … , )} and ℎ(1 , … , ) = ℙ{1 = 1 , … , = | (1 , … , ) = (1 , … , )}.

Conversely, suppose that factorisation holds, then
ℙ{1 = 1 , … , = | (1 , … , ) = } =
=
=
=

ℙ{1 =1 ,…, = ,(1 ,…, )=}
ℙ{(1 ,…, )=}
ℙ{1 =1 ,…, = }
ℙ{(1 ,…, )=}

((1 ,…, ),)ℎ(1 ,…, )
∑ ((1 ,…, ),)ℎ(1 ,…, )
ℎ(1 ,…, )
∑ ℎ(1 ,…, )

where the summation extends over all (1 , … , ) for which (1 , … , ) = . Then, if:

(1 , … , ) = ⇒ ℙ{1 = 1 , … , = | (1 , … , ) = } does not depend on ,

(1 , … , ) ≠ ⇒ ℙ{1 = 1 , … , = | (1 , … , ) = } = 0 and does not depend

on as well.

Hence is a true sufficient statistic for .

□

As a result if we reconsider the goals’ poisson likelihood that was obtained in (4.2.2), it can be decomposed as:

(, ) = −

where = ∑ =1 is sufficient.

and

ℎ(1 , … , ) = ∏ =1 � !�,
1

29

Chapter 4: Bayesian Inference

4.3

Bayesian Estimation

After having defined the fundamental constituents that make up the posterior distribution, we shall next move on to discuss the Bayesian estimation methods for the unknown probability distribution parameters. While starting off with the concepts of point estimation, the focus will eventually turn over the ideas behind interval estimation.

4.3.1 Point Estimation
The first approach to inference is to calculate a single point estimate that should serve as the best possible value of an unknown population parameter. And just as the classical approach has the maximum likelihood estimator, the Bayesian perspective has its own so called generalised maximum likelihood estimation method.

Definition: The generalised maximum likelihood is an ideal estimate � for the parameter

, at which the posterior density is at a maximum:

� = max{( | 1 , … , )}.

However the parameter estimate � can also be regarded as the mode of the posterior

density. Nevertheless rather than presenting such a parameter estimate on its own, it would be appropriate to accompany it with some measure of its statistical behaviour. Herein we can calculate the Bayes risk by referring to the posterior variance.

�
Definition: The posterior variance of Θ is defined as:

�
�
� 2
�Θ | 1 , … , � = ��Θ − �Θ�� | 1 , … , �
� 2
= ��Θ − Θ� | 1 , … , �

�
= ��Θ − Θ� ( | 1 , … , )
2

� where Θ is assumed to be an unbiased estimator of the real valued parameter Θ with posterior density ( | 1 , … , ).

30

Chapter 4: Bayesian Inference

However, although point estimation is very practical and efficient, it is sometimes also desirable to go beyond the generation of single ‘best’ estimates.

4.3.2 Interval Estimation
In contrast to point estimation, this method expands the estimation to an interval which is most likely to host the true value of some parameter. While commonly compared to the confidence intervals found in classical statistical inference, these are known as Bayesian credible intervals.

4.3.2.1 Credible Interval

For some level of significance , a credible interval can be defined as a range under which

the posterior probability that the parameter Θ lies in the same interval is taken as (1 − ).
In other words this means that with (1 − ) % certainity the selected interval contains the

confidence interval. In contrast, the (1 − ) % of the classical framework refers to the parameter in question, which is however quite different from the interpretation of the

amount of selected confidence intervals that are likely to host the parameter under study.

Definition: Suppose that ( | 1 , … , ) represents the posterior cumulative distribution

function for the parameter Θ, then we can specify a credible interval which ranges over the limits a and b such that,

(1 − ) = ℙ[ < < | 1 , … , ]

= ( | 1 , … , ) − ( | 1 , … , )

where α stands for some predetermined level of significance.

Hence according to result (4.4.3), the credible interval [a, b] is derived from:

( | 1 , … , ) = ∫ ∑
0

exp �− −

and

(2 −2)
2 2

� = ,

2

( | 1 , … , ) = ∫ ∑
0

Chapter 4: Bayesian Inference

exp �− −

(2 −2)
2 2

� = 1 − .

31

2

interval with probability (1 − ). Therefore in absence of this interval uniqueness, we need

However the problem with such interval estimations is that there are more than just one

to define something which filters the selection.

4.3.2.2 Highest Posterior Density Interval
The highest posterior density interval is an extension of the credible interval theory. It applies certain conditions which lead to that specific interval that contains the points with the highest posterior densities. essentially those smallest (1 − ) credible intervals that satisfy:

Definition: For a unimodal posterior density, the highest posterior density intervals are

1.

2.

( | 1 , … , ) − ( | 1 , … , ) = 1 − , and

(1 | 1 , … , ) ≥ (2 | 1 , … , ), for all 1 ∈ [, ] & 2 ∉ [, ].

In other words the principal condition ensures that each point inside the highest posterior density interval has a greater posterior density than the points outside the interval. As a result, given that by (4.4.3):

�

we fix a value:
∑

∑

exp �− −

such that:

For all ∉ [, ]

For all ∈ [, ]

∑

∑

( 2 − 2)
� = 1 − , exp �− −
2 2

(2 − 2)
( 2 − 2)

� = ∑ exp �− −
� = ,
2 2
2 2

exp �− −

exp �− −

� 2 −2�
22

� < ,

�2 −2�
22

� ≥ .

32

Chapter 4: Bayesian Inference

interval can be easily generalised as that (1 − ) credible region that comprises of the Θ

Moreover, in the case of higher dimensions the definition of a highest posterior density

values with the highest posterior densities. In addition the highest posterior density interval is guaranteed to hold the uniqueness property as shown by the next theorem.

Theorem: Given that the posterior density for all credible intervals with limits (a, b) is never uniform in any interval of the space of θ, the highest posterior density interval exists and is unique.
Proof: Assuming a unimodal posterior density, we start by defining the Lagrangian as,

ℒ = − + �� ( | 1 , … , ) − (1 − )�,

where is in fact a Lagrange multiplier. Then, if we differentiate partially with respect to a and b respectively, and equate to 0:
ℒ

= −1 − [( | 1 , … , )] = 0,

⟹ ( | 1 , … , ) = − .
1

and

ℒ

= −1 − [( | 1 , … , )] = 0,

⟹ ( | 1 , … , ) = − .
1

Hence for the probability density to be positive, should be negative. Moreover the second order differential terms are as follows:

2ℒ
( | 1 , … , )
= − �
� > 0,
()2

2ℒ
( | 1 , … , )
= − �
� > 0,
()2

2ℒ 2ℒ
=
= 0,
()() ()()

such that the Hessian matrix is positive definite, and so the interval (a, b) is truly a minimum. □

Another common estimation technique uses the emprical Bayes’ estimator, in which the sample data is utilised to decipher the parameters (better referred to as hyperparameters) of

33

Chapter 4: Bayesian Inference

the prior distribution. Hence strictly speaking this process violates the Bayes’ theorem which formally requires that the formulation of the prior distribution is totally independent from the actual data. Moreover in light of an inexistent natural standard error, it is a major setback when establishing credible intervals or testing hypotheses. As a result it is not deemed very useful for this study.

4.4

Hypothesis Testing

Although the parameter’s posterior distribution generally summarises all the required information, a researcher might need to investigate whether the random parameter θ lies in any particular part of the parameter space ΩΘ. As a result, a null and alternative hypotheses are established as:

0 = Θ ∈ ΩΘ0 ,
1 = Θ ∈ ΩΘ1 ,

where ΩΘ0 and ΩΘ1 are two partitioned subsets of the parameter space ΩΘ .

ℙ[0 | 1 , … , ] and ℙ[1 | 1 , … , ]. In contrast to the Frequentist hypothesis testing,
Rejection of hypothesis is only based upon comparison of the posterior probabilities

these subjective probabilities are in accordance to the data and prior information. As a result it is important to outline that the accepted hypothesis is not necessarily true, but it is temporarily the best in light of the actual data.
Next we shall illustrate the approach for two different types of hypotheses. In the process the new concept of posterior odds ratio will also be introduced as according to Jeffreys’ accept 0 ; otherwise, we reject 0 in favor of 1 .” [Press 2003]

hypothesis testing criterion which states that: “If the posterior odds ratio exceeds unity, we

4.4.1 Simple Null & Alternative Hypotheses alternative hypotheses are fully specified. Suppose that 0 and 1 are constants, then the

The simplest hypothesis testing occurs when both distribution functions of the null and

34

Chapter 4: Bayesian Inference

hypotheses would be:

0 = Θ = 0 ,
1 = Θ = 1 .

appropriate test statistic for a sample of n readings by . Then using the Bayes’ theorem,

Assuming these hypotheses to be mutually exclusive and exhaustive, we denote the the posterior probabilities for the two hypotheses given the observed data value are:
ℙ[0 | ] =

and,

ℙ[1 | ] =

ℙ[ | 0 ]ℙ[0 ]
ℙ[ | 0 ]ℙ[0 ] + ℙ[ | 1 ]ℙ[1 ]

ℙ[ | 1 ]ℙ[1 ]
ℙ[ | 1 ]ℙ[1 ] + ℙ[ | 0 ]ℙ[0 ]

where the ℙ[0 ] and ℙ[1 ] represent the prior probabilities of the respective hypotheses.
Consequently we can combine these two equations as:

ℙ[0 | ] ℙ[ | 0 ]ℙ[0 ]
=
,
ℙ[1 | ] ℙ[ | 1 ]ℙ[1 ]

and since ℙ[0 | ] and ℙ[1 | ] add up to 1, this gives the posterior odds ratio in favour of 0 .

4.4.2 Simple Null & Composite Alternative Hypotheses alternative hypothesis is partly specified by a range of possible values for . Hence
Another common hypothesis testing case is when the distribution function of the

assuming a constant value 0 , the null and alternative hypotheses would be:
0 = Θ = 0 ,
1 = Θ ≠ 0 .

Moreover assuming that retains its same significance, the posterior odds ratio in
Chapter 4: Bayesian Inference

favour of 0 is differently defined as:

ℙ[0 | ] ℙ[ | 0 ]ℙ[0 ]
ℙ[ | 0 , ]
ℙ[0 ]
=
=
�
�,
ℙ[1 | ] ℙ[ | 1 ]ℙ[1 ] ∫ ℙ[ | 1 , ]1 () ℙ[1 ]

with 1 () being the posterior density for under 1 .

35

Chapter 5

Model Estimation using
Football Data

5.1

Introduction

Having covered the fundamental concepts of both Markov processes and Bayesian statistics, it is time to put the theory into action. This chapter starts off with a concise description of the available data set, and moves on to formulate the problem that will be regarded. A Bayesian hierarchical model (following Baio and Blangiardo (2010)) is consequently presented for a particular football season, and its performance is compared with reference to a generalised linear model that was estimated using the same data.

5.2

Data Set

The study will focus upon the Italian serie A scudetto season 1994/1995, which has seen
Juventus F.C. dominate the final league table and the introduction of the 3-1-0 points scheme. The main reason behind this selection is that this prestigious football league features some very slow football (in comparison to other major leagues such as the English
Premier League or Spanish Liga). As a result some soccer experts argue that the Italian
36

37

Chapter 5: Model Estimation using Football Data

football style gives rise to sequences of matches whose results are somewhat more predictible than for other leagues. Hence it is quite plausible to expect that such football is more susceptible to mathematical modelling.
The corresponding data set (which is also enclosed in Appendix C(C.1)) was downloaded online from the url address: http://www.football-data.co.uk/italym.php. This is actually a free website that collects match statistics and betting odds data for up to 22 European league divisions. In fact its primary intention is basically that of enhancing the development and analysis of football betting systems.
The data set per se comprises the season’s history list of 306 football matches. Each of these observations consists of the final match score (eg. 0 – 1) along with the corresponding opponent teams (eg. Bari vs Lazio, where Bari would be the home side). However such online data files usually contain much more extra information (such as team line ups, referees, attendances, shots on goal, etc), but we are not interested in using these in our study. Most importantly, the data set is also ordered by the respective match dates.

5.2.1 Descriptive Statistics
In order to grasp a better understanding of the data set at hand, this subsection is dedicated to some descriptive statistics for the variables under study. Table 5.1 gives us the summary statistics for the home and away goals scored in all the 306 matches of the season in question. And as expected, the average home goals (1.5621) is quite high on comparison to the average away goals (0.9542).
Summary Statistics home_goals N
Mean
Median
Std. Deviation

Valid

away_goals

306

306

1.5621

.9542

1.0000

1.0000

1.31488

1.09748

Minimum

.00

.00

Maximum

8.00

5.00

Table 5.1: Summary statistics of home and away goals.

38

Chapter 5: Model Estimation using Football Data

Meanwhile from the same table, the maximum number of home goals is 8, which in turn exceeds the number of away goals by a count of 3. In addition the respective standard deviations further indicate that the variability of the home goals is also slightly greater with respect to that of the away goals.
In addition the standard deviations are 1.31488 and 1.09748 respectively. But consequently, one can observe how both variances (1.72891 and 1.20446) exceed the respective sample means. And in fact, in such cases of overdispersion the use of the poisson distribution for modelling goals is rather questionable. Furthermore Figure 5.1 represents the respective distributions of the home and away goals.

Goals' Distribution
140
120
100
80
60
40
20
0

home_goals away_goals 0

1

2

3

4

5

6

7

8

Number of Goals

Figure 5.1: Histogram for the number of home and away goals.
Nevertheless it is very important to comprehend that the collection of goals illustrated up here form a so called heterogeneous population. Had it been the case where the set of goals originates from one sole team, it would have been a totally different story. However in reality the goals under study come from a mixed group of participating teams, where each of which is endowed with its own different qualities and special abilities.

5.3

Defining the Problem

Given the whole list of match observations, it is thus desirable to estimate the necessary parameters that best explain the (just mentioned) different attributes of the teams. Therefore

39

Chapter 5: Model Estimation using Football Data

for each football match the study will follow Baio and Blangiardo (2010), and assume a log linear combination of the teams’ attack and defense parameters, such that: log 1 = ℎ + ℎ

log 2 =

+

+ ℎ

.

,

Consequently we shall indulge in an exercise to extract comparable parameters from a generalised linear model. However at this point, it should be clear that this reference model is not expected to provide an exceptional fit. One should keep in mind that the assumption of goals that follow a poisson distribution holds only because the poisson theory is one of the most evolved theories available at the moment. One flaw of this distribution simply arises from the absence of a theoretical upper bound, and hence match goals can unrealistically tend to infinity.

5.4

Bayesian Modelling Procedure

First of all, the data set in question was very slightly modified according to the general requirements of WinBUGS. Apart from the specific text file format, the participating teams were ordered alphabetically and consequently assigned a number from 1 to 18. Following the Bayesian trend, all random parameters were then assigned a suitable prior distribution
(with the normal and gamma being the main protagonists). Furthermore all the 18 participating teams were further identified as top, mid or low-table teams according to the general perception of the respective attack and defense potential.
The model (which is attached in Appendix B(B.1)) was then implemented in WinBUGS along with the use of a special file of initial values (also attached in Appendix B(B.2)). In fact this particular file of initial values was directly obtained from professor Gianluca Baio himself, and comprises a series of sensible initial values for all important nodes of the model. He advised me that due to the complexity of the model, one cannot let WinBUGS generate the initial values on its own as it will tend to produce non acceptible values that will freeze the simulation process.

40

Chapter 5: Model Estimation using Football Data

5.4.1 Results
Table 5.2 represents the summary statistics obtained for the posterior distributions of the parameters of interest. These comprise the posterior mean along with the corresponding standard error and MC error of each node. However one also finds the median and the 95 % confidence interval. The simulation consisted of 30,000 iterations, but the first 500 were discarded during the burning process (in order for the final estimates to become independent from the arbitrary initial values). node mean

sd

MC error

2.5%

median

97.5%

home

0.4919

0.06946

8.836E-4

0.3569

0.4916

0.6279

att[Bari] att[Brescia] att[Cagliari] att[Cremonese] att[Fiorentina] att[Foggia] att[Genoa] att[Internazionale] att[Juventus] att[Lazio] att[Milan] att[Napoli] att[Padova] att[Parma] att[Reggiana] att[Roma] att[Sampdoria] att[Torino] -0.02083
-0.5688
-0.02697
-0.2481
0.3537
-0.2732
-0.2382
-0.158
0.3164
0.4007
0.2546
-0.02067
-0.1918
0.2307
-0.4164
0.0367
0.234
0.02676

0.1142
0.2154
0.1152
0.1549
0.1237
0.1558
0.1577
0.1573
0.1227
0.1255
0.1305
0.1144
0.1593
0.1335
0.1739
0.1126
0.1332
0.1144

0.001061
0.002727
0.001097
0.002108
0.00181
0.002064
0.002163
0.002233
0.001838
0.001887
0.001928
0.00106
0.002331
0.001991
0.002437
0.001029
0.001988
0.001237

-0.2683
-1.042
-0.2785
-0.5523
0.1102
-0.583
-0.545
-0.4657
0.06996
0.1567
-0.01348
-0.2726
-0.4997
-0.04523
-0.7872
-0.1813
-0.04135
-0.1981

-0.01662
-0.5506
-0.02124
-0.2478
0.3528
-0.2716
-0.2388
-0.1577
0.3171
0.3991
0.2594
-0.01567
-0.1931
0.2361
-0.4075
0.03124
0.2408
0.02175

0.2016
-0.2024
0.1931
0.05275
0.5982
0.02916
0.07062
0.1468
0.5547
0.6536
0.5
0.198
0.1212
0.4793
-0.1031
0.2816
0.4794
0.2725

def[Bari] def[Brescia] def[Cagliari] def[Cremonese] def[Fiorentina] def[Foggia] def[Genoa] def[Internazionale] def[Juventus] def[Lazio] def[Milan] def[Napoli] def[Padova] def[Parma] def[Reggiana] def[Roma] def[Sampdoria] def[Torino] -0.04996
0.2675
-0.2468
-0.2662
0.227
0.03375
0.02213
-0.3145
-0.325
-0.2963
-0.3261
-0.01576
0.2175
-0.3392
0.1913
-0.4264
-0.2775
0.01925

0.1188
0.125
0.1408
0.1393
0.1261
0.1125
0.1107
0.1354
0.1359
0.1372
0.1364
0.1139
0.1266
0.1369
0.1273
0.1545
0.138
0.1122

0.001425
0.001689
0.002035
0.002242
0.001766
0.001149
0.001098
0.002029
0.002123
0.002195
0.002108
0.001257
0.00165
0.002139
0.001631
0.002317
0.002035
0.001167

-0.3214
0.02425
-0.5103
-0.5279
-0.02151
-0.1847
-0.2011
-0.5807
-0.5995
-0.5614
-0.596
-0.2595
-0.03289
-0.616
-0.06106
-0.763
-0.5425
-0.207

-0.04015
0.2669
-0.2527
-0.2712
0.2274
0.02811
0.01866
-0.3149
-0.3234
-0.2993
-0.325
-0.0119
0.2181
-0.3372
0.1912
-0.4145
-0.2814
0.01663

0.1656
0.5166
0.04289
0.02059
0.4744
0.2736
0.2519
-0.04452
-0.05822
-0.01356
-0.05745
0.2068
0.4643
-0.0726
0.4407
-0.152
0.003826
0.2531

Table 5.2: Bayesian estimation of the main parameters
The home effect resulted to be 0.4928 and is assumed to be constant for the entire list of participating teams. All other nodes represent the attack and defense parameters of the respective teams, for which we next attach the posterior density plots.

41

Chapter 5: Model Estimation using Football Data att[1] sample: 29500

att[2] sample: 29500

6.0

att[3] sample: 29500

2.0
1.5
1.0
0.5
0.0

4.0
2.0
0.0
-1.0

-0.5

0.0

6.0
4.0
2.0
0.0
-2.0

att[4] sample: 29500

-1.5

-1.0

-0.5

2.0
1.0
0.0
0.0

att[7] sample: 29500

att[6] sample: 29500

1.0
0.0
0.0

1.0
0.0
-1.0

-0.5

0.0

-1.0

att[10] sample: 29500

-0.5

0.0

0.5

2.0
0.0
0.0

0.5

0.0
-1.0

-0.5

0.0

2.0
1.0
0.0
0.0

0.5

0.0
-0.5

0.0

-1.5

att[17] sample: 29500

2.0
0.5

0.0

att[15] sample: 29500

-1.0

-0.5

0.0

att[18] sample: 29500

4.0
3.0
2.0
1.0
0.0

4.0

-0.5

3.0

-0.5

att[16] sample: 29500
6.0

-1.0

-1.0

att[14] sample: 29500

1.0

0.5

4.0

4.0
3.0
2.0
1.0
0.0

2.0

0.0

6.0

-0.5

att[13] sample: 29500
3.0

0.0

att[12] sample: 29500

4.0
3.0
2.0
1.0
0.0
-0.5

-0.5

0.0

att[11] sample: 29500

4.0
3.0
2.0
1.0
0.0

-0.5

4.0
3.0
2.0
1.0
0.0

2.0

0.0

-1.0

att[9] sample: 29500

att[8] sample: 29500

1.0
-1.5

-1.5

0.5

3.0

2.0

0.0

2.0

-0.5

3.0

-0.5

3.0

4.0
3.0
2.0
1.0
0.0
-0.5

-1.0

att[5] sample: 29500

3.0

-1.0

0.0

6.0
4.0
2.0
0.0
-0.5

0.0

0.5

-0.5

0.0

0.5

Figure 5.2: Posterior densities for the attack parameters.
All these plots consist of the product of the data (which is expressed formally by the likelihood function) and the prior distribution (which models our external beliefs) as explained previously in chapter 4. The posterior densities above respresent the attack parameters, whilst those on the following page comprise the defense parameters of the teams (numbered respectively from 1 to 18).
On a general note, all posterior plots are unimodal and the interval basically ranges from –
0.5 up to 0.5 However each plot is accordingly centered around different values, and these particular values are actually the mean node values presented in Table 5.2. All in all, we have more defense (rather than attack parameters) that are centred around a negative mean.
Also the majority of the plots are symmetric as well.

42

Chapter 5: Model Estimation using Football Data

Nevertheless we find several exceptions which do not conform with this general explanation. The most significantly different posterior densities are actually those for att[2], att[15], and def[16]. In fact these represent the attack parameters of Brescia and Reggiana
(both relegated) and Roma’s defense parameter respectively. def[1] sample: 29500

def[2] sample: 29500

6.0

def[3] sample: 29500

4.0
3.0
2.0
1.0
0.0

4.0
2.0
0.0
-1.0

-0.5

0.0

3.0
2.0
1.0
0.0
-0.5

def[4] sample: 29500

0.0

0.5

def[5] sample: 29500

4.0
3.0
2.0
1.0
0.0
-0.5

0.0

4.0
2.0
0.0
0.0

0.5

2.0
0.0
-0.5

0.0

0.5

-0.5

0.0

2.0
0.0
-0.5

0.0

-0.5

0.0

1.0
0.0
-1.0

-0.5

-0.5

def[17] sample: 29500

2.0

0.0

-0.5

0.0

0.5

def[15] sample: 29500

0.0

0.5

def[18] sample: 29500

4.0
3.0
2.0
1.0
0.0
-1.5

-1.0

4.0
3.0
2.0
1.0
0.0
-1.0

def[16] sample: 29500
3.0

0.0

4.0

def[14] sample: 29500

0.5

-0.5

def[12] sample: 29500

4.0
3.0
2.0
1.0
0.0
0.0

0.5

6.0

-1.0

def[13] sample: 29500
4.0
3.0
2.0
1.0
0.0
-0.5

-1.0

def[11] sample: 29500

0.0

0.0

def[9] sample: 29500

4.0
3.0
2.0
1.0
0.0
-0.5

-0.5

4.0
3.0
2.0
1.0
0.0
-1.0

def[10] sample: 29500
4.0
3.0
2.0
1.0
0.0
-1.0

-1.0

def[8] sample: 29500
4.0
3.0
2.0
1.0
0.0

4.0

0.0

6.0

-0.5

def[7] sample: 29500
6.0

-0.5

def[6] sample: 29500

4.0
3.0
2.0
1.0
0.0
-1.0

-1.0

6.0
4.0
2.0
0.0
-1.0

-0.5

0.0

-1.0

-0.5

0.0

0.5

Figure 5.3: Posterior densities for the defense parameters.
Furthermore it would be wise to put the mean posterior results in a more comparable scenario. Herein the respective attack and defense parameters were embedded within the final league table. Meanwhile it is important to understand that the higher parameters indicate greater amounts of goals (i.e. effective style in case of attack and poor play in case of defense).

43

Chapter 5: Model Estimation using Football Data
Team

Played

wins

Draws

lost

goals for

attack parameter goals against defense parameter final points Bari

34

23

4

7

40

-0.02083

43

-0.04996

44

Brescia (R)

34

19

6

9

18

-0.5688

65

0.2675

12

Cagliari

34

18

9

7

40

-0.02697

39

-0.2468

49

Cremonese

34

17

9

8

35

-0.2481

38

-0.2662

41

Fiorentina

34

16

11

7

61

0.3537

57

0.227

47

Foggia (R)

34

14

10

10

32

-0.2732

50

0.03375

34

Genoa

34

13

12

9

34

-0.2382

49

0.02213

40

Internazionale

34

13

11

10

39

-0.158

34

-0.3145

52

Juventus (C)

34

13

10

11

59

0.3164

32

-0.325

73

Lazio

34

12

11

11

69

0.4007

34

-0.2963

63

Milan

34

12

9

13

53

0.2546

32

-0.3261

60

Napoli

34

12

8

14

40

-0.02067

45

-0.01576

51

Padova

34

11

8

15

37

-0.1918

58

0.2175

40

Parma

34

10

10

14

51

0.2307

31

-0.3392

63

Reggiana (R)

34

12

4

18

24

-0.4164

59

0.1913

18

Roma

34

8

10

16

46

0.0367

25

-0.4264

59

Sampdoria

34

4

6

24

51

0.234

37

-0.2775

50

Torino

34

2

6

26

44

0.02676

48

0.01925

45

Table 5.3: Final league table and the corresponding team parameters estimates.
From Table 5.3, one can see that Juventus (the crowned champions) are assigned with one of the highest attack parameters (0.3164). However it is important to take notice of Lazio and Fiorentina who managed to gain a better estimate. Such a situation can in fact be supported by their respective amount of goals (69 and 61) that utterly exceed the 59 goals scored by Juventus.
Meanwhile the lowest attack parameter (-0.5688) unsurprisingly resulted for Brescia who have merely managed 18 goals in total. Furthermore these also had the worst defense parameter (0.2675) and are accompanied by those relegated and several exceptions such as
Padova and Fiorentina. On the other hand the best defense parameter is reserved for Roma who have conceeded the least goals (25) of all.
In addition, the same two sets of estimated parameters were further plotted against the final league points. And herein Figure 5.4 further witnesses the expected opposite linear dependence between the respective variables and final league points.

44

Chapter 5: Model Estimation using Football Data
80
70 final league points

60
50
40 attack parameter

30

defense parameter

20
10
0

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

estimated parameters

Figure 5.4: The estimated team parameters vs the final league points.
Furthermore, WinBUGS provides us with a Bayesian method for model comparsion that is known as the Deviance Information Criterion (DIC):
DIC = Dbar + pD = Dhat + 2 pD, where Dbar is the posterior mean of the deviance, Dhat is a point estimate of the deviance, and pD is the effective number of parameters. Figure 5.4 comprises the DIC obtained for our two main variables of interest.
Dbar
Awaygoals
Homegoals
total

Dhat
777.969
923.093
1701.060

pD
769.526
909.608
1679.130

DIC
8.443
13.485
21.927

786.412
936.578
1722.990

Table 5.4: Deviance Information Criterion
As a matter of fact, the smallest DIC is generally considered to indicate the best replication of a data set. However in our case, both DICs are very low and the difference of 13.485 –
8.443 = 5.042 is very minimal. In fact this suggests that our model is probably too simplistic to cater for the complexity with which the actual data generating mechanism operates. In addition let us compare this model with another which also introduces another key factor into play.

45

Chapter 5: Model Estimation using Football Data

5.5

Poisson Regression Formulation

The development of a comparable poisson regression model was quite challenging and interesting at the same time. The very first difference to clarify with respect to the Bayesian approach is that we will now estimate two models (according to the two dependent variables) which will be referred to as:
(i)

the dependent home goals model.

(ii)

the dependent away goals model.

As a result we will instead generate 2 sets of attack and defense parameters for the same team. Meanwhile we departed by running the simplest possible model using SPSS, and hereunder we attached Table 5.5 with the respective tests of model effects.
Tests of Model Effects
Type III

Source
(Intercept)
home_team away_team Wald
ChiSquare
54.109
41.755
35.574

Df

Type III

1

Sig.
.000

17
17

.001
.005

Source
(Intercept)
home_team away_team Wald
ChiSquare
7.238
26.584
38.555

df
1

Sig.
.007

17
17

.064
.002

Table 5.5: Respective tests of model effects for the dependent home goals and dependent away goals models.
However following lots of discussion, we questioned whether we could add something more that could well be utilised as an extra model predictor to improve these results. So eventually we came up with the idea of generating a new sequence of team strengths from the same available data set. Herein this particular variable was supposed to calculate the difference in actual teams’ strengths at the time when the corresponding matches were held, by considering the teams’ performances of the previous 6 fixtures.
With this in mind an appropriate program (which is attached in Appendix A.1) was specifically constructed, and the variable which calculates this difference in the strengths was called TS. In addition it is also important to mention that in the absence of prior fixtures such as in the case of match day 1, the final league table of the previous season laid the foundations for the generation of the very first 9 TS values.

46

Chapter 5: Model Estimation using Football Data

5.5.1 Results
Having now repeated the SPSS poisson regression with this new variable, we obtained some very interesting results which we will comment hereunder. However we first present
Table 5.6 which introduces some characteristics of the TS variable.
Variable Information

Covariate

N
306

TS

Minimum
-16

Maximum
18

Std.
Deviation
5.15639

Mean
-0.2026

Table 5.6: Summary statistics of TS variable.
In fact with a mean of – 0.2026 and a standard deviation of 5.15639, it is important to note that the TS values range from – 16 up to a maximum of 18. Subsequently Table 5.7 comprises the new tests of model effects for both the dependent home goals and dependent away goals models.
Tests of Model Effects
Type III

Source
(Intercept)

Wald
ChiSquare
49.86

df

Type III

Sig.
1

0

Source
(Intercept)
home_team away_team team_strengths

home_team

45.823

17

0

away_team team_strengths 39.965
4.514

17
1

0.001
0.034

Wald
ChiSquare
7.182

Df
1

Sig.
0.007

28.792

17

0.036

39.216
2.631

17
1

0.002
0.105

Table 5.7: Newly generated tests of model effects.
On comparing Table 5.7 with Table 5.6, one can notice that the Wald Chi-Squares of the respective intercepts were reduced. However on the other hand, the Wald Chi-Squares for the home and away teams were all increased. As a result, such a change implies that the contribution of this new variable can be utterly retained significant and is worth considering. Subsequently we present Tables 5.8 and 5.9 which comprise the estimated sets of parameters (along with the corresponding standard error), the 95 % Wald Confidence
Interval and the respective Hypothesis test.

47

Chapter 5: Model Estimation using Football Data
Parameter Estimates
95% Wald Confidence
Interval

Hypothesis Test

Parameter
(Intercept)

B
0.692

Std. Error
0.2651

Lower
0.172

Upper
1.212

Wald ChiSquare
6.812

df
1

Sig.
0.009

[Bari]

-0.053

0.2836

-0.609

0.503

0.035

1

0.852

[Brescia]

-0.746

0.3428

-1.418

-0.075

4.741

1

0.029

[Cagliari]

-0.006

0.2835

-0.562

0.549

0.001

1

0.982

[Cremoneses]

-0.134

0.2897

-0.702

0.434

0.213

1

0.644

0.485

0.2577

-0.021

0.99

3.535

1

0.06

[Foggia]

-0.223

0.2974

-0.806

0.36

0.563

1

0.453

[Genoa]

-0.109

0.2896

-0.677

0.458

0.143

1

0.705

[Internazionale]

[Fiorentina]

-0.159

0.2929

-0.733

0.415

0.295

1

0.587

[Juventus]

0.192

0.281

-0.359

0.743

0.468

1

0.494

[Lazio]

0.738

0.2488

0.251

1.226

8.808

1

0.003

[Milan]

0.04

0.2853

-0.519

0.599

0.02

1

0.889

[Napoli]

-0.04

0.2866

-0.602

0.521

0.02

1

0.888

[Padova]

-0.059

0.2869

-0.621

0.504

0.042

1

0.838

[Parma]

0.316

0.2678

-0.208

0.841

1.396

1

0.237

-0.628

0.3314

-1.278

0.021

3.593

1

0.058

[Roma]

0.109

0.2805

-0.441

0.658

0.15

1

0.698

[Sampdoria]

0.328

0.2628

-0.187

0.843

1.558

1

0.212

a

.

.

.

.

.

.

-0.575

0.2795

-1.123

-0.028

4.238

1

0.04

[Brescia]

0.057

0.2481

-0.429

0.543

0.053

1

0.817

[Cagliari]

-0.268

0.2526

-0.763

0.227

1.125

1

0.289

[Cremoneses]

-0.352

0.2653

-0.872

0.168

1.763

1

0.184

[Reggiana]

[Torino]
[Bari]

[Fiorentina]

0

0

0.2371

-0.465

0.464

0

1

0.999

[Foggia]

0.005

0.2418

-0.469

0.479

0

1

0.983

[Genoa]

-0.122

0.2458

-0.604

0.359

0.248

1

0.618

[Internazionale]

-0.645

0.2804

-1.195

-0.096

5.293

1

0.021

[Juventus]

-0.707

0.2856

-1.267

-0.147

6.126

1

0.013

[Lazio]

-0.743

0.2958

-1.323

-0.163

6.306

1

0.012

[Milan]

-0.618

0.2778

-1.162

-0.073

4.948

1

0.026

[Napoli]

-0.319

0.2579

-0.825

0.186

1.533

1

0.216

[Padova]

0.113

0.2304

-0.339

0.564

0.239

1

0.625

[Parma]

-0.637

0.2803

-1.186

-0.088

5.163

1

0.023

0.049

0.2423

-0.425

0.524

0.042

1

0.839

[Roma]

-0.731

0.295

-1.309

-0.153

6.143

1

0.013

[Sampdoria]

-0.654

0.2893

-1.221

-0.087

5.107

1

0.024

a

.

.

.

.

.

.

-0.027

0.0125

-0.051

-0.002

4.514

1

0.034

[Reggiana]

[Torino] team_strengths (Scale)

0
1

b

Table 5.8: Parameter estimates for the model with home goals as dependent variable.

48

Chapter 5: Model Estimation using Football Data
Parameter Estimates
95% Wald Confidence
Interval

Hypothesis Test

Parameter
(Intercept)

B
-0.19

Std. Error
0.374

Lower
-0.923

Upper
0.543

Wald ChiSquare
0.257

df
1

Sig.
0.612

[Bari]

0.615

0.3597

-0.09

1.32

2.925

1

0.087

[Brescia]

1.014

0.3542

0.32

1.708

8.195

1

0.004

[Cagliari]

-0.111

0.4181

-0.931

0.708

0.071

1

0.79

[Cremoneses]

0.138

0.3941

-0.635

0.91

0.122

1

0.727

[Fiorentina]

0.542

0.3632

-0.17

1.254

2.226

1

0.136

[Foggia]

0.263

0.3825

-0.486

1.013

0.474

1

0.491

[Genoa]

0.397

0.3736

-0.335

1.129

1.13

1

0.288

[Internazionale]

0.143

0.3941

-0.629

0.916

0.132

1

0.716

-0.053

0.4134

-0.863

0.757

0.017

1

0.898

[Lazio]

0.284

0.3798

-0.46

1.028

0.559

1

0.455

[Milan]

-0.127

0.421

-0.953

0.698

0.092

1

0.762

[Napoli]

0.435

0.3695

-0.289

1.159

1.384

1

0.239

[Padova]

0.407

0.3736

-0.326

1.139

1.185

1

0.276

[Parma]

-0.236

0.43

-1.079

0.607

0.301

1

0.583

0.651

0.3709

-0.076

1.378

3.078

1

0.079

-0.468

0.4583

-1.366

0.431

1.042

1

0.307

0.369

0.3738

-0.364

1.101

0.973

1

0.324

a

.

.

.

.

.

.

[Bari]

-0.211

0.3462

-0.889

0.468

0.37

1

0.543

[Brescia]

-1.612

0.5555

-2.701

-0.524

8.427

1

0.004

[Cagliari]

-0.211

0.3463

-0.89

0.467

0.373

1

0.541

[Cremoneses]

[Juventus]

[Reggiana]
[Roma]
[Sampdoria]
[Torino]

0

-0.588

0.3808

-1.335

0.158

2.387

1

0.122

[Fiorentina]

0.226

0.3151

-0.392

0.843

0.512

1

0.474

[Foggia]

-0.61

0.3825

-1.359

0.14

2.541

1

0.111

[Genoa]

-0.567

0.3805

-1.313

0.179

2.219

1

0.136

[Internazionale]

-0.075

0.3349

-0.731

0.582

0.05

1

0.824

0.62

0.3023

0.028

1.213

4.209

1

0.04

[Lazio]

0.006

0.3305

-0.642

0.654

0

1

0.986

[Milan]

0.445

0.2998

-0.143

1.033

2.203

1

0.138

[Napoli]

-0.151

0.34

-0.817

0.516

0.196

1

0.658

[Padova]

-0.444

0.3694

-1.168

0.28

1.442

1

0.23

[Parma]

0.048

0.3363

-0.611

0.707

0.021

1

0.886

[Reggiana]

-0.799

0.4086

-1.6

0.002

3.826

1

0.05

[Roma]

-0.017

0.3248

-0.654

0.62

0.003

1

0.958

[Sampdoria]

-0.167

0.3402

-0.834

0.499

0.242

1

0.623

a

.

.

.

.

.

.

0.025

0.0156

-0.005

0.056

2.631

1

0.105

[Juventus]

[Torino] team_strengths (Scale)

0
1

b

Table 5.9: Parameter estimates for the model with away goals as dependent variable.

49

Chapter 5: Model Estimation using Football Data

Differently from the Bayesian results, one of the generalised linear model parameter estimates has been set to 0 such that it serves as the base parameter, and all other parameters are interpreted as a difference from it. In case of the dependent home goals model the base parameter (0.692) is Torino’s home attack, whilst in case of the dependent away goals model the base parameter (– 0.19) is the same team’s home defense.
From Table 5.8, the best home attack parameter is 0.692 + 0.738 = 1.43 and is associated with Lazio, while the worst parameter is 0.692 – 0.746 = – 0.054 and resulted for Brescia.
In fact it is interesting to note that these teams hold the highest (69) and lowest (18) scored amount of goals respectively. Also both parameter estimates happened to be two of the few home attack parameters to hold a p-value which is less than 0.05.
Similar story holds for the away attack, where Brescia further excelled with the worst parameter estimate. Using table 5.9 it was in fact – 0.19 – 1.612 = – 1.802. However the best away attack parameter was yet reserved for the league champions. Juventus’ away attack parameter is actually – 0.19 + 0.62 = 0.43, and probably it embodies the real team strength behind their success.
With regard to the defense scenario, the significance of the parameters is then the other way round. In fact the trend is that a higher coefficient indicates a higher amount of goals conceeded. Meanwhile Brescia further reconfirmed its insecure position with the worst home defense parameter (1.014 – 0.19 = 0.824), but in case of the away defense category
Padova did the worst with 0.692 + 0.113 = 0.805. Furthermore the two Romans (Roma &
Lazio) had the best associated home (– 0.19 – 0.468 = – 0.658) and away (0.692 – 0.743 =
– 0.051) defense parameter estimates respectively. However Roma did have a very strong away defense parameter as well.
Consequently we should consider the goodness of fit of this generalised linear model

=

=

application. Herein Table 5.10 comprises the associated:
−2 ln ( )+2

and

( ) − ln

where ( ) represents the model likelihood, refers to the number of parameters, and is the degrees of freedom for the deviance.

50

Chapter 5: Model Estimation using Football Data
Goodness of Fit
Dependent Variable: home goals

Dependent Variable: away goals

Akaike's Information Criterion (AIC)

959.556

813.567

Bayesian Information Criterion (BIC)

1093.605

947.616

Table 5.10: AICs and BICs for the generalised linear model.

Similarly to the DIC statistic which was considered earlier for the Bayesian model, the smallest values of AICs and BICs indicate the better model fits. And in this respect, both criterions enclosed in Table 5.10 show that when the away goals were considered as the dependent variable, the poisson regression fit was better.
Finally, the last part of this section was designed to identify any possible outliers that could have influenced the results. In this respect Figure 5.5 plots the actual differences between the observed and the predicted goals for both the dependent home goals and the dependent away goals models.

residuals(homegoals)

residuals(awaygoals)

4

4

2

2

0

0

-2
-4

-2

1
23
45
67
89
111
133
155
177
199
221
243
265
287

6

1
23
45
67
89
111
133
155
177
199
221
243
265
287

6

-4

Figure 5.5: Residuals for the 2 generalised linear models.
From the above residual plots one can note that in general there is an element of randomness, and no particular patterns are present. Herein it is a very good indication that such residuals are worth considering.
Both plots happen to feature some few exceptions where large residuals are observed. In fact these are the so called outliers, which mainly result from the most unexpected match scores. For instance, two of the most unexpected heavy wins which are featured in the

51

Chapter 5: Model Estimation using Football Data

above plots refer to Torino vs Internazionale (5 – 0) and Padova vs Napoli (5 – 0) respectively. 5.6

Bayesian vs GLM

Overall the two models considered during this study were both interesting for their own diverse characteristics. Whilst the Bayesian method generated one set of attack and defense parameters along with a fixed home effect, the generalised linear model differed by developing two sets of parameters for the same team. For instance, if we consider
Juventus’s attacking style when playing home:
From Table 5.2

From Table 5.8

home + att[Juventus]

intercept + home attack [Juventus]

= 0.4919 + 0.3164

= 0.692 + 0.192

= 0.8083

= 0.884

These two results are not the same, but are however quite comparable. Consequently in order to understand whether this was just an exception, the following figure plots the differences between the respective home attack parameters of both models.

Home Attack Differences
0
-0.2
-0.4

Bari
Brescia (R)
Cagliari
Cremonese
Fiorentina
Foggia (R)
Genoa
Internazion…
Juventus (C)
Lazio
Milan
Napoli
Padova
Parma
Reggiana (R)
Roma
Sampdoria
Torino

0.2

-0.6

Figure 5.6: Plotting the differences between the home attack parameters.
In fact this graph illustrates that some discrepancies do exist. And basically the same situation was observed for the remaining 3 categories of parameter estimates as well.

Chapter 5: Model Estimation using Football Data

52

However, in conclusion we can outline that although the respective parameter estimates did differ from each other, the main characteristics of the season (which particularly regarded the performances of Juventus, Lazio, Roma, Brescia and Padova amongst others) were pointed out by both models.

Chapter 6

Conclusion

Throughout this dissertation, Markov theory and Bayesian statistics were fundamental in the modelling process of football data. In particular, the use of the Markov Chain Monte
Carlo technique was useful and interesting to work with, even if computationally time demanding to bring about the parameter estimates of the participating teams for all sorts of models. The validity and interpretation can be quite tricky, but such exercises are nowadays being utilised in all fields of application.
Recapitulating our efforts, the football league under study was the Italian serie A season
1994/1995. In fact this was chosen because of its tendency to feature some of the slowest played football, and hence the style could be somewhat more susceptible to mathematical modelling. Similar to many studies which were consulted, the first data characteristic to be noted was overdispersion. This was mainly because the goals considered orginate from a list of different teams (with distinct qualities), and thus the population is not homogeneous.
Eventually it was pointed out that the Bayesian application was probably too simplistic to cope with the complexity with which the actual data generating mechanism operates. And when compared to a reference generalised linear model the respective parameter estimates that were obtained were not particularly close. The dependent variables used were the home

53

Chapter 6: Conclusion

54

goals and away goals sequences. However with regard to the generalised linear model, the situation got fairly interesting with the introduction of more predictors.
For example, the appropriately generated sequence of team strengths played a very important role in the generalised linear model. This new TS variable was intended to calculate the actual difference between two opponent teams by considering the results of up to 6 previous match days. Herein, as confirmed by the change in the Wald Chi-Square statistic, it managed to ameliorate the contributions of both model predictors used (home & away teams).
However one must also admit that the final conclusions are very much the same. For instance both models singled out Brescia’s poor performance whilst valued the prestigious behaviour of Juventus and Lazio amongst others. Thus it can be concluded that the two models are quite comparable when explaining the overall season performance. And in this respect the results are considered to be quite satisfying.
Meanwhile the poisson regression model was originally not expected to provide an exceptional fit. The assumption that goals follow a poisson distribution has its own limitations as well. For instance the inexistent upper bound of this particular distribution theoretically allows match scores to tend to inifinity. And in fact such a situation is definitely not considered realistic.
In addition, when the whole procedure was repeated for a particular season of the English
Premier League (EPL), the results were quite inconsistent to be considered here. The situation encountered brings into mind Reep and Benjamin’s statement that “chance does dominate the game”. Nevertheless this was probably because of the faster and more spontaneous football that is present in the EPL. And in this respect, apart from the fact that each football season is characterised by its own different story, this highlights that different countries offer different dynamics and styles which might need different modelling techniques. With regard to the future of this field, there is no doubt that the software improvements will continue to enhance interest. However in a more technical framework the situation could develop at a more delicate pace. At this point, rather than modelling team attacks and

Chapter 6: Conclusion

55

defenses, an interesting idea which could probably entice people in the gambling sector is to focus on systems that make use of direct player evaluations.

Appendix A
Matlab m-files

A.1

Calculating team strengths (using previous 6 match days)

Input: List of season’s participating teams (teams), final league table for previous season
(c0(1;:)), sequence of scores (scores), and respective teams (home & away). winpoints=3; n=18; c=zeros(2*(n-1),n); for i=1:n*(n-1) team=home(i,:); k=0; for j=1:n if team==teams(j,:) k=j; end end h(i,1)=k; team=away(i,:); k=0; for j=1:n if team==teams(j,:) k=j; end end a(i,1)=k; end for i=1:2*(n-1) for j=1:n/2

56

Appendix A: Matlab m-files k=j+(i-1)*n/2; u=scores(k,1)-scores(k,2); if u>0 c(i,h(k))=c(i,h(k))+winpoints; else if u==0 c(i,h(k))=c(i,h(k))+1; c(i,a(k))=c(i,a(k))+1; else c(i,a(k))=c(i,a(k))+winpoints; end end end end
% first round, first match
TS=[];
for k=1:n/2
TS=[TS
c0(h(k))-c0(a(k))]; end for i=2:2*(n-1) for j=1:n/2 k=j+(i-1)*n/2; l=max(1,i-6); u=max(1,i-1); v=sum(c(l:u,h(k)))-sum(c(l:u,a(k)));
TS=[TS
v]; end end

57

Appendix B
WinBUGS files (Baio & Blangiardo (2010))

B.1

Bayesian mixture model

Input: Home & Away teams, scores, attack & defense prior categories, and initial values. model { for (i in 1:ngames) {
# observed no of goals
Homegoals[i] ~ dpois(lambda[i,1])
Awaygoals[i] ~ dpois(lambda[i,2])
# Predictive distribution for the number of goals scored ynew[i,1] ~ dpois(lambda[i,1]) ynew[i,2] ~ dpois(lambda[i,2])
# scoring intensity
log(lambda[i,1])

Modelling Football Data

Similar Documents

Statistical Thining in Sports

Dferewrwerwe

Top Down Network Design

Understanding How Big Data and Crowd Movements Will Shape the Cities of Tomorrow

Discrete Event Simulation of Sports Facilities at University of Cincinnati

Anz, Analysis

Attendance

Data Modeling

Performance Analysis of Substitution Cipher Based Cryptographic Algorithm

Oil and Gas

Atit

Seljkjdlsfssdf

Paddy Power Facebook Marketing

Assessment

Review

Popular Essays