In contact with Facebook Twitter RSS feed

Pearson distribution (chi-square distribution). Chi-square distribution. Distributions of mathematical statistics in MS EXCEL Value of the chi square distribution function

Ministry of Education and Science of the Russian Federation

Federal Agency for Education of the City of Irkutsk

Baikal State University of Economics and Law

Department of Informatics and Cybernetics

Chi-square distribution and its applications

Kolmykova Anna Andreevna

2nd year student

group IS-09-1

Irkutsk 2010

Introduction

1. Chi-square distribution

Application

Conclusion

Bibliography

Introduction

How are the approaches, ideas and results of probability theory used in our lives?

The basis is a probabilistic model of a real phenomenon or process, i.e. a mathematical model in which objective relationships are expressed in terms of probability theory. Probabilities are used primarily to describe the uncertainties that must be taken into account when making decisions. This refers to both undesirable opportunities (risks) and attractive ones (“lucky chance”). Sometimes randomness is deliberately introduced into a situation, for example, when drawing lots, randomly selecting units for control, conducting lotteries or conducting consumer surveys.

Probability theory allows one probabilities to be used to calculate others of interest to the researcher.

A probabilistic model of a phenomenon or process is the foundation of mathematical statistics. Two parallel series of concepts are used - those related to theory (probabilistic model) and those related to practice (sampling of observation results). For example, the theoretical probability corresponds to the frequency found from the sample. The mathematical expectation (theoretical series) corresponds to the sample arithmetic mean (practical series). As a rule, sample characteristics are estimates of theoretical ones. At the same time, quantities related to the theoretical series “are in the heads of researchers”, relate to the world of ideas (according to the ancient Greek philosopher Plato), and are not available for direct measurement. Researchers have only sample data with which they try to establish the properties of a theoretical probabilistic model that interest them.

Why do we need a probabilistic model? The fact is that only with its help can the properties established from the analysis of a specific sample be transferred to other samples, as well as to the entire so-called general population. The term "population" is used when referring to a large but finite collection of units being studied. For example, about the totality of all residents of Russia or the totality of all consumers of instant coffee in Moscow. The goal of marketing or sociological surveys is to transfer statements obtained from a sample of hundreds or thousands of people to populations of several million people. In quality control, a batch of products acts as a general population.

To transfer conclusions from a sample to a larger population requires some assumptions about the relationship of the sample characteristics with the characteristics of this larger population. These assumptions are based on an appropriate probabilistic model.

Of course, it is possible to process sample data without using one or another probabilistic model. For example, you can calculate a sample arithmetic mean, count the frequency of fulfillment of certain conditions, etc. However, the calculation results will relate only to a specific sample; transferring the conclusions obtained with their help to any other population is incorrect. This activity is sometimes called “data analysis.” Compared to probabilistic-statistical methods, data analysis has limited educational value.

So, the use of probabilistic models based on estimation and testing of hypotheses using sample characteristics is the essence of probabilistic-statistical methods of decision making.

Chi-square distribution

Using the normal distribution, three distributions are defined that are now often used in statistical data processing. These are the Pearson (“chi-square”), Student and Fisher distributions.

We will focus on distribution

(“chi – square”). This distribution was first studied by astronomer F. Helmert in 1876. In connection with Gaussian error theory, he studied the sums of squares of n independent standardly normally distributed random variables. Karl Pearson later named this distribution function “chi-square.” And now the distribution bears his name.

Due to its close connection with the normal distribution, the χ2 distribution plays an important role in probability theory and mathematical statistics. The χ2 distribution, and many other distributions that are defined by the χ2 distribution (for example, the Student distribution), describe sample distributions of various functions from normally distributed observation results and are used to construct confidence intervals and statistical tests.

Pearson distribution

(chi - square) – distribution of a random variable where X1, X2,..., Xn are normal independent random variables, and the mathematical expectation of each of them is zero, and the standard deviation is one.

Sum of squares


distributed according to law

(“chi – square”).

In this case, the number of terms, i.e. n is called the “number of degrees of freedom” of the chi-square distribution. As the number of degrees of freedom increases, the distribution slowly approaches normal.

The density of this distribution

So, the distribution of χ2 depends on one parameter n – the number of degrees of freedom.

The distribution function χ2 has the form:

if χ2≥0. (2.7.)

Figure 1 shows a graph of the probability density and the χ2 distribution function for different degrees of freedom.

Picture 1 Dependence of the probability density φ (x) in the distribution χ2 (chi – square) for different numbers of degrees of freedom.

Moments of the chi-square distribution:

The chi-square distribution is used in estimating variance (using a confidence interval), testing hypotheses of agreement, homogeneity, independence, primarily for qualitative (categorized) variables that take a finite number of values, and in many other tasks of statistical data analysis.

2. "Chi-square" in problems of statistical data analysis

Statistical methods of data analysis are used in almost all areas of human activity. They are used whenever it is necessary to obtain and justify any judgments about a group (objects or subjects) with some internal heterogeneity.

The modern stage of development of statistical methods can be counted from 1900, when the Englishman K. Pearson founded the journal "Biometrika". First third of the twentieth century. passed under the sign of parametric statistics. Methods were studied based on the analysis of data from parametric families of distributions described by Pearson family curves. The most popular was the normal distribution. To test the hypotheses, Pearson, Student, and Fisher tests were used. The maximum likelihood method and analysis of variance were proposed, and the basic ideas of experiment planning were formulated.

The chi-square distribution is one of the most widely used in statistics for testing statistical hypotheses. Based on the chi-square distribution, one of the most powerful goodness-of-fit tests is constructed - the Pearson chi-square test.

The criterion of agreement is the criterion for testing the hypothesis about the assumed law of an unknown distribution.

The χ2 (chi-square) test is used to test the hypothesis of different distributions. This is his dignity.

The calculation formula of the criterion is equal to

where m and m’ are empirical and theoretical frequencies, respectively

the distribution in question;

n is the number of degrees of freedom.

To check, we need to compare empirical (observed) and theoretical (calculated under the assumption of a normal distribution) frequencies.

If the empirical frequencies completely coincide with the frequencies calculated or expected, S (E – T) = 0 and the χ2 criterion will also be equal to zero. If S (E – T) is not equal to zero, this will indicate a discrepancy between the calculated frequencies and the empirical frequencies of the series. In such cases, it is necessary to evaluate the significance of the χ2 criterion, which theoretically can vary from zero to infinity. This is done by comparing the actually obtained value of χ2ф with its critical value (χ2st). The null hypothesis, i.e. the assumption that the discrepancy between the empirical and theoretical or expected frequencies is random, is refuted if χ2ф is greater than or equal to χ2st for the accepted significance level (a) and the number of degrees of freedom (n).

The quantitative study of biological phenomena necessarily requires the creation of hypotheses with which to explain these phenomena. To test a particular hypothesis, a series of special experiments are carried out and the actual data obtained are compared with those theoretically expected according to this hypothesis. If there is a coincidence, this may be sufficient reason to accept the hypothesis. If the experimental data do not agree well with the theoretically expected ones, great doubt arises about the correctness of the proposed hypothesis.

The degree to which the actual data corresponds to the expected (hypothetical) is measured by the chi-square test:

- actual observed value of the characteristic in i- that; theoretically expected number or sign (indicator) for a given group, k-number of data groups.

The criterion was proposed by K. Pearson in 1900 and is sometimes called the Pearson criterion.

Task. Among 164 children who inherited a factor from one parent and a factor from the other, there were 46 children with the factor, 50 with the factor, 68 with both. Calculate the expected frequencies for a 1:2:1 ratio between groups and determine the degree of agreement of the empirical data using the Pearson test.

Solution: The ratio of observed frequencies is 46:68:50, theoretically expected 41:82:41.

Let's set the significance level to 0.05. The table value of the Pearson criterion for this level of significance with the number of degrees of freedom equal turned out to be 5.99. Therefore, the hypothesis about the correspondence of experimental data to theoretical data can be accepted, since, .

Note that when calculating the chi-square test, we no longer set the conditions for the indispensable normality of the distribution. The chi-square test can be used for any distributions that we are free to choose in our assumptions. There is some universality of this criterion.

Another application of the Pearson test is to compare the empirical distribution with the Gaussian normal distribution. Moreover, it can be classified as a group of criteria for checking the normality of distribution. The only limitation is the fact that the total number of values ​​(options) when using this criterion must be large enough (at least 40), and the number of values ​​in individual classes (intervals) must be at least 5. Otherwise, adjacent intervals should be combined. The number of degrees of freedom when checking the normality of the distribution should be calculated as:.

    1. Fisher criterion.

This parametric test is used to test the null hypothesis that the variances of normally distributed populations are equal.

Or.

With small sample sizes, the use of the Student's test can be correct only if the variances are equal. Therefore, before testing the equality of sample means, it is necessary to ensure the validity of using the Student t test.

Where N 1 , N 2 sample sizes, 1 , 2 number of degrees of freedom for these samples.

When using tables, you should pay attention that the number of degrees of freedom for a sample with a larger dispersion is selected as the table column number, and for a smaller dispersion as the table row number.

For the significance level , we find the table value from the tables of mathematical statistics. If, then the hypothesis of equality of variances is rejected for the selected significance level.

Example. The effect of cobalt on the body weight of rabbits was studied. The experiment was carried out on two groups of animals: experimental and control. The experimental subjects received a diet supplement in the form of an aqueous solution of cobalt chloride. During the experiment, weight gain was in grams:

Control

Chi-square distribution

Using the normal distribution, three distributions are defined that are now often used in statistical data processing. These are the Pearson (“chi-square”), Student and Fisher distributions.

We will focus on the distribution (“chi-square”). This distribution was first studied by astronomer F. Helmert in 1876. In connection with Gaussian error theory, he studied the sums of squares of n independent standardly normally distributed random variables. Later, Karl Pearson gave the name “chi-square” to this distribution function. And now the distribution bears his name.

Due to its close connection with the normal distribution, the h2 distribution plays an important role in probability theory and mathematical statistics. The h2 distribution, and many other distributions that are determined by the h2 distribution (for example, the Student distribution), describe sample distributions of various functions from normally distributed observation results and are used to construct confidence intervals and statistical tests.

Pearson distribution (chi - square) - distribution of a random variable where X1, X2,..., Xn are normal independent random variables, and the mathematical expectation of each of them is zero, and the standard deviation is one.

Sum of squares

distributed according to the law (“chi - square”).

In this case, the number of terms, i.e. n is called the "number of degrees of freedom" of the chi-square distribution. As the number of degrees of freedom increases, the distribution slowly approaches normal.

The density of this distribution


So, the distribution h2 depends on one parameter n - the number of degrees of freedom.

The distribution function h2 has the form:

if h2?0. (2.7.)

Figure 1 shows a graph of the probability density and h2 distribution functions for different degrees of freedom.

Figure 1 Dependence of the probability density q (x) in the distribution h2 (chi - square) for different numbers of degrees of freedom.

Moments of the chi-square distribution:

The chi-square distribution is used in estimating variance (using a confidence interval), testing hypotheses of agreement, homogeneity, independence, primarily for qualitative (categorized) variables that take a finite number of values, and in many other tasks of statistical data analysis.

"Chi-square" in problems of statistical data analysis

Statistical methods of data analysis are used in almost all areas of human activity. They are used whenever it is necessary to obtain and justify any judgments about a group (objects or subjects) with some internal heterogeneity.

The modern stage of development of statistical methods can be counted from 1900, when the Englishman K. Pearson founded the journal "Biometrika". First third of the twentieth century. passed under the sign of parametric statistics. Methods were studied based on the analysis of data from parametric families of distributions described by Pearson family curves. The most popular was the normal distribution. To test the hypotheses, Pearson, Student, and Fisher tests were used. The maximum likelihood method and analysis of variance were proposed, and the basic ideas of experiment planning were formulated.

The chi-square distribution is one of the most widely used in statistics for testing statistical hypotheses. Based on the chi-square distribution, one of the most powerful goodness-of-fit tests is constructed - the Pearson chi-square test.

The criterion of agreement is the criterion for testing the hypothesis about the assumed law of an unknown distribution.

The h2 test ("chi-square") is used to test the hypothesis of various distributions. This is his dignity.

The calculation formula of the criterion is equal to

where m and m" are empirical and theoretical frequencies, respectively

the distribution in question;

n is the number of degrees of freedom.

To check, we need to compare empirical (observed) and theoretical (calculated under the assumption of a normal distribution) frequencies.

If the empirical frequencies completely coincide with the frequencies calculated or expected, S (E - T) = 0 and the criterion h2 will also be equal to zero. If S (E - T) is not equal to zero, this will indicate a discrepancy between the calculated frequencies and the empirical frequencies of the series. In such cases, it is necessary to evaluate the significance of criterion h2, which theoretically can vary from zero to infinity. This is done by comparing the actual value of h2f with its critical value (h2st). The null hypothesis, i.e. the assumption that the discrepancy between the empirical and theoretical or expected frequencies is random, is refuted if h2f is greater than or equal to h2st for the accepted significance level (a) and the number of degrees of freedom (n).

The distribution of probable values ​​of the random variable h2 is continuous and asymmetric. It depends on the number of degrees of freedom (n) and approaches a normal distribution as the number of observations increases. Therefore, the application of the h2 criterion to the assessment of discrete distributions is associated with some errors that affect its value, especially on small samples. To obtain more accurate estimates, the sample distributed into the variation series must have at least 50 options. Correct application of criterion h2 also requires that the frequencies of variants in extreme classes should not be less than 5; if there are less than 5 of them, then they are combined with the frequencies of neighboring classes so that the total amount is greater than or equal to 5. According to the combination of frequencies, the number of classes (N) decreases. The number of degrees of freedom is established by the secondary number of classes, taking into account the number of restrictions on the freedom of variation.

Since the accuracy of determining the h2 criterion largely depends on the accuracy of calculating theoretical frequencies (T), unrounded theoretical frequencies should be used to obtain the difference between the empirical and calculated frequencies.

As an example, let's take a study published on a website dedicated to the application of statistical methods in the humanities.

The Chi-square test allows you to compare frequency distributions regardless of whether they are normally distributed or not.

Frequency refers to the number of occurrences of an event. Usually, the frequency of occurrence of events is dealt with when variables are measured on a scale of names and their other characteristics, besides frequency, are impossible or problematic to select. In other words, when a variable has qualitative characteristics. Also, many researchers tend to convert test scores into levels (high, average, low) and build tables of score distributions to find out the number of people at these levels. To prove that in one of the levels (in one of the categories) the number of people is really greater (less) the Chi-square coefficient is also used.

Let's look at the simplest example.

A test was conducted among younger adolescents to identify self-esteem. The test scores were converted into three levels: high, medium, low. The frequencies were distributed as follows:

High (B) 27 people.

Average (C) 12 people.

Low (L) 11 people

It is obvious that the majority of children have high self-esteem, but this needs to be proven statistically. To do this, we use the Chi-square test.

Our task is to check whether the obtained empirical data differ from theoretically equally probable ones. To do this, you need to find the theoretical frequencies. In our case, theoretical frequencies are equally probable frequencies, which are found by adding all frequencies and dividing by the number of categories.

In our case:

(B + C + H)/3 = (27+12+11)/3 = 16.6

Formula for calculating the chi-square test:

h2 = ?(E - T)? / T

We build the table:

Empirical (E)

Theoretical (T)

Find the sum of the last column:

Now you need to find the critical value of the criterion using the table of critical values ​​(Table 1 in the Appendix). To do this we need the number of degrees of freedom (n).

n = (R - 1) * (C - 1)

where R is the number of rows in the table, C is the number of columns.

In our case, there is only one column (meaning the original empirical frequencies) and three rows (categories), so the formula changes - we exclude the columns.

n = (R - 1) = 3-1 = 2

For the error probability p?0.05 and n = 2, the critical value is h2 = 5.99.

The obtained empirical value is greater than the critical value - the differences in frequencies are significant (h2 = 9.64; p? 0.05).

As you can see, calculating the criterion is very simple and does not take much time. The practical value of the chi-square test is enormous. This method is most valuable when analyzing responses to questionnaires.

Let's look at a more complex example.

For example, a psychologist wants to know whether it is true that teachers are more biased towards boys than towards girls. Those. more likely to praise girls. To do this, the psychologist analyzed the characteristics of students written by teachers for the frequency of occurrence of three words: “active,” “diligent,” “disciplined,” and synonyms of the words were also counted. Data on the frequency of occurrence of words were entered into the table:

To process the obtained data we use the chi-square test.

To do this, we will build a table of the distribution of empirical frequencies, i.e. those frequencies that we observe:

Theoretically, we expect that the frequencies will be equally distributed, i.e. the frequency will be distributed proportionally between boys and girls. Let's build a table of theoretical frequencies. To do this, multiply the row sum by the column sum and divide the resulting number by the total sum (s).

The final table for calculations will look like this:

h2 = ?(E - T)? / T

n = (R - 1), where R is the number of rows in the table.

In our case, chi-square = 4.21; n = 2.

Using the table of critical values ​​of the criterion, we find: with n = 2 and an error level of 0.05, the critical value h2 = 5.99.

The resulting value is less than the critical value, which means the null hypothesis is accepted.

Conclusion: teachers do not attach importance to the gender of the child when writing characteristics for him.

Application

Critical distribution points h2

The chi-square test is a universal method for checking the agreement between the results of an experiment and the statistical model used.

Pearson distance X 2

Pyatnitsky A.M.

Russian State Medical University

In 1900, Karl Pearson proposed a simple, universal and effective way to test the agreement between model predictions and experimental data. The “chi-square test” he proposed is the most important and most commonly used statistical test. Most problems related to estimating unknown model parameters and checking the agreement between the model and experimental data can be solved with its help.

Let there be an a priori (“pre-experimental”) model of the object or process being studied (in statistics they speak of the “null hypothesis” H 0), and the results of an experiment with this object. It is necessary to decide whether the model is adequate (does it correspond to reality)? Do the experimental results contradict our ideas about how reality works, or in other words, should H0 be rejected? Often this task can be reduced to comparing the observed (O i = Observed) and expected according to the model (E i = Expected) average frequencies of occurrence of certain events. It is believed that the observed frequencies were obtained in a series of N independent (!) observations made under constant (!) conditions. As a result of each observation, one of M events is recorded. These events cannot occur simultaneously (they are incompatible in pairs) and one of them necessarily occurs (their combination forms a reliable event). The totality of all observations is reduced to a table (vector) of frequencies (O i )=(O 1 ,… O M ), which completely describes the results of the experiment. The value O 2 =4 means that event number 2 occurred 4 times. Sum of frequencies O 1 +… O M =N. It is important to distinguish between two cases: N – fixed, non-random, N – random variable. For a fixed total number of experiments N, the frequencies have a polynomial distribution. Let us illustrate this general scheme with a simple example.

Using the chi-square test to test simple hypotheses.

Let the model (null hypothesis H 0) be that the die is fair - all faces appear equally often with probability p i =1/6, i =, M=6. An experiment was conducted in which the die was thrown 60 times (N = 60 independent trials were conducted). According to the model, we expect that all observed frequencies O i of occurrence 1,2,... 6 points should be close to their average values ​​E i =Np i =60∙(1/6)=10. According to H 0, the vector of average frequencies (E i )=(Np i )=(10, 10, 10, 10, 10, 10). (Hypotheses in which the average frequencies are completely known before the start of the experiment are called simple.) If the observed vector (O i ) were equal to (34,0,0,0,0,26), then it is immediately clear that the model is incorrect - bone cannot be correct, since only 1 and 6 were rolled 60 times. The probability of such an event for a correct dice is negligible: P = (2/6) 60 =2.4*10 -29. However, the appearance of such obvious discrepancies between the model and experience is an exception. Let the vector of observed frequencies (O i ) be equal to (5, 15, 6, 14, 4, 16). Is this consistent with H0? So, we need to compare two frequency vectors (E i) and (O i). In this case, the vector of expected frequencies (Ei) is not random, but the vector of observed frequencies (Oi) is random - during the next experiment (in a new series of 60 throws) it will turn out to be different. It is useful to introduce a geometric interpretation of the problem and assume that in frequency space (in this case 6-dimensional) two points are given with coordinates (5, 15, 6, 14, 4, 16) and (10, 10, 10, 10, 10, 10 ). Are they far enough apart to consider this incompatible with H 0 ? In other words, we need:

  1. learn to measure distances between frequencies (points in frequency space),
  2. have a criterion for what distance should be considered too (“implausibly”) large, that is, inconsistent with H 0 .

The square of the ordinary Euclidean distance would be equal to:

X 2 Euclid = S(O i -E i) 2 = (5-10) 2 +(15-10) 2 + (6-10) 2 +(14-10) 2 +(4-10) 2 +(16-10) 2

In this case, the surfaces X 2 Euclid = const are always spheres if we fix the values ​​of E i and change O i . Karl Pearson noted that the use of Euclidean distance in frequency space should not be used. Thus, it is incorrect to assume that the points (O = 1030 and E = 1000) and (O = 40 and E = 10) are at equal distances from each other, although in both cases the difference is O -E = 30. After all, the higher the expected frequency, the greater deviations from it should be considered possible. Therefore, the points (O =1030 and E =1000) should be considered “close”, and the points (O =40 and E =10) “far” from each other. It can be shown that if the hypothesis H 0 is true, then the frequency fluctuations O i relative to E i are of the order of the square root(!) of E i . Therefore, Pearson proposed, when calculating the distance, to square not the differences (O i -E i), but the normalized differences (O i -E i)/E i 1/2. So here's the formula to calculate the Pearson distance (it's actually the square of the distance):

X 2 Pearson = S((O i -E i )/E i 1/2) 2 = S(O i -E i ) 2 /E i

In our example:

X 2 Pearson = (5-10) 2 /10+(15-10) 2 /10 +(6-10) 2 /10+(14-10) 2 /10+(4-10) 2 /10+( 16-10) 2 /10=15.4

For a regular die, all expected frequencies E i are the same, but usually they are different, so surfaces on which the Pearson distance is constant (X 2 Pearson =const) turn out to be ellipsoids, not spheres.

Now that the formula for calculating the distances has been chosen, it is necessary to find out which distances should be considered “not too large” (consistent with H 0). So, for example, what can we say about the distance we calculated 15.4? In what percentage of cases (or with what probability) would we get a distance greater than 15.4 when conducting experiments with a regular die? If this percentage is small (<0.05), то H 0 надо отвергнуть. Иными словами требуется найти распределение длярасстояния Пирсона. Если все ожидаемые частоты E i не слишком малы (≥5), и верна H 0 , то нормированные разности (O i - E i )/E i 1/2 приближенно эквивалентны стандартным гауссовским случайным величинам: (O i - E i )/E i 1/2 ≈N (0,1). Это, например, означает, что в 95% случаев| (O i - E i )/E i 1/2 | < 1.96 ≈ 2 (правило “двух сигм”).

Explanation. The number of measurements O i falling into the table cell with number i has a binomial distribution with the parameters: m =Np i =E i,σ =(Np i (1-p i)) 1/2, where N is the number of measurements (N " 1), p i is the probability for one measurement to fall into a given cell (recall that the measurements are independent and are carried out under constant conditions). If p i is small, then: σ≈(Np i ) 1/2 =E i and the binomial distribution is close to Poisson, in which the average number of observations E i =λ, and the standard deviation σ=λ 1/2 = E i 1/ 2. For λ≥5, the Poisson distribution is close to normal N (m =E i =λ, σ=E i 1/2 =λ 1/2), and the normalized value (O i - E i )/E i 1/2 ≈ N (0 ,1).

Pearson defined the random variable χ 2 n – “chi-square with n degrees of freedom”, as the sum of the squares of n independent standard normal random variables:

χ 2 n = T 1 2 + T 2 2 + …+ T n 2 , where is everyone T i = N(0,1) - n. O. R. With. V.

Let's try to clearly understand the meaning of this most important random variable in statistics. To do this, on the plane (with n = 2) or in space (with n = 3) we present a cloud of points whose coordinates are independent and have a standard normal distributionf T (x) ~exp (-x 2 /2). On a plane, according to the “two sigma” rule, which is independently applied to both coordinates, 90% (0.95*0.95≈0.90) of points are contained within a square (-2

f χ 2 2 (a) = Сexp(-a/2) = 0.5exp(-a/2).

With a sufficiently large number of degrees of freedom n (n > 30), the chi-square distribution approaches normal: N (m = n; σ = (2n) ½). This is a consequence of the “central limit theorem”: the sum of identically distributed quantities with finite variance approaches the normal law as the number of terms increases.

In practice, you need to remember that the average square of the distance is m (χ 2 n) = n, and its variance is σ 2 (χ 2 n) = 2n. From here it is easy to conclude which chi-square values ​​should be considered too small and too large: most of the distribution lies in the range from n -2∙(2n) ½ to n +2∙(2n) ½.

So, Pearson distances significantly exceeding n +2∙ (2n) ½ should be considered implausibly large (inconsistent with H 0). If the result is close to n +2∙(2n) ½, then you should use tables in which you can find out exactly in what proportion of cases such and large chi-square values ​​can appear.

It is important to know how to choose the right value for the number of degrees of freedom (abbreviated n.d.f.). It seemed natural to assume that n was simply equal to the number of digits: n =M. In his article, Pearson suggested as much. In the dice example, this would mean that n =6. However, several years later it was shown that Pearson was mistaken. The number of degrees of freedom is always less than the number of digits if there are connections between the random variables O i. For the dice example, the sum O i is 60, and only 5 frequencies can be changed independently, so the correct value is n = 6-1 = 5. For this value of n we get n +2∙(2n) ½ =5+2∙(10) ½ =11.3. Since 15.4>11.3, then the hypothesis H 0 - the die is correct, should be rejected.

After clarifying the error, the existing χ 2 tables had to be supplemented, since initially they did not have the case n = 1, since the smallest number of digits = 2. Now it turns out that there may be cases when the Pearson distance has the distribution χ 2 n =1.

Example. With 100 coin tosses, the number of heads is O 1 = 65, and tails O 2 = 35. The number of digits is M = 2. If the coin is symmetrical, then the expected frequencies are E 1 =50, E 2 =50.

X 2 Pearson = S(O i -E i) 2 /E i = (65-50) 2 /50 + (35-50) 2 /50 = 2*225/50 = 9.

The resulting value should be compared with those that the random variable χ 2 n =1 can take, defined as the square of the standard normal value χ 2 n =1 =T 1 2 ≥ 9 ó T 1 ≥3 or T 1 ≤-3. The probability of such an event is very low P (χ 2 n =1 ≥9) = 0.006. Therefore, the coin cannot be considered symmetrical: H 0 should be rejected. The fact that the number of degrees of freedom cannot be equal to the number of digits is evident from the fact that the sum of the observed frequencies is always equal to the sum of the expected ones, for example O 1 +O 2 =65+35 = E 1 +E 2 =50+50=100. Therefore, random points with coordinates O 1 and O 2 are located on a straight line: O 1 +O 2 =E 1 +E 2 =100 and the distance to the center turns out to be less than if this restriction did not exist and they were located on the entire plane. Indeed, for two independent random variables with mathematical expectations E 1 =50, E 2 =50, the sum of their realizations should not always be equal to 100 - for example, the values ​​O 1 =60, O 2 =55 would be acceptable.

Explanation. Let's compare the result of the Pearson criterion at M = 2 with what the Moivre-Laplace formula gives when estimating random fluctuations in the frequency of occurrence of an event ν =K /N having a probability p in a series of N independent Bernoulli tests (K is the number of successes):

χ 2 n =1 = S(O i -E i) 2 /E i = (O 1 -E 1) 2 /E 1 + (O 2 -E 2) 2 /E 2 = (Nν -Np) 2 /(Np) + (N ( 1-ν )-N (1-p )) 2 /(N (1-p ))=

=(Nν-Np) 2 (1/p + 1/(1-p))/N=(Nν-Np) 2 /(Np(1-p))=((K-Np)/(Npq) ½ ) 2 = T 2

Value T =(K -Np)/(Npq) ½ = (K -m (K))/σ(K) ≈N (0.1) with σ(K)=(Npq) ½ ≥3. We see that in this case Pearson's result exactly coincides with what is obtained by using the normal approximation for the binomial distribution.

So far we have considered simple hypotheses for which the expected average frequencies E i are completely known in advance. For information on how to choose the correct number of degrees of freedom for complex hypotheses, see below.

Using the chi-square test to test complex hypotheses

In the examples with a regular die and coin, the expected frequencies could be determined before(!) the experiment. Such hypotheses are called “simple”. In practice, “complex hypotheses” are more common. Moreover, in order to find the expected frequencies E i, it is necessary to first estimate one or several quantities (model parameters), and this can only be done using experimental data. As a result, for “complex hypotheses” the expected frequencies E i turn out to depend on the observed frequencies O i and therefore themselves become random variables, varying depending on the results of the experiment. In the process of selecting parameters, the Pearson distance decreases - the parameters are selected so as to improve the agreement between the model and experiment. Therefore, the number of degrees of freedom should decrease.

How to estimate model parameters? There are many different estimation methods - “maximum likelihood method”, “method of moments”, “substitution method”. However, you can not use any additional funds and find parameter estimates by minimizing the Pearson distance. In the pre-computer era, this approach was rarely used: it is inconvenient for manual calculations and, as a rule, cannot be solved analytically. When calculating on a computer, numerical minimization is usually easy to carry out, and the advantage of this method is its versatility. So, according to the “chi-square minimization method,” we select the values ​​of the unknown parameters so that the Pearson distance becomes the smallest. (By the way, by studying changes in this distance with small displacements relative to the found minimum, you can estimate the measure of accuracy of the estimate: construct confidence intervals.) After the parameters and this minimum distance itself have been found, it is again necessary to answer the question of whether it is small enough.

The general sequence of actions is as follows:

  1. Model selection (hypothesis H 0).
  2. Selection of digits and determination of the vector of observed frequencies O i .
  3. Estimation of unknown model parameters and construction of confidence intervals for them (for example, by searching for the minimum Pearson distance).
  4. Calculation of expected frequencies E i .
  5. Comparison of the found value of the Pearson distance X 2 with the critical value of chi-square χ 2 crit - the largest, which is still considered plausible, compatible with H 0. We find the value χ 2 crit from the tables by solving the equation

P (χ 2 n > χ 2 crit) = 1-α,

where α is the “level of significance” or “size of the criterion” or “magnitude of the first type error” (typical value α = 0.05).

Usually the number of degrees of freedom n is calculated using the formula

n = (number of digits) – 1 – (number of parameters to be estimated)

If X 2 > χ 2 crit, then the hypothesis H 0 is rejected, otherwise it is accepted. In α∙100% of cases (that is, quite rarely), this method of checking H 0 will lead to an “error of the first kind”: the hypothesis H 0 will be rejected erroneously.

Example. When studying 10 series of 100 seeds, the number of green-eyed fly-infected ones was counted. Data received: O i =(16, 18, 11, 18, 21, 10, 20, 18, 17, 21);

Here the vector of expected frequencies is unknown in advance. If the data are homogeneous and obtained for a binomial distribution, then one parameter is unknown: the proportion p of infected seeds. Note that in the original table there are actually not 10 but 20 frequencies that satisfy 10 connections: 16+84=100, ... 21+79=100.

X 2 = (16-100p) 2 /100p +(84-100(1-p)) 2 /(100(1-p))+…+

(21-100p) 2 /100p +(79-100(1-p)) 2 /(100(1-p))

Combining terms in pairs (as in the example with a coin), we obtain the form of writing the Pearson criterion, which is usually written immediately:

X 2 = (16-100p) 2 /(100p(1-p))+…+ (21-100p) 2 /(100p(1-p)).

Now, if the minimum Pearson distance is used as a method for estimating p, then it is necessary to find a p for which X 2 =min. (The model tries, if possible, to “adjust” to the experimental data.)

The Pearson criterion is the most universal of all used in statistics. It can be applied to univariate and multivariate data, quantitative and qualitative features. However, precisely because of its versatility, one should be careful not to make mistakes.

Important points

1.Selection of categories.

  • If the distribution is discrete, then there is usually no arbitrariness in the choice of digits.
  • If distribution is continuous, then arbitrariness is inevitable. Statistically equivalent blocks can be used (all O are the same, for example =10). However, the lengths of the intervals are different. When doing manual calculations, they tried to make the intervals the same. Should the intervals when studying the distribution of a univariate trait be equal? No.
  • The digits must be combined in such a way that the expected (not observed!) frequencies are not too small (≥5). Let us recall that it is they (E i) that are in the denominators when calculating X 2! When analyzing one-dimensional characteristics, it is allowed to violate this rule in the two extreme digits E 1 =E max =1. If the number of digits is large and the expected frequencies are close, then X 2 is a good approximation of χ 2 even for E i =2.

Parameter Estimation. The use of “homemade”, inefficient estimation methods can lead to inflated Pearson distance values.

Choosing the right number of degrees of freedom. If parameter estimates are made not from frequencies, but directly from the data (for example, the arithmetic mean is taken as an estimate of the mean), then the exact number of degrees of freedom n is unknown. We only know that it satisfies the inequality:

(number of digits – 1 – number of parameters being evaluated)< n < (число разрядов – 1)

Therefore, it is necessary to compare X 2 with the critical values ​​of χ 2 crit calculated throughout this range of n.

How to interpret implausibly small chi-square values? Should a coin be considered symmetrical if, after 10,000 tosses, it lands on the coat of arms 5,000 times? Previously, many statisticians believed that H 0 should also be rejected. Now another approach is proposed: accept H 0, but subject the data and the methodology for their analysis to additional verification. There are two possibilities: either a too small Pearson distance means that increasing the number of model parameters was not accompanied by a proper decrease in the number of degrees of freedom, or the data itself was falsified (perhaps unintentionally adjusted to the expected result).

Example. Two researchers A and B calculated the proportion of recessive homozygotes aa in the second generation of an AA * aa monohybrid cross. According to Mendel's laws, this fraction is 0.25. Each researcher conducted 5 experiments, and 100 organisms were studied in each experiment.

Results A: 25, 24, 26, 25, 24. Researcher’s conclusion: Mendel’s law is true(?).

Results B: 29, 21, 23, 30, 19. Researcher’s conclusion: Mendel’s law is not fair(?).

However, Mendel's law is of a statistical nature, and quantitative analysis of the results reverses the conclusions! Combining five experiments into one, we arrive at a chi-square distribution with 5 degrees of freedom (a simple hypothesis is tested):

X 2 A = ((25-25) 2 +(24-25) 2 +(26-25) 2 +(25-25) 2 +(24-25) 2)/(100∙0.25∙0.75)=0.16

X 2 B = ((29-25) 2 +(21-25) 2 +(23-25) 2 +(30-25) 2 +(19-25) 2)/(100∙0.25∙0.75)=5.17

Average value m [χ 2 n =5 ]=5, standard deviation σ[χ 2 n =5 ]=(2∙5) 1/2 =3.2.

Therefore, without reference to the tables, it is clear that the value of X 2 B is typical, and the value of X 2 A is implausibly small. According to tables P (χ 2 n =5<0.16)<0.0001.

This example is an adaptation of a real case that occurred in the 1930s (see Kolmogorov’s work “On Another Proof of Mendel’s Laws”). Interestingly, Researcher A was a proponent of genetics, and Researcher B was against it.

Confusion in notation. It is necessary to distinguish the Pearson distance, which requires additional conventions in its calculation, from the mathematical concept of a chi-square random variable. The Pearson distance under certain conditions has a distribution close to chi-square with n degrees of freedom. Therefore, it is advisable NOT to denote the Pearson distance by the symbol χ 2 n, but to use a similar but different notation X 2. .

The Pearson criterion is not omnipotent. There are an infinite number of alternatives for H 0 that he is unable to take into account. Suppose you are testing the hypothesis that the feature had a uniform distribution, you have 10 digits and the vector of observed frequencies is equal to (130,125,121,118,116,115,114,113,111,110). The Pearson criterion cannot “notice” that the frequencies are monotonically decreasing and H 0 will not be rejected. If it were supplemented with a series criterion, then yes!

The chi-square distribution is one of the most widely used in statistics for testing statistical hypotheses. Based on the chi-square distribution, one of the most powerful goodness-of-fit tests is constructed - the Pearson chi-square test.

The criterion of agreement is the criterion for testing the hypothesis about the assumed law of an unknown distribution.

The χ2 (chi-square) test is used to test the hypothesis of different distributions. This is his dignity.

The calculation formula of the criterion is equal to

where m and m’ are empirical and theoretical frequencies, respectively

the distribution in question;

n is the number of degrees of freedom.

To check, we need to compare empirical (observed) and theoretical (calculated under the assumption of a normal distribution) frequencies.

If the empirical frequencies completely coincide with the frequencies calculated or expected, S (E – T) = 0 and the χ2 criterion will also be equal to zero. If S (E – T) is not equal to zero, this will indicate a discrepancy between the calculated frequencies and the empirical frequencies of the series. In such cases, it is necessary to evaluate the significance of the χ2 criterion, which theoretically can vary from zero to infinity. This is done by comparing the actually obtained value of χ2ф with its critical value (χ2st). The null hypothesis, i.e. the assumption that the discrepancy between the empirical and theoretical or expected frequencies is random, is refuted if χ2ф is greater than or equal to χ2st for the accepted significance level (a) and the number of degrees of freedom (n).

The distribution of probable values ​​of the random variable χ2 is continuous and asymmetric. It depends on the number of degrees of freedom (n) and approaches a normal distribution as the number of observations increases. Therefore, the application of the χ2 criterion to the assessment of discrete distributions is associated with some errors that affect its value, especially in small samples. To obtain more accurate estimates, the sample distributed into the variation series must have at least 50 options. Correct application of the χ2 criterion also requires that the frequencies of variants in extreme classes should not be less than 5; if there are less than 5 of them, then they are combined with the frequencies of neighboring classes so that the total amount is greater than or equal to 5. According to the combination of frequencies, the number of classes (N) decreases. The number of degrees of freedom is established by the secondary number of classes, taking into account the number of restrictions on the freedom of variation.



Since the accuracy of determining the χ2 criterion largely depends on the accuracy of calculating theoretical frequencies (T), unrounded theoretical frequencies should be used to obtain the difference between the empirical and calculated frequencies.

As an example, let's take a study published on a website dedicated to the application of statistical methods in the humanities.

The Chi-square test allows you to compare frequency distributions regardless of whether they are normally distributed or not.

Frequency refers to the number of occurrences of an event. Usually, the frequency of occurrence of events is dealt with when variables are measured on a scale of names and their other characteristics, besides frequency, are impossible or problematic to select. In other words, when a variable has qualitative characteristics. Also, many researchers tend to convert test scores into levels (high, average, low) and build tables of score distributions to find out the number of people at these levels. To prove that in one of the levels (in one of the categories) the number of people is really greater (less) the Chi-square coefficient is also used.

Let's look at the simplest example.

A test was conducted among younger adolescents to identify self-esteem. The test scores were converted into three levels: high, medium, low. The frequencies were distributed as follows:

High (B) 27 people.

Average (C) 12 people.

Low (L) 11 people

It is obvious that the majority of children have high self-esteem, but this needs to be proven statistically. To do this, we use the Chi-square test.

Our task is to check whether the obtained empirical data differ from theoretically equally probable ones. To do this, you need to find the theoretical frequencies. In our case, theoretical frequencies are equally probable frequencies, which are found by adding all frequencies and dividing by the number of categories.

In our case:

(B + C + H)/3 = (27+12+11)/3 = 16.6

Formula for calculating the chi-square test:

χ2 = ∑(E - T)I / T

We build the table:

Find the sum of the last column:

Now you need to find the critical value of the criterion using the table of critical values ​​(Table 1 in the Appendix). To do this we need the number of degrees of freedom (n).

n = (R - 1) * (C - 1)

where R is the number of rows in the table, C is the number of columns.

In our case, there is only one column (meaning the original empirical frequencies) and three rows (categories), so the formula changes - we exclude the columns.

n = (R - 1) = 3-1 = 2

For the error probability p≤0.05 and n = 2, the critical value is χ2 = 5.99.

The obtained empirical value is greater than the critical value - the differences in frequencies are significant (χ2= 9.64; p≤0.05).

As you can see, calculating the criterion is very simple and does not take much time. The practical value of the chi-square test is enormous. This method is most valuable when analyzing responses to questionnaires.


Let's look at a more complex example.

For example, a psychologist wants to know whether it is true that teachers are more biased towards boys than towards girls. Those. more likely to praise girls. To do this, the psychologist analyzed the characteristics of students written by teachers for the frequency of occurrence of three words: “active,” “diligent,” “disciplined,” and synonyms of the words were also counted. Data on the frequency of occurrence of words were entered into the table:

To process the obtained data we use the chi-square test.

To do this, we will build a table of the distribution of empirical frequencies, i.e. those frequencies that we observe:

Theoretically, we expect that the frequencies will be equally distributed, i.e. the frequency will be distributed proportionally between boys and girls. Let's build a table of theoretical frequencies. To do this, multiply the row sum by the column sum and divide the resulting number by the total sum (s).

The final table for calculations will look like this:

χ2 = ∑(E - T)I / T

n = (R - 1), where R is the number of rows in the table.

In our case, chi-square = 4.21; n = 2.

Using the table of critical values ​​of the criterion, we find: with n = 2 and an error level of 0.05, the critical value is χ2 = 5.99.

The resulting value is less than the critical value, which means the null hypothesis is accepted.

Conclusion: teachers do not attach importance to the gender of the child when writing characteristics for him.


Conclusion.

K. Pearson made a significant contribution to the development of mathematical statistics (a large number of fundamental concepts). Pearson's main philosophical position is formulated as follows: the concepts of science are artificial constructions, means of describing and ordering sensory experience; the rules for connecting them into scientific sentences are isolated by the grammar of science, which is the philosophy of science. The universal discipline - applied statistics - allows us to connect disparate concepts and phenomena, although according to Pearson it is subjective.

Many of K. Pearson's constructions are directly related or developed using anthropological materials. He developed numerous methods of numerical classification and statistical criteria used in all areas of science.


Literature.

1. Bogolyubov A. N. Mathematics. Mechanics. Biographical reference book. - Kyiv: Naukova Dumka, 1983.

2. Kolmogorov A. N., Yushkevich A. P. (eds.). Mathematics of the 19th century. - M.: Science. - T. I.

3. 3. Borovkov A.A. Math statistics. M.: Nauka, 1994.

4. 8. Feller V. Introduction to the theory of probability and its applications. - M.: Mir, T.2, 1984.

5. 9. Harman G., Modern factor analysis. - M.: Statistics, 1972.

2024 About comfort in the home. Gas meters. Heating system. Water supply. Ventilation system