BIOSTATISTICS: December 2010

Friday, December 10, 2010

Student's t Distribution

According to the central limit theorem, the sampling distribution of a statistic (like a sample mean) will follow a normal distribution, as long as the sample size is sufficiently large. Therefore, when we know the standard deviation of the population, we can compute a z-score, and use the normal distribution to evaluate probabilities with the sample mean.

But sample sizes are sometimes small, and often we do not know the standard deviation of the population. When either of these problems occur, statisticians rely on the distribution of the t statistic (also known as the t score), whose values are given by:

t = [ x - μ ] / [ s / sqrt( n ) ]

where x is the sample mean, μ is the population mean, s is the standard deviation of the sample, and n is the sample size. The distribution of the t statistic is called the t distribution or the Student t distribution.

Degrees of Freedom

There are actually many different t distributions. The particular form of the t distribution is determined by its degrees of freedom. The degrees of freedom refers to the number of independent observations in a set of data.

When estimating a mean score or a proportion from a single sample, the number of independent observations is equal to the sample size minus one. Hence, the distribution of the t statistic from samples of size 8 would be described by a t distribution having 8 - 1 or 7 degrees of freedom. Similarly, a t distribution having 15 degrees of freedom would be used with a sample of size 16.

For other applications, the degrees of freedom may be calculated differently. We will describe those computations as they come up.

Properties of the t Distribution

The t distribution has the following properties:

The mean of the distribution is equal to 0 .
The variance is equal to v / ( v - 2 ), where v is the degrees of freedom (see last section) and v > 2.
The variance is always greater than 1, although it is close to 1 when there are many degrees of freedom. With infinite degrees of freedom, the t distribution is the same as the standard normal distribution.

When to Use the t Distribution

The t distribution can be used with any statistic having a bell-shaped distribution (i.e., approximately normal). The central limit theorem states that the sampling distribution of a statistic will be normal or nearly normal, if any of the following conditions apply.

The population distribution is normal.
The sampling distribution is symmetric, unimodal, without outliers, and the sample size is 15 or less.
The sampling distribution is moderately skewed, unimodal, without outliers, and the sample size is between 16 and 40.
The sample size is greater than 40, without outliers.

The t distribution should not be used with small samples from populations that are not approximately normal.

Probability and the Student t Distribution

When a sample of size n is drawn from a population having a normal (or nearly normal) distribution, the sample mean can be transformed into a t score, using the equation presented at the beginning of this lesson. We repeat that equation below:

t = [ x - μ ] / [ s / sqrt( n ) ]

where x is the sample mean, μ is the population mean, s is the standard deviation of the sample, n is the sample size, and degrees of freedom are equal to n - 1.

Rules of Probability

Often, we want to compute the probability of an event from the known probabilities of other events. This lesson covers some important rules that simplify those computations.

Definitions and Notation

Before discussing the rules of probability, we state the following definitions:

Two events are mutually exclusive or disjoint if they cannot occur at the same time.
The probability that Event A occurs, given that Event B has occurred, is called a conditional probability. The conditional probability of Event A, given Event B, is denoted by the symbol P(A|B).
The complement of an event is the event not occuring. The probability that Event A will not occur is denoted by P(A').

The probability that Events A and B both occur is the probability of the intersection of A and B. The probability of the intersection of Events A and B is denoted by P(A ∩ B). If Events A and B are mutually exclusive, P(A ∩ B) = 0.
The probability that Events A or B occur is the probability of the union of A and B. The probability of the union of Events A and B is denoted by P(A ∪ B) .
If the occurence of Event A changes the probability of Event B, then Events A and B are dependent. On the other hand, if the occurence of Event A does not change the probability of Event B, then Events A and B are independent.

Probability Calculator

Use the Probability Calculator to compute the probability of an event from the known probabilities of other events. The Probability Calculator is free and easy to use. It can be found under the Stat Tools tab, which appears in the header of every Stat Trek web page.

Probability Calculator

Rule of Subtraction

In a previous lesson, we learned two important properties of probability:

The probability of an event ranges from 0 to 1.
The sum of probabilities of all possible events equals 1.

The rule of subtraction follows directly from these properties.

Rule of Subtraction The probability that event A will occur is equal to 1 minus the probability that event A will not occur.

P(A) = 1 - P(A')

Suppose, for example, the probability that Bill will graduate from college is 0.80. What is the probability that Bill will not graduate from college? Based on the rule of subtraction, the probability that Bill will not graduate is 1.00 - 0.80 or 0.20.

Rule of Multiplication

The rule of multiplication applies to the situation when we want to know the probability of the intersection of two events; that is, we want to know the probability that two events (Event A and Event B) both occur.

Rule of Multiplication The probability that Events A and B both occur is equal to the probability that Event A occurs times the probability that Event B occurs, given that A has occurred.

P(A ∩ B) = P(A) P(B|A)

Example
An urn contains 6 red marbles and 4 black marbles. Two marbles are drawn without replacement from the urn. What is the probability that both of the marbles are black?

Solution: Let A = the event that the first marble is black; and let B = the event that the second marble is black. We know the following:

In the beginning, there are 10 marbles in the urn, 4 of which are black. Therefore, P(A) = 4/10.
After the first selection, there are 9 marbles in the urn, 3 of which are black. Therefore, P(B|A) = 3/9.

Therefore, based on the rule of multiplication:

P(A ∩ B) = P(A) P(B|A)
P(A ∩ B) = (4/10)*(3/9) = 12/90 = 2/15

Rule of Addition

The rule of addition applies to the following situation. We have two events, and we want to know the probability that either event occurs.

Rule of Addition The probability that Event A or Event B occurs is equal to the probability that Event A occurs plus the probability that Event B occurs minus the probability that both Events A and B occur.

P(A ∪ B) = P(A) + P(B) - P(A ∩ B))

Note: Invoking the fact that P(A ∩ B) = P( A )P( B | A ), the Addition Rule can also be expressed as

P(A ∪ B) = P(A) + P(B) - P(A)P( B | A )

Simple Random Sampling

To understand sampling, you need to first understand a few basic definitions.

The total set of observations that can be made is called the population.
A sample is a subset of a population.
A parameter is a measurable characteristic of a population, such as a mean or standard deviation.
A statistic is a measurable characteristic of a sample, such as a mean or standard deviation.
A sampling method is a procedure for selecting sample elements from a population.
A random number is a number determined totally by chance, with no predictable relationship to any other number.
A random number table is a list of numbers, composed of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. Numbers in the list are arranged so that each digit has no predictable relationship to the digits that preceded it or to the digits that followed it. In short, the digits are arranged randomly. The numbers in a random number table are random numbers.

Simple Random Sampling

Simple random sampling refers to a sampling method that has the following properties.

The population consists of N objects.
The sample consists of n objects.
All possible samples of n objects are equally likely to occur.

The main benefit of simple random sampling is that it guarantees that the sample chosen is representative of the population. This ensures that the statistical conclusions will be valid.

There are many ways to obtain a simple random sample. One way would be the lottery method. Each of the N population members is assigned a unique number. The numbers are placed in a bowl and thoroughly mixed. Then, a blind-folded researcher selects n numbers. Population members having the selected numbers are included in the sample.

Random Number Generator

In practice, the lottery method described above can be cumbersome, particularly with large sample sizes. As an alternative, use Stat Trek's Random Number Generator. With the Random Number Generator, you can select n random numbers quickly and easily. This tool is provided at no cost - free!! To access the Random Number Generator, simply click on the button below. It can also be found under the Stat Tools tab, which appears in the header of every Stat Trek web page.

Random Number Generator

Sampling With Replacement and Without Replacement

Suppose we use the lottery method described above to select a simple random sample. After we pick a number from the bowl, we can put the number aside or we can put it back into the bowl. If we put the number back in the bowl, it may be selected more than once; if we put it aside, it can selected only one time.

When a population element can be selected more than one time, we are sampling with replacement. When a population element can be selected only one time, we are sampling without replacement

Poisson Distribution

Attributes of a Poisson Experiment

A Poisson experiment is a statistical experiment that has the following properties:

The experiment results in outcomes that can be classified as successes or failures.
The average number of successes (μ) that occurs in a specified region is known.
The probability that a success will occur is proportional to the size of the region.
The probability that a success will occur in an extremely small region is virtually zero.

Note that the specified region could take many forms. For instance, it could be a length, an area, a volume, a period of time, etc.

Notation

The following notation is helpful, when we talk about the Poisson distribution.

e: A constant equal to approximately 2.71828. (Actually, e is the base of the natural logarithm system.)
μ: The mean number of successes that occur in a specified region.
x: The actual number of successes that occur in a specified region.
P(x; μ): The Poisson probability that exactly x successes occur in a Poisson experiment, when the mean number of successes is μ.

Poisson Distribution

A Poisson random variable is the number of successes that result from a Poisson experiment. The probability distribution of a Poisson random variable is called a Poisson distribution.

Given the mean number of successes (μ) that occur in a specified region, we can compute the Poisson probability based on the following formula:

Poisson Formula. Suppose we conduct a Poisson experiment, in which the average number of successes within a given region is μ. Then, the Poisson probability is:

P(x; μ) = (e^-μ) (μ^x) / x!

where x is the actual number of successes that result from the experiment, and e is approximately equal to 2.71828.

The Poisson distribution has the following properties:

The mean of the distribution is equal to μ .
The variance is also equal to μ .

Example 1

The average number of homes sold by the Acme Realty company is 2 homes per day. What is the probability that exactly 3 homes will be sold tomorrow?

Solution: This is a Poisson experiment in which we know the following:

μ = 2; since 2 homes are sold per day, on average.
x = 3; since we want to find the likelihood that 3 homes will be sold tomorrow.
e = 2.71828; since e is a constant equal to approximately 2.71828.

We plug these values into the Poisson formula as follows:

P(x; μ) = (e^-μ) (μ^x) / x!
P(3; 2) = (2.71828^-2) (2³) / 3!
P(3; 2) = (0.13534) (8) / 6
P(3; 2) = 0.180

Thus, the probability of selling 3 homes tomorrow is 0.180 .

Normal Distribution

The normal distribution refers to a family of continuous probability distributions described by the normal equation.

The Normal Equation

The normal distribution is defined by the following equation:

Normal equation. The value of the random variable Y is:

Y = [ 1/σ * sqrt(2π) ] * e^{-(x - μ)2/2σ2}

where X is a normal random variable, μ is the mean, σ is the standard deviation, π is approximately 3.14159, and e is approximately 2.71828.

The random variable X in the normal equation is called the normal random variable. The normal equation is the probability density function for the normal distribution.

The Normal Curve

The graph of the normal distribution depends on two factors - the mean and the standard deviation. The mean of the distribution determines the location of the center of the graph, and the standard deviation determines the height and width of the graph. When the standard deviation is large, the curve is short and wide; when the standard deviation is small, the curve is tall and narrow. All normal distributions look like a symmetric, bell-shaped curve, as shown below.

The curve on the left is shorter and wider than the curve on the right, because the curve on the left has a bigger standard deviation.

Probability and the Normal Curve

The normal distribution is a continuous probability distribution. This has several implications for probability.

The total area under the normal curve is equal to 1.
The probability that a normal random variable X equals any particular value is 0.
The probability that X is greater than a equals the area under the normal curve bounded by a and plus infinity (as indicated by the non-shaded area in the figure below).
The probability that X is less than a equals the area under the normal curve bounded by a and minus infinity (as indicated by the shaded area in the figure below).

Additionally, every normal curve (regardless of its mean or standard deviation) conforms to the following "rule".

About 68% of the area under the curve falls within 1 standard deviation of the mean.
About 95% of the area under the curve falls within 2 standard deviations of the mean.
About 99.7% of the area under the curve falls within 3 standard deviations of the mean.

Collectively, these points are known as the empirical rule or the 68-95-99.7 rule. Clearly, given a normal distribution, most outcomes will be within 3 standard deviations of the mean.

Statistics Tutorial: Standard Normal Distribution

Standard Normal Distribution

The standard normal distribution is a special case of the normal distribution. It is the distribution that occurs when a normal random variable has a mean of zero and a standard deviation of one.

The normal random variable of a standard normal distribution is called a standard score or a z-score. Every normal random variable X can be transformed into a z score via the following equation:

z = (X - μ) / σ

where X is a normal random variable, μ is the mean mean of X, and σ is the standard deviation of X.

The Normal Distribution as a Model for Measurements

Often, phenomena in the real world follow a normal (or near-normal) distribution. This allows researchers to use the normal distribution as a model for assessing probabilities associated with real-world phenomena. Typically, the analysis involves two steps.

Transform raw data. Usually, the raw data are not in the form of z-scores. They need to be transformed into z-scores, using the transformation equation presented earlier: z = (X - μ) / σ.
Find probability. Once the data have been transformed into z-scores, you can use standard normal distribution tables, online calculators (e.g., Stat Trek's free normal distribution calculator), or handheld graphing calculators to find probabilities associated with the z-scores.

Measures of Central Tendency

Several different measures of central tendency are defined below.

The mode is the most frequently appearing value in the population or sample. Suppose we draw a sample of five women and measure their weights. They weigh 100 pounds, 100 pounds, 130 pounds, 140 pounds, and 150 pounds. Since more women weigh 100 pounds than any other weight, the mode would equal 100 pounds.
To find the median, we arrange the observations in order from smallest to largest value. If there is an odd number of observations, the median is the middle value. If there is an even number of observations, the median is the average of the two middle values. Thus, in the sample of five women, the median value would be 130 pounds; since 130 pounds is the middle weight.
The mean of a sample or a population is computed by adding all of the observations and dividing by the number of observations. Returning to the example of the five women, the mean weight would equal (100 + 100 + 130 + 140 + 150)/5 = 620/5 = 124 pounds.

Proportions and Percentages

When the focus is on the degree to which a population possesses a particular attribute, the measure of interest is a percentage or a proportion.

A proportion refers to the fraction of the total that possesses a certain attribute. For example, we might ask what proportion of women in our sample weigh less than 135 pounds. Since 3 women weigh less than 135 pounds, the proportion would be 3/5 or 0.60.
A percentage is another way of expressing a proportion. A percentage is equal to the proportion times 100. In our example of the five women, the percent of the total who weigh less than 135 pounds would be 100 * (3/5) or 60 percent.

Notation

Of the various measures, the mean and the proportion are most important. The notation used to describe these measures appears below:

X: Refers to a population mean.
x: Refers to a sample mean.
P: The proportion of elements in the population that has a particular attribute.
p: The proportion of elements in the sample that has a particular attribute.
Q: The proportion of elements in the population that does not have a specified attribute. Note that Q = 1 - P.
q: The proportion of elements in the sample that does not have a specified attribute. Note that q = 1 - p.

Note that capital letters refer to population parameters, and lower-case letters refer to sample statistics.