Study of the collection, analysis, interpretation, and presentation of data.
In this unit, we will delve into the fundamental concepts of statistics and probability, which form the backbone of data science. Understanding these concepts is crucial for data analysis and predictive modeling.
Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data. In data science, we primarily focus on two types of statistics:
Descriptive Statistics: This involves methods of organizing, picturing, and summarizing information from data. It provides simple summaries about the sample and the measures. These summaries may be either quantitative (i.e., mean, median, mode) or visual (i.e., graphs and charts).
Inferential Statistics: This involves methods of using information from a sample to draw conclusions (inferences) about the population. It allows us to make predictions or generalizations about a population from a sample of data.
Data can be classified into four types:
These are statistical measures that identify a single value as representative of an entire distribution. The three most common measures of central tendency are:
These are statistical measures that describe the variability or spread in a data set. The most common measures of dispersion include:
Probability is a mathematical framework for quantifying our uncertainty. It provides a way of summarizing the uncertainty that comes from our laziness and ignorance. It's an essential tool in predicting what will happen next, thus, it underlies all machine learning models.
A probability distribution describes how a random variable is distributed. It tells us which outcomes are likely, which are less likely, and how likely they are. Each class of probability distributions includes a wide range of specific distributions:
Normal Distribution: Also known as the Gaussian distribution, is a continuous probability distribution for a real-valued random variable. The graph of the normal distribution is characterized by its bell shape and symmetrical nature.
Binomial Distribution: A discrete probability distribution of the number of successes in a sequence of n independent experiments.
Poisson Distribution: A discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space.
Understanding these concepts will provide a solid foundation for the more advanced data science techniques to come.