# Data Analysis and Visualization

This guide was created to support Tulane faculty, staff, and students as they undergo data analysis and visualization tasks. If you have any questions about this guide or any suggestions for improved content, reach out to scholarlyengagement@tulane.edu

## Statistical Analysis

Descriptive statistics are used to describe measures of frequency, central tendency, variation, and position of variables in a data set.

Measures of Frequency

Measuring frequency of variables in a data set consist of counting the variables and providing that amount in context with the rest of the data set. For example, you may administer a survey to people from the United States and find that 50/100 (50%) of your respondents are from the West Coast. You might use this measure to make decisions or investigate further.

Measures of Central Tendency

Measures of central tendency give you a sense of your data's "average-ness," helping you understand the data's mid-point. These measures include:

Mean (Average) - Sum of all values over the number of values

Median - Middle value of an ordered set of numeric values

Mode - Value that occurs most frequently in a given set

Measures of Variation and Position

Variation measures give insight into the spread or variability of variables in a data set. These measures help you determine distance between data points.

Range - The difference between the smallest number and largest number of a given set/variable

Quartiles - Four equal sections of a given set/variable

Percentiles -  Indicates value below which a certain amount of the data can be found (Example: If a student scores in the 90th percentile on a standardized test, they score the same as or higher than 90% of students who took the test)

Interquartile Range - The difference between the medians of the 1st and 3rd quartiles (IQR = Q3 - Q1)

Variance - The average squared variations from the mean, comparing each value to the mean value

Standard Deviation - A measurement of the dispersion of data compared to the mean value; standard deviation is the square root of variance

Outliers - Term used describe values that are outside of a specified range. The specified range is usually considered the preferred range for values in this set.

Inferential statistics are used to help researchers make estimations about larger populations based on sample data. When random samples are taken a population, it is assumed that (1) as sample size increases, the sample's distribution becomes similar to the normal distribution, the mean of the sample equals the mean of the population, and (3)  standard deviation represent standard error.  Inferential statistics include the following concepts:

Estimation

The use of a sample parameter to derive a population parameter. One might use, for example, the sample's mean as an estimation of the population mean. This is especially significant if the sample distribution is similar to that of the population.

Point estimations - Point estimations are measures of sample data that can be used to determine parameters (measures) of the respective population

Probability - the likelihood of an event based on known information

Hypothesis Testing

Hypothesis testing involves performing a parametric or non-parametric statistical test to determine whether there is a statistically significant relationship between an independent and dependent variable or to estimate the difference between two or more groups. Tests are chosen based on data type and comparison factor (mean, variance, etc.)

In hypothesis testing, you create a null hypothesis that expresses the absence of a significant relationship or difference. The alternative hypothesis speculates that there is a significant relationship or difference in your data. You are expected to reject or accept your null hypothesis based on your test's generated statistic and a probability value (p-value). Traditionally, a p-value of .05 or .01 is an acceptable threshold for measuring how likely the sample parameters are due to chance. Any p-value above these values would suggest that the relationship or difference noted in your data could be due to chance (p<.05 or .01). If your value is outside of this range, you can reject your null hypothesis, thus asserting that the difference or relationship is statistically significant.

Confidence intervals and levels are also generated from inferential tests to help measure the tests' strength.

Confidence Interval - The range of values that is used to estimate the true value of a population based on a sample

Confidence Level - the probability that the confidence interval truly contains the population parameter

Parametric Tests

Parametric tests are used on data that have (1) normal distribution, (2) consistent variance across comparison groups, and (3) independent observations.

Parametric tests include:

• t-test
• z-test
• Paired t test
• ANOVA

Non-Parametric Tests

Non-parametric tests can be used to determine significance of differences and relationships when your data does not meet one or more of the parametric assumptions.

Non-parametric tests include:

• Mann-Whitney test
• Wilcoxon signed rank test
• Spearman's R
• Chi-Square Test of Independence

Regression Analysis

Regression analysis helps predicts how much change is caused in one variable by another. Regression analysis generates a model/equation that can be used to predict outcomes based on a known variable.

Correlation Analysis

Correlation analysis helps determine the strength and direction of the a relationship between variables presented as a number between -1.00 and 1.00.

For more on inferential statistics, check out the DATA ANALYSIS section on the SELECT READINGS page of this guide.

Need help determining which inferential test to use? Use the following flowchart to help with your analysis.​ Take a closer look using the following link: InferentialStatisticalDecisionMakingTrees.pdf (wikimedia.org)​​​​​​  