Data Analysis¶
This tutorial is based on the following resources:
Introduction¶
- Data
- Why Python
- Libs
- Installation
What Kinds of Data¶
The focus here is the analysis of structured data that include:
- Two dimmensional (tabular or spreadsheet-like) data that has rows and columns. Each column may be a different type.
- Multidimensional Arrays: also called matrices.
- Multiple tables of data interrated by foreign keys. It is the data structure for relatioinal database.
- Time series data.
Why Python¶
- Most data analysis consist of a small portiion and time-consuming computing and a large amount of "glue code" that doesn't run often. Python integrates well with C, C++ and Fortran code.
- Computational intensive code should be implemented in C or C++.
- Python can be used for both prototyping and production.
- Python has rich libraries for data manipulation, numeric computing, machine learning, visualization, and nutural language processing (NLP).
Essential Python Libraries¶
- Numpy (Numerical Python): matrix and numerical computing.
- pandas: high-level data structures, tabular and time series data processing functions. It borrows the dataframe idea from R.
- metplotlib: data visualization.
- SciPy: scientific computing includes calculus, linear algebra, signal processing, statistics, and more.
- scikit-learn: machine learning tookit includes classification, regression, clustering, and more.
- statsmodels: classical statistics and econometrics.
- TensorFlow (Google) and PyTorch (Facebook): two popular deep learning libraries.
Installation¶
pip install jupyter # for Jupyter Notebook
pip install numpy matplotlib pandas seaborn statsmodels
pip install scipy scikit-learn patsy pyarrow numba
pip install lxml beautifulsoup4 html5lib openpyxl requests sqlalchemy
NumPy¶
NumPy stands for Numerical Python. It is a foundation for many Python numeric computation packages.
ndarrayan array data type- Array data operations
- Tools for reading/writing disk or memory files
- Linear algebra, random numbers, and Fourier tranform
- A C API for connecting NumPy with C, C++, or Fortran
Basic Functions¶
- Array operations
- Array algorithms like sorting, unique, and set operations
- Data alignment and relatioinal data manipulation for heterogeneous datasets
- Conditional logic
- Group-wise data manipulations (aggretation, transformation, and function applications)
ndarray¶
ndarrayis a fast, flexible data type for large datasets.- Operations are on whole dataset.
import numpy as np
data = np.array([[1.5, -0.1, 3], [0, -3, 6.5]])
print(data * 10)
print(data + data)
print(data.shape)
print(data.dtype)
Creating Arrays¶
array: Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a data type or explicitly specifying a data type; copies the input data by default.ones,zeros: Produce an array of all 1s with the given shape and data type;ones_like,zeros_like: takes another array and produces a ones/zeros array of the same shape and data type.empty,empty_like,full,full_like: no data or filled data.eye,identity:N x Nidentity matrix with1s on the diagonal and0s elsewhere
Data Types and Arithmetic Operations¶
- Types:
int8/16/32/64,uint8/16/32/64,float16/32/64,complex64/128/256,bool,object,string_,unicode_. - Conversion:
astype. - Vector arithmetic operations:
+,-,*,/,**,>etc. - Element-wise and Binary Operations
- Indexing and Slicing: like list.
- Boolean indexing
- Reshape
names = np.array(["Bob", "Joe", "Will", "Bob", "Will", "Joe", "Joe"])
data = np.array([[4, 7], [0, 2], [-5, 6], [0, 0], [1, 2], [-12, -4], [3, 4]])
data[names == "Bob"] # selecting rows
data[names == "Bob", 1:] # selecting rows and columns
data[~(names == "Bob")] # negate selection
# mutiple conditions
mask = (names == "Bob") | (names == "Will")
data[mask]
# selecting and setting
data[data < 0] = 0
arr = np.arange(32).reshape((8, 4)) # reshape
print(arr)
print(arr.T) # transposing
print(arr.T @ arr) # dot multiplication
arr.swapaxes(0, 1) # transposing is a specical case of swapping axeses
Pseudorandom Number Generation¶
The numpy.random module has many random number generation functions.
standard_normalpermutationshuffleuniformintegersbinomial
Array-Oriented Programming¶
- The
numpy.meshgridfunction takes two one-dimensional arrays and produces two two-dimensional matrices corresponding to all pairs of(x, y)in the two arrays. - Array statistical methods:
sum,mean,std,var, etc - Linear algebra:
diag,dot,trace,eig,solve, etc
pandas¶
pandasis based onNumPyand contains data structures and data manipulation tools.- It is designed to work with tabluar or heterogeneous data.
Data Structures¶
Series: one dimensional array with labels.DataFrame: tabular data with row and column labels
Visualization¶
matplotlib is a library for creating static, animated, and interactive visualizations in Python. It integrates with the numpy array data types.
Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates with pandas data structures.