Data Science
Data Science is basically trying to make sense of data by identifying patterns, applying algorithms to extract information and predict future actions that are suitable for a specific situation.
We know data can be available in both unstructured and structured formats. But what is Unstructured and Structured Data?
-
Unstructured Data: It does not have a predefined data model. It can be available in audio, video and text format.
-
Structured Data: It is mostly available in a text format which originates from databases or specific file formats like CSV (Comma Separated Values)
After categorizing data (unstructured or structured), you need to analyze it. Identifying patterns of the data and understanding the data by processing it in such a way that you convert the data into information. This area of processing the data is typically referred as Data Analytics.
Having analyzed this data, you’d want to present it visually so that decisions can be made. This is called Data Visualization.
Once you have visualized the data, you can decide what to do with that data. Either you can ask an expert to take suitable actions using that data or hand over the entire process to machines. When you transfer the process to a machine, the process is basically automation. You feed data into the machine, and the machine tries to identify patterns in data after rigorous training and testing and there upon gives us desired results. This is Machine Learning. Whenever there is situation when the machine needs to make decisions on its own where judgment is involved, it is called Artificial Intelligence.
Mumpy
As a general-purpose programming language, Python is now the most popular programming language in data science. It’s easy to use, has great community support, and integrates well with other frameworks (e.g., web applications) in an engineering environment.
Numerical Data
Datasets come from a wide range of sources and formats: it could be collections of numerical measurements, text corpus, images, audio clips, or basically anything. No matter the format, the first step in data science is to transform it into arrays of numbers.
We collected 45 U.S. president heights in centimeters in chronological order and stored them in a list, a built-in data type in python.
heights = [189, 170, 189, 163, 183, 171, 185, 168, 173, 183, 173, 173, 175, 178, 183, 193, 178, 173, 174, 183, 183, 180, 168, 180, 170, 178, 182, 180, 183, 178, 182, 188, 175, 179, 183, 193, 182, 183, 177, 185, 188, 188, 182, 185, 191]
In this example, George Washington was the first president, and his height was 189 cm.
If we wanted to know how many presidents are taller than 188cm, we could iterate through the list, compare each element against 188, and increase the count by 1 as the criteria is met.
cnt = 0
for height in heights:
if height > 188:
cnt +=1
print(cnt)
Numpy (short for Numerical Python) allows us to find the answer to how many presidents are taller than 188cm with ease. Below we show how to use the library and start with the basic object in numpy.
import numpy as np
heights_arr = np.array(heights)
print((heights_arr > 188).sum())
The import statement allows us to access the functions and modules inside the numpy library. The library will be used frequently, so by convention numpy is imported under a shorter name, np. The second line is to convert the list into a numpy array object, via np.array(), that tools provided in numpy can work with. The last line provides a simple and natural solution, enabled by numpy, to the original question.
As our datasets grow larger and more complicated, numpy allows us the use of a more efficient and for-loop-free method to manipulate and analyze our data. Our dataset example in this module will include the US Presidents’ height, age and party.
Size and Shape
An array class in Numpy is called an ndarray or n-dimensional array. We can use this to count the number of presidents in heights_arr, use attribute numpy.ndarray.size:
heights_arr.size
Note that once an array is created in numpy, its size cannot be changed.
Size tells us how big the array is, shape tells us the dimension. To get current shape of an array use attribute shape:
heights_arr.shape
The output is a tuple, recall that the built-in data type tuple is immutable whereas a list is mutable, containing a single value, indicating that there is only one dimension, i.e., axis 0. Along axis 0, there are 45 elements (one for each president) Here, heights_arr is a 1d array
re shape
Other data we have collected includes the ages of the presidents:
ages = [57, 61, 57, 57, 58, 57, 61, 54, 68, 51, 49, 64, 50, 48, 65, 52, 56, 46, 54, 49, 51, 47, 55, 55, 54, 42, 51, 56, 55, 51, 54, 51, 60, 62, 43, 55, 56, 61, 52, 69, 64, 46, 54, 47, 70]
Since both heights and ages are all about the same presidents, we can combine them:
heights_and_ages = heights + ages
# convert a list to a numpy array
heights_and_ages_arr = np.array(heights_and_ages)
heights_and_ages_arr.shape
This produces one long array. It would be clearer if we could align height and age for each president and reorganize the data into a 2 by 45 matrix where the first row contains all heights and the second row contains ages. To achieve this, a new array can be created by calling numpy.ndarray.reshape with new dimensions specified in a tuple:
heights_and_ages_arr.reshape((2,45))
The reshaped array is now a 2darray, yet note that the original array is not changed. We can reshape an array in multiple ways, as long as the size of the reshaped array matches that of the original.
Data Type
Another characteristic about numpy array is that it is homogeneous, meaning each element must be of the same data type.
For example, in heights_arr, we recorded all heights in whole numbers; thus each element is stored as an integer in the array. To check the data type, use numpy.ndarray.dtype
heights_arr.dtype
If we mixed a float number in, say, the first element is 189.0 instead of 189:
heights_float = [189.0, 170, 189, 163, 183, 171, 185, 168, 173, 183, 173, 173, 175, 178, 183, 193, 178, 173, 174, 183, 183, 180, 168, 180, 170, 178, 182, 180, 183, 178, 182, 188, 175, 179, 183, 193, 182, 183, 177, 185, 188, 188, 182, 185, 191]
Then after converting the list into an array, we’d see all other numbers are coerced into floats:
heights_float_arr = np.array(heights_float)
heights_float_arr
heights_float_arr.dtype