Boost your learning by mastering this key library.
Check out our free Numpy course:
There are thousands of data science tutorials online and many students jump from one to another endlessly, because they feel that their knowledge is not solid enough to adapt what they learned to their own projects. However, that is precisely the best way to learn data science: trying to adapt what you learn to a new context.
Many aspiring data scientists lack the ability to manipulate data sets proficiently. Libraries such as Keras and Scikit-learn allow us to implement Machine Learning projects easily but hide the complexity of the operations they perform, so if something goes wrong the student does not have the tools to solve it.
This lack of solid roots in their knowledge can lead them to feel impostor syndrome and delay too long the search for their first job as a Data Scientist.
However, the cure for this ill is very simple: learning how to manipulate datasets comfortably. This will allow you to modify and extend everything you learn in tutorials to your own projects and understand the errors you encounter along the way.
The easiest and most efficient way to develop this skill is to become familiar with the most important Python data science library, called Numpy. This is the core library because the rest rely on it to implement its functionality. It is a numerical computation library that allows us to manipulate large volumes of data in an extremely simple way.
Numpy is the central data science library in Python.
So much so that all other libraries depend on it in one way or another.
Numpy + Matplotlib
Matplotlib is the core Python graph visualization library. To create its visualizations, this library uses Numpy arrays:
x = np.linspace(0, 5, 100) y = np.sin(x) plt.plot(x, y)
Numpy + TensorFlow & PyTorch
TensorFlow and PyTorch are numerical computing libraries just like Numpy. The difference is that they are specially oriented to artificial intelligence. However, these two libraries copied a large part of their functionality from Numpy.
For example, to square the values of an array using the Numpy library, we can simply write:
Can you guess how it is done in TensorFlow and PyTorch?
As you can see, these libraries perform operations in a very similar way. Therefore, learning Numpy will help you master these two other libraries.
Numpy + Scikit-learn
Scikit-learn is a core data analysis and machine learning library in Python. Of course, it also uses Numpy arrays to implement its functionality, as the following linear regression example shows:
import numpy as np
from sklearn.linear_model import LinearRegressionX = np.ones((4,2))
y = np.arange(4)
reg = LinearRegression().fit(X, y)
Numpy + Pandas
Pandas is a library for manipulating and analyzing tabular data.
This library represents datasets using the Series and DataFrame classes. Both of them can be implemented with Numpy arrays:
import numpy as np
import pandas as pdx = np.array([1., 2., 3.])
series = pd.Series(x)
Moreover, converting objects of these classes into Numpy arrays is just as easy.
x = series.to_numpy()
Ndarray: an extremely versatile object
It is no coincidence that Numpy is the central data science library in Python. This library allows us to perform complex operations on large volumes of data quickly and efficiently. To achieve this, it uses a data structure called n-dimensional array (ndarray).
This data structure is specially designed to represent data sets, and to that end, it stores elements arranged in rows. These rows can go in one dimension, two, three, or more. For example:
In addition to being extremely efficient, these arrays are very versatile and allow us to represent phenomena of a very diverse nature, and above all, they are very easy to manipulate. Imagine you have a data set and you want to separate the independent variable from the other dependent variables:
Or you might need to separate your data into training and validation sets:
If you want to create an image classifier, you will have to reorder the image dimensions so that the channels come first:
As you can see, all these operations are performed with a simple line of code, while in pure Python they would require complex code blocks. That is the power of Numpy: being able to do complex operations intuitively.
How to learn Numpy in record time
At Escape Velocity Labs we have created a free one-and-a-half-hour course in which you will learn the core functionality of this library. You can register for free at the following link:
Mastering Numpy will allow you to…
- Take your data science projects beyond tutorials.
- Create and adapt your own datasets.
- Understand how high-level libraries work (Keras, Scikit-learn, etc).
- Diagnose and resolve bugs in your code implementation.
- Understand how TensorFlow and PyTorch work at a low level.
- And above all the above: learn to reason about the shape of your data and the effect that each operation has on it. Although it may seem counter-intuitive, this is the most useful skill you can develop in your first steps as a Data Scientist.