In this tutorial for Python, we’ll show you NumPy basics for data science and machine learning.
NumPy is the fundamental library for array/scientific computing in Python. Many Python libraries for machine learning are built upon NumPy. So it’s hard to do data analysis using Python without NumPy.
Following this beginner-friendly tutorial, you’ll learn the basics of Python NumPy:
- What is NumPy in Python and How to install and import it.
- What are NumPy multi-dimensional arrays.
- How to create and manipulate NumPy arrays.
- How to do NumPy arrays math operations (linear algebra) and indexing.
- What are the popular NumPy random number generators.
- Much more.
This tutorial with examples will help you start data operations in Python, which is the foundation for more advanced machine learning tasks.
Note: if you are new to Python, take our FREE Python crash course for data science to learn the basics. This tutorial of Python NumPy assumes you have a basic knowledge of Python.
Let’s get started!
- What is NumPy Python for Data Science?
- Install and Import NumPy
- Python NumPy Array Creation and Basics
- From Range to One-Dimensional Array
- From List to Multi-Dimensional Array
- Special Arrays (zeros, ones, full, eye)
- Data Types
What is NumPy Python for Data Science?
Or why NumPy?
NumPy introduces powerful N-dimensional arrays, which are more memory efficient and faster than Python lists. For more details, check out What’s the difference between a Python list and a NumPy array?
NumPy makes Python easier for numerical operations, which provides functionality comparable to MatLab and R.
Besides this, NumPy also provides useful linear algebra, Fourier transform, and random number capabilities.
Many other machine learning packages (e.g., Pandas, SciPy, scikit-learn, TensorFlow) are built upon or rely on NumPy.
In summary, NumPy is a must-know package for data science and machine learning.
Note before the tutorial: it’s not necessary (or possible) to memorize all the functions or methods. Plus, there might be other functions or methods you need in the future.
We suggest you follow the tutorial to learn systematically, which covers the essentials. When you are practicing data science, a simple Google search would often solve the problem.
Install and Import NumPy
So how to get started with NumPy?
You can install the Anaconda distribution to start using Python for data science, which is FREE and includes popular packages like NumPy:
- If you need help with installing Anaconda, check out How to Install/Setup Python and Prep for Data Science NOW.
- If you want to install NumPy and learn the basics of Python, take our FREE Python crash course. This tutorial assumes you have all the basic Python knowledge.
If you already have Python but not NumPy, please use the command below to install it.
pip install numpy
After downloading the package, we need to import the packages for each new Python session.
Note that np is the popular alias name used for NumPy.
Python NumPy Array Creation and Basics
As mentioned earlier, the main object within NumPy is the multi-dimensional array (ndarray). It’s a table of elements (usually numbers), all the same type, indexed by a tuple of non-negative integers. While in NumPy, dimensions of arrays are also called axes.
First, let’s see how to create some NumPy arrays.
To create new NumPy arrays, we use np.array. We’ll show a few common methods used in data science.
Let’s start from the most basic way of creating NumPy arrays: from existing Python sequences.
From Range to One-Dimensional Array
First, we can create an array from range (for example, with the stop position at 100 below).
Great, we created our first NumPy array. It has one dimension/axis with 100 elements, or we can say it has a length of 100.
Since this is the first NumPy Array we created, let’s also look at its important attributes:
- ndim: the number of axes (dimensions).
- shape: the dimensions of the array as a tuple. For a matrix with n rows and m columns, the shape will be (n,m).
- size: the total number of elements of the array, which equals to the product of the elements of shape.
- dtype: the data type of the elements in the array. More on this in the Data Types section.
- itemsize: the size in bytes of each element of the array. For example, int32 has itemsize 4 (=32/8).
Below we put the results for each of these methods for array x after the # sign.
From List to Multi-Dimensional Array
Another popular way of creating arrays is from lists of lists (or nested lists).
For example, we create another multidimensional array y below.
We can also look at its primary attributes, which are straight-forward.
Special Arrays (zeros, ones, full, eye)
Next, let’s use some array functions to create special arrays.
When the size of the array is known but not the elements, we can use the NumPy functions to create arrays with initial placeholders. This helps us avoiding expensive operations of growing arrays after.
We can use the zeros function to create arrays full of zeros. By default, the dtype of the created array is
For example, we can create an array with zeros of one-dimension and length of 10.
Or an array with zeros of two-dimensions by specifying its shape input parameter.
Similarly, we can use the ones function to create arrays filled with ones.
We can also use the full function to create arrays filled with defined fill_value.
For example, we can create a 5*5 array filled with number 53.
We can also create arrays based on the shape of existing arrays.
Another useful function is eye. It returns a 2-D array with ones on the diagonal and zeros elsewhere, which can be used to create Identity Matrices.
For example, we can specify the number of rows in the array to be 5, while the number of columns will be the same by default. Now we have an Identity Matrix of size 5.
Or we can specify both the number of rows and columns.
NumPy supports a more variety of numerical types than Python does. For instance, NumPy has its data types like numpy.int32 and numpy.float64. Different data types take different bytes/memory.
For a full list of data types in NumPy, take a look at the official data types document.
We’ve been leaving the data types to default when creating arrays. Let’s try to specify the data type using the dtype parameter.
Let’s create two arrays from range sequences – one using the default data type of int32, the other using int8.
Next, let’s print out the arrays and their corresponding itemsizes.
x and y hold the same numeric values. But x has an itemsize of 4 and y has an itemsize of 1. So y takes up much less space. But note that int8 numbers cannot have values outside the range -128 to 127. But with int32, the range can be much larger (-2147483648 to 2147483647).
When we only need a smaller range of numeric values, setting a smaller sized data type helps save memory.
Now we’ve learned the basics of NumPy arrays, let’s see how we can manipulate it after creation.
Before fitting machine learning models, we often need to change the shape of an existing array. Let’s see some handy methods to do that.
We’ll start from the reshape method.
For instance, say we have an array with data of length 90.
We can use the reshape method to change the shape of this array while keeping its data.
Below we change x to a two-dimensional array of 10 rows and 9 columns.
We can also change the array to 3-dimensions of shape 3*3*10.
Besides specifying, one of the shape dimensions can be -1. In this case, the value is inferred from the length of the array and the remaining dimensions. Since the new shape needs to be compatible with the original one.
For example, the below reshaped array can only be (3, 3, 10) since the length of x is 90, which equals 3*3*10.
We’ve seen examples of transforming from 1D array to multidimensional array, what if we want to do the opposite?
Let’s create a 3D array below.
We can use the ravel method to flatten it to 1D array.
Besides modifying the shapes of arrays, it’s also common operation to transpose the array.
Let’s see a quick example.
To transpose the array, we can use T. We can see that x is transposed from shape (10, 3, 3) to (3, 3, 10).
That’s it for now about array manipulation, let’s move on to some essential math functions for arrays. They are handy to learn about the numeric values stored in the arrays.
Single Array Math Operations
To show the mathematical functions, let’s first create an example array.
It’s useful to learn about the summary statistics of the array. We can perform sum, min, max, mean, std on the array for the elements within it.
The outputs for the functions are copied behind the #s.
Other than these functions, we can also get the median, or get the non-negative square-root of the array.
The median returns 50.5. While the sqrt was applied on each element of array x and thus returning a new array below.
In a later section of this tutorial, We’ll also show some basics for multiple NumPy array operations. Keep reading to find out!
Before that, let’s continue to see what else can be done with one array.
Array Indexing and Slicing
Similar to Python lists, we can use indexing and slicing to get elements from the NumPy arrays.
To show examples, we’ll create a new array with two-dimensions of shape 10*10.
The most basic selection is one element; we can call the elements within the array by its corresponding index positions. Recall that the index positions start from 0. For example, x[0, 1] is calling for the element on the first row (index position 0) and the second column (index position 1).
The Python code above returns 1 and 29.
We can also use indexing and slicing to select a subsection of the array.
This concept is most natural to learn with a 2D array with columns and rows. For example, to select the second row (index position 1), we use the Python code below.
To select the third column (index position 2), we use the Python code below.
What if we want a chunk of the array?
We can use the slicing technique as well.
For example, by specifying the start and stop indexes, the code below can return multiple rows. Recall that the stopping index position is not included.
We can do similar slicing to take multiple columns as well.
Or we can get subsets of both rows and columns.
We can also use advanced indexing to get particular rows.
Or particular rows and a particular column.
Below is another example. From each row, a specific element is selected. The row index is
[0,3,9], and the column index specifies the element to choose for the corresponding row, here
Besides the above, we can also use boolean (True/False) indexing on NumPy arrays. It’s handy for selecting the elements of an array that satisfy some condition.
For example, if we want to see the elements of array x that are over 50, we can use the following code.
The bool_index is a boolean array of the same shape as x, where each element says whether the corresponding element of x is > 50. Then we can use this boolean array to select elements from array x.
There is more for indexing of NumPy arrays. Please check out the Advanced Indexing documentation of NumPy.
NumPy arrays are mutable. With indexing and slicing, we can also reassign the elements within an array.
We’ll show an example of a 3D array of shape (3, 3, 3).
We can reassign the element with index 0, 0, 0 to 100.
We can also change multiple elements at once. It’s hard to imagine the dimensions of higher than 2D arrays. But you can get the idea with examples.
The Python code below reassigns all the elements with 2nd and 3rd dimensions with index position 1 to 99.
While the Python code below reassigns all the elements of index 0 in the first dimension to -888.
So far this Python NumPy tutorial has been about one array, let’s see manipulation and operations with multiple arrays.
Multiple Arrays: Stack / Concatenation
We’ll start with how to combine two NumPy arrays.
Let’s create two example arrays of the same shape.
How do we stack these arrays together?
We can either stack them:
We are not covering examples of splitting, the opposite operation of concatenation. If you are interested, take a look at np.split.
Multiple Arrays: Math Operations / Linear Algebra
Now let’s also look at some multiple array operations. We’ll include the basic arithmetic operations, broadcasting, and multiplication.
Most of the concepts are easy to learn with basic knowledge of matrix operations in linear algebra.
Basic Arithmetic Operations
To show the basics of arithmetic calculation between arrays, we create two simple arrays.
To perform element-wise arithmetic operations, we can simply write code as below.
Matrix multiplication is a basic linear algebra operation. Let’s see how to do it on NumPy arrays.
We’ll use the same arrays as the previous example.
Can we multiply arrays x and y using the dot product method?
We got an error message. Because for matrix multiplication, the number of columns of the first matrix must be equal to the number of rows in the second matrix.
This will work if we transpose y to be shaped (2, 5). The result is a new array of shape (5, 5).
What is broadcasting?
In NumPy, broadcasting allows mathematical operations of arrays of different shapes. NumPy can “broadcast” the smaller array across the larger array so that they have compatible shapes.
This is very useful since we often have arrays of different shapes, and it would be cumbersome always to have to match them.
Broadcasting is so powerful that we might have been using it without even noticing!
Yet, two main rules must be met for broadcasting to work.
Broadcasting Rule One
The arrays must have matching dimension sizes. A match means that either the dimension sizes are the same OR at least one of the arrays has that dimension being 1.
When one of the dimensions is 1, NumPy will expand that dimension to be the same dimension as the other array, so that they can match up. For example:
- array1 with shape (3, 4, 5) would match with array2 with shape (3, 4, 1). Since the third dimension of array2 can be expanded to be of size 5.
- array3 with shape (7, 1, 10) would match with array4 with shape (1, 10, 10). The second dimension of array3 can be expanded to 10, and the first dimension of array4 can be extended to 7.
- but array5 with shape (1, 2, 3) wouldn’t match with array6 with shape (1, 2, 6) because the third dimensions are not equal, and neither of them is 1.
We’ve seen examples for arrays of the same number of dimensions, what about the arrays of different dimensions?
Broadcasting Rule Two
The array with fewer dimensions will have additional dimensions of size 1 prepended, until the dimensions are the same size as the other array. For example,
- array1 with shape (3, 4, 5) would match with array2 with shape (4, 5). Numpy will expand the shape of array2 to be (1, 4, 5). And as mentioned earlier, shape (3, 4, 5) will match with shape (1, 4, 5).
- array3 with shape (5, 6, 7, 8) would match with array4 with shape (7, 8). Numpy will expand the shape of array4 to be (1, 1, 7, 8), which matches (5, 6, 7, 8).
- array5 with shape (1, 2, 3, 4) would match with array6 with shape (1, ). The shape of array6 will be expanded to (1, 1, 1, 1).
So what are the typical applications of broadcasting?
The most common use of broadcasting is doing arithmetic on an array with a scalar value. A scalar value can be considered as an array with shape (1, ), which would match with any arrays.
Let’s see an example.
We can do simple arithmetic operations of this array and a scalar. Without broadcasting, we’d have to expand the scalar into the same shape as x.
Another common usage is to do something different for each row or column of an array.
Besides the x array just created, let’s also create a new array y as below.
The shapes of x and y are different: (5, 2) and (2, ).
But we can still add them together with broadcasting.
As we can see, the array y is added to each row of array x without us expanding y.
Broadcasting saves us a lot of extra work!
Likewise, let’s see an example of column-wise addition.
We create two new arrays below as examples.
They are of different shapes (5, 2) and (5, 1).
But again, we can add them together without further transformation. As you can guess, array y is added to each column of array x.
Random Number Generation
Besides array operations, another powerful data science function in NumPy is the random libraries. We often use it to generate pseudo-random numbers, which can be used for simulations, sample data creation, and random sampling.
For example, the random.uniform returns a random array of a specified size with elements uniformly distributed over the interval
[low, high). The random.rand creates an array of the given shape and populates it with random samples from a uniform distribution over
Some examples are shown below.
Similarly, the random.normal draws array with elements as random samples from a normal (Gaussian) distribution. The loc is the mean while the scale is the standard deviation of the distribution.
The random.randn draws a random array with elements from the standard normal distribution (the normal distribution of mean 0 and variance 1).
Besides drawing numbers from statistical distributions, we can also draw random integers from a “discrete uniform” distribution.
So far, we’ve been drawing random numbers, what if we want a random sample from an existing set?
We use random.choice to generate a random sample from a given 1-D array.
For example, we can generate a random sample of size five from the elements in choices.
That’s it for random numbers in NumPy. Before we finish this tutorial, let’s see a few more useful functions.
Arange and linspace
We’ve been using the range function a lot in Python. In NumPy, we have a similar function – the arange function.
Arange returns evenly spaced values within a given interval
[start, stop). For an integer input, the function is equivalent to the Python built-in range function, but returns a NumPy array rather than a list.
So why do we need arange?
One of the improvements of arange over range is the support for floating-point values. If we try to create a range sequence with decimal start and stop positions, it will return errors. But arange can handle that.
For example, the range code below will return error while arange will create an array of floating values.
An alternative to arange is
numpy.linspace. linspace creates an array of evenly spaced numbers over a specified interval. For example, we can generate num = 101 equally spaced samples ranging from 0 to 10.
Min and Max Index
Previously, we learned how to get the minimum and maximum elements from the array.
What if we want to know their index positions?
For example, say we have a simple array x.
We can use argmin/argmax and the format methods to print out the min and max elements’ indexes and values.
The index of the minimum value: 5 The value of the minimum: -1.6087688672846108
The index of the maximum value: 8 The value of the maximum: 0.9493540484559849
What if we only want to know the index of the min and max of certain dimensions?
By default, the index given by argmin/max is into the flattened array. But we can specify the axis in the argmin/argmax with axis = 0 being rows and axis = 1 being columns for 2D arrays.
Let’s see an example.
For this random array, we can get the row indices of the minimum column values or the other way around.
For instance, when axis = 0, the indices of the minimum column values are 4, 4, 3. The minimum value is -1.02420395 with index position 4 for the first column of the array, and so on.
Similar to other Python objects, we can also sort the arrays.
We’ll show a quick example in this tutorial.
For the above random array, we can sort it in ascending order as below.
The last thing of NumPy arrays we want to cover is the array copying.
Say we have an array x, let’s use two methods to create copies of it by either assigning directly or using the copy method.
If we change the element of x, the element of y will be changed to 10 as well, while the copy z’s element remains the same as 1.
So when we don’t want the two arrays to link to each other, we have to use the copy method.
That’s it! Well done!
In this tutorial, you’ve learned a lot about the Python NumPy library, the foundation of data science and machine learning. You should be able to use the NumPy n-dimensional arrays and its basics for data science by now.
We’ll be writing a quickstart tutorial for the Python pandas library as well. Stay tuned!
Leave a comment for any questions you may have or anything else.
Machine Learning for Beginners: Overview of Algorithm Types
This is a complete tutorial to machine learning algorithm types for ML beginners. Start learning ML with this overview, including a list of popular algorithms.
Python crash course: Break into Data Science – FREE
A FREE Python online course, beginner-friendly tutorial. Start your successful data science career journey: learn Python for data science, machine learning.
How to Learn Data Science Online: ALL You Need to Know
Check out this for a detailed review of resources online, including courses, books, free tutorials, portfolios building, and more.