11 Python packages you should learn as a data scientist
Data scientists perform a large variety of tasks on a daily basis — data collection, pre-processing, analysis, machine learning, and visualization.
If you are a beginner in the data science industry, you might have taken a course in Python or R, and understand the basics of the data science life-cycle.
However, when you try to experiment with datasets on Kaggle on your own, you might find it difficult because you don’t know where to start.
You don’t know the right tools to use for the tasks you want to perform.
In this article, I will walk you through some of the most important Python libraries for data science. There is a library out there for almost every task you want to perform, and I will break some of them down here.
These are libraries that most data scientists use on a daily basis (I use most of them at my data science job), so it is important you know how to work with them.
I will also list some free resources (videos, articles, and tutorials) for you to go through, in order to get hands-on experience with these libraries.
The first step to solving any data science problem is data collection. Sometimes, this data will be available in the form of an SQL database or an Excel sheet. At other times, you will need to extract data yourself — either by using APIs or web scraping.
Below, I will list some of the most common data collection libraries in Python. I use these libraries very often depending on the type of data I need to collect, and they have made my data science workflow a lot easier.
If the data you need to extract is in the form of an SQL database, you will need to load the database into Python before pre-processing and analyzing it.
MySQLConnector is a library that allows you to establish a connection with an SQL database using Python.
You can load database tables easily with the help of this library, then convert the tables into Pandas data frames to perform further data manipulation.
You can also create databases and write to them with the help of this library.
Get started with MySQLConnector:
Companies often depend on external data when making business decisions — they might want to compare prices of competitor products, analyze competitor brand reviews, etc.
BeautifulSoup is a Python library that can help you scrape data from any web page.
Here is a tutorial to help you get started with BeautifulSoup in Python.
Social media APIs
Social media platforms like Twitter, Facebook, and Instagram generate large amounts of data on a daily basis.
This data can be useful for many data science projects, such as:
Company A has just released a product and come up with a special discount. How are their customers responding to the product and this discount? Are people talking more about the brand than usual? Is the promotion driving higher brand awareness? How good is the overall product sentiment when compared to competitor brands?
It is difficult for a company to gauge things like overall brand sentiment (on a large scale) solely with internal data.
Social media analysis plays a huge role collecting data for tasks like churn prediction and customer segmentation.
And it really isn’t difficult to collect data from social platforms, since there are a lot of publicly available APIs that can help you do this quickly. Some of them include:
- Twitter: Tweepy, Twint
- Facebook: Python-Facebook-API
- YouTube: Python-YouTube
Here are some tutorials to help you get started:
Here is an example of a sentiment analysis project I created with a Twitter API.
Real world data is dirty. It doesn’t always come in the format of an Excel sheet or a .csv file. It could come in the format of an SQL database, text file, JSON dictionary, or even a PDF file.
As a data scientist, a huge portion of your time will be dedicated to creating data frames, cleaning them, and merging them together.
Some Python libraries that can help with data preparation include:
Numpy is a package that allows you to perform operations quickly on large amounts of data.
You can convert data frames into arrays, manipulate matrices, and easily find basic statistics (like the median or standard deviation) of a population with the help of Numpy.
Some tutorials to help you get started with Numpy:
Pandas is one of the most popular and widely used Python packages for data science.
You can easily read different file types and create data frames with the help of Pandas. Then, you can create functions to pre-process this data really quickly — you can clean the data frame, remove missing/invalid values, and perform data scaling/standardization.
To learn Pandas, you can take the following tutorials:
Have you ever encountered invalid values, weird symbols, or whitespaces when working with Pandas data frames?
Although RegEx isn’t a library specifically built for data scientists, I’m adding it to this list because it is incredibly useful.
You can use RegEx (or Regular Expressions) to identify a set of characters within data. This library can be used to find rows of data that specify a certain condition. It can also be used to pre-process data and remove invalid values that don’t match a specific format.
Some tutorials to start using RegEx with Pandas data frames:
The most important library to perform data analysis is Pandas. I’ve explained the use of Pandas for data pre-processing above, so I will now go through one of the best modules for data analysis within Pandas:
Pandas-profiling is an incredibly useful module for data analysis. Once you run pandas-profiling on a data frame, it provides you with summary statistics of the data as shown below:
It also provides you with a description of each variable, their correlation with each other, their distribution and cardinality.
To learn more about Pandas profiling, read this article.
Another crucial part of any data science project is visualization. It is important to visualize the spread of variables, check their skewness, and understand the relationship between them.
Seaborn is a library you can use for this purpose. It is quick to import and you can make charts easily, with only one or two lines of code.
Here are some learning resources to help you get started with Seaborn:
Here is a data visualization tutorial I created in Seaborn:
Plotly is another visualization library I’m adding to this list. With Plotly, you can make beautiful, interactive visualizations.
It takes slightly more code and a bit more effort to customize Plotly visualizations.
I generally use Seaborn if I want to quickly check the distribution/relationship between variables. I use Plotly if I need to present visualizations to others, Plotly’s charts are interactive and look nice.
Plotly also allows you to build interactive choropleth maps, that allow you to easily plot location data. If you need to present data by region, country, or latitude/longitude, Plotly’s choropleth maps are the best way to do so.
Some learning resources to get started with Plotly:
The most popular libraries for machine learning in Python include:
Scikit-Learn is the most widely used Python library for machine learning. It allows you to build machine learning algorithms like linear regression, logistic regression, and decision trees with just a few lines of code.
You can get started with Scikit-Learn with the help of this tutorial.
Here is a linear regression tutorial I created that uses Scikit-Learn to implement the model.
Statsmodel is another Python library you can use to build machine learning algorithms.
The reason I prefer using Statsmodel over Scikit-Learn is because of the detailed summary it provides after building a machine learning model.
With just one line of code, we can take a look at metrics like the standard error, model coefficients for each variable, and p-values.
This tells us everything we need to know about the model’s performance at a glance.
To get started with Statsmodel, you can read the following articles:
A data scientist’s job doesn’t start and end with building machine learning models.
You should be able to pull data from different sources when required, and clean/analyze this data first before using it to build models. When working in the data industry, you should know how to perform an end-to-end data science workflow.
You need to know how to collect and pre-process data, analyze it, and build machine models.
The 11 packages listed above are some of the easiest ways to do this with Python.
Once you get familiar with some of these packages, you can start using them to build data science projects. This will help enhance your data science and programming skills.