How to select subsets of data In SQL Query Style in Pandas?

Pandas Server Side Programming Programming MySQL

Introduction

In this post, I will show you how to perform Data Analysis with SQL style filtering with Pandas. Most of the corporate company’s data are stored in databases that require SQL to retrieve and manipulate it. For instance, there are companies like Oracle, IBM, Microsoft having their own databases with their own SQL implementations.

Data scientists have to deal with SQL at some stage of their career as the data is not always stored in CSV files. I personally prefer to use Oracle, as the majority of my company’s data is stored in Oracle.

Scenario – 1 Suppose we are given a task to find all the movies from our movies dataset with below conditions.

The language of the movies should be either English(en) or Spanish(es).
The popularity of the movies must be between 500 and 1000.
The movie’s status must be released.
The vote count must be greater than 5000. For the above scenario, the SQL statement would look some thing like below.

SELECT
FROM WHERE
title AS movie_title
,original_language AS movie_language
,popularityAS movie_popularity
,statusAS movie_status
,vote_count AS movie_vote_count movies_data
original_languageIN ('en', 'es')

AND status=('Released')
AND popularitybetween 500 AND 1000
AND vote_count > 5000;

Now that you have seen the SQL for the requirement, let’s do this step by step using pandas. I will show you two methods.

Method 1:- Boolean Indexing

1. Load the movies_data dataset to DataFrame.

import pandas as pd movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv")

Assign a variable for each condition.

languages = [ "en" , "es" ] condition_on_languages = movies . original_language . isin ( languages )
condition_on_status = movies . status == "Released"
condition_on_popularity = movies . popularity . between ( 500 , 1000 )
condition_on_votecount = movies . vote_count > 5000

3. Combine all the conditions(boolean arrays) together.

final_conditions = ( condition_on_languages & condition_on_status & condition_on_popularity & condition_on_votecount )
columns = [ "title" , "original_language" , "status" , "popularity" , "vote_count" ]
# clubbing all together movies . loc [ final_conditions , columns ]

title	original_language	status	popularity	vote_count
95 Interstellar	en	Released	724.247784	10867
788Deadpool	en	Released	514.569956	10995

Method 2:- .query() method.

The .query() method is a SQL where clause style way of filtering the data. The conditions can be passed as a string to this method, however, the column names must not contain any spaces.

If you have spaces in your column names, replace them with underscores using the python replace function.

From my experience I have seen query() method when applied on a larger DataFrame is faster than the previous method.

import pandas as pd movies = pd . read_csv ( "https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv" )

4.Build the query string and execute the method.

Note the .query method does not work with triple quoted strings spanning multiple lines.

final_conditions = (
"original_language in ['en','es']"
"and status == 'Released' "
"and popularity > 500 "
"and popularity < 1000"
"and vote_count > 5000"
) final_result = movies . query ( final_conditions )
final_result

	budget	id	original_language	original_title	popularity	release_date	revenue	runtime	st
95	165000000	157336	en	Interstellar	724.247784	5/11/2014	675120017	169.0	Rele
788	58000000	293660	en	Deadpool	514.569956	9/02/2016	783112979	108.0	Rele

There is more, often in my coding, I have multiple values to check in my “in” clause. So the above syntax is not ideal to work with. It is possible to reference Python variables using the at symbol (@).

You can also programmatically create the values as a python List and use them with (@).

movie_languages = [ 'en' , 'es' ]
final_conditions = (
"original_language in @movie_languages "
"and status == 'Released' "
"and popularity > 500 "
"and popularity < 1000"
"and vote_count > 5000" )
final_result = movies . query ( final_conditions )
final_result

	budget	id	original_language	original_title	popularity	release_date	revenue	runtime	st
95	165000000	157336	en	Interstellar	724.247784	5/11/2014	675120017	169.0	Rele
788	58000000	293660	en	Deadpool	514.569956	9/02/2016	783112979	108.0	Rele

Kiran P

Updated on: 10-Nov-2020

279 Views

Kickstart Your Career

Get certified by completing the course

Get Started