How to select subsets of data In SQL Query Style in Pandas?


Introduction

In this post, I will show you how to perform Data Analysis with SQL style filtering with Pandas. Most of the corporate company’s data are stored in databases that require SQL to retrieve and manipulate it. For instance, there are companies like Oracle, IBM, Microsoft having their own databases with their own SQL implementations.

Data scientists have to deal with SQL at some stage of their career as the data is not always stored in CSV files. I personally prefer to use Oracle, as the majority of my company’s data is stored in Oracle.

Scenario – 1 Suppose we are given a task to find all the movies from our movies dataset with below conditions.

  • The language of the movies should be either English(en) or Spanish(es).
  • The popularity of the movies must be between 500 and 1000.
  • The movie’s status must be released.
  • The vote count must be greater than 5000. For the above scenario, the SQL statement would look some thing like below.
SELECT
FROM WHERE
title AS movie_title
,original_language AS movie_language
,popularityAS movie_popularity
,statusAS movie_status
,vote_count AS movie_vote_count movies_data
original_languageIN ('en', 'es')

AND status=('Released')
AND popularitybetween 500 AND 1000
AND vote_count > 5000;

Now that you have seen the SQL for the requirement, let’s do this step by step using pandas. I will show you two methods.

Method 1:- Boolean Indexing

1. Load the movies_data dataset to DataFrame.

import pandas as pd movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv")

Assign a variable for each condition.

languages = [ "en" , "es" ] condition_on_languages = movies . original_language . isin ( languages )
condition_on_status = movies . status == "Released"
condition_on_popularity = movies . popularity . between ( 500 , 1000 )
condition_on_votecount = movies . vote_count > 5000

3. Combine all the conditions(boolean arrays) together.

final_conditions = ( condition_on_languages & condition_on_status & condition_on_popularity & condition_on_votecount )
columns = [ "title" , "original_language" , "status" , "popularity" , "vote_count" ]
# clubbing all together movies . loc [ final_conditions , columns ]


title
original_language
status
popularity
vote_count
95 Interstellar
en
Released
724.247784
10867
788Deadpool
en
Released
514.569956
10995


Method 2:- .query() method.

The .query() method is a SQL where clause style way of filtering the data. The conditions can be passed as a string to this method, however, the column names must not contain any spaces.

If you have spaces in your column names, replace them with underscores using the python replace function.

From my experience I have seen query() method when applied on a larger DataFrame is faster than the previous method.

import pandas as pd movies = pd . read_csv ( "https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv" )

4.Build the query string and execute the method.

Note the .query method does not work with triple quoted strings spanning multiple lines. 

final_conditions = (
"original_language in ['en','es']"
"and status == 'Released' "
"and popularity > 500 "
"and popularity < 1000"
"and vote_count > 5000"
) final_result = movies . query ( final_conditions )
final_result



budget
id
original_language
original_title
popularity
release_date
revenue
runtime
st
95
165000000
157336
en
Interstellar
724.247784
5/11/2014
675120017
169.0
Rele
788
58000000
293660
en
Deadpool
514.569956
9/02/2016
783112979
108.0
Rele


There is more, often in my coding, I have multiple values to check in my “in” clause. So the above syntax is not ideal to work with. It is possible to reference Python variables using the at symbol (@).

You can also programmatically create the values as a python List and use them with (@).

movie_languages = [ 'en' , 'es' ]
final_conditions = (
"original_language in @movie_languages "
"and status == 'Released' "
"and popularity > 500 "
"and popularity < 1000"
"and vote_count > 5000" )
final_result = movies . query ( final_conditions )
final_result



budget
id
original_language
original_title
popularity
release_date
revenue
runtime
st
95
165000000
157336
en
Interstellar
724.247784
5/11/2014
675120017
169.0
Rele
788
58000000
293660
en
Deadpool
514.569956
9/02/2016
783112979
108.0
Rele

Updated on: 10-Nov-2020

279 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements