
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Select Subset of Data with Index Labels in Python Pandas
Introduction
Pandas have a dual selection capability to select the subset of data using the Index position or by using the Index labels. Inthis post, I will show you how to “Select a Subset Of Data Using Index Labels” using the index label.
Remember, Python dictionaries and lists are built-in data structures that select their data either by using the index label or byindex position. A dictionary’s key must be a string, integer, or tuple while a List must either use integers (the position) or sliceobjects for selection.
Pandas have .loc and.iloc attributes available to perform index operations in their own unique ways. ). With.iloc attribute,pandas select only by position and work similarly to Python lists. The .loc attribute selects only by index label, which is similarto how Python dictionaries work.
Select a Subset Of Data Using Index Labels with .loc[]
The loc and iloc attributes are available on both Series and DataFrame
1.Import the movies dataset with the title as index.
import pandas as pd movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv", index_col="title", usecols=["title","budget","vote_average","vote_count"])
I always recommend sorting the index, especially if the index is made up of strings. You will notice the difference if youare dealing with a huge dataset when your index is sorted.
movies.sort_index(inplace = True) movies.head(3)
budget vote_average vote_count title ___________________________________ #Horror 1500000 3.3 52 (500) Days of Summer 7500000 7.2 2904 10 Cloverfield Lane 15000000 6.8 2468
I have sorted the index using sort_index and "inplace = True" parameter.
1. One thing interesting about the syntax of the loc method is that it does not take parenthesis() rather takes squarebrackets[]. I think (might be wrong) this is because they wanted consistency i.e. you can use [] on a Series to extractrows, while applied on a Dataframe will fetch you the columns.
# extract "Spider-Man 3" ( I'm not a big fan of spidy) movies.loc["Spider-Man 3"]
budget 258000000.0 vote_average 5.9 vote_count 3576.0 Name: Spider-Man 3, dtype: float64
1. Use a slice to pull out many values. I'm going to pull the movies which I haven't watched. Because this is a string labelwe are going to get all the data for our search criteria including "Avatar".
Remember - If you work with Python List the last value is excluded but since we are working with strings it's inclusive.
movies.loc["Alien":"Avatar"]
budget vote_average vote_count title Alien 11000000 7.9 4470 Alien Zone 0 4.0 3 Alien: Resurrection 70000000 5.9 1365 Aliens 18500000 7.7 3220 Aliens in the Attic 45000000 5.3 244 ... ... ... ... Australia 130000000 6.3 694 Auto Focus 7000000 6.1 56 Automata 7000000 5.6 670 Autumn in New York 65000000 5.7 135 Avatar 237000000 7.2 11800
167 rows × 3 columns
1. Can I get any two or more random movies that are not next to each other? Definitely yes, but you need to put more effortinto passing a list of movies you need.
What I meant was you need to have square brackets with in a square bracket.
movies.loc[["Avatar","Avengers: Age of Ultron"]]
budget vote_average vote_count title Avatar 237000000 7.2 11800 Avengers: Age of Ultron 280000000 7.3 6767
6. Can I change the order of selection? Of course, you can help yourself by specifying the list of labels you need in an order.
While this looks cool to specify the list of labels you want to extract, do you know what happens if you spelled a valuewrongly? Pandas would have stuck missing Values (NaN) for the wrongly spelled label. But those days are gone, with thelatest updates it raises an exception.
movies.loc[["Avengers: Age of Ultron","Avatar","When is Avengers next movie?"]]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-6-ebe975264840> in <module> ----> 1 movies.loc[["Avengers: Age of Ultron","Avatar","When is Avengers next movie?"]] ~\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__ (self, key) 1766 1767 maybe_callable = com.apply_if_callable( key,self.obj) -> 1768 return self._getitem_axis(maybe_callable,axis = axis) 1769 1770 def_is_scalar_access(self,key:Tuple): ~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis (self, key, axis) 1952 raiseValueError("Cannot index with multidimensional key") 1953 -> 1954 return self._getitem_iterable(key, axis=axis) 1955 1956 # nested tuple slicing ~\anaconda3\lib\site-packages\pandas\core\indexing.py in_getitem_iterable(self, key, axis) 1593 else: 1594 # A collection of keys -> 1595 keyarr,indexer=self._get_listlike_indexer(key,axis,raise_missing=False) 1596 return self.obj._reindex_with_indexers( 1597 {axis:[keyarr,indexer]},copy=True,allow_dups=True ~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing) 1550 keyarr,indexer,new_indexer=ax._reindex_non_unique (keyarr) 1551 -> 1552 self._validate_read_indexer( 1553 keyarr,indexer,o._get_axis_number (axis),raise_missing=raise_missing 1554 ) ~\anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer (self, key, indexer, axis, raise_missing) 1652 # just raising 1653 ifnot(ax.is_categorical()orax.is_interval() ) : -> 1654 raise KeyError( 1655 "Passing list-likes to .loc or [] with any missing labels " 1656 "is no longer supported, see "
KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'
One way to take care is by checking the values in the index directly.
"When is Avengers next movie?"in movies.index
Output
False
If you want to ignore the error and move on you can use below approach
movies.query("title in ('Avatar','When is Avengers next Movie?')")
budget vote_average vote_count title Avatar 237000000 7.2 11800