Python Pandas - Working with Text Data



Pandas provides powerful tools for working with text data using the .str accessor. This allows us to apply various string operations on Series and Index objects, which work efficiently on string manipulation within a Pandas DataFrame.

The .str accessor provides a variety of string methods that can perform operations like string transformation, concatenation, searching, and many other on string objects. Below, these methods are categorized based on their functionalities −

String Transformation

This category includes methods that transform the strings in some way, such as changing the case, formatting, or modifying specific characters.

Sr.No. Methods & Description
1

Series.str.capitalize()

Transforms the first character of each string in the Series or Index to uppercase and the rest to lowercase.

2

Series.str.casefold()

Converts each string to lowercase in a more aggressive manner suitable for case-insensitive comparisons.

3

Series.str.lower()

Converts all characters in each string of the Series or Index to lowercase.

4

Series.str.upper()

Converts all characters in each string of the Series or Index to uppercase.

5

Series.str.title()

Converts each string to titlecase, where the first character of each word is capitalized.

6

Series.str.swapcase()

Swaps case converts uppercase characters to lowercase and vice versa.

7

Series.str.replace()

Replaces occurrences of a pattern or regular expression in each string with another string.

String Trimming

This category includes methods to trim strings to a specific characters or specified prefix.

Sr.No. Methods & Description
1

Series.str.lstrip()

Removes leading characters (by default, whitespace) from each string.

2

Series.str.strip()

Removes leading and trailing characters (by default, whitespace) from each string.

3

Series.str.rstrip()

Removes trailing characters (by default, whitespace) from each string.

4

Series.str.removeprefix(prefix)

Removes the specified prefix from each string in the Series or Index, if it exists.

5

Series.str.removesuffix(suffix)

Removes the specified suffix from each string in the Series or Index, if it exists.

String Concatenation and Joining Methods

These methods allow you to combine multiple strings into one or join elements within strings using specified separators.

Sr.No. Methods & Description
1

Series.str.cat()

Concatenates strings in the Series or Index with an optional separator.

2

Series.str.join()

Joins the elements in lists contained in each string of the Series or Index using the specified separator.

String Padding Methods

This category includes methods to pad strings to a specific length or align them within a specified width.

Sr.No. Methods & Description
1

Series.str.center()

Centers each string in the Series or Index within a specified width, padding with a character.

2

Series.str.pad()

Pads each string in the Series or Index to a specified width, with an option to pad from the left, right, or both sides.

3

Series.str.ljust()

Pads the right side of each string in the Series or Index with a specified character to reach the specified width.

4

Series.str.rjust()

Pads the left side of each string in the Series or Index with a specified character to reach the specified width.

5

Series.str.zfill()

Pads each string in the Series or Index with zeros on the left, up to the specified width.

String Searching Methods

These methods help you locate substrings, count occurrences, or check for patterns within the text.

Sr.No. Methods & Description
1

Series.str.contains()

Checks whether each string in the Series or Index contains a specified pattern.

2

Series.str.count()

Counts occurrences of a pattern or regular expression in each string of the Series or Index.

3

Series.str.find()

Finds the lowest index of a substring in each string of the Series or Index.

4

Series.str.rfind()

Finds the highest index of a substring in each string of the Series or Index.

5

Series.str.index()

Similar to find(), but raises an exception if the substring is not found.

6

Series.str.rindex()

Similar to rfind(), but raises an exception if the substring is not found.

7

Series.str.match()

Checks for a match only at the beginning of each string.

8

Series.str.fullmatch()

Checks for a match across the entire string.

9

Series.str.extract()

Extracts matched groups in each string using regular expressions.

10

Series.str.extractall()

Extracts all matches in each string using regular expressions.

String Splitting Methods

Splitting methods divide strings based on a delimiter or pattern, which is useful for parsing text data into separate components.

Sr.No. Methods & Description
1

Series.str.split()

Splits each string in the Series or Index by the specified delimiter or regular expression, and returns a list of strings.

2

Series.str.rsplit()

Splits each string in the Series or Index by the specified delimiter or regular expression, starting from the right side, and returns a list of strings.

3

Series.str.partition()

Splits each string at the first occurrence of the delimiter, and returns a tuple containing three elements: the part before the delimiter, the delimiter itself, and the part after the delimiter.

4

Series.str.rpartition()

Splits each string at the last occurrence of the delimiter, and returns a tuple containing three elements: the part before the delimiter, the delimiter itself, and the part after the delimiter.

String Filtering Methods

These methods are useful for filtering out non-alphanumeric characters, controlling character sets, or cleaning text data.

Sr.No. Methods & Description
1

Series.str.filter()

Returns elements for which a provided function evaluates to true.

2

Series.str.get()

Extracts element from each component at specified position.

3

Series.str.get_dummies()

Splits each string in the Series by the specified delimiter and returns a DataFrame of dummy/indicator variables.

4

Series.str.isalpha()

Checks whether each string consists only of alphabetic characters.

5

Series.str.isdigit()

Checks whether each string consists only of digits.

6

Series.str.isnumeric()s

Checks whether each string consists only of numeric characters.

7

Series.str.isspace()

Checks whether each string consists only of whitespace.

8

Series.str.isupper()

Checks whether all characters in each string are uppercase.

9

Series.str.islower()

Checks if all characters in each string are lowercase.

10

Series.str.isalnum()

Checks if all characters in each string are alphanumeric (letters and digits).

11

Series.str.istitle()

Checks if each string in the Series or Index is in title case, where each word starts with a capital letter.

12

Series.str.isdecimal()

Checks if all characters in each string are decimal characters.

13

Series.str.len()

Computes the length of each string in the Series or Index.

14

Series.str.findall()

Finds all occurrences of a pattern or regular expression in each string.

Miscellaneous Methods

This category includes methods that perform a variety of other operations on strings, such as encoding, decoding, and checking for the presence of certain characters.

Sr.No. Methods & Description
1

Series.str.encode()

Encodes each string using the specified encoding.

2

Series.str.decode()

Decodes each string using the specified encoding.

3

Series.str.expandtabs()

Expands tab characters ('\t') into spaces.

4

Series.str.repeat()

Repeats each string in the Series or Index by the specified number of times.

5

Series.str.slice_replace()

Replaces a slice in each string with a passed replacement.

6

Series.str.translate()

Maps each character in the string through a translation table.

7

Series.str.slice()

Slices each string in the Series or Index by a passed argument.

8

Series.str.startswith()

Checks whether each string in the Series or Index starts with a specified pattern.

9

Series.str.endswith()

Checks whether each string in the Series or Index ends with a specified pattern.

10

Series.str.normalize()

Normalizes the Unicode representation of each string in the Series or Index to the specified normalization form.

11

Series.str.wrap()

Wraps each string in the Series or Index to the specified line width, breaking lines as needed.

Advertisements