Regular Expressions in Python

A RegEx is what is mainly referred to as a Regular Expression. It is a search pattern made up of a series of characters. As a result, to see if a string contains the supplied search pattern, use RegEx.

Module for RegEx

Regular Expressions can be used using a built-in Python library called re, and you can easily import the re module as follows.

import re

RegEx Functions

Regular expressions come with many functions that you will find very useful if you are a frequent user of the re module. The following are some of the re module functions that allow us to search a string for a match:

findall

The findall function is responsible for returning a list containing all matches. We will look at an example to ensure that you properly grasp this function as follows.

import re

the_string = "Codeunderscored rains in the development industry."
the_result= re.findall("ai", the_string)
print(the_result)
findall function
findall function

The matches are listed in the order that they were discovered. However, if there are no matches, an empty list is usually returned. For example,

import re

the_string = "Codeunderscored rains in the development industry."
the_result = re.findall("Google", the_string )
print(the_result)
findall has no matches
findall has no matches

search

On the other hand, the search function returns a Match object off chance of an actual match in the string. In addition, only the first instance of the match will be returned if there are multiple matches. Let’s consider an example where the first white-space character in the string by running the following piece of code:

import re

the_string = "Codeunderscored rains in the development industry."
the_result = re.search("\s", the_string)
print("The first initial instance of white-space character was found in position:", the_result.start())
search function
Search function

If there are no matches, the value None is returned like in the following demo.

import re

the_string = "Codeunderscored rains in the development industry."
the_result = re.search("Google", the_string)
print(the_result)
search function returns None
search function returns None

split

The split function has a return value of a list. Now, the resultant list is one in which the string has splits at every matching point. In the demo example, we want to split the words after each white-space character, as shown.

import re

the_string = "Codeunderscored rains in the development industry."
the_result = re.split("\s", the_string)
print(the_result)
split function
split function

Further note that the maxsplit parameter can be used to limit the number of occurrences. In addition, we can only test splitting the string after several occurrences like two.

import re

the_string = "Codeunderscored rains in the development industry."
the_result = re.split("\s", the_string , 2)
print(the_result)
split function after two occurrences

sub

The sub-function is the last we will consider in this article, but that does not mean it is the least among the functions we have covered. It is as vital as the other functions and replaces either many or a single match with a given string.

Let us consider the following example that replaces all the white-spaces with the word replaced,

import re

the_string = "Codeunderscored rains in the development industry."
the_result= re.sub("\s", "__replaced__", the_string )
print(the_result)
sub function replaces white spaces
sub-function replaces white spaces

The count parameter can be used to adjust the number of replacements; for instance, you may only need to replace the first occurrence, as shown below.

import re

the_string = "Codeunderscored rains in the development industry."
the_result = re.sub("\s", "__replaced__", the_string, 1)
print(the_result)
use sub function to replace the first white space
use sub-function to replace the first white space

Using a Phone Number to explore regular expressions

Regular expressions, often known as regex, are widely used to help us parse data. Before we go into how to do that, let’s look at a real-world example utilizing US phone numbers. The following are all legal phone number formats in writing:

  • 1-333-234-0007
  • +1(333)- 234-0007
  • +13332340007
  • +1-333-234-0007
  • 333-234-0007

Incredibly, all of these figures are the same, only formatted differently. So, how would we search a whole document for all possible phone number format derivations?

There are various ways you would probably use to solve this problem, for instance, machine learning. However, in this case, we will utilize pattern matching, also known as regular expressions, to make the problem easier to solve.

Regular expressions can be frightening and take some getting used to. As a result, we hope to explain utilizing Regular Expressions in Python in this article successfully. These regex patterns and principles are shared across languages, primarily since Python regex is based on Perl regex.

Consider the following example:

usa_phone_number_pattern = “333-234-0007”

How do we extract all of the integers from the above string (without the dashes -)? Let’s start with the more amateur and challenging approach that would almost come naturally to most of us.

Approach 1:

usa_phone_number_pattern= "333-234-0007"
num_vals = []
for char in usa_phone_number_pattern:
    the_val = None
    try:
        the_val = int(char)
    except:
        pass
    if the_val != None:
        num_vals.append(the_val)

the_result = "".join([f"{x}" for x in num_vals])
the_result
Approach 1: extract integers without dashes

Approach 2:

Here’s another direction your intuition could lead you:

usa_phone_number_pattern= "333-234-0007"
the_result = usa_phone_number_pattern.replace("-", "")
the_result
Approach2: extract integers without dashes
Approach2: extract integers without dashes

Approach 3:

Finally, there is a built-in method for Python Strings .isdigit() that is a function and can be used to see if a string contains a number or not. Here’s how you do it:

usa_phone_number_pattern= "333-234-0007"
the_result= "".join([f"{phone}" for phone in usa_phone_number_pattern if phone.isdigit()])
the_result
Approach 3: extract integers without dashes
Approach 3: extract integers without dashes

These ways are genuine in that they accomplish the purpose; however, there is a more practical and reliable method.

That’s just using regular expressions. Let’s take a look at the first regex example:

import re # the built-in regex library

code_pattern = r"\d+"
the_result = re.findall(code_pattern, usa_phone_number_pattern)
the_result

The re.findAll method produces a better result because each group of numbers has been parsed out by default.

It’s easier for us to deduce that [‘333’, ‘234’, ‘0007’] is a phone number than 3332340007. That’s because we are from the United States, and that’s how we categorize numbers.

We are not yet done on the matter when it comes to why we use regex. It is because we’re seeking a specific pattern in our text to analyze, not just numerals. We want to disregard any digits that don’t follow this pattern. For example, let’s say we are given a time and a phone number. Our task is to extract all the digits in the following string.

the_string = " Please check on Codeunderscored, at 14:00 or dial my phone at +1-333-234-0007."

code_pattern = r"\d+"
the_result = re.findall(code_pattern, the_string)
the_result

If we wanted to extract the phone number from the string above, we would run the following code.

code_pattern = r"+\d{1}-\d{3}-\d{3}-\d{4}"
the_result = re.findall(phone_pattern, meeting_str)
the_result

Let’s break down this pattern:

The string r”\d” is used to match any digit. It is a regular expression, as shown by the r in front. The d is a pattern that can be used to match any numeric number.

Part 1: +\d{1}-

The + indicates that the given string must start with a + in this case.
The + is followed by d, which, as you may recall, matches any or all digits. However, this does not include letters, spaces, and dashes.

The letter \d is preceded by the number one. When you see braces inside a regex pattern that starts with a number, such as { 1} or {3 }, it signifies the previous pattern can only be {n} n length long. For instance, r”\d{6}” is a pattern for six digits.

Finally, as we conclude this section, we see a -. This section will end with a dash in this case (-). We don’t need to escape the dash like the + because regex doesn’t utilize a dash – for anything else.

Part 2: \d{3}-

It should be self-evident at this point. This chunk matches any \d digit with a following – dash that is {3} 3 characters long.

Part 3 : \d{3}-

The latter is Identical to part 2 above. It matches any \d digit that is {3} 3 characters long and has a trailing – dash.

Part 4: \d{4}

This part matches any \d digit that is {4} 4 characters long. In fact, you can infer this from what we have discussed in sections three and two above.

Key things you have to master in Regular Expressions

? is responsible for making things optional

For instance, +?\d{1}- is similar to what we covered in part 1 above. However, the difference is the ? after the plus sign that makes + in a string optional.

import re

code_pattern = r"\+?\d{1}-?\d{3}-?\d{3}-?\d{4}"

numbers_without_dashes = "+15558655309"
print(re.findall(code_pattern, numbers_without_dashes))

numbers_without_plus = "15558655309"

print(re.findall(code_pattern, numbers_without_plus))

number_with_dashes = "1555-8655309"

print(re.findall(code_pattern, number_with_dashes))

number_dashes_and_plus = "+1555-865-5309"

print(re.findall(code_pattern, number_dashes_and_plus))
is responsible for making things optional

How to write () in regex expressions

Parentheses, like the plus +, are special regex characters. As a result, we must write an escape character before both the opening and the closing parentheses.

for example,

extract_with_parentheses = "(" + "\d{3}" + ")"

the_string = "I need characters within the brackets,+1(555)-555-3121, isolated in this string."
print(re.findall(extract_with_parentheses, the_string))

The or | operator

In some regexes, you’ll need to allow for two different sorts of patterns. Allow 555 or [4-9] to continue on our phone pattern.

Assume we needed the area code to match two different area codes. Let us try 455 and 456 to cement our understanding of this concept.

the_string = "I need characters within the brackets,+1(555)-555-3121, or 213-323-1233 or 312-456-6666 isolated in this string."
extract_with_parentheses = "(?" + "(?:455|456)" + ")?" + "-?"
print(re.findall(extract_with_parentheses, the_string))

Groups

We can aggregate components of a pattern using regex to make them easier to detect. For instance, in the case of the American Phone Number, it is grouped into the country code, the area code, the exchange code and the line number. Let’s represent the same using an example in the subsequent section to make the most sense of this.

  • 1 is the country code
  • 212 is the area code
  • 323 is the exchange code
  • 5123 is the line number

Let’s divide our bits into groups before assigning names. To demonstrate how we’ll use simple chunks:

group_1 = "(\d{1}-?)"
group_2 = "(\d{3}-?)"
group_3 = "(\d{3}-?)"
group_4 = "(\d{4})"

phone_example = "1-212-555-5123"
combined_patterns = f"{group_1}{group_2}{group_3}{group_4}"

matched = re.compile(combined_patterns).match(phone_example)
print('group', matched.group())
print('groups', matched.groups())
Groups
Groups

Groups are simple regex chunks surrounded by parenthesis(). Of course, these parentheses must not be omitted using the escape character(); otherwise, the group is no longer a group. Naturally, the entire pattern is likewise one enormous group by default, whether or not parentheses are used. Now, we only want digits in our groupings (i.e. no dashes -). That’s a simple case that you can easily solve by simply placing parentheses () around the area of the pattern that we want to extract the most:

group_1 = "(\d{1})-?"
group_2 = "(\d{3})-?"
group_3 = "(\d{3})-?"
group_4 = "(\d{4})"

phone_example = "1-212-555-5123"
combined_patterns = f"{group_1}{group_2}{group_3}{group_4}"

matched = re.compile(combined_patterns).match(phone_example )
print('group', matched.group())
print('groups', matched.groups())

What are Named Groups

In your regex expression, named groups allow you to add a keyword to each group. Let’s look at an example to bring the point home.

group_1 = "(?P<country_code>\d{1})-?"
group_2 = "(?P<area_code>\d{3})-?"
group_3 = "(?P<exchange_code>\d{3})-?"
group_4 = "(?P<line_number>\d{4})"


phone_example = "1-555-333-6502"
named_group_pattern = f"{group_1}{group_2}{group_3}{group_4}"

matched = re.compile(named_group_pattern).match(phone_example )
print('named_groups', matched.groupdict())
What are Named Groups
What are Named Groups

Letters in Regular Expressions

So far, we’ve only used digits beginning with d or [0-9]. Letters are nearly equivalent, except because letters can be capitalized, [a-z] or [A-Z] . Thus, you can use [a-z] or [A-Z] instead of [0-9].

the_string = "CODEUNDERSCORED is having a good run of OVER 50 % ?"

the_pattern = r"[a-z]"

print(re.findall(the_pattern,the_string))
Letters in Regular Expressions
Letters in Regular Expressions

What does Match Object mean?

A Match Object is a data structure that contains information about the search and its outcome. As a result, the value None will be returned instead of the Match Object if there is no match.

Let us look at an example in which we perform a search that yields a Match Object:

import re

the_string = "Codeunderscored rains in the development industry."
the_result = re.search("red", the_string )

# finally print out the resultant object
print(the_result)
what  does Match Object mean?
What does Match Object mean?

Property and methods of the Match object are used to access information about the search and the result.

  • .string returns the string supplied into the function,
  • .span() returns a tuple containing the start- and end-positions of the match.
  • The segment of the string where there was a match is produced by group().

Example 1: Print the first match occurrence’s location (both start and end positions)

The regular expression checks for words that begin with the letter “C” in upper case:

import re

the_string = "codeunderscored rains in the Code development industry."
the_result = re.search(r"\bC\w+", the_string)
print(the_result.span())
Print the first match occurrence's location
Print the first match occurrence’s location

Example 2: Printing the string that was supplied into the function

the_string = "codeunderscored rains in the Code development industry."
the_result = re.search(r"\bC\w+", the_string)
print(the_result.string)
Printing the string that was supplied into the function
Printing the string that was supplied into the function

Example 3: Print the section of the string where a match was found

In the third example, the regular expression checks for words that begin with the letter “C” in upper case. However, the value None will be returned instead of the Match Object if there is no match.

import re

the_string = "codeunderscored rains in the Code development industry."
the_result = re.search(r"\bC\w+", the_string)
print(the_result .group())
Print the section of the string where a match was found
Print the section of the string where a match was found

Conclusion

Regular Expressions (abbreviated regex) are a set of characters used to determine whether or not a pattern exists in a given text (string). If you’ve ever used search engines, word processors’ search and replace capabilities, or text editors, you’ve seen regular expressions in action. They’re used on the server to validate the format of email addresses or passwords during registration, and they’re also used to parse text data files to find, alter, or delete specific strings, among other things.

Essentially, they aid with manipulating textual data, which is frequently required for data science initiatives that involve text mining.

The strings to be searched can be both Unicode and 8-bit strings. Unicode and 8-bit strings, on the other hand, cannot be mixed: a Unicode string cannot be matched with a byte pattern, and vice versa; similarly, when requesting a substitution, the replacement string must be of the same type as the pattern and the search string.

Finally, the backslash character (“\”) is used in regular expressions to denote unique forms or to allow special characters to be used without triggering their respective meaning.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *