A RegEx is what is mainly referred to as a Regular Expression. It is a search pattern made up of a series of characters. As a result, to see if a string contains the supplied search pattern, use RegEx.
Module for RegEx
Regular Expressions can be used using a built-in Python library called re, and you can easily import the re module as follows.
import re
RegEx Functions
Regular expressions come with many functions that you will find very useful if you are a frequent user of the re module. The following are some of the re module functions that allow us to search a string for a match:
findall
The findall function is responsible for returning a list containing all matches. We will look at an example to ensure that you properly grasp this function as follows.
import re the_string = "Codeunderscored rains in the development industry." the_result= re.findall("ai", the_string) print(the_result)
The matches are listed in the order that they were discovered. However, if there are no matches, an empty list is usually returned. For example,
import re the_string = "Codeunderscored rains in the development industry." the_result = re.findall("Google", the_string ) print(the_result)
search
On the other hand, the search function returns a Match object off chance of an actual match in the string. In addition, only the first instance of the match will be returned if there are multiple matches. Let’s consider an example where the first white-space character in the string by running the following piece of code:
import re the_string = "Codeunderscored rains in the development industry." the_result = re.search("\s", the_string) print("The first initial instance of white-space character was found in position:", the_result.start())
If there are no matches, the value None is returned like in the following demo.
import re the_string = "Codeunderscored rains in the development industry." the_result = re.search("Google", the_string) print(the_result)
split
The split function has a return value of a list. Now, the resultant list is one in which the string has splits at every matching point. In the demo example, we want to split the words after each white-space character, as shown.
import re the_string = "Codeunderscored rains in the development industry." the_result = re.split("\s", the_string) print(the_result)
Further note that the maxsplit parameter can be used to limit the number of occurrences. In addition, we can only test splitting the string after several occurrences like two.
import re the_string = "Codeunderscored rains in the development industry." the_result = re.split("\s", the_string , 2) print(the_result)
sub
The sub-function is the last we will consider in this article, but that does not mean it is the least among the functions we have covered. It is as vital as the other functions and replaces either many or a single match with a given string.
Let us consider the following example that replaces all the white-spaces with the word replaced,
import re the_string = "Codeunderscored rains in the development industry." the_result= re.sub("\s", "__replaced__", the_string ) print(the_result)
The count parameter can be used to adjust the number of replacements; for instance, you may only need to replace the first occurrence, as shown below.
import re the_string = "Codeunderscored rains in the development industry." the_result = re.sub("\s", "__replaced__", the_string, 1) print(the_result)
Using a Phone Number to explore regular expressions
Regular expressions, often known as regex, are widely used to help us parse data. Before we go into how to do that, let’s look at a real-world example utilizing US phone numbers. The following are all legal phone number formats in writing:
- 1-333-234-0007
- +1(333)- 234-0007
- +13332340007
- +1-333-234-0007
- 333-234-0007
Incredibly, all of these figures are the same, only formatted differently. So, how would we search a whole document for all possible phone number format derivations?
There are various ways you would probably use to solve this problem, for instance, machine learning. However, in this case, we will utilize pattern matching, also known as regular expressions, to make the problem easier to solve.
Regular expressions can be frightening and take some getting used to. As a result, we hope to explain utilizing Regular Expressions in Python in this article successfully. These regex patterns and principles are shared across languages, primarily since Python regex is based on Perl regex.
Consider the following example:
usa_phone_number_pattern = “333-234-0007”
How do we extract all of the integers from the above string (without the dashes -)? Let’s start with the more amateur and challenging approach that would almost come naturally to most of us.
Approach 1:
usa_phone_number_pattern= "333-234-0007" num_vals = [] for char in usa_phone_number_pattern: the_val = None try: the_val = int(char) except: pass if the_val != None: num_vals.append(the_val) the_result = "".join([f"{x}" for x in num_vals]) the_result
Approach 2:
Here’s another direction your intuition could lead you:
usa_phone_number_pattern= "333-234-0007" the_result = usa_phone_number_pattern.replace("-", "") the_result
Approach 3:
Finally, there is a built-in method for Python Strings .isdigit() that is a function and can be used to see if a string contains a number or not. Here’s how you do it:
usa_phone_number_pattern= "333-234-0007" the_result= "".join([f"{phone}" for phone in usa_phone_number_pattern if phone.isdigit()]) the_result
These ways are genuine in that they accomplish the purpose; however, there is a more practical and reliable method.
That’s just using regular expressions. Let’s take a look at the first regex example:
import re # the built-in regex library code_pattern = r"\d+" the_result = re.findall(code_pattern, usa_phone_number_pattern) the_result
The re.findAll method produces a better result because each group of numbers has been parsed out by default.
It’s easier for us to deduce that [‘333’, ‘234’, ‘0007’] is a phone number than 3332340007. That’s because we are from the United States, and that’s how we categorize numbers.
We are not yet done on the matter when it comes to why we use regex. It is because we’re seeking a specific pattern in our text to analyze, not just numerals. We want to disregard any digits that don’t follow this pattern. For example, let’s say we are given a time and a phone number. Our task is to extract all the digits in the following string.
the_string = " Please check on Codeunderscored, at 14:00 or dial my phone at +1-333-234-0007." code_pattern = r"\d+" the_result = re.findall(code_pattern, the_string) the_result
If we wanted to extract the phone number from the string above, we would run the following code.
code_pattern = r"+\d{1}-\d{3}-\d{3}-\d{4}" the_result = re.findall(phone_pattern, meeting_str) the_result
Let’s break down this pattern:
The string r”\d” is used to match any digit. It is a regular expression, as shown by the r in front. The d is a pattern that can be used to match any numeric number.
Part 1: +\d{1}-
The + indicates that the given string must start with a + in this case.
The + is followed by d, which, as you may recall, matches any or all digits. However, this does not include letters, spaces, and dashes.
The letter \d is preceded by the number one. When you see braces inside a regex pattern that starts with a number, such as { 1} or {3 }, it signifies the previous pattern can only be {n} n length long. For instance, r”\d{6}” is a pattern for six digits.
Finally, as we conclude this section, we see a -. This section will end with a dash in this case (-). We don’t need to escape the dash like the + because regex doesn’t utilize a dash – for anything else.
Part 2: \d{3}-
It should be self-evident at this point. This chunk matches any \d digit with a following – dash that is {3} 3 characters long.
Part 3 : \d{3}-
The latter is Identical to part 2 above. It matches any \d digit that is {3} 3 characters long and has a trailing – dash.
Part 4: \d{4}
This part matches any \d digit that is {4} 4 characters long. In fact, you can infer this from what we have discussed in sections three and two above.
Key things you have to master in Regular Expressions
? is responsible for making things optional
For instance, +?\d{1}- is similar to what we covered in part 1 above. However, the difference is the ? after the plus sign that makes + in a string optional.
import re code_pattern = r"\+?\d{1}-?\d{3}-?\d{3}-?\d{4}" numbers_without_dashes = "+15558655309" print(re.findall(code_pattern, numbers_without_dashes)) numbers_without_plus = "15558655309" print(re.findall(code_pattern, numbers_without_plus)) number_with_dashes = "1555-8655309" print(re.findall(code_pattern, number_with_dashes)) number_dashes_and_plus = "+1555-865-5309" print(re.findall(code_pattern, number_dashes_and_plus))
How to write () in regex expressions
Parentheses, like the plus +, are special regex characters. As a result, we must write an escape character before both the opening and the closing parentheses.
for example,
extract_with_parentheses = "(" + "\d{3}" + ")" the_string = "I need characters within the brackets,+1(555)-555-3121, isolated in this string." print(re.findall(extract_with_parentheses, the_string))
The or | operator
In some regexes, you’ll need to allow for two different sorts of patterns. Allow 555 or [4-9] to continue on our phone pattern.
Assume we needed the area code to match two different area codes. Let us try 455 and 456 to cement our understanding of this concept.
the_string = "I need characters within the brackets,+1(555)-555-3121, or 213-323-1233 or 312-456-6666 isolated in this string." extract_with_parentheses = "(?" + "(?:455|456)" + ")?" + "-?" print(re.findall(extract_with_parentheses, the_string))
Groups
We can aggregate components of a pattern using regex to make them easier to detect. For instance, in the case of the American Phone Number, it is grouped into the country code, the area code, the exchange code and the line number. Let’s represent the same using an example in the subsequent section to make the most sense of this.
- 1 is the country code
- 212 is the area code
- 323 is the exchange code
- 5123 is the line number
Let’s divide our bits into groups before assigning names. To demonstrate how we’ll use simple chunks:
group_1 = "(\d{1}-?)" group_2 = "(\d{3}-?)" group_3 = "(\d{3}-?)" group_4 = "(\d{4})" phone_example = "1-212-555-5123" combined_patterns = f"{group_1}{group_2}{group_3}{group_4}" matched = re.compile(combined_patterns).match(phone_example) print('group', matched.group()) print('groups', matched.groups())
Groups are simple regex chunks surrounded by parenthesis(). Of course, these parentheses must not be omitted using the escape character(); otherwise, the group is no longer a group. Naturally, the entire pattern is likewise one enormous group by default, whether or not parentheses are used. Now, we only want digits in our groupings (i.e. no dashes -). That’s a simple case that you can easily solve by simply placing parentheses () around the area of the pattern that we want to extract the most:
group_1 = "(\d{1})-?" group_2 = "(\d{3})-?" group_3 = "(\d{3})-?" group_4 = "(\d{4})" phone_example = "1-212-555-5123" combined_patterns = f"{group_1}{group_2}{group_3}{group_4}" matched = re.compile(combined_patterns).match(phone_example ) print('group', matched.group()) print('groups', matched.groups())
What are Named Groups
In your regex expression, named groups allow you to add a keyword to each group. Let’s look at an example to bring the point home.
group_1 = "(?P<country_code>\d{1})-?" group_2 = "(?P<area_code>\d{3})-?" group_3 = "(?P<exchange_code>\d{3})-?" group_4 = "(?P<line_number>\d{4})" phone_example = "1-555-333-6502" named_group_pattern = f"{group_1}{group_2}{group_3}{group_4}" matched = re.compile(named_group_pattern).match(phone_example ) print('named_groups', matched.groupdict())
Letters in Regular Expressions
So far, we’ve only used digits beginning with d or [0-9]. Letters are nearly equivalent, except because letters can be capitalized, [a-z] or [A-Z] . Thus, you can use [a-z] or [A-Z] instead of [0-9].
the_string = "CODEUNDERSCORED is having a good run of OVER 50 % ?" the_pattern = r"[a-z]" print(re.findall(the_pattern,the_string))
What does Match Object mean?
A Match Object is a data structure that contains information about the search and its outcome. As a result, the value None will be returned instead of the Match Object if there is no match.
Let us look at an example in which we perform a search that yields a Match Object:
import re the_string = "Codeunderscored rains in the development industry." the_result = re.search("red", the_string ) # finally print out the resultant object print(the_result)
Property and methods of the Match object are used to access information about the search and the result.
- .string returns the string supplied into the function,
- .span() returns a tuple containing the start- and end-positions of the match.
- The segment of the string where there was a match is produced by group().
Example 1: Print the first match occurrence’s location (both start and end positions)
The regular expression checks for words that begin with the letter “C” in upper case:
import re the_string = "codeunderscored rains in the Code development industry." the_result = re.search(r"\bC\w+", the_string) print(the_result.span())
Example 2: Printing the string that was supplied into the function
the_string = "codeunderscored rains in the Code development industry." the_result = re.search(r"\bC\w+", the_string) print(the_result.string)
Example 3: Print the section of the string where a match was found
In the third example, the regular expression checks for words that begin with the letter “C” in upper case. However, the value None will be returned instead of the Match Object if there is no match.
import re the_string = "codeunderscored rains in the Code development industry." the_result = re.search(r"\bC\w+", the_string) print(the_result .group())
Conclusion
Regular Expressions (abbreviated regex) are a set of characters used to determine whether or not a pattern exists in a given text (string). If you’ve ever used search engines, word processors’ search and replace capabilities, or text editors, you’ve seen regular expressions in action. They’re used on the server to validate the format of email addresses or passwords during registration, and they’re also used to parse text data files to find, alter, or delete specific strings, among other things.
Essentially, they aid with manipulating textual data, which is frequently required for data science initiatives that involve text mining.
The strings to be searched can be both Unicode and 8-bit strings. Unicode and 8-bit strings, on the other hand, cannot be mixed: a Unicode string cannot be matched with a byte pattern, and vice versa; similarly, when requesting a substitution, the replacement string must be of the same type as the pattern and the search string.
Finally, the backslash character (“\”) is used in regular expressions to denote unique forms or to allow special characters to be used without triggering their respective meaning.