Postgres regexp_match() function
Extract substrings matching a regular expression pattern
The Postgres regexp_match()
function is used to extract substrings that match a regular expression pattern from a given string. It returns an array of matching substrings, including capture groups if specified in the pattern.
This function is particularly useful for complex string parsing tasks, such as extracting structured information from semi-structured text data. For example, it can be used to parse log files, extract specific components from URLs, or analyze text data for specific patterns.
Try it on Neon!
Neon is Serverless Postgres built for the cloud. Explore Postgres features and functions in our user-friendly SQL editor. Sign up for a free account to get started.
Function signature
The regexp_match()
function has the following form:
regexp_match(string text, pattern text [, flags text]) -> text[]
string
: The input string to search for matches.pattern
: A POSIX regular expression pattern to match against the string.flags
(optional): A string of one or more single-letter flags that modify how the regular expression is interpreted.
The function returns an array of text values, where each element corresponds to a substring within the first match of the pattern in the input string. If there are no matches, the function returns NULL. If there are no capture groups in the pattern, the array contains a single element with the full match.
Example usage
Consider a table log_entries
with a log_text
column containing log messages. We can use regexp_match()
to extract specific information from these logs.
WITH log_entries AS (
SELECT '[2024-03-04 10:15:30] INFO: User john_doe logged in from 192.168.1.100' AS log_text
UNION ALL
SELECT '[2024-03-04 10:20:45] ERROR: Failed login attempt for user jane_smith from 10.0.0.50' AS log_text
UNION ALL
SELECT '[2024-03-04 10:25:55] INFO: User admin logged out' AS log_text
)
SELECT
regexp_match(log_text, '\[(.*?)\] (\w+): (.*)$') AS parsed_log
FROM log_entries;
This query extracts the timestamp, log level, and message from each log entry. The regular expression pattern \[(.*?)\] (\w+): (.*)$
captures three groups:
- The timestamp between square brackets
- The log level (INFO, ERROR, etc.), which is alphabetical and terminated with a colon
- The rest of the message
parsed_log
-----------------------------------------------------------------------------------------
{"2024-03-04 10:15:30",INFO,"User john_doe logged in from 192.168.1.100"}
{"2024-03-04 10:20:45",ERROR,"Failed login attempt for user jane_smith from 10.0.0.50"}
{"2024-03-04 10:25:55",INFO,"User admin logged out"}
(3 rows)
Advanced examples
regexp_match()
with regex flags
Use The regexp_match()
function accepts optional flags to modify how the regular expression is interpreted. Here's an example using the 'i' flag for case-insensitive matching:
WITH user_agents AS (
SELECT 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' AS user_agent
UNION ALL
SELECT 'Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Mobile/15E148 Safari/604.1' AS user_agent
UNION ALL
SELECT 'CHROME/91.0.4472.124' AS user_agent
)
SELECT
regexp_match(user_agent, '(chrome|safari|firefox|msie|opera)\/[\d\.]+', 'i') AS browser
FROM user_agents;
This query extracts the browser name and version from user agent strings, using case-insensitive matching.
browser
----------
{Chrome}
{Safari}
{CHROME}
(3 rows)
regexp_match()
in a WHERE clause
Use You can use regexp_match()
in a WHERE clause to filter rows based on a regex pattern:
WITH emails AS (
SELECT 'john.doe@example.com' AS email
UNION ALL
SELECT 'jane.smith@company.co.uk' AS email
UNION ALL
SELECT 'support@mydomain.io' AS email
)
SELECT *
FROM emails
WHERE regexp_match(email, '^[^@]+@[^@]+\.(com|org|io)$') IS NOT NULL;
This query selects all rows from the emails
table where the email address ends with .com
, ``.org, or
.io`.
email
----------------------
john.doe@example.com
support@mydomain.io
(2 rows)
Additional considerations
Performance implications
Using regexp_match()
can be computationally expensive, especially on large datasets or with complex patterns. For better performance:
- Use simpler patterns when possible.
- Consider using
LIKE
orSIMILAR TO
for simple pattern matching. - If you frequently filter based on regex patterns, consider creating a functional index using the
regexp_match()
expression.
NULL handling
regexp_match()
returns NULL if there's no match or if the input string is NULL. This behavior can be useful in WHERE
clauses but may require careful handling in SELECT
lists.
Alternative functions
regexp_matches()
: Returns a set of all matches, useful for extracting multiple occurrences of the pattern in the input string.regexp_replace()
: Replaces substrings matching a regex pattern within a specified string.regexp_split_to_array()
: Splits a string using a regex pattern as the delimiter and returns the result as an array.substring()
: Extracts substrings based on a regex pattern similar toregexp_match()
, but only returns the first captured group of the match.
Resources
- PostgreSQL documentation: Pattern Matching
- PostgreSQL documentation: Regular Expression Details
- Regular Expression Tester: A useful tool for testing and debugging regular expressions