This documents lists out different regex pattern, and how to do regex matching and replace in Java.

Regex cheat sheet

Character classes

Character Set  e.g [aeiou] Match any character in the set.
e.g. Any of the character ‘a’,’e’, ‘i’, ‘o’, ‘u’
Negated Set  e.g [^aeiou] Match any character that is not in the set.
e.g. Any character except ‘a’,’e’, ‘i’, ‘o’, ‘u’
Range e.g [g-s] Matches a character having a character code between the two specified characters inclusive.
e.g all characters between character ‘g’ and character ‘s’, both inclusive.
Word \w Matches any word character (alphanumeric & underscore). Only matches low-ascii characters (no accented or non-roman characters). Equivalent to [A-Za-z0-9_]
Digit \d Matches any digit character (0-9). Equivalent to [0-9].
Whitespace \s Matches any whitespace character (spaces, tabs, line breaks).
Not Word
\W Matches any character that is not a word character (alphanumeric & underscore). Equivalent to [^A-Za-z0-9_]
Not Digit \D Matches any character that is not a digit character (0-9). Equivalent to [^0-9].
Not Whitespace
\S Matches any character that is not a whitespace character (spaces, tabs, line breaks).
Dot . Matches any character except line breaks. Equivalent to [^\n\r].

Anchors

Beginning  ^ Matches the beginning of the string, or the beginning of a line if the multiline flag (m) is enabled. This matches a position, not a character.

Example
Regex :  ^\w+
Text : johny can read
Match: johny

End Matches the end of the string, or the end of a line if the multiline flag (m) is enabled. This matches a position, not a character.

Example
Regex:  \w+$
Text :  johny can read
Match: read

Word Boundary \b  Matches a word boundary position between a word character and non-word character or position (start / end of string).

Example
Regex:  n\b
Text: johny can read only numbers
Match (shown in red): johny can read only numbers

Not Word Boundary  \B  Matches any position that is not a word boundary. This matches a position, not a character.

Example
Regex:  n\B
Text: johny can read only numbers
Match (shown in red):  johny can read only numbers

Escaped Characters

 

 Reserved Characters  e.g \+  The following character have special meaning, and must be preceded by a \ (backslash) to represent a literal character:+*?^$\.[]{}()|/

Within a character set, only \-, and ] need to be escaped.

Groups and References

Capturing Group  (ABC) Groups multiple tokens together and creates a capture group for extracting a substring or using a backreference.

Example 1
Regex: (ha)+
Text: hahaha haa hah!
Matched Text (in red): hahaha haa hah!

Example 2
Regex: (http://)(www)*.(%5B\w.]*)
Text: http://www.wealthminder.com
Matched Groups:
Group 1: http://
Group 2: www
Group 3: wealthminder.com

Numeric Reference  \1 Matches the results of a capture group. For example \1 matches the results of the first capture group & \3 matches the third.

Example 1
Regex:  (\w)a\1
Text: hah dad bad dab gag gab
Matched Text (in red):  hah dad bad dab gag gab

Non Capturing group  (?:ABC)  Groups multiple tokens together without creating a capture group.

Example 1
Regex:  (?:ha)+
Text: hahaha haa hah!
Matched Text (in red): hahaha haa hah!
(But no groups created).

Look Arounds

Positive look ahead  (?=ABC) Matches a group after the main expression without including it in the result.

Example
Regex:  \d(?=px)
Text: 1pt 2px 3em 4px
Matched Text (in red): 1pt 2px 3em 4px

Negative look ahead  (?!ABC) Specifies a group that can not match after the main expression (if it matches, the result is discarded).

Example
Regex: \d(?!px)
Text: 1pt 2px 3em 4px
Matched Text (in red): 1pt 2px 3em 4px

 

 

Regex Matching in Java

Pattern

A compiled representation of a regular expression.

A regular expression, specified as a string, must first be compiled into an instance of this class. The resulting pattern can then be used to create a Matcher object that can match arbitrary character sequences against the regular expression. All of the state involved in performing a match resides in the matcher, so many matchers can share the same pattern.

A typical invocation sequence is thus

Pattern p = Pattern.compile("a*b"); Matcher m = p.matcher("aaaaab"); boolean b = m.matches();

Above three statements can also be simply written as below.

 boolean b = Pattern.matches("a*b", "aaaaab");

 

Matcher

An engine that performs match operations on a character sequence by interpreting a Pattern.

A matcher is created from a pattern by invoking the pattern’s matcher method. Once created, a matcher can be used to perform three different kinds of match operations:

  • The matches method attempts to match the entire input sequence against the pattern.
  • The lookingAt method attempts to match the input sequence, starting at the beginning, against the pattern.
  • The find method scans the input sequence looking for the next subsequence that matches the pattern.

 

 

Pattern Matching Examples

Example 1 : Extracting file name

Consider that file names are in the form of <timestamp>_filename.<extension>. e.g
2346786876_AI Companies.pdf

We need to extract the ACTUAL file name and its extension. e.g .
File Name = AI Companies
Extension = pdf

A program to do the same would look like

private static void patternMatch(String inputFileName) {
    String regex = "([\\d]*)_(.*)\\.([\\w]*)";
    Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(inputFileName);

    if(matcher.matches()) {
        String fileName = matcher.group(2);
        String extension = matcher.group(3);
        System.out.println(fileName);
        System.out.println(extension);
    }
}

Note: Second argument of Pattern.compile, i.e, the flags are optional.
Here we are passing the flags as Pattern.CASE_INSENSITIVE, to do a case insensitive regex match.

Explaining the Pattern Used

([\\d]*)_  : The first group matches all numbers till an _ character is found

(.*)\\.  : The second group matches any characters that end with a dot (.)

([\\w]*) : Last group matches only word characters

 

 

 

 

Example 2: Extracting data from URLs

Consider sample urls like

http://dummy-dev-site/data/huntleyIL/jeff-aurand/285917
https://prod-site/data/san-rafaelCA/john-mcnertney/264870

– Above URLs can begin with http OR https
– The website of URL is not fixed, and can vary.
– The website name is followed by /data/
 – Then comes the city and state, that we need to extract.
e.g City in above URLs are Huntley and San Rafael, and states are IL and CA

Following program would be able to extract the data

private static void matchURL(String[] urls) {
    String regex = "(http[s]?:\\/\\/)([\\w-.]*)/data/([\\w-]*)-([\\w]{2})/(.*)";
    Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

    Arrays.stream(urls)
            .forEach(url -> {
                Matcher match = pattern.matcher(url);
                if(match.matches() && match.groupCount() == 5) {
                    System.out.println("City = " + match.group(3) + ", " +
                            "State = " + match.group(4));
                }
            });
}

 

Explaining the Pattern Used

(http[s]?:\\/\\/) : The first group matches to http:// or https://

(([\\w-.]*)/data/ : The second group matches to the website name, which is followed by /data/ in the url

([\\w-]*)-([\\w]{2})/ : The third and fourth group matches the city and state respectively.
The fourth group is two characters which is preceded by – and followed by /
All word characters and hyphen characters preceding the – and the fourth group form the third group, i.e. the city.

 

Example 3: Finding out valid matches

Example: Topic numbers can be two digit numbers separated by a dot character (.)
e.g valid topic numbers are 01,  01.03.02.01,  31.01.03
Invalid topic numbers could be 1, 1.1, 1.01, 01.a

Given a input of list of topic numbers print out the valid and invalid numbers

String[] testInput = {
        "01", "1", "1.1", "1.01", "01.01.03.04", "01.a", "31.01.13"};
Pattern pattern = Pattern.compile("[\\d]{2}([.][\\d]{2})*");
Arrays.stream(testInput).forEach(s -> 
        System.out.println(s + "\t: " + pattern.matcher(s).matches()));

 

Explaining the pattern used.

[\\d]{2}([.][\\d]{2})*

This means that the string should begin with a two digit number.
After that it can have 0 or more combination of a DOT followed by a two digit number.

 

 

Example 4 : Converting a normal SQL Query into a count(*) query

Convert any given SQL query into its equivalent count(*) query.
Consider a scenario, where the sql queries are generated at run time based on user inputs. But our application also needs to generate a matching count(*) query to get the total counts of records.

Consider following SQL Query below

SELECT
        jh.jobHistoryId,
        j.jobId,
        j.jobName,
        j.jobType,
        jh.startedOn


FROM
        job_history jh
        INNER JOIN job j ON jh.jobId = j.jobId
        INNER JOIN user_ user ON jh.startedBy = user.userId
        where 1=1

The equivalent count(*) query of above query will be the one below, i.e., ALL Text between SELECT and FROM has been replace by count(*)

SELECT
        count(*)
FROM
        job_history jh
        INNER JOIN job j ON jh.jobId = j.jobId
        INNER JOIN user_ user ON jh.startedBy = user.userId
where 1=1

Now since the text between SELECT and FROM will vary in each of the user query, so we will have to use a REGEX to replace the text

The program to convert the text will be something like this

private static String generateCountQuery(String sql) {
    String regex = "[\\s]*(select)([.,_\\-\\s\\d\\w]*)(from.*)";
    Pattern pattern =  Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(sql);

    if(matcher.matches()) {
        return (matcher.group(1) + " count(*) " + matcher.group(3));
    }
    return null;
}

Explaining the pattern used

[\\s]* : Match any whitespace characters at beginning of sql

(select) : Matches the SELECT clause of the query and puts it into GROUP 1

([.,_\\-\\s\\d\\w]*) : Matches 0 or more combinations of whitespace character, hyphen, dot , digits, words, comma characters, and puts them into GROUP 2

(from.*) : Matches any sequence of characters that begin with FROM and puts it in GROUP 3

Note: Group 2 characters match between group 1 and group 2

The final count(*) query is then generated by concatenating text from Group 1 + count(*) + text from Group 3.

 

 

Reference

https://regexr.com/

https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

https://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html