Basic Regular Expression (regex) usage: Perl, Python, R, Shell, Javascript, C

This post demonstrage simple regular expression usage in various programming languages. The examples contains two parts:

Given a simple text string (chr1 65436543 rsID776), check if it matches a specific pattern.
Extract a value from the string (e.g. extract the 1 from chr1) using regular-expression groupping.

Regular expression Flavors - there are three common ‘flavors’ of regular expression capabilities: basic, extended, perl (or Perl-Compatible) (learn more on the https://en.wikipedia.org/wiki/Regular_expression#Standards).

Generally speaking, the following are the supported flavors for each environment (examples below):

grep       => basic
grep -E    => extended
grep -P    => Perl (not always supported)
pcregrep   => Perl
sed        => basic
sed -r     => extended (on linux)
sed -E     => extended (on some macs)
awk        => extended (on linux), basic (on mac)
perl       => Perl (aka PCRE = Perl-Compatible Regular Expressions)
python     => very similar to Perl
javascript => ECMAScript (similar to Perl)
standard C => basic or extended
R          => Perl
C++11      => ECMAScript (similar to Perl)

Python

Download regex-demo.py .

#!/usr/bin/env python
import re

# Python Regex Howto + Syntax
# https://docs.python.org/2/howto/regex.html
data = "chr1   65436543   rsID776"


# Build a Regex Object based on desired pattern
regex = re.compile('^chr')
if regex.search(data):
    print "Found match! line starts with 'chr'"

# Regex to extract information (ie. 'grouping')
# Get the number following the 'chr', store in object 'm'
regex = re.compile('^chr([0-9]+)')
m = regex.search(data)
if m:
     print "found chromosome: ", m.group(1)

Perl

Download regex-demo.pl .

#!/usr/bin/env perl
use strict;
use warnings;

# Perl Quick Regex Tutorial: http://perldoc.perl.org/perlrequick.html
# Perl Long Regex Tutorial:  http://perldoc.perl.org/perlretut.html
# Perl Full Regex Syntax:    http://perldoc.perl.org/perlrequick.html
my $data = "chr1   65436543   rsID776";

# Regex to test if string matches
if ($data =~ m/^chr/) {
    print "Found match! line starts with 'chr'\n";
}

# Regex to extract information (ie. 'grouping')
# Get the number following the 'chr', store it in '$1'
if ($data =~ m/^chr([0-9]+)/) {
	print "found chromosome: $1\n";
}

Javascript

Download regex-demo.js .

#!/usr/bin/env node

// Javascript Regex Tutorial:
//   https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
// Javascript Regex Reference:

var data = "chr1   65436543   rsID776";

// Test if 'data' matches the regex. returns NULL if no match found.
if (data.match(/^chr/)) {
    console.log("Found match! line starts with 'chr'");
}

// Regex to extract information (ie. 'grouping')
// Get the number following the 'chr', store it in array 'm'
// 'm' will be NULL if no match is found.
var m = data.match(/^chr([0-9]+)/);
if (m) {
   console.log("found chromosome: ", m[1]);
}

Shell

Download regex-demo.sh .

#!/bin/sh

# NOTE:
# 'grep -E' to use Extended Regular expression in grep (the default is 'basic').
#
# 'sed -r' to use Extended regular expressions in sed
#          (sadly does not work on Mac).

data="chr1   65436543   rsID776"

# option 1: print lines that match the regex
echo "Matching lines:"
echo "$data" | grep -E '^chr'

# option 2: just check if there is a match and print nothing
# 'grep -q' means 'quiet' (don't print anything)
if echo "$data" | grep -q -E '^chr' ; then
	echo 'Found match! line starts with 'chr'\n'
fi

# sed trick:
# The regex will find:
#   1. lines starting with 'chr'
#   2. followed by digits
#   3. then anything (.*)
# And will be replaced by whatever was found in the parenthesis group
echo "found chromosome: "
echo "$data" | sed -r 's/^chr([0-9]*).*/\1/'

C

Download regex-demo.c .

/*
 to compile:
     cc -Wall -g -o regex-demo regex-demo.c
 to run:
     ./regex-demo
*/
#include <string.h>
#include <regex.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

int main()
{
    const char *data = "chr1   65436543   rsID776";
    regex_t re;
    #define MAX_MATCHES (10)
    regmatch_t matches[MAX_MATCHES];

    // Compile the regex
    int i = regcomp(&re,"^chr",REG_EXTENDED);
    assert (i==0);

    // Test it on the string
    i = regexec(&re, data, 0, NULL, 0);
    if (i != REG_NOMATCH) {
	printf("Found match! line starts with 'chr'\n");
    }
    regfree(&re);


    // Compile a regex with grouping
    i = regcomp(&re,"^chr([0-9]+)",REG_EXTENDED);
    assert (i==0);

    // Test it, results will be stored in 'matches'
    i = regexec(&re, data, MAX_MATCHES, (regmatch_t*)&matches, 0);
    if ( i != REG_NOMATCH ) {
        // matches[1].rm_so => start-offset of match
        // matches[1].rm_eo => end-offset of match
        const int len = matches[1].rm_eo - matches[1].rm_so;
        const char *ptr = data + matches[1].rm_so;
        // extract the matched value into a new string
        char *val = strndup(ptr, len);
        printf("found chromosome: %s\n", val);
	free(val);
    }
    regfree(&re);

    return 0;
}

C++ (using the c++11 standard)

Download regex-demo.cpp .

/*
 NOTE:
    Using std::regex requires g++>=4.9 or a recent enough clang.
 to compile:
     c++ -std=c++11 -Wall -g -o regex-demo-cpp regex-demo.cpp
 to run:
     ./regex-demo-cpp
*/
#include <iostream>
#include <string>
#include <regex>

using namespace std;

/*
NOTES:
std::regex_match => requires the ENTIRE string to match the regex
                    (as if with '^' and '$' anchors).
std::regex_search => match any part of the string.

The above funtions accepts a 'flags' parameter specifing the regex flavor,
defaulting to EMCAScript if none is given.
See: http://www.cplusplus.com/reference/regex/regex_constants/
     http://www.cplusplus.com/reference/regex/ECMAScript/
*/

int main()
{
    const string data("chr1   65436543   rsID776");

    const regex re("^chr");
    if (std::regex_search(data,re)) {
        cout << "Found match! line starts with 'chr'" << endl;
    }

    const regex re2("^chr(\\d+)");
    smatch matches;
    if (regex_search(data,matches,re2)) {
        cout << "found chromosome: " << matches[1] << endl;
    }

    return 0;
}

R

## example 1: check if a string matches a regular expression pattern

# A vector with three character strings
data = c("chr1   65436543   rsID776",
         "foo bar",
		 "chr2   123456     rsID531431")

# 'grep' returns a vector of the indices of matching elements
grep('^chr',data,perl=T)
#
# Output: c(1, 3)
# meaning: data[1] and data[3] matched '^chr' .


# 'grepl' returns a vector of TRUE/FALSE for each element
# in the input vector.
grepl('^chr',data,perl=T)
#
# Output: c(TRUE, FALSE, TRUE)


## example 2: extract value using regex grouping
data = "chr1   65436543   rsID776"
tmp = regexec('^chr(\\d+)',data)
res = regmatches(data,tmp)[[1]]

# The first element is the entire match
> res[1]
[1] "chr1"

# The second element is the content of the matched group
> res[2]
[1] "1"

pcregrep

pcrerep (Perl Compatible Regular Expression grep) is a command-line utilitry available on most modern unixes. It supports perl-style regular expressions (e.g. with \d and \s, which are not supported in standard GREP), and also allows printing the value of the matched groups directly:

$ echo "chr1 65436543 rsID776" | pcregrep -o1 '^chr(\d+)'
1

The parameter -o1 means print the content of the first matched group (the value matching (\d+) - which is 1 in the above example).

To install, try:

sudo apt-get install pcregrep on Ubuntu/Debian
sudo yum install pcregrep on Redhat/Centos
sudo brew install pcre on Mac OS X with homebrew.