Basic Regular Expression (regex) usage: Perl, Python, R, Shell, Javascript, C
This post demonstrage simple regular expression usage in various programming languages. The examples contains two parts:
- Given a simple text string (
chr1 65436543 rsID776
), check if it matches a specific pattern. - Extract a value from the string (e.g. extract the
1
fromchr1
) using regular-expression groupping.
Regular expression Flavors - there are three common ‘flavors’ of
regular expression capabilities: basic
, extended
, perl
(or Perl-Compatible
)
(learn more on the
https://en.wikipedia.org/wiki/Regular_expression#Standards).
Generally speaking, the following are the supported flavors for each environment (examples below):
grep => basic
grep -E => extended
grep -P => Perl (not always supported)
pcregrep => Perl
sed => basic
sed -r => extended (on linux)
sed -E => extended (on some macs)
awk => extended (on linux), basic (on mac)
perl => Perl (aka PCRE = Perl-Compatible Regular Expressions)
python => very similar to Perl
javascript => ECMAScript (similar to Perl)
standard C => basic or extended
R => Perl
C++11 => ECMAScript (similar to Perl)
Python
Download regex-demo.py .
#!/usr/bin/env python
import re
# Python Regex Howto + Syntax
# https://docs.python.org/2/howto/regex.html
data = "chr1 65436543 rsID776"
# Build a Regex Object based on desired pattern
regex = re.compile('^chr')
if regex.search(data):
print "Found match! line starts with 'chr'"
# Regex to extract information (ie. 'grouping')
# Get the number following the 'chr', store in object 'm'
regex = re.compile('^chr([0-9]+)')
m = regex.search(data)
if m:
print "found chromosome: ", m.group(1)
Perl
Download regex-demo.pl .
#!/usr/bin/env perl
use strict;
use warnings;
# Perl Quick Regex Tutorial: http://perldoc.perl.org/perlrequick.html
# Perl Long Regex Tutorial: http://perldoc.perl.org/perlretut.html
# Perl Full Regex Syntax: http://perldoc.perl.org/perlrequick.html
my $data = "chr1 65436543 rsID776";
# Regex to test if string matches
if ($data =~ m/^chr/) {
print "Found match! line starts with 'chr'\n";
}
# Regex to extract information (ie. 'grouping')
# Get the number following the 'chr', store it in '$1'
if ($data =~ m/^chr([0-9]+)/) {
print "found chromosome: $1\n";
}
Javascript
Download regex-demo.js .
#!/usr/bin/env node
// Javascript Regex Tutorial:
// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
// Javascript Regex Reference:
var data = "chr1 65436543 rsID776";
// Test if 'data' matches the regex. returns NULL if no match found.
if (data.match(/^chr/)) {
console.log("Found match! line starts with 'chr'");
}
// Regex to extract information (ie. 'grouping')
// Get the number following the 'chr', store it in array 'm'
// 'm' will be NULL if no match is found.
var m = data.match(/^chr([0-9]+)/);
if (m) {
console.log("found chromosome: ", m[1]);
}
Shell
Download regex-demo.sh .
#!/bin/sh
# NOTE:
# 'grep -E' to use Extended Regular expression in grep (the default is 'basic').
#
# 'sed -r' to use Extended regular expressions in sed
# (sadly does not work on Mac).
data="chr1 65436543 rsID776"
# option 1: print lines that match the regex
echo "Matching lines:"
echo "$data" | grep -E '^chr'
# option 2: just check if there is a match and print nothing
# 'grep -q' means 'quiet' (don't print anything)
if echo "$data" | grep -q -E '^chr' ; then
echo 'Found match! line starts with 'chr'\n'
fi
# sed trick:
# The regex will find:
# 1. lines starting with 'chr'
# 2. followed by digits
# 3. then anything (.*)
# And will be replaced by whatever was found in the parenthesis group
echo "found chromosome: "
echo "$data" | sed -r 's/^chr([0-9]*).*/\1/'
C
Download regex-demo.c .
/*
to compile:
cc -Wall -g -o regex-demo regex-demo.c
to run:
./regex-demo
*/
#include <string.h>
#include <regex.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
int main()
{
const char *data = "chr1 65436543 rsID776";
regex_t re;
#define MAX_MATCHES (10)
regmatch_t matches[MAX_MATCHES];
// Compile the regex
int i = regcomp(&re,"^chr",REG_EXTENDED);
assert (i==0);
// Test it on the string
i = regexec(&re, data, 0, NULL, 0);
if (i != REG_NOMATCH) {
printf("Found match! line starts with 'chr'\n");
}
regfree(&re);
// Compile a regex with grouping
i = regcomp(&re,"^chr([0-9]+)",REG_EXTENDED);
assert (i==0);
// Test it, results will be stored in 'matches'
i = regexec(&re, data, MAX_MATCHES, (regmatch_t*)&matches, 0);
if ( i != REG_NOMATCH ) {
// matches[1].rm_so => start-offset of match
// matches[1].rm_eo => end-offset of match
const int len = matches[1].rm_eo - matches[1].rm_so;
const char *ptr = data + matches[1].rm_so;
// extract the matched value into a new string
char *val = strndup(ptr, len);
printf("found chromosome: %s\n", val);
free(val);
}
regfree(&re);
return 0;
}
C++ (using the c++11 standard)
Download regex-demo.cpp .
/*
NOTE:
Using std::regex requires g++>=4.9 or a recent enough clang.
to compile:
c++ -std=c++11 -Wall -g -o regex-demo-cpp regex-demo.cpp
to run:
./regex-demo-cpp
*/
#include <iostream>
#include <string>
#include <regex>
using namespace std;
/*
NOTES:
std::regex_match => requires the ENTIRE string to match the regex
(as if with '^' and '$' anchors).
std::regex_search => match any part of the string.
The above funtions accepts a 'flags' parameter specifing the regex flavor,
defaulting to EMCAScript if none is given.
See: http://www.cplusplus.com/reference/regex/regex_constants/
http://www.cplusplus.com/reference/regex/ECMAScript/
*/
int main()
{
const string data("chr1 65436543 rsID776");
const regex re("^chr");
if (std::regex_search(data,re)) {
cout << "Found match! line starts with 'chr'" << endl;
}
const regex re2("^chr(\\d+)");
smatch matches;
if (regex_search(data,matches,re2)) {
cout << "found chromosome: " << matches[1] << endl;
}
return 0;
}
R
## example 1: check if a string matches a regular expression pattern
# A vector with three character strings
data = c("chr1 65436543 rsID776",
"foo bar",
"chr2 123456 rsID531431")
# 'grep' returns a vector of the indices of matching elements
grep('^chr',data,perl=T)
#
# Output: c(1, 3)
# meaning: data[1] and data[3] matched '^chr' .
# 'grepl' returns a vector of TRUE/FALSE for each element
# in the input vector.
grepl('^chr',data,perl=T)
#
# Output: c(TRUE, FALSE, TRUE)
## example 2: extract value using regex grouping
data = "chr1 65436543 rsID776"
tmp = regexec('^chr(\\d+)',data)
res = regmatches(data,tmp)[[1]]
# The first element is the entire match
> res[1]
[1] "chr1"
# The second element is the content of the matched group
> res[2]
[1] "1"
pcregrep
pcrerep
(Perl Compatible Regular Expression grep) is
a command-line utilitry available on most modern unixes. It supports
perl-style regular expressions (e.g. with \d
and \s
, which are not
supported in standard GREP), and also allows printing the value of the
matched groups directly:
$ echo "chr1 65436543 rsID776" | pcregrep -o1 '^chr(\d+)'
1
The parameter -o1
means print the content of the first matched
group (the value matching (\d+)
- which is 1
in the above example).
To install, try:
sudo apt-get install pcregrep
on Ubuntu/Debiansudo yum install pcregrep
on Redhat/Centossudo brew install pcre
on Mac OS X with homebrew.