Regular expressions ⎕R ⎕S#

⎕R and ⎕S are Dyalog’s regex operators; and take note that they are operators, not functions. Occasionally, their operator syntax has unexpected consequences, so it is important to remember this. They are dyadic operators. The left operand is always a character scalar, vector, or vector of such. The right operand may also be any of those, but can also be a function (any type; tacit, dfn or trad), and ⎕S can also take an integer scalar or vector as right operand.

They then derive an ambivalent function which is can be named or applied to text. Some of their behaviour can be modified with the operator, but since operators can only take functions (or arrays) as operands, will be acting on the derived function, not on ⎕R or ⎕S themselves. This may sound trivial, but you have to remember that you cannot make a case insensitive (more about that later) version of ⎕S with MyRegexMachine←⎕S⍠1, only MyRegexMachine←'something'⎕S'something else'⍠1.

Basic use#

Final note before we really start: The regex flavour is PCRE, which is well documented, so we won’t go too much into details about it. It is summarised here and described in detail here.

⎕R (Replace) changes text in-place and returns the entire amended argument. ⎕S only returns the amended match(es). In most other aspects, they are identical, so when we speak of one, it applies to the other unless otherwise noted.

OK, the basic example is:

'and' ⎕R 'or'  'Programming Puzzles and Code Golf'  ⍝ Replace 'and' with 'or'
Programming Puzzles or Code Golf

However, the operands are not just simple text vectors, but rather regexes. For the left operand, that’s just regular PCRE to find a match, but the right argument uses something that very much feels like regex, but in fact is a Dyalog-invented notation to indicate what you want the match replaced by.

The first such notational symbol is & which means the match itself; in other words, no change:

'(.)\1' ⎕S '&'  'Programming Puzzles and Code Golf'  ⍝ Match repeated pairs
┌──┬──┐
│mm│zz│
└──┴──┘

The left operand is just PCRE: . is any char, the parens is a capture group, which gives it a number, and \1 is a reference to the first such group. It matches any sequence of two identical characters after each other.

A % in the right operand means the entire container (line or document) which contained the match:

'(.)\1' ⎕S '%'  'Programming' 'Puzzles' 'and' 'Code' 'Golf'
┌───────────┬───────┐
│Programming│Puzzles│
└───────────┴───────┘

So this returned a list of all lines which contained double letters.

The transformation string in depth#

We’ve earlier talked about how simple APL’s “string” (i.e. character vector) model is. The only special character is the quote which you need to double. There’s no escaping, rather you have to use …',(⎕UCS nn),'….

However, in the transformation string (that’s what the right operand is called), you may also use some common escapes: \n and \r for newline and carriage return, and \x{nn} for any other Unicode character, where nn is in hex. Moreover, as & and \ are special, you’ll have to escape them too with a prefix backslash.

You may of course mix and match transformation strings as you please:

'(.)\1' ⎕S '"%" has "&"'  'Programming' 'Puzzles' 'and' 'Code' 'Golf'
┌──────────────────────┬──────────────────┐
│"Programming" has "mm"│"Puzzles" has "zz"│
└──────────────────────┴──────────────────┘

You can also refer to the numbered capture groups with \N (or \(NN) for two-digit numbers):

'(.)\1' ⎕S '"%" has two "\1"s'  'Programming' 'Puzzles' 'and' 'Code' 'Golf'
┌──────────────────────────┬──────────────────────┐
│"Programming" has two "m"s│"Puzzles" has two "z"s│
└──────────────────────────┴──────────────────────┘

Finally, you can fold to upper or lowercase by inserting u or l immediately after the backslash (adding a backslash to & and %):

'(.)\1' ⎕S '"\u%" has 2 "\u1"s'  'Programming' 'Puzzles' 'and' 'Code' 'Golf'
┌────────────────────────┬────────────────────┐
│"PROGRAMMING" has 2 "M"s│"PUZZLES" has 2 "Z"s│
└────────────────────────┴────────────────────┘

This means that you can also use ⎕R to just fold case (like ⎕C):

'.'⎕R'\u&''Programming Puzzles and Code Golf'
PROGRAMMING PUZZLES AND CODE GOLF

In addition to using these text-based codes, ⎕S can also use a few numeric codes which then return numeric results.

0 is the offset from the start of the input of the start of the match:

'(.)\1'⎕S 0'Programming Puzzles and Code Golf'
6 14

The above means that mm and zz begin 6 and 14 characters offset from the left. Notice that these are offsets, not indices, so they are as indices in origin 0 (⎕IO←0).

1 is the length of the match:

'\w+' ⎕S 1  'Programming Puzzles and Code Golf' ⍝ Length of each word
11 7 3 4 4

\w is any word character, and + means one or more, so this matches whole words, and the result is a list of word lengths.

Question:

Is there a way to get how many uppercased characters there are in a string?

You can e.g. match all uppercase letters and then tally the result:

'[[:upper:]]' ⎕S 0  'Programming Puzzles and Code Golf'   ⍝ POSIX character class reflecting your locale
'[A-Z]' ⎕S 0  'Programming Puzzles and Code Golf'         ⍝ Ranged character class
'\p{Lu}' ⎕S 0  'Programming Puzzles and Code Golf'        ⍝ Unicode uppercase letter property
4
4
4

2 is the number of the block which had the match:

'(.)\1' ⎕S 2  'Programming' 'Puzzles' 'and' 'Code' 'Golf'
0 1

So we can see that only strings 0 and 1 had double-letters (again, always origin 0.)

Simultaneous patterns#

The last one, 3, is the pattern number, which brings us to an amazing feature of ⎕R and ⎕S: multiple simultaneous patterns:

'(.)\1' 'P' ⎕S 3  'Programming Puzzles and Code Golf'
1 0 1 0

Again, the patterns are numbered in origin 0, so first we find a double-letter (mm), then a P, then a double-letter (zz) and then a P. The amazing thing about the multiple patterns is that ⎕R and ⎕S step through the input letter by letter, and for each letter they look whether each pattern (from left to right) begins there.

You can of course also have multiple transformation patterns. This means that you can use a pattern to exclude from other patterns by placing the exclusion first, and replacing with the match (&):

' ' '\w' ⎕R (,¨'&' '_')  'Programming Puzzles and Code Golf'
___________ _______ ___ ____ ____

This replaced spaces with themselves, and word characters with underscores.

(,¨' ' '.') ⎕R (,¨'&' '_')  'Programming Puzzles and Code Golf'
___________ _______ ___ ____ ____

But here, we replaced spaces with themselves, and then any character – including spaces – with underscores.

The vectorisation also works differently for numeric and text operands. Text goes pairwise, while numbers return the entire list for each. You can have one transformation string for each matching string, or a single transformation string for all the matching strings:

(,¨'aeiou') ⎕R (,¨'AEIOU')  'Programming Puzzles and Code Golf'
(,¨'aeiou') ⎕R '_'  'Programming Puzzles and Code Golf'
PrOgrAmmIng PUzzlEs And COdE GOlf
Pr_gr_mm_ng P_zzl_s _nd C_d_ G_lf

But of course, you can’t have multiple transformation strings for a single matching string:

'o'⎕R(,¨'AEIOU')'Programming Puzzles and Code Golf'  ⍝ LENGTH ERROR
LENGTH ERROR: Invalid transformation format
      'o'⎕R(,¨'AEIOU')⊢'Programming Puzzles and Code Golf'  ⍝ LENGTH ERROR
         ∧

Variants#

We mentioned earlier that you can use variant, . The most commonly used option is case sensitivity, so it is the default option which means that you don’t have to use the name-value pair ⍠'IC' 1 (Insensitive Case); ⍠1 is enough:

'g'⎕R'_'1'Programming Puzzles and Code Golf'
Pro_rammin_ Puzzles and Code _olf

Notice that g matched both upper and lowercase Gs.

Another cool option is for ⎕S only: ⍠'OM' 1 (Overlapping Matches):

'[^aeiou]{3}'⎕S'&''Programming Puzzles and Code Golf'  ⍝ Non-overlapping matches
┌───┬───┬───┐
│ng │zzl│nd │
└───┴───┴───┘

[^aeiou] is a negated character group, which means NOT any of these letters and {3} means exactly three of such.

'[^aeiou]{3}'⎕S'&''OM'1'Programming Puzzles and Code Golf' ⍝ Overlapping matches
┌───┬───┬───┬───┬───┐
│ng │g P│zzl│nd │d C│
└───┴───┴───┴───┴───┘

Notice how this matched g P even though its first two letters were already found in the first match. ⎕R cannot allow overlapping matches because that may lead to infinite substitution looping: 'x' ⎕R 'xx'⍠'OM' 1 would loop forever. In xyz it would first replace x with xx to get xxyz then continue at the next character, which also matches, and makes xxxyz, etc.

Function operand#

Arguably the most powerful feature of them all is the fact that the right operand may be any monadic (or ambivalent) function. The right argument (which may of course be ignored) will be a namespace with a few members. This namespace survives between matches for the entire time that the current ⎕R/⎕S call is ongoing, so you further populate the namespace and so use it to convey information from earlier matches to later matches. The only names that are reserved (i.e. get overwritten each time your operand function is called) are:

  • Block – same as %

  • BlockNum – same as 2

  • Pattern – the literal pattern which matched (i.e. not the match itself)

  • PatternNum – the origin 0 number of the above

  • Match – same as &

  • Offsets – first element is same as 0 but has additional elements corresponding to capture groups

  • Lengths – first element is same as 1 but has additional elements corresponding to capture groups

  • ReplaceMode0 for ⎕S and 1 for ⎕R

  • TextOnly – Boolean whether the result of the function must be a character vector (i.e. for ⎕R) or can be anything (i.e. for ⎕S).

The function can then do any computation necessary to determine its result, so you could even have it prompt the user for whether to replace this match or not (i.e. when implementing a “Replace All” button in an editor). This of course renders ⎕R and ⎕S as powerful as Dyalog APL as a whole – they are both supersets and subsets of Dyalog APL!