Counting words, faster#

As a practical application, let’s consider the counting of words in a string. There are many ways to do that, but I’ll show you how an array oriented approach can give tremendous speed-ups. But first we have to generate some test data. Since actual letters don’t matter, we’ll just have a text consisting of XY,. , will be our “space” because it is easier to see that way. Now a Perl programmer would, of course, jump to regular expressions. As we’ve seen, Dyalog APL has a really powerful support for the PCRE-flavour of regular expressions built-in:

'[^,]+'⎕S 3',YYY,,YYYYYY,,XXXXXX,YYYYYYXXYYY,YYYXXYYYXX,XX,XXYYYXXXX,YYY'
8

⎕S is an operator which takes the regex on its left and what to return for each match on its right. 3 is a special code meaning the pattern number, which is just 0 because we only have one regex. Then we tally (count) that with and we’re done.

Another approach is to just split on the delimiter: a good job for here. If you give it a Boolean mask as left argument, it isolates runs corresponding to runs of 1s, discarding the elements corresponding to 0s:

','',YYY,,YYYYYY,,XXXXXX,YYYYYYXXYYY,YYYXXYYYXX,XX,XXYYYXXXX,YYY'      ⍝ non-delimiters
','(≠⊆⊢)',YYY,,YYYYYY,,XXXXXX,YYYYYYXXYYY,YYYXXYYYXX,XX,XXYYYXXXX,YYY'  ⍝ groups corresponding to runs of 1s
0 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1
┌───┬──────┬──────┬───────────┬──────────┬──┬─────────┬───┐
│YYY│YYYYYY│XXXXXX│YYYYYYXXYYY│YYYXXYYYXX│XX│XXYYYXXXX│YYY│
└───┴──────┴──────┴───────────┴──────────┴──┴─────────┴───┘

Read ≠⊆⊢ as “the difference partitions the right argument”. What remains is to count the partitions:

','(≢≠⊆⊢)',YYY,,YYYYYY,,XXXXXX,YYYYYYXXYYY,YYYXXYYYXX,XX,XXYYYXXXX,YYY'
8

This solution has an issue, but before we get to that, let’s compare the performance of the “pure” APL solution to the regex solution.

cmpx '≢''[^,]+''⎕S 3⊢t' 's(≢≠⊆⊢)t'  s','  t',XY'[/⍨?1e63]  'cmpx'⎕CY'dfns'
2000655
  ≢'[^,]+'⎕S 3⊢t → 5.3E¯2 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  s(≢≠⊆⊢)t       → 3.2E¯2 | -40% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕                

On about 2 million characters we’re saving two thirds of the running time by using the split and count approach over regex. Quite a bit faster, but there is more scope: it is problematic that we split the array to count the pieces, as this has to make a new (pointer!) array.

So our issue is that we need to ignore multiple spaces. We actually need to do edge detection. If we have a text, say ,YYY,,YYYYYY,, we want to see whenever we go from a non-space to a space (or the opposite). The only gotcha is at the end, if there are no trailing spaces, we will miss the last word. APL has the “find” function :

'ss''mississippi'
0 0 1 0 0 1 0 0 0 0 0

It indicates the beginning of its left argument (“the top-left corner”) in its right argument. So now we can create an is-space mask, and look for 0 1.

','=',YYY,,YYYYYY,,'        ⍝ is-space mask
0 1','=',YYY,,YYYYYY,,'    ⍝ locate star points for 0 1 patterns
+/0 1','=',YYY,,YYYYYY,,'  ⍝ count them
1 0 0 0 1 1 0 0 0 0 0 0 1 1
0 0 0 1 0 0 0 0 0 0 0 1 0 0
2

However, it counts wrong here:

+/0 1','=',YYY,,YYYYYY'
1

So we need to add a “space” to the end.

','{+/0 11,=}',YYY,,YYYYYY'
2

How do they stack up, speed-wise?

cmpx '≢''[^,]+''⎕S 3⊢t' 's(≢≠⊆⊢)t' 's{+/0 1⍷1,⍨⍺=⍵}t' s','  t',XY'[/⍨?1e63]
2000992
  ≢'[^,]+'⎕S 3⊢t   → 5.3E¯2 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  s(≢≠⊆⊢)t         → 3.2E¯2 | -40% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕                
  s{+/0 1⍷1,⍨⍺=⍵}t → 2.4E¯2 | -56% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕                      

Unfortunately, that’s a bit slower, but perhaps we can think of another way to exploit our idea to look for the 0 to 1 transition? Since we’re looking for 0 1, we can just insert < between elements, using a windowed reduction:

','{+/2</1,=}',YYY,,YYYYYY'
2

As before, we append a 1 after we’ve calculated our binary mask. We could, of course, also have written that as

','{+/2</=,}',YYY,,YYYYYY'
2

which is adding an extra ‘,’ at the end, before calculating the mask. When we concatenate the space to the string, APL has to create a copy of the whole string with one additional byte at the end, which is costlier than appending a 1 to a bit-Boolean array as we did in the first version.

Note also that 2</ takes care of duplicate spaces. What about performance?

cmpx '≢''[^,]+''⎕S 3⊢t' 's(≢≠⊆⊢)t' 's{+/0 1⍷⍺=⍵,⍺}t' 's{+/2</1,⍨⍺=⍵}t' s','  t',XY'[/⍨?1e63]
2000631
  ≢'[^,]+'⎕S 3⊢t  → 5.4E¯2 |    0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  s(≢≠⊆⊢)t        → 3.2E¯2 |  -41% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕                
  s{+/0 1⍷⍺=⍵,⍺}t → 2.4E¯2 |  -56% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕                      
  s{+/2</1,⍨⍺=⍵}t → 1.9E¯4 | -100%                                         

That’s a crushing improvement! Let’s zoom in a bit, by removing the slower versions from the comparison:

cmpx 's(≢≠⊆⊢)t' 's{+/2</(⍺=⍵),1}t' s','  t',XY'[/⍨?1e63]
2000890
  s(≢≠⊆⊢)t         → 3.2E¯2 |    0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  s{+/2</(⍺=⍵),1}t → 1.9E¯4 | -100%