Counting words, faster
Counting words, faster#
As a practical application, let’s consider the counting of words in a string. There are many ways to do that, but I’ll show you how an array oriented approach can give tremendous speed-ups. But first we have to generate some test data. Since actual letters don’t matter, we’ll just have a text consisting of XY,
. ,
will be our “space” because it is easier to see that way. Now a Perl programmer would, of course, jump to regular expressions. As we’ve seen, Dyalog APL has a really powerful support for the PCRE-flavour of regular expressions built-in:
≢'[^,]+'⎕S 3⊢',YYY,,YYYYYY,,XXXXXX,YYYYYYXXYYY,YYYXXYYYXX,XX,XXYYYXXXX,YYY'
8
⎕S
is an operator which takes the regex on its left and what to return for each match on its right. 3 is a special code meaning the pattern number, which is just 0 because we only have one regex. Then we tally (count) that with ≢
and we’re done.
Another approach is to just split on the delimiter: a good job for ⊆
here. If you give it a Boolean mask as left argument, it isolates runs corresponding to runs of 1s, discarding the elements corresponding to 0s:
','≠',YYY,,YYYYYY,,XXXXXX,YYYYYYXXYYY,YYYXXYYYXX,XX,XXYYYXXXX,YYY' ⍝ non-delimiters
','(≠⊆⊢)',YYY,,YYYYYY,,XXXXXX,YYYYYYXXYYY,YYYXXYYYXX,XX,XXYYYXXXX,YYY' ⍝ groups corresponding to runs of 1s
0 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1
┌───┬──────┬──────┬───────────┬──────────┬──┬─────────┬───┐ │YYY│YYYYYY│XXXXXX│YYYYYYXXYYY│YYYXXYYYXX│XX│XXYYYXXXX│YYY│ └───┴──────┴──────┴───────────┴──────────┴──┴─────────┴───┘
Read ≠⊆⊢
as “the difference partitions the right argument”. What remains is to count the partitions:
','(≢≠⊆⊢)',YYY,,YYYYYY,,XXXXXX,YYYYYYXXYYY,YYYXXYYYXX,XX,XXYYYXXXX,YYY'
8
This solution has an issue, but before we get to that, let’s compare the performance of the “pure” APL solution to the regex solution.
cmpx '≢''[^,]+''⎕S 3⊢t' 's(≢≠⊆⊢)t' ⊣ s←',' ⊣ ⎕←≢t←',XY'[/⍨?1e6⍴3] ⊣ 'cmpx'⎕CY'dfns'
2000655 ≢'[^,]+'⎕S 3⊢t → 5.3E¯2 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ s(≢≠⊆⊢)t → 3.2E¯2 | -40% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
On about 2 million characters we’re saving two thirds of the running time by using the split and count approach over regex. Quite a bit faster, but there is more scope: it is problematic that we split the array to count the pieces, as this has to make a new (pointer!) array.
So our issue is that we need to ignore multiple spaces. We actually need to do edge detection. If we have a text, say ,YYY,,YYYYYY,,
we want to see whenever we go from a non-space to a space (or the opposite). The only gotcha is at the end, if there are no trailing spaces, we will miss the last word. APL has the “find” function ⍷
:
'ss'⍷'mississippi'
0 0 1 0 0 1 0 0 0 0 0
It indicates the beginning of its left argument (“the top-left corner”) in its right argument. So now we can create an is-space mask, and look for 0 1
.
','=',YYY,,YYYYYY,,' ⍝ is-space mask
0 1⍷','=',YYY,,YYYYYY,,' ⍝ locate star points for 0 1 patterns
+/0 1⍷','=',YYY,,YYYYYY,,' ⍝ count them
1 0 0 0 1 1 0 0 0 0 0 0 1 1
0 0 0 1 0 0 0 0 0 0 0 1 0 0
2
However, it counts wrong here:
+/0 1⍷','=',YYY,,YYYYYY'
1
So we need to add a “space” to the end.
','{+/0 1⍷1,⍨⍺=⍵}',YYY,,YYYYYY'
2
How do they stack up, speed-wise?
cmpx '≢''[^,]+''⎕S 3⊢t' 's(≢≠⊆⊢)t' 's{+/0 1⍷1,⍨⍺=⍵}t'⊣ s←',' ⊣ ⎕←≢t←',XY'[/⍨?1e6⍴3]
2000992 ≢'[^,]+'⎕S 3⊢t → 5.3E¯2 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ s(≢≠⊆⊢)t → 3.2E¯2 | -40% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ s{+/0 1⍷1,⍨⍺=⍵}t → 2.4E¯2 | -56% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
Unfortunately, that’s a bit slower, but perhaps we can think of another way to exploit our idea to look for the 0 to 1 transition? Since we’re looking for 0 1
, we can just insert <
between elements, using a windowed reduction:
','{+/2</1,⍨⍺=⍵}',YYY,,YYYYYY'
2
As before, we append a 1 after we’ve calculated our binary mask. We could, of course, also have written that as
','{+/2</⍺=⍵,⍺}',YYY,,YYYYYY'
2
which is adding an extra ‘,’ at the end, before calculating the mask. When we concatenate the space to the string, APL has to create a copy of the whole string with one additional byte at the end, which is costlier than appending a 1 to a bit-Boolean array as we did in the first version.
Note also that 2</
takes care of duplicate spaces. What about performance?
cmpx '≢''[^,]+''⎕S 3⊢t' 's(≢≠⊆⊢)t' 's{+/0 1⍷⍺=⍵,⍺}t' 's{+/2</1,⍨⍺=⍵}t'⊣ s←',' ⊣ ⎕←≢t←',XY'[/⍨?1e6⍴3]
2000631 ≢'[^,]+'⎕S 3⊢t → 5.4E¯2 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ s(≢≠⊆⊢)t → 3.2E¯2 | -41% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ s{+/0 1⍷⍺=⍵,⍺}t → 2.4E¯2 | -56% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ s{+/2</1,⍨⍺=⍵}t → 1.9E¯4 | -100%
That’s a crushing improvement! Let’s zoom in a bit, by removing the slower versions from the comparison:
cmpx 's(≢≠⊆⊢)t' 's{+/2</(⍺=⍵),1}t'⊣ s←',' ⊣ ⎕←≢t←',XY'[/⍨?1e6⍴3]
2000890 s(≢≠⊆⊢)t → 3.2E¯2 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ s{+/2</(⍺=⍵),1}t → 1.9E¯4 | -100%