Dealing with real data#

You want it in one line? Does it have to fit in 80 columns? –Larry Wall

Amazingly, we got all the way to here without once mentioning how to get data in (or out) with less than typing it in. Nor have we dealt with “real” data much. Real data is usually messy. Dyalog APL has a rich set of routines to deal with data, either as bytes on disk, or formatted as JSON or CSV (or XML, but let’s keep pretending that XML was never actually a thing), or fetching it from a database, or as an HTTP-request. Some of this stuff we won’t cover – if you need it, you’re probably capable enough as an APLer to figure it out yourself.

Some other resources to consult:

⎕IO  0
]box on
]rows on
Was ON
Was OFF

Reading text files: ⎕NGET#

The system function ⎕NGET reads text files from disk:

⎕NGET'/Users/stefan/work/dyalog/file.txt'1
┌───────────────┬───────────────┬───────────────┬───────────────┐
│1 8 6 9 5 9 5 9│6 3 4 2 8 1 8 2│8 0 7 5 5 1 2 3│6 5 2 9 8 2 6 1│
└───────────────┴───────────────┴───────────────┴───────────────┘

We can note two things from that:

  1. We disclose the result (actually, we pick the first item)

  2. There is a 1 after the file name

Let’s explore what happens if we don’t do these things.

⎕NGET'/Users/stefan/work/dyalog/file.txt'1
┌─────────────────────────────────────────────────────────────────┬───────────┬──┐
│┌───────────────┬───────────────┬───────────────┬───────────────┐│UTF-8-NOBOM│10│
││1 8 6 9 5 9 5 9│6 3 4 2 8 1 8 2│8 0 7 5 5 1 2 3│6 5 2 9 8 2 6 1││           │  │
│└───────────────┴───────────────┴───────────────┴───────────────┘│           │  │
└─────────────────────────────────────────────────────────────────┴───────────┴──┘

We can see that when we read a file with ⎕NGET we get back a vector with several elements, the first is the data itself, and the second contains information about the file’s encoding. The last one is Dyalog’s opinion of what the file’s newline separator is. This will vary across operating systems.

What happens if we leave out the 1 at the end?

⎕NGET'/Users/stefan/work/dyalog/file.txt'
┌───────────────┬───────────┬──┐
│1 8 6 9 5 9 5 9│UTF-8-NOBOM│10│
│6 3 4 2 8 1 8 2│           │  │
│8 0 7 5 5 1 2 3│           │  │
│6 5 2 9 8 2 6 1│           │  │
│               │           │  │
└───────────────┴───────────┴──┘

Leaving out the 1 (or indeed instead passing the default value of 0) returns a single character vector containing the data:

⍴⊃⎕NGET'/Users/stefan/work/dyalog/file.txt'
64

This is probably not what you want in most cases when processing a text file. Note that we’ve read lines of characters. But we can convert them to numbers:

1⊢↑⊃⎕NGET'/Users/stefan/work/dyalog/file.txt'1 ⍝ Note health warning below!
1 8 6 9 5 9 5 9
6 3 4 2 8 1 8 2
8 0 7 5 5 1 2 3
6 5 2 9 8 2 6 1

We mixed the vector into an array of rank 2, then applied the Hydrant, at rank 1 (vectors). Hydrant is a function called Execute, and it takes a string and evaluates it as if it was a little APL program, similar to how eval() works in Python and Perl:

 '1 8 6 7 5 9 5 9'
1 8 6 7 5 9 5 9

If you have ever seen code written in PHP, you know why evaluating strings requires a degree of caution unless you’re certain of the provenance. Perl has a taint mode to stop bleed-over from untrusted sources, and Python papers over the issue by decree, calling eval() un-Pythonic, and Guido-hated.

Tip

Dyalog provides more industry-strength mechanisms for converting strings to numbers in a safe manner, like verify-fix-input, ⎕VFI

Anyway, enough of the yak shaving already.

Reading CSV: ⎕CSV#

CSV, or comma-separated values, is a data format for tabular data. It’s hairier than one might think to write a correct CSV parser. Fortunately, Dyalog provides one for us as the system function ⎕CSV. It can be used to convert data both to and from the CSV format. The ⎕CSV function has one extremely handy feature: it converts numbers for us, avoiding the hydrant-dance we resorted to above.

Consider the following decidedly not-CSV-looking data. It’s a section from the data given in Day 2 in the Advent of Code competition from 2020. Let’s assume we’ve already read from disk using ⎕NGET:

data  '3-6 s: ssdsssss' '17-19 f: cnffsfffzhfnsffttms' '8-11 c: tzvtwncnwvwttp' '8-10 r: rwrrtrvttrrrr' '1-2 p: zhpjph' '4-6 l: pldnxv'

Whilst it’s not CSV, it does look tabular. One handy use of ⎕CSV is to rely on it to make a table from data, and convert columns of numbers. In our case, we separate each item by converting each “non-word” character to a comma, and then ask ⎕CSV to do the conversion:

⎕CSV('\W+'⎕R','data)''4
┌──┬──┬─┬───────────────────┐
│3 │6 │s│ssdsssss           │
├──┼──┼─┼───────────────────┤
│17│19│f│cnffsfffzhfnsffttms│
├──┼──┼─┼───────────────────┤
│8 │11│c│tzvtwncnwvwttp     │
├──┼──┼─┼───────────────────┤
│8 │10│r│rwrrtrvttrrrr      │
├──┼──┼─┼───────────────────┤
│1 │2 │p│zhpjph             │
├──┼──┼─┼───────────────────┤
│4 │6 │l│pldnxv             │
└──┴──┴─┴───────────────────┘

We called ⎕CSV monadically, with an argument vector containing three items. The first, in our case, is obviously the data itself. The second is called the data description, which is used to specify the encoding where necessary. Dyalog’s docs tells us:

If omitted or empty, the file encoding is deduced

The 4 at the end of that is the column specifier, and this is what Dyalog’s docs have to say about it:

4: The field is to be interpreted numeric data but invalid numeric data is tolerated. Empty fields and fields which cannot be converted to numeric values are returned instead as character data.

We can read files containing CSV data directly with ⎕CSV, in which case the first element of the argument vector should be a character vector containing the file name. We can also use ⎕CSV to write .csv files – consult Dyalog’s docs for details.

Reading JSON: ⎕JSON#

Dyalog also has a function for reading (and writing) JSON, appropriately enough called ⎕JSON. The JSON data interchange format crops up everywhere: usually in REST-like APIs as the payload, as config files, as the document format in document-oriented databases. The name JSON is short for JavaScript Object Notation, and as one can expect, dealing with JSON in JavaScript is both trivial and convenient. Python’s native dictionary and list types - through luck, mostly - map pretty closely to JSON, too, making working with JSON in Python pretty simple.

In APL, we have a wealth of different data types: array. JSON doesn’t do arrays of rank > 1. This makes a recursively defined format depending on dictionaries, like JSON, potentially more awkward to deal with. In the specific case of Dyalog APL, we do have the namespace which can be persuaded to act a bit like a dictionary.

Warning

Dyalog notes:

JSON supports a limited number of data types and there is not a direct correspondence between JSON and APL data structures. In particular:

  • JSON does not support arrays with rank > 1.

  • The JSON standard includes Boolean values true and false which are distinct from numeric values 1 and 0, and have no direct APL equivalent.

  • The JSON5 standard includes numeric constants Infinity, -Infinity, NaN and -NaN which have no direct APL equivalent.

  • JSON objects are named and these might not be valid names in APL.

Whilst the ⎕JSON function does a good job given these constraints, working with JSON data (IMHO) in Dyalog always end up feeling a smidge gritty compared with JavaScript or Python.

Let’s look at some examples:

json  '{"key": 1, "list": [1, 2, 3, {"colour": "red", "shape": "oblong", "number": 5}, [98, 43, 77]]}'

Yep, that’s JSON alright. Let’s convert that to APL as vectors and namespaces:

  data  ⎕JSON json
#.[JSON object]

Ok, that went through without complaints, and we can see that Dyalog (correctly) thinks that the top level is “object”. We can now get at the innards of this by attribute names and indices:

data.key
data.list
data.list[3].shape
1
┌─┬─┬─┬───────────────────────────────┬────────┐
│1│2│3│ #.[JSON object].[JSON object] │98 43 77│
└─┴─┴─┴───────────────────────────────┴────────┘
oblong

Well, that looks perfectly smooth, right? But the unsmooth bit here is that in a JSON dict, the keys are strings, not object instance attributes which is what they end up as in our namespace. If we’re reading an unknown bit of JSON data where we don’t already know what the layout is, how do we, for example, process each key’s value in turn if the keys aren’t known to us?

We can list all keys in a namespace using the following construct, where ⎕NL is the Name list and the ¯2 magic number means that we want a vector back:

data.⎕NL¯2
┌───┬────┐
│key│list│
└───┴────┘

To get the corresponding values, we need to evaluate each such key:

data.(¨⎕NL¯2)
┌─┬────────────────────────────────────────────────┐
│1│┌─┬─┬─┬───────────────────────────────┬────────┐│
│ ││1│2│3│ #.[JSON object].[JSON object] │98 43 77││
│ │└─┴─┴─┴───────────────────────────────┴────────┘│
└─┴────────────────────────────────────────────────┘

Let’s say that we want to sum all numbers in the JSON. Here’s one way we can achieve that:

]dinput
Values  {
    keys  .⎕NL¯2
    values  keys {0=≢⍺:.(¨)}  
    0=≢.⎕NL¯9:values       ⍝ No nested namespaces: done
    values,∊¨.(¨⎕NL¯9)   ⍝ Also expand any namespaces we found as values
}
]dinput
JSONSum  {
    {(1=2|⎕DR)}⍵:⍵                    ⍝ Are we a number?
    (⍕≡⊢)⍵:0                           ⍝ Are we a string?
    {(326=⎕DR)(0=≡)} ⍵:+/Values   ⍝ Are we an object?
    +/¨                             ⍝ We're a list
}
JSONSum data
230

The bits to find the type of each field in JSONSum are all things you can just look up on APLCart: monadic ⎕DR means Data representation, and you just need to feed it the right magic number.

There is a lot more to say about the ⎕JSON function that we won’t go into here (we’ll see a bit more of it in the next chapter). It can also convert JSON to a pure array format instead of using namespaces, and it can also be used to convert APL arrays to JSON. The interested reader will have plenty to go at in the Dyalog docs, and Morten’s webinar signposted at the top of this chapter.