myawk scripts
myawk is a small OCaml library for quick, rough and
ready scripts. It is meant as a lightweight and less idiosyncratic
replacement for AWK (let alone Perl) -- and also as an exploration and
demonstration of a better alternative to shell pipes.
As its name implies, it is a personal project, for personal scripting needs. A bit of motivation may still be in order. Why scripting in OCaml?
The drawbacks of shell scripts are well-known: late reporting of
errors, numerous edge cases (of substitutions, escaping, IFS) often
with security implications, waste of system resources. Less criticized
are pipes, which are regarded as shell and Unix's greatest
strengths. Building a long hose, so to speak, by plugging in simple
segments is the commendable strength -- which diminishes when the
segments are not simple and need tuning in their own idiosyncratic
ways. The lack of consistency among Unix command-line tools has been
noted long time ago, and is getting worse as the number of flags keeps
increasing. Pipes as the concept also deserve criticism.
One of the motivations for myawk is to see if an expressive streaming
could, in fact, be pleasant to use in everyday's
scripting. We introduce myawk pipelines below, contrasting them with
shell pipes and thus pin-pointing the limitations of the latter. The
rest of this page shows off half a dozen myawk scripts. Some of
them are re-written from shell or AWK, to compare with the classical
scripting.
OCaml is a bit verbose language. Therefore, myawk scripts often (but
not always!) have more characters than the corresponding Perl or shell
script, although with a better performance. One does not have
to type in all these characters however. Scripts in myawk are meant to be
entered in a text editor, where one may take the full advantage of
code completion, advanced editing, documentation and other
facilities of modern editors.
As an aside, the command line, as an artifact of teletypes and line-oriented processing of early computers, is one of the most limited text user interfaces. Lisp Machines, Acme, language modes of Emacs and other editors, and, recently, Proof General and CoqIDE showed that one may interact with a computing system -- execute commands, evaluate expressions, adjust and re-execute -- within the full editor, never having to drop to a `command line'.
myawk core
strings.ml [5K]
The other file of myawk library:
frequently occurring string operations, not found in Stdlib
Makefile [<1K]
How to build the library
myawk pipelines and compare.
The simplest, the `identity' pipeline is
for_each_line @@ print_endlineThe producer
for_each_line parses the standard input into lines and
sends each line (without the line terminator) to the consumer,
print_endline, which prints the line on standard input adding the
newline. This myawk pipelines looks and feels like an ordinary shell
pipe: the `process' for_each_line produces lines, print_endline
`process' reads them across the @@ `pipe'. The only difference with
shell is pipe's transmitting structured data: lines.
No processes are launched in myawk however: for_each_line is an
ordinary function, print_endline is the standard library
function. The `pipe' operator @@ also comes from the standard
library: it is the function application, of lower precedence
compared to juxtaposition. In the present example, the parsing
precedence is irrelevant and the operator can be dropped:
for_each_line print_endline. What is merely happening is
for_each_line applying its argument print_endline to each line it
reads from the standard input.
A more interesting, although also essentially the identity pipeline splits lines to fields, using whitespace as the field separator:
for_each_line @@ map words @@ printHere,
words (as the eponymous function in Haskell) splits a string
of whitespace-separated words into the list of words. The function
print prints the list of words, using space as the (default)
separator, adding newline at the end. This pipeline can also be
interpreted as a shell pipe: between the line producer process
for_each_line and the ultimate consumer print we wedge the
process map words that repeatedly
reads lines from for_each_line and splits them,
writing the resulting list of fields into the pipe to
print. Again, the only apparent difference from shell is that the
pipes are structured; in fact, more structured as before: the
pipe to print has as its transmission unit a list of words.
In reality, map is no process at all: it is a mere function
composition (with swapped arguments). The pipeline could have been
written simply, albeit less elegantly, as:
for_each_line (print <|> words)where
<|> is the functional composition. In this `pipeline'
for_each_line passes each line from stdin to the composition
of print and words. The role of map hence is merely to arrange
individual processors in a more visually pleasing order, highlighting
the left-to-right flow of data.
One may look at map from yet another angle: keeping in mind that
@@ is right associative, the original pipeline is in fact
for_each_line (map words print)which may be read as `adjusting' the consumer of fields
print to consume lines instead. Here words does the adjustment,
splitting a line into fields.
Once we took the trouble of splitting a line into fields, we can do something with them, say, extract the first field for further processing.
for_each_line @@
map words @@
map_option (function (x::_) -> Some x | _ -> None) @@
print_endline
Here map_option is another `adjuster' of a consumer to a producer,
transforming the produced data to fit the consumer. Only now the
transformation is allowed to fail, in which case the produced data is
disregarded. In our example, empty lines, which have no fields, are
skipped.
Why do we now use print_endline whereas before it was print? The
reader is encouraged to try using print and see what happens.
Indeed, there are actually types and type checking.
Having extracted the first field, we may parse it, say, to an integer:
for_each_line @@
map words @@
map_option (function (x::_) -> Some x | _ -> None) @@
map_option int_of_string_opt @@
Printf.printf "The field is %d.\n"
The two consecutive map_option may, of course, be fused -- which
eliminates some constant overhead. In either case, the overall
processing, just as with shell pipelines, is incremental:
for_each_line reads a line, which is then split into fields, the
first field extracted, converted to an integer and printed. If any
conversion fails, the line is disregarded. It is after the line is fully
consumed that for_each_line tries to read another.
All the previous myawk pipelines could be interpreted as shell pipes
(with typed pipes), with a loss of performance. Now we come to
pipelines that cannot be viewed as shell pipes, and which are very
difficult to write in shell. Suppose that we want to amend the
previous pipeline to print not just the extracted field, converted to
an integer, but also the original line in which that field appeared.
It is easy to do if we name a line produced by
for_each_line, for later use:
for_each_line @@ fun l -> l |>
map words @@
map_option (function (x::_) -> Some x | _ -> None) @@
map_option int_of_string_opt @@
fun x -> Printf.printf "The field is %d and the line is `%s'\n" x l
Here, |> from the standard library is yet another application
operator, with swapped arguments. (In fact, x |> f
is the same as f @@ x, which is the same as f x). In myawk, we
can introduce names any time we need to refer to some
intermediate result.
The final example is a truly non-linear pipeline. It is a modification to the earlier one, to log the lines whose first field is not parseable as an int, rather than dropping them:
for_each_line @@ fun l -> l |>
map words @@
map_option (function (x::_) -> Some x | _ -> None) @@
cases [
optthen int_of_string_opt
(map (fun x -> x*x) @@
fun x -> Printf.printf "The squared field is %d, the line is `%s'\n" x l);
fun x -> Some
(Printf.eprintf "The first field `%s' in the line `%s' isn't int\n" x l)
]
As the name suggests, cases does the case analysis, routing the incoming
data to the first alternative that succeeds in transforming them.
In our example, if a field can be parsed to an integer, the
integer is passed to the consumer for squaring and printing. If the
field cannot be parsed, it is passed to the second
alternative.
myawk was meant for AWK-like processing, like the
following. It is similar to the examples explained in the tutorial
above. Below is the complete code:
#!/bin/env -S ocaml
#load "myawk.cma"
open Myawk open Strings
let hash = string_of_int <|> Hashtbl.hash
;;
(* Sanitize the files originally used by the example further below.
The files are made of space-separated fields; the first field is the
key. It is sensitive; but because it is a key it can't be replaced with
meaningless garbage. We obfuscate it beyond recognition. The third field
is obfuscated as well. The second and fourth can be left as they are,
and the fifth, if present, is replaced with XXX
The script is a proper filter: reads from stdin, writes to stdout
*)
for_each_line @@ map words @@ function (f1::f2::f3::f4::rest) ->
print [hash f1; f2; hash f3; f4; if rest = [] then "" else "XXX"]
;;
myawk was a script to perform
a join on two text files -- what one could write in SQL as
SELECT T2.* from Table1 as T1, Table2 as T2 where T1.f1 = T2.f1where
Table1 and Table2 are actually text files with space-separated column
values. Here is the first version of the script (Table1 is supposed
to be fed to stdin):
for_each_line @@ map words @@
map_option (function (x::_) -> Some x | _ -> None) @@
(ignore <|> shell "grep %s table1.txt")
It is a typical rough-and-dirty script. Alas, it was too rough: I was so excited that it typechecked and worked the first time, that I did not look carefully at the output and overlooked what I was looking for (resulting in an unneeded hassle and apology). I should have queried exactly for what I wanted:
SELECT T1.f1, T1.f4 FROM Table1 as T1, Table2 as T2
WHERE T1.f1 = T2.f1 AND T1.f3 <> "3"
which is actually easy to write in myawk (probably not so in AWK though)
for_each_line ~fname:"table2.txt" @@ map words @@
map_option (function (w::_) -> Some w | _ -> None) @@
fun w ->
for_each_line ~fname:"table1.txt" @@ map words @@
map_option (function
(x::f2::f3::f4::_) when x = w && f4 <> "3" -> Some [x;f4] | _ -> None) @@
print
This is the classical nested loop join.join2.ml [<1K]
An honest nested-loop join: producing just
what I was looking for
table1.txt [3K]
table2.txt [<1K]
The `database' files for join?.ml
myawk do not have to be linear: dataflow may split
into several branches, routed by guards.
The motivation came from the following script to partition a file
into two:
cat ./bench_result.txt | grep baseline >| bench_baseline.txt && \
cat ./bench_result.txt | grep staged >| bench_staged.txt
Here is its rendition in myawk, which reads the input file only
once. Also, it prints the lines which contain neither `baseline' nor `staged'
on the standard output. It hence splits the input three-way.
let () =
with_output_file "bench_baseline.txt" @@ fun chbase ->
with_output_file "bench_staged.txt" @@ fun chstaged ->
for_each_line ~fname:"bench_result.txt" @@ fun l ->
l |> cases [
optthen (strstr "baseline") (fun _ -> print ~ch:chbase [l]);
optthen (strstr "staged") (fun _ -> print ~ch:chstaged [l]);
fun l -> Some (print_endline l) (* default case *)
]
stdlib/remove_module_aliases.awk,
whose contents is shown below
BEGIN { in_aliases=0 }
NR == 1 { printf ("# 1 \"%s\"\n", FILENAME) }
/^\(\*MODULE_ALIASES\*\)\r?$/ { in_aliases=1 }
!in_aliases { print }
The stateful processing in realized in myawk as follows:
let filename = Sys.argv.(1)
let in_aliases = ref false
let () = Printf.printf "# 1 \"%s\"\n" filename
let () =
for_each_line ~fname:filename @@ fun l ->
if is_prefix "(*MODULE_ALIASES*)" l <> None then in_aliases := true;
if not !in_aliases then print_endline l
There is no need for the idiosyncratic BEGIN {} and the
record counter NR.
myawk well, but does show what is to decry about shell
pipes. One can do better in most any other language.
The original shell script is a sample GIT commit hook,
.git/hooks/commit-msg.sample found in any GIT repository. Its
comments explain:
Called by "git commit" with one argument, the name of the file that has the commit message. The hook should exit with non-zero status after issuing an appropriate message if it wants to stop the commit. The hook is allowed to edit the commit message file.This example catches duplicate Signed-off-by lines.
Here is the shell script itself
test "" = "$(grep '^Signed-off-by: ' "$1" |
sort | uniq -c | sed -e '/^[ ]*1[ ]/d')" || {
echo >&2 Duplicate Signed-off-by lines.
exit 1
}
Looking at the original shell script brings despair. Not only the
script is algorithmically ugly: if a duplicate signed-off line occurs
near the beginning, we should be able report it right away and
stop. We do not need to read the rest of the commit message, filter
it, sort it, precisely count all duplicates and filter again. Not only
the script gratuitously wastes system resources by launching many
processes and allocating communication
buffers. Mainly, the script is not good for its primary purpose: it
is not easy to write and read. Pipeline composition of small stream
processors is generally a good thing -- but not when each stream
processor is written in its own idiosyncratic language. (For example:
although one may have a general idea about uniq, what does uniq -c
do? Understanding the sed phrase requires reading yet another man page.)
Incidentally,
I have doubts about the script: I think that quotes around $1
are meant to be embedded; but why they are not
escaped then? Probably it is some edge case of bash, out of several
thousands.
In contrast, the OCaml script below does exactly what is required, with no extra work -- and can be understood without reading many man pages.
module H = Hashtbl
let commit_msg = Sys.argv.(1)
let ht = H.create 5
let () =
for_each_line ~fname:commit_msg @@ fun l ->
if is_prefix "Signed-off-by: " l <> None then begin
if H.find_opt ht l <> None then begin
prerr_endline "Duplicate Signed-off-by lines.";
exit 1
end else
H.add ht l ()
end