Yay! YAML for bash scripts

Yay : A bash Yamlesque parser.


YAML is a data configuration format consisting of hierarchial collections of named data items. Yay is a parser that understands a subset of YAML, or Yamlesque, that is intended as a way to provide basic configuration or other data to bash shell scripts.

Yamlesque has a structured syntax that is a small subset of YAML. Valid Yamlesque is also valid YAML but the reverse isn't necessarily true due to Yamlesque only supporting a basic subset of the YAML syntax. The name Yay is a reminder that Yaml ain't Yamlesque!

Valid Yamlesque will pass a YAML validity check: http://www.yamllint.com. The full YAML specification is at http://yaml.org.

Yay is inspired by http://stackoverflow.com/a/21189044 and https://gist.github.com/pkuczynski/8665367

Format

Yamlesque is written in a plain text file and such files contain one or more input lines that consist of identifiers that are separated by whitespace:

  • an indent
  • a key
  • a colon (:)
  • a value

like this:

<indent><key>:[<value>]

Only lines in this format are parsed and anything else is ignored.

Lines beginning with the octothorpe character (# aka hash, sharp or pound) are ignored, as is any trailing part of a line beginning with this character.

In general, whitespace is ignored except when it is leading whitespace, in which case it is considered to be an indent. An indent is zero or more pairs of space characters (TAB isn't valid YAML), each representing one level of indentation.

Note that, unlike YAML, two spaces must be used for each level of indentation.

If a line does not have a value then it defines a new collection of key/value pairs which follow in subsequent lines and have one more level of indent.

If a value is given then the key defines a setting in the current collection. If the value is wrapped in quotation marks then these are removed, otherwise the value is used as-is including whitespace.

Yay provides a bash function that reads an appropriately formatted data file and produces associative array definitions containing the data read from the file.

This yay_parse function reads a Yay file and returns bash commands that can be executed to define associative arrays containing the data defined in the file. It takes one or two arguments:

yay_parse <filename> [<dataset>]

Where <filename> is the name of the file. If the given name doesn't exist then further searches are performed with the suffixes .yay and .yml appended . The first matching file is used.

The <dataset> is a label that is used to prefix the arrays that get created to reduce the risk of collissions. If omitted then the filename, less its suffix, is used.

There are various ways to apply Yay definitions to the current shell environment:

  • eval $(yay_parse demo)
  • source <(yay_parse demo)
  • yay_parse demo | source /dev/stdin

However, the easiest approach is to use the yay helper which loads data from the given file and creates arrays in the current environment.

$ yay demo

An example of Yay is shown below

# Example YAY data file
root_key1: this is value one
root_key2: "this is value two"

drink:
  state: liquid
  coffee:
    best_served: hot
    colour: brown
  orange_juice:
    best_served: cold
    colour: orange

food:
  state: solid
  apple_pie:
    best_served: warm

root_key_3: this is value three

! Yay uses associative arrays which are a feature of Bash version 4. It will not work with other bash versions.

Usage

First, include the Yay source in a script and then load a file

#!/bin/bash
. /path/to/yay
yay demo

This leaves at least one array that is named after the data set. It will have entries per top-level key/value pair. It will also have a special entry called keys that contains a space-delimited string of the names of all such keys. Another special entry called children lists the names of further arrays defining other data sets within it. Such arrays follow the same structure.

Here is a recursive example that displays a data set:

# helper to get array value at key
value() { eval echo \${$1[$2]}; }

# print a data set
print_dataset() { 
  for k in $(value $1 keys)
  do  
    echo "$2$k = $(value $1 $k)"
  done

  for c in $(value $1 children)
  do  
    echo -e "$2$c\n$2{"
    print_dataset $c "  $2"
    echo "$2}"
  done
}

yay demo
print_dataset demo

which, given the example input above, produces

root_key1 = this is value one
root_key2 = this is value two
root_key_3 = this is value three
example_drink
{
  state = liquid
  example_coffee
  {
    best_served = hot
    colour = brown
  }
  example_orange_juice
  {
    best_served = cold
    colour = orange
  }
}
example_food
{
  state = solid
  example_apple_pie
  {
    best_served = warm
  }
}
Internals

The yay_parse function first locates the input file or exits with an exit status of 1. Next, it determines the dataset prefix, either explicitly specified or derived from the file name.

It writes valid bash commands to its standard output that, if executed, define arrays representing the contents of the input data file. The first of these defines the top-level array:

echo "declare -g -A $prefix;"

Note that array declarations are associative (-A) which is a feature of Bash version 4. Declarations are also global (-g) so they can be executed in a function but be available to the global scope like the yay helper:

yay() { eval $(yay_parse "$@"); }

The input data is initially processed with sed. It drops lines that don't match the Yamlesque format specification before delimiting the valid Yamlesque fields with an ASCII File Separator character and removing any double-quotes surrounding the value field.

 local s='[[:space:]]*' w='[a-zA-Z0-9_]*' fs=$(echo @|tr @ '\034')
 sed -n -e "s|^\($s\)\($w\)$s:$s\"\(.*\)\"$s\$|\1$fs\2$fs\3|p" \
        -e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p" "$input" |

The two expressions are similar; they differ only because the first one picks out quoted values where as the second one picks out unquoted ones.

The File Separator (28/hex 12/octal 034) is used because, as a non-printable character, it is unlikely to be in the input data.

The result is piped into awk which processes its input one line at a time. It uses the FS character to assign each field to a variable:

indent       = length($1)/2;
key          = $2;
value        = $3;

All lines have an indent (possibly zero) and a key but they don't all have a value. It computes an indent level for the line dividing the length of the first field, which contains the leading whitespace, by two. The top level items without any indent are at indent level zero.

Next, it works out what prefix to use for the current item. This is what gets added to a key name to make an array name. There's a root_prefix for the top-level array which is defined as the data set name and an underscore:

root_prefix  = "'$prefix'_";
if (indent ==0 ) {
  prefix = "";          parent_key = "'$prefix'";
} else {
  prefix = root_prefix; parent_key = keys[indent-1];
}

The parent_key is the key at the indent level above the current line's indent level and represents the collection that the current line is part of. The collection's key/value pairs will be stored in an array with its name defined as the concatenation of the prefix and parent_key.

For the top level (indent level zero) the data set prefix is used as the parent key so it has no prefix (it's set to ""). All other arrays are prefixed with the root prefix.

Next, the current key is inserted into an (awk-internal) array containing the keys. This array persists throughout the whole awk session and therefore contains keys inserted by prior lines. The key is inserted into the array using its indent as the array index.

keys[indent] = key;

Because this array contains keys from previous lines, any keys with an indent level grater than the current line's indent level are removed:

 for (i in keys) {if (i > indent) {delete keys[i]}}

This leaves the keys array containing the key-chain from the root at indent level 0 to the current line. It removes stale keys that remain when the prior line was indented deeper than the current line.

The final section outputs the bash commands: an input line without a value starts a new indent level (a collection in YAML parlance) and an input line with a value adds a key to the current collection.

The collection's name is the concatenation of the current line's prefix and parent_key.

When a key has a value, a key with that value is assigned to the current collection like this:

printf("%s%s[%s]=\"%s\";\n", prefix, parent_key , key, value);
printf("%s%s[keys]+=\" %s\";\n", prefix, parent_key , key);

The first statement outputs the command to assign the value to an associative array element named after the key and the second one outputs the command to add the key to the collection's space-delimited keys list:

<current_collection>[<key>]="<value>";
<current_collection>[keys]+=" <key>";

When a key doesn't have a value, a new collection is started like this:

printf("%s%s[children]+=\" %s%s\";\n", prefix, parent_key , root_prefix, key);
printf("declare -g -A %s%s;\n", root_prefix, key);

The first statement outputs the command to add the new collection to the current's collection's space-delimited children list and the second one outputs the command to declare a new associative array for the new collection:

<current_collection>[children]+=" <new_collection>"
declare -g -A <new_collection>;

All of the output from yay_parse can be parsed as bash commands by the bash eval or source built-in commands.

yay_parse is a good introduction to using awk for more than simple column extraction.