Skip to content

Data tools

The ADM data engineering team shall use simple tools -- preferably native Unix/Linux command line utilities. Every data transformation should be written as code rather than drawn in a GUI.

Expressing the sequencing of those transformation operations shall be done initially in a makefile but may move to a code-based orchestration tool like Apache Airflow.

Tool inventory

ToolDescriptionInstall
jqA command-line JSON processorbrew install jq
gojqA command-line JSON processor written in Gobrew install gojq
jaqA command-line JSON processor wriiten in Rustbrew install jaq
sortSort lines of text filesNative to the OS
uniqReport or omit repeated linesNative to the OS
sedA stream editor for filtering and transforming textNative to the OS
xsvA suite of utilities for converting to and working with CSV, the king of tabular file formatsbrew install xsv
adm-gen-initAn internally-developed CLI for generating a Neo4j database initialization script from a metamodelnpm i -g @united-talent-agency/adm-gen-init
(see doc below)
guishExperimental

A cookbook of serveral examples showing how the tools are used:

bash
cat people.json | gojq '.[].contacts[].contactType' | sort | uniq -c
bash
cat people.json | gojq '.[].addresses' | sed  '/^null/d' | sed  '/^\[\]/d'
bash
#! /bin/bash

cat people.json  \
    | gojq ".[].$1" \
    | sed  '/^null/d' \
    | sed  '/^\[\]/d'
bash
cat people.json | gojq '.[] | select(.addresses | length >= 1) | { onyxId: ._id."$oid", addresses: .addresses }'
bash
cat people.json | gojq '.[] | select(._id."$oid" == "5d9d0b10d6703f0011cd3c84")'
bash
cat people.json | jq length
bash
cat people.json \
    | gojq '.[].name' \
    | sort \
    | uniq -D \
    | uniq -c \
    | sort -r
bash
cat people.json | gojq '.[].name' | grep -i "o/b/o"

adm-gen-init

The adm-gen-init command line interface allows the ADM data engineering team to generate a Cypher init script for authoritative data. We will use this not only to prepare production and test databases, but also to build local development and CI-based containers for testing.

The source code for adm-gen-init can be found at https://github.com/united-talent-agency/adm-gen-init.

Authenticating to the GitHub Package Registry

adm-gen-init is packaged as an npm module and hosted in UTA's private GitHub packages repository. To install adm-gen-init or any other UTA private package:

  1. Create a "classic" personal access token with the read:packages scope.
  2. Authorize that token to access the @united-talent-agency organization using the "Configure SSO" button:

Alt text

  1. Create (or update) a .npmrc file in your home directory (usually ~/) with the following contents:
ini
@united-talent-agency:registry=https://npm.pkg.github.com
//npm.pkg.github.com/:_authToken=YOUR_TOKEN_HERE

WARNING

Be sure to keep your token secure!

At this point you should be able to run or install the CLI as described below.

Installation

If you want to install the package, you can do so with:

npm i -g @united-talent-agency/adm-gen-init

You can also run it via npx or pnpx:

pnpx @united-talent-agency/adm-gen-init -i mdm-meta-model.yaml -o init.cypher

Usage

Run adm-gen-init --help for usage information.

The metamodel

The input file on the command line is required and should be a YAML-based "metamodel," which is just a trivial DSL for specifying core entities we need initialized in Neo4j before loading data. The data engineering team maintains the latest metamodel in the data engineering repo, but below is a sample of what the metamodel file looks like.

adm-gen-seed

The adm-gen-init command line interface allows the ADM data engineering team to generate a Cypher script that will seed some sample authoritative data into a database. This will not be used for production but rather for test databses used in development and CI.

The source code for adm-gen-seed can be found at https://github.com/united-talent-agency/adm-gen-seed.

Installation

If you want to install the package, you can do so with:

npm i -g @united-talent-agency/adm-gen-seed

You can also run it via npx or pnpx:

pnpx @united-talent-agency/adm-gen-seed > seed.cypher

Usage

Run adm-gen-seed --help for usage information.

Confidential. For internal use only.