Community Data Validation ( Thanks to WebDetails & Pedro Alves for this Job)


Why ?

We need a way to do data validation

Use cases

Below are some of the use cases we want to tackle. Emphasized are the ones we think the current spec satisfies

  • Global
    • Is the server running?
    • Is the server running properly?
  • Connectivity
    • Do we have all the access we should? (network / database)
  • Query specific
    • Do we have up to date data?
    • Can we trust the data?
    • How long did the queries take to run?
    • Do we have wrong data? (duplicated users in community)
    • Do we have a big number of ‘unknowns’? (tk=1 in DW)
    • Do we have peaks or valleys in the data? (due to double process or no process)
    • Is the data stalled? (eg: number of twitter followers not updating)
    • Did the data format change
    • We need a way to handle known effects (eg: Christmas dip)
    • We need to correlate independent datasources (eg: comparing AUS with Blocklist evolution)
    • We need to connect the long running queries from CDA to CDV
    • Be able to validate big chunks of reprocessing
    • Do we have clearly wrong lines in resultset? (eg: a line there)
  • Dashboards
    • Are the dashboards rendering properly?
      • Do we have all the components?
      • Any layout change?
      • Any js errors?
    • Are the dashboards performing properly?
    • Can CDF talk with CDV to report client-side errors?
    • Alternatively, can CDA talk with CDV to report query errors?
      • Who caused the error?
  • ETL
    • Did the etl run?
    • Are we processing the expected amount of data?
    • Is the etl taking the expected time to run?
    • Did the etl finish before X am?
    • Test etl against tracer bullets?

Work flow

We expect from this system:

  • A central dashboard that allows us to quickly glimpse the overall status of our system.
    • Did all tests pass?
    • Which one failed?
    • Why?
    • When was the last time the tests ran
  • We need simple ways to define the tests (based on existing CDAs)
  • We need to which queries failed and which queries took long time to run
  • We need push notification system by email
  • We need to make sure it can talk to nagios
  • We need an outside test to check if server is up

Logging types

Every test will result in the following levels:

  • Critical
  • Error
  • Warn
  • Ok

Each specific test will be responsible for converting the output of that test (validation function for cda, tbd for kettle) into that status. The object format is:

{
 level: "Critical",
 type: "Missing data",
 description: "Whatever string the user defined" 
}

On each test definition, we need to be able to optionally set a timing threshold for the queries, and that will automatically generate a log with Type ‘Duration’

Test types

There are 4 possible types of tests:

  • CDA based query validation
  • ETL monitoring
  • Datawarehouse validation (a specific set of the cda based query validation)
  • Dashboard validation (we may opt to leave this one out for now as we’ll try to infer the errors from CDA’s 405)

CDA based query

Workflow

We want to select one or more cda / dataAccessId from our system, define the input parameters and select the type of validation we need.
The shape of the function will be: f( [ query, [params] ], validationFunction )
The generic test will be the implementation of the validation function:

validationFunction = function ( [ {metadata: [] , resultset: [[]]} ] ) :  value

That will be freely mapped to the log outputs

ETL monitoring query

The workflow defined here has to match with the previous section. We’ll build specific CDA queries that will read the kettle log files. From that point on, specific validations will have to be built for this logs
We’ll need, in pentaho, to define which connection refers to the kettle logging tables. Either by defining a special jndi or specifying in the settings.
We’ll need to test for:

  • Time
  • Start /end time
  • Amount of data processed

Datawarehouse schema validation

There are some specific tests we can do on the sanity of a datawarehouse.

  • Coherent amount of data on a daily / hourly basis
  • Test the same as before with specific breakdowns
  • Test for the amount of ‘unknowns’ on dimensions

Invocation and Scheduling

There are 2 ways to call the validations:

  • By url request
  • Scheduled calls

Url will be based on the id / query name (tbd). The schedule calls will be cron based, with the following presets:

  • Every hour
  • Every day
  • Every week
  • Every month
  • Custom cron

User interface

This are the features in the main user interface (this is the ultimate goal, the implementation may be broken into stages):

  • See existing validations
    • Allow to fire a specific validation
    • Get the url of a specific validation / all validations
  • Create / Edit validation
    • Define query name
    • Define queries and parameters
    • Define validation function
    • Chose log alerts (when to throw error / severe / warn / ok)
    • Choose duration thresholds
    • Define error message
    • Define cron
  • Validation status dashboard
  • CDA Query error dashboard (Should this belong to CDA instead?)
    • Query and parameters
    • Error
    • Incidents
  • Duration dashboard to identify slow points in the system
    • Query and parameters
    • Duration
    • Incidents

Technical approach

All the specific information will be stored in solution/cdv/queries/). The files will have the format _queryName.cdv and will internally be a JSON file with the following structure:

{
  type: "query",
  name: "validationName",
  group: "MyGrouping" 
  validation: [ 
    { cdaFile: "/solution/cda/test.cda", dataAccessId: "1" , parameters: [...] },
    { cdaFile: "/solution/cda/test2.cda", dataAccessId: "2" , parameters: [...] }
  ],
  validationType: "custom",
  validationFunction: "function(resultArray,arguments){ return 123 }",
  alerts: {
     /* This functions will be executed from bottom up. As the functions return true, the next one
        will be executed and the last matching level will be thrown.
        The exception to this rule is the optional okAlert(v) function. If this one returns true, no other calls will be made 
     */
     criticalAlert: "function(v){ return v > 10 }",
     errorAlert: undefined,
     warnAlert:  "function(v){ return v > 5 }",
     okAlert: "function(v){return v<3;}",
     alertType: "MissingData",
     alertMessage: "function(level,v){return 'My error message: ' + v)"  /* this can either be a function or a string */
  },
  executionTimeValidation: {
      expected: 5000,
      warnPercentage: 0.30,
      errorPercentage: 0.70,
      errorOnLow: true
  },

  cron: "0 2 * * ? *" 
}

Preset validations

We won’t need to manually define all kinds of validations. CDV will support a preset that can also be extended by adding the definitions to solution/cdv/validationFunctions/ . The template for one such Javascript file looks like this:

wd.cdv.validation.register({
  name: "Existence",
  validationArguments: [
    {name: "testAll", type:"boolean", default: true}
  ],

  validationFunction: function(rs, conf) {
    var exists = !!conf.testAll;

    return rs.map(function(r){return r.length > 0}).reduce(function(prev, curr){
      return conf.testAll ? (curr && prev) : (curr || prev);
    });
  },

  alertArguments: {
    {name: "failOnExistence" type: "boolean", default: true},
    {name: "failureLevel", type: "alarmLevel", default: "ERROR"},
    {name: "failureMessage", type: "String", default: "Failed Existence Test: ${result}"}
  },

  alertMapper: function(result, conf) {
    var success = conf.failOnExistence && result,
        level = success ? "OK", conf.failureLevel,
        message = success ? conf.successMessage : conf.failureMessage; 
    return Alarm(level, message, result);
  }
});

The wd.cdb.validation API is defined in the Validation Module.
There are 5 objects there that we need to analyze:

  • validationFunction(rs, conf) – This is the validation function that will be executed after the query runs
  • validationArguments – Definition of the arguments that will be used within the validation function
  • alertArguments – Definition of the arguments that will be sent to the alertMapper
  • alertMapper(result, conf) – Mapping between the validation result and the alerts

Preset validations or custom validations

When we define a query, we can chose which validation function to use and pass the parameters that specific validation requires.
Alternatively, we can use a custom validation function. That validation function has the following format, where all we need is to return the alarm level (this is a spec, may change after implementation)

function(rs, conf) {

    var exists = rs.map(function(r){return r.length > 0}).reduce(function(prev, curr){
      return conf.testAll ? (curr && prev) : (curr || prev);
    });

    return exists ? Alarm.ERROR : Alarm.OK;
}

CDA integration

We need a tight integration between CDA and CDV to report:

  • Errors in CDA queries
  • Long running CDA queries
  • Queries with obvious errors in the structure (eg: missing lines)

It will obviously need to take into account the fact that CDV may not be installed and can’t have performance impacts in CDA

External interfaces

We can have several external interfaces supported:

  • Email
  • Http
  • Nagios integration
  • Server up check

The last one is a very specific check. All the other integrations will fail if suddenly the server hangs, and we must be notified of that. On http and nagios integration, we’ll be able to get reports on the individual tests and also on the test groups. This will not rerun the tests but get the report on the last status of a test.
On the http case, we can pass a flat to force a test to be rerun.
For nagios, we can have an export of test rules

Settings

We’ll be able to define the group rules, mainly for connectivity reasons. So the settings (that later can be converted to an UI), will look like this:

  • Groups
    • Email
      • Threshold
      • Destinations
    • SMS
      • Threshold
      • Destinations