Motivation

  • 35-50% of time validating and debugging software
  • Estimated cost: 50-75% of testing, debugging and verification

General Idea

Debugging can be done

  • during development and testing (due to failing tests, static analysis reports)
  • during commit, integration and production (based on user reports)

Terminology

Mistake

Human act/decision resulting in error

Defect/bug

Error in program code

Fault

Location in which programming error is triggered, and unwanted state is entered

Failure

Where fault becomes externally visible

Types of Debugging

printf Debugging

Pros

  • Simple and intuitive
  • Language and tool agnostic

Cons

  • Can quickly become confusing and unclear
  • Might forget print statements in program code
  • Might require recompilation

Logging

Logging is a more systematic alternative, allowing for multiple debugging levels:

  • DEBUG a general debugging event
  • INFO an event for informational purposes
  • WARN an event that might possibly lead to an error
  • ERROR an error in application that is possibly recoverable
  • FATAL a fatal event preventing continuation of program

Debugger

A debugger works by setting breakpoints

  • which refers to pausing at certain locations in the code.

With the pause, the user can

  • step in/over/out of methods
  • do step by step execution
  • inspect program state at a point of time

Pros

  • More systematic than printf debugging
  • Can display additional information
  • Can set breakpoints specific to values

Mental Model

InputState 1WantedState 1PassedState 2UnwantedState 3UnwantedOutput

In order to debug the failure, we need to trace it back to the defect that caused the first fault.

Challenges

Program states are typically large, making it challenging or infeasible to search them all amnually.

  • Not always clear if intermediate states are correct
  • Executions might consists of million of steps

Scientific Method to Debugging

  1. Formulate a question
  2. Come up with hypothesis, based on knowledge obtained, while formulating question that may explain the observed behaviour
  3. Determine logical consequences of hypothesis and formulate a prediction that can support or refute the hypothesis. Ideally, prediction would distinguish hypothesis from likely alternatives.
  4. Test prediction in experiment. If prediction holds, confidence in hypothesis increases, otherwise it decreases.
  5. Repeat Steps 2-4 until no discrepancies between hypothesis and predictions and/or observations.

Example

Given a function that strips away HTML markup, these are the test cases.

InputExpected OutputOutputOutcome
<b>foo</b>foofooPassed
<b>"foo"</b>”foo”fooFailed
"<b>foo</b>"”foo”<b>foo</b>Failed
<b id="bar">foo</b>foofooPassed
Then, we can come up with some example hypotheses:
HypothesesOutcomeTest Case
Tag <> always stripped from inputFalseCheck test case 3
Double quotes always stripped from inputTrue
Tags in double quotes are not strippedTrueCheck test case 2,3
  • Then, we can refine the hypothesis with a new one: The error is due to the tag being set.
  • However, this is wrong.
  • We can then refine it again: Error is due to quote condition evaluating to true.
  • Now, this is correct.
  • Continue and repeat and iterate.

Diagnosis

When diagnosis explains both causality and incorrectness, then and only then is it time to actually fix the code accordingly.

Rubberducking

Revisit observations and explain problem to someone else:

  • use a rubber duck/another inanimate object to explain the issue.

Program Slicing

How to automatically determine which earlier variables influenced current, erroneous state?

Slicing

Extracts a subset of the program that potentially affects variables or values at a certain program location.

The main advantage means that the slice only needs to be inspected, rather than the whole program.

Pros

  • Rules out locations of program that could not have effect on failure
  • Bring together possible origins that may be scattered across code

Tracking Origins

  • Start with faulty state
  • Determine ‘s origins - parts of earlier states that could have caused faulty state
  • For each origin , determine whether faulty or not

If we find a part of state that is faulty, yet only has correct origins, defect has been found.

Dependencies

Data dependency

is assigned a value computed from .

Control dependency

statement is only executed because condition involving was evaluated, influencing execution path.

Static vs Dynamic Slicing

Static slicing generally contains the whole method, while dynamic slicing eliminates some more part of the code - for example, if a branch was not executed, it cannot be part of the slice.

Forward vs Backward Slicing

Backward slicing refers to what influenced a value, while forward slicing refers to what statements are influenced by value.

Statistical Fault Localisation

How to automatically localise faults?

Statistical debugging

Statistical reasoning for debugging

Statistical fault localisation

Utilise multiple executions for localising faults

Observation: whether a program passes or fails likely correlates with events that precede it - where correlation does not imply causation.

Tarantula

Intution

Program elements that are executed by a failing test are more likely to be faulty than those primarily executed by passing test cases.

If suspiciousness is low:

  • failing test cases do not exercise line If suspiciousness is high:
  • failing test cases exercise this line

Statistical Debugging Tools Adoption

Limited tool support in form of IDEs, but useful as ingredients in other approaches.

Hypotheses:

  1. Programmers who debug with assistance of automated debugging tools will locate bugs faster (conclusion: yes, on easy debugging task)
  2. Effectiveness of automated tools increases with level of difficulty (no supprort)
  3. Effectiveness of debugging when using ranking based automated tools are affected by rank of faulty statement (no support)

Test Case Reduction

Are large inputs (generated form automated testing techniques) necessary to trigger bugs?

Small scope hypothesis

A high proportion of errors can be found by testing a program for all test inputs within some small scope.

Test inputTest oracle+Reduced testcase

Test-case reduction tools reduce an input element, apply the user-supplied test oracle, and checks whether the bug is still being triggered. If so, it continues applying another reduction, if not, it undoes the change and attempts another reduction.

Pros

  • Easier to debug/analyze
  • Identify duplicate bug-inducing test cases
  • Easier to communicate

Cons

  • Complex interestingness test
  • Delta debugging inefficent for highly structured languages

Approaches

Binary search - if both halves pass, then where is the error?

Delta Debugging

Systematically remove elements from bug-inducing input.

Generally

  • find a smaller test case that still fails
  • find local minimum: no subset of reduced test causes the test to fail
  • start reducing considering a granularity of (halves)
  • consider partitions of input, and complement
  • if any partitions causes test to fail, continue reducing, otherwise reduce complement
    • increase granularity if does not work

Isolating Failure-Inducing Changes

Regression bug

Something that worked before stopped working

Automatic regression debugging using version control systems.

Approach

Naive approach: run every version to check if failure is triggered

  • takes long time
  • many changes across two adjacent versions
  • fully automated

Binary search:

  • identify commit in tries
  • supported by git bisect

Delta debugging on changes:

  • apply delta debugging by considering individual patches
  • differrent granularities possible