Motivation
- 35-50% of time validating and debugging software
- Estimated cost: 50-75% of testing, debugging and verification
General Idea
Debugging can be done
- during development and testing (due to failing tests, static analysis reports)
- during commit, integration and production (based on user reports)
Terminology
Mistake
Human act/decision resulting in error
Defect/bug
Error in program code
Fault
Location in which programming error is triggered, and unwanted state is entered
Failure
Where fault becomes externally visible
Types of Debugging
printf
Debugging
Pros
- Simple and intuitive
- Language and tool agnostic
Cons
- Can quickly become confusing and unclear
- Might forget print statements in program code
- Might require recompilation
Logging
Logging is a more systematic alternative, allowing for multiple debugging levels:
DEBUG
a general debugging eventINFO
an event for informational purposesWARN
an event that might possibly lead to an errorERROR
an error in application that is possibly recoverableFATAL
a fatal event preventing continuation of program
Debugger
A debugger works by setting breakpoints
- which refers to pausing at certain locations in the code.
With the pause, the user can
- step in/over/out of methods
- do step by step execution
- inspect program state at a point of time
Pros
- More systematic than
printf
debugging- Can display additional information
- Can set breakpoints specific to values
Mental Model
In order to debug the failure, we need to trace it back to the defect that caused the first fault.
Challenges
Program states are typically large, making it challenging or infeasible to search them all amnually.
- Not always clear if intermediate states are correct
- Executions might consists of million of steps
Scientific Method to Debugging
- Formulate a question
- Come up with hypothesis, based on knowledge obtained, while formulating question that may explain the observed behaviour
- Determine logical consequences of hypothesis and formulate a prediction that can support or refute the hypothesis. Ideally, prediction would distinguish hypothesis from likely alternatives.
- Test prediction in experiment. If prediction holds, confidence in hypothesis increases, otherwise it decreases.
- Repeat Steps 2-4 until no discrepancies between hypothesis and predictions and/or observations.
Example
Given a function that strips away HTML markup, these are the test cases.
Input | Expected Output | Output | Outcome |
---|---|---|---|
<b>foo</b> | foo | foo | Passed |
<b>"foo"</b> | ”foo” | foo | Failed |
"<b>foo</b>" | ”foo” | <b>foo</b> | Failed |
<b id="bar">foo</b> | foo | foo | Passed |
Then, we can come up with some example hypotheses: |
Hypotheses | Outcome | Test Case |
---|---|---|
Tag <> always stripped from input | False | Check test case 3 |
Double quotes always stripped from input | True | |
Tags in double quotes are not stripped | True | Check test case 2,3 |
- Then, we can refine the hypothesis with a new one: The error is due to the tag being set.
- However, this is wrong.
- We can then refine it again: Error is due to quote condition evaluating to true.
- Now, this is correct.
- Continue and repeat and iterate.
Diagnosis
When diagnosis explains both causality and incorrectness, then and only then is it time to actually fix the code accordingly.
Rubberducking
Revisit observations and explain problem to someone else:
- use a rubber duck/another inanimate object to explain the issue.
Program Slicing
How to automatically determine which earlier variables influenced current, erroneous state?
Slicing
Extracts a subset of the program that potentially affects variables or values at a certain program location.
The main advantage means that the slice only needs to be inspected, rather than the whole program.
Pros
- Rules out locations of program that could not have effect on failure
- Bring together possible origins that may be scattered across code
Tracking Origins
- Start with faulty state
- Determine
‘s origins - parts of earlier states that could have caused faulty state - For each origin
, determine whether faulty or not
If we find a part of state that is faulty, yet only has correct origins, defect has been found.
Dependencies
Data dependency
is assigned a value computed from .
Control dependency
statement is only executed because condition involving was evaluated, influencing execution path.
Static vs Dynamic Slicing
Static slicing generally contains the whole method, while dynamic slicing eliminates some more part of the code - for example, if a branch was not executed, it cannot be part of the slice.
Forward vs Backward Slicing
Backward slicing refers to what influenced a value, while forward slicing refers to what statements are influenced by value.
Statistical Fault Localisation
How to automatically localise faults?
Statistical debugging
Statistical reasoning for debugging
Statistical fault localisation
Utilise multiple executions for localising faults
Observation: whether a program passes or fails likely correlates with events that precede it - where correlation does not imply causation.
Tarantula
Intution
Program elements that are executed by a failing test are more likely to be faulty than those primarily executed by passing test cases.
If suspiciousness is low:
- failing test cases do not exercise line If suspiciousness is high:
- failing test cases exercise this line
Statistical Debugging Tools Adoption
Limited tool support in form of IDEs, but useful as ingredients in other approaches.
Hypotheses:
- Programmers who debug with assistance of automated debugging tools will locate bugs faster (conclusion: yes, on easy debugging task)
- Effectiveness of automated tools increases with level of difficulty (no supprort)
- Effectiveness of debugging when using ranking based automated tools are affected by rank of faulty statement (no support)
Test Case Reduction
Are large inputs (generated form automated testing techniques) necessary to trigger bugs?
Small scope hypothesis
A high proportion of errors can be found by testing a program for all test inputs within some small scope.
Test-case reduction tools reduce an input element, apply the user-supplied test oracle, and checks whether the bug is still being triggered. If so, it continues applying another reduction, if not, it undoes the change and attempts another reduction.
Pros
- Easier to debug/analyze
- Identify duplicate bug-inducing test cases
- Easier to communicate
Cons
- Complex interestingness test
- Delta debugging inefficent for highly structured languages
Approaches
Binary search - if both halves pass, then where is the error?
Delta Debugging
Systematically remove elements from bug-inducing input.
Generally
- find a smaller test case that still fails
- find local minimum: no subset of reduced test causes the test to fail
- start reducing considering a granularity of
(halves) - consider partitions of input, and complement
- if any partitions causes test to fail, continue reducing, otherwise reduce complement
- increase granularity if does not work
Isolating Failure-Inducing Changes
Regression bug
Something that worked before stopped working
Automatic regression debugging using version control systems.
Approach
Naive approach: run every version to check if failure is triggered
- takes long time
- many changes across two adjacent versions
- fully automated
Binary search:
- identify commit in
tries - supported by
git bisect
Delta debugging on changes:
- apply delta debugging by considering individual patches
- differrent granularities possible