Design of Experiments in Python

Hank Anderson
hank@statease.com
https://github.com/hpanderson/dexpy-pymntos

Agenda

  • What is Design of Experiments?
  • Design an experiment with dexpy to improve the office coffee
  • Measure the power of our experiment
  • Model the taste test results using statsmodels
  • Visualize the data with seaborn and matplotlib to find the best pot of coffee

What is Design of Experiments (DoE)?

A systematic series of tests, in which purposeful changes are made to input factors, so that you may identify causes for significant changes in the output responses.

NIST has a nice primer on DOE.

In [4]:
dot = Digraph(comment='Design of Experiments')
dot.body.extend(['rankdir=LR', 'size="10,10"'])
dot.node_attr.update(shape='rectangle', style='filled', fontsize='20', fontname="helvetica")

dot.node('X', 'Controllable Factors', color='mediumseagreen', width='3')
dot.node('Z', 'Noise Factors', color='indianred2', width='3')
dot.node('P', 'Process', color='lightblue', height='1.25', width='3')
dot.node('Y', 'Responses', color='lightblue')

dot.edges(['XP', 'ZP', 'PY'] * 3)

dot
Out[4]:
%3 X Controllable Factors P Process X->P X->P X->P Z Noise Factors Z->P Z->P Z->P Y Responses P->Y P->Y P->Y

What Is It For?

  • Screening Experiments - Determining which inputs to your system are important
  • Modeling a Process - Gain a better understanding of your system
  • Optimization - Find the combination of inputs that results in a better output
  • Robustness - Find a combination of inputs that produces a consistent result

DoE in Python: dexpy

  • Based on Design-Expert® software, a package for design and analysis of industrial experiments
  • Apache2 licensed, pure python (for now), available on pypi
  • Requires numpy, scipy, pandas and patsy
  • Recommend statsmodels for analysis, matplotlib and seaborn for visualization
  • Other alternatives are:

Motivating Example: Better Office Coffee

  • Current coffee is subpar ("disgusting and unacceptable")
  • Need to answer the following questions via experimentation:
    • What coffee beans to use?
    • How much coffee to use?
    • How to grind the coffee?

Motivating Example: Better Office Coffee

  • 5 input factors
    • Amount of Coffee (2.5 to 4.0 oz.)
    • Grind size (8-10mm)
    • Brew time (3.5 to 4.5 minutes)
    • Grind Type (burr vs blade)
    • Coffee beans (light vs dark)
  • 1 response: Average overall liking by a panel of 5 office coffee addicts
    • Each taster rates the coffee from 1-9
  • Maximum of 3 taste tests a day, for liability reasons
         ((((
        ((((
         ))))
      _ .---.
     ( |`---'|
      \|     |
      : `---' :
       `-----'

Factorial Design

  • Change multiple inputs at once
  • Reveals interactions
  • Maximizes information with minimum runs
In [8]:
df = dexpy.factorial.build_factorial(2, 4)
df.columns = ['amount', 'grind_size']

fg = sns.lmplot('amount', 'grind_size', data=df, fit_reg=False)
ax = fg.axes[0, 0]
ax.set_xticks([-1, 1])
ax.set_xticklabels(get_tick_labels('amount', actual_lows, actual_highs, units))
ax.set_yticks([-1, 1])
ax.set_yticklabels(get_tick_labels('grind_size', actual_lows, actual_highs, units))
p = ax.add_patch(patches.Rectangle((-1, -1), 2, 2, color="navy", alpha=0.3, lw=2))
In [9]:
cube_design = dexpy.factorial.build_factorial(3, 8)

points = np.array(cube_design)
fig = plt.figure()

ax = fig.add_subplot(111, projection='3d', axisbg='w')
ax.view_init(30, -60) # rotate plot

X, Y = np.meshgrid([-1,1], [-1,1])

cube_alpha = 0.1
ax.plot_surface(X, Y, 1, alpha=cube_alpha)
ax.plot_surface(X, Y, -1, alpha=cube_alpha)
ax.plot_surface(X, -1, Y, alpha=cube_alpha)
ax.plot_surface(X, 1, Y, alpha=cube_alpha)
ax.plot_surface(1, X, Y, alpha=cube_alpha)
ax.plot_surface(-1, X, Y, alpha=cube_alpha)
ax.scatter3D(points[:, 0], points[:, 1], points[:, 2], c="b")

ax.set_xticks([-1, 1])
ax.set_xticklabels(["8mm", "10mm"])
ax.set_yticks([-1, 1])
ax.set_yticklabels(["2.5oz", "4oz"])
ax.set_zticks([-1, 1])
ax.set_zticklabels(["light", "dark"])

plt.show()

Statistical Power

The probability that a design will detect an active effect.

Effect? Retain H0 Reject H0
No OK Type 1 Error
Yes Type II Error OK

Power is expressed as a probability to detect an effect of size Δ, given noise σ. This is typically given as a delta to sigma ratio Δ/σ. Power is a function of the signal to noise ratio, as well as the number and layout of experiments in the design.

In [10]:
runs = 50
delta = 0.5
sigma = 1.5
alpha = 0.05

one_factor = pd.DataFrame([ -1, 1 ] * runs, columns=['beans'])
one_factor_power = dexpy.power.f_power('beans', one_factor, delta/sigma, alpha)

display(Markdown('''
# Power Example
{} pots of coffee are tested with light beans, then {} pots with dark beans.
There is a variance of {} taste rating from pot to pot. If we expect a {} change
in the taste rating when going from light to dark, what is the likelihood we would detect it?
(Answer: **{:.2f}%**)

Note: this assumes that we reject H<sub>0</sub> at p <= {}
'''.format(int(runs / 2), int(runs / 2), sigma, delta, one_factor_power[1]*100, alpha)
))

def plot_shift(runs, delta, sigma, annotate=False):
    """Plots two sets of random normal data, one shifted up delta units."""
    mean = 5
    low = sigma*np.random.randn(int(runs/2),1)+mean
    high = sigma*np.random.randn(int(runs/2),1)+mean+delta
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.set_ylabel("taste")
    ax.set_xlabel("runs")
    ax.set_ylim([0, 11])
    
    plt.plot(np.concatenate([low, high]))
    plt.plot([0, (runs/2)], [low.mean()] * 2, color='firebrick', lw=2)
    plt.plot([(runs/2), runs-1], [high.mean()] * 2, color='g', lw=2)
    
    p_value = scipy.stats.f_oneway(low, high).pvalue[0]
    if annotate:
        plt.annotate("p: {:.5f}".format(p_value),
                     xy=(runs / 2, (low.mean() + high.mean()) / 2), xycoords='data',
                     xytext=(.8, .9), textcoords='axes fraction',
                     bbox=dict(boxstyle="round4", fc="w"),
                     arrowprops=dict(arrowstyle='-[', linewidth=2, color='black', connectionstyle="angle3")
                     )

    plt.show()
    return [low, high]

low, high = plot_shift(runs, delta, sigma)

Power Example

25 pots of coffee are tested with light beans, then 25 pots with dark beans. There is a variance of 1.5 taste rating from pot to pot. If we expect a 0.5 change in the taste rating when going from light to dark, what is the likelihood we would detect it? (Answer: 37.86%)

Note: this assumes that we reject H0 at p <= 0.05

In [11]:
increased_delta = delta*4
increased_delta_power = dexpy.power.f_power('beans', one_factor, increased_delta/sigma, alpha)

display(Markdown('''
# Power Example - Increase Delta
What if we don't care about a taste increase of 0.5? That's not that much better
than the current coffee, after all. Instead, if we say we only care about a change
in rating above {}, what is the likelihood we would detect such a change?
(Answer: **{:.5f}%**)
'''.format(increased_delta, increased_delta_power[1]*100)
))

_ = plot_shift(runs, increased_delta, sigma)

Power Example - Increase Delta

What if we don't care about a taste increase of 0.5? That's not that much better than the current coffee, after all. Instead, if we say we only care about a change in rating above 2.0, what is the likelihood we would detect such a change? (Answer: 99.99983%)

In [12]:
decreased_sigma = sigma*0.5
decreased_sigma_power = dexpy.power.f_power('beans', one_factor, delta/decreased_sigma, alpha)

display(Markdown('''
# Power Example - Decrease Noise
Instead of lowering our standards for our noisy taste ratings, instead
we could bring in expert testers who have a much more accurate palate.
If we assume a decrease in noise from {} to {}, then we can detect a
change in rating of {} with {:.2f}% probability.
'''.format(sigma, decreased_sigma, delta, decreased_sigma_power[1]*100)
))

_ = plot_shift(runs, delta, sigma*0.1)

Power Example - Decrease Noise

Instead of lowering our standards for our noisy taste ratings, instead we could bring in expert testers who have a much more accurate palate. If we assume a decrease in noise from 1.5 to 0.75, then we can detect a change in rating of 0.5 with 91.00% probability.

In [13]:
increased_runs = runs * 4
one_factor = pd.DataFrame([ -1, 1 ] * increased_runs, columns=['beans'])
increased_runs_power = dexpy.power.f_power('beans', one_factor, delta/sigma, alpha)

display(Markdown('''
# Power Example - Increase Runs
If expert testers are too expensive, and we are unwilling to compromise
our standards, then the only remaining option is to create a more powerful
design. In this toy example, there isn't much we can do, since it's
just one factor. Increasing the runs from {} to {} gives a power of
{:.2f}%. This may be a more acceptable success rate than the original power
of {:.2f}%, however... that is a lot of coffee to drink.

For more complicated designs changing the structure of the design
can also increase power.
'''.format(runs, increased_runs, increased_runs_power[1]*100, one_factor_power[1]*100)
))
_ = plot_shift(increased_runs, delta, sigma)

Power Example - Increase Runs

If expert testers are too expensive, and we are unwilling to compromise our standards, then the only remaining option is to create a more powerful design. In this toy example, there isn't much we can do, since it's just one factor. Increasing the runs from 50 to 200 gives a power of 91.39%. This may be a more acceptable success rate than the original power of 37.86%, however... that is a lot of coffee to drink.

For more complicated designs changing the structure of the design can also increase power.

In [14]:
help(dexpy.power.f_power)
Help on function f_power in module dexpy.power:

f_power(model, design, effect_size, alpha)
    Calculates the power of an F test.
    
    This calculates the probability that the F-statistic is above its critical
    value (alpha) given an effect of some size.
    
    :param model: A patsy formula for which to calculate power.
    :type model: patsy.formula
    :param design: A pandas.DataFrame representing a design.
    :type design: pandas.DataFrame
    :param effect_size: The size of the effect that the test should be able to detect (also called a signal to noise
        ratio).
    :type effect_size: float
    :param alpha: The critical value that we want the test to be above.
    :type alpha: float between 0 and 1
    :returns: A list of percentage probabilities that an F-test could detect an effect of the given size at the given
        alpha value for a particular column.
    
    Usage:
      >>> design = dexpy.factorial.build_factorial(4, 8)
      >>> print(dexpy.power.f_power("1 + A + B + C + D", design, 2.0, 0.05))
      [ 95.016, 49.003, 49.003, 49.003, 49.003 ]

In [16]:
full_design = dexpy.factorial.build_factorial(5, 2**5)
full_design.columns = ['amount', 'grind_size', 'brew_time', 'grind_type', 'beans']
factorial_power = dexpy.power.f_power(model, full_design, sn, alpha)
factorial_power.pop(0)
factorial_power = ['{0:.2f}%'.format(i*100) for i in factorial_power] # convert to %
factorial_power = pd.DataFrame(factorial_power, columns=['Power'], index=full_design.columns)

display(Markdown('''
# Calculating Power with dexpy: Factorial
* {} total runs, with a signal to noise ratio of 2.
* Model (`patsy` for: `{}`
'''.format(len(full_design), model)))
display(PrettyPandas(factorial_power))

Calculating Power with dexpy: Factorial

  • 32 total runs, with a signal to noise ratio of 2.
  • Model (patsy for: amount + grind_size + brew_time + grind_type + beans
Power
amount 99.97%
grind_size 99.97%
brew_time 99.97%
grind_type 99.97%
beans 99.97%

Fractional Factorials

  • Coffee experiment is 25 runs (32)
  • We want to add 4 center point runs to check for curvature
  • Total runs = 36, 3 per day if all testers are in the office
  • Estimate experiment will take a month

Fractional Factorials

  • Power for the experiment is > 99%
  • Full factorial is overkill
  • Instead run 25-1 experiments, a "half fraction"
In [17]:
help(dexpy.factorial.build_factorial)
Help on function build_factorial in module dexpy.factorial:

build_factorial(factor_count, run_count)
    Builds a regular two-level design based on a number of factors and runs.
    
    Full two-level factorial designs may be run for up to 9 factors. These
    designs permit estimation of all main effects and all interaction effects.
    If the number of runs requested is a 2^factor_count, the design will be a
    full factorial.
    
    If the number of runs is less than 2^factor_count (it still must be a power
    of two) a fractional design will be created. Not all combinations of runs
    and factor counts will result in a design. Use the
    :ref:`alias list<alias-list>` method to see what terms are estimable in
    the resulting design.
    
    :param factor_count: The number of factors to build for.
    :type factor_count: int
    :param run_count: The number of runs in the resulting design. Must be a power of 2.
    :type run_count: int
    :returns: A pandas.DataFrame object containing the requested design.

In [18]:
coffee_design = dexpy.factorial.build_factorial(5, 2**(5-1))
coffee_design.columns = ['amount', 'grind_size', 'brew_time', 'grind_type', 'beans']
center_points = [
    [0, 0, 0, -1, -1],
    [0, 0, 0, -1, 1],
    [0, 0, 0, 1, -1],
    [0, 0, 0, 1, 1]
]

coffee_design = coffee_design.append(pd.DataFrame(center_points * 2, columns=coffee_design.columns))
coffee_design.index = np.arange(0, len(coffee_design))

actual_design = coded_to_actual(coffee_design, actual_lows, actual_highs)

display(Markdown("## 2<sup>(5-1)</sup> Factorial Design"))
display(PrettyPandas(actual_design))

2(5-1) Factorial Design

amount grind_size brew_time grind_type beans
0 2.5 8 3.5 burr dark
1 2.5 8 3.5 blade light
2 2.5 8 4.5 burr light
3 2.5 8 4.5 blade dark
4 2.5 10 3.5 burr light
5 2.5 10 3.5 blade dark
6 2.5 10 4.5 burr dark
7 2.5 10 4.5 blade light
8 4 8 3.5 burr light
9 4 8 3.5 blade dark
10 4 8 4.5 burr dark
11 4 8 4.5 blade light
12 4 10 3.5 burr dark
13 4 10 3.5 blade light
14 4 10 4.5 burr light
15 4 10 4.5 blade dark
16 3.25 9 4 burr light
17 3.25 9 4 burr dark
18 3.25 9 4 blade light
19 3.25 9 4 blade dark
20 3.25 9 4 burr light
21 3.25 9 4 burr dark
22 3.25 9 4 blade light
23 3.25 9 4 blade dark
In [19]:
model = ' + '.join(coffee_design.columns)
factorial_power = dexpy.power.f_power(model, coffee_design, sn, alpha)
factorial_power.pop(0)
factorial_power = ['{0:.2f}%'.format(i*100) for i in factorial_power] # convert to %
factorial_power = pd.DataFrame(factorial_power, columns=['Power'], index=coffee_design.columns)

display(Markdown('''
## 2<sup>(5-1)</sup> Factorial Power
* Power for {} total runs
* Signal to noise ratio of 2
* Model: `{}`
'''.format(len(coffee_design), model)))
display(PrettyPandas(factorial_power))

2(5-1) Factorial Power

  • Power for 24 total runs
  • Signal to noise ratio of 2
  • Model: amount + grind_size + brew_time + grind_type + beans
Power
amount 96.55%
grind_size 96.55%
brew_time 96.55%
grind_type 99.61%
beans 99.61%

Analysis

  • statsmodels has lots of routines for modeling data
  • We will use Ordinary Least Squares (OLS) to fit
  • statsmodels typically takes numpy arrays for X and y data
  • It also has a "formulas" api that accepts a patsy formula
In [27]:
reduced_model = "amount + grind_size + brew_time + grind_type + beans + amount:beans + grind_size:brew_time + grind_size:grind_type"
lm = statsmodels.formula.api.ols("taste_rating ~" + reduced_model, data=coffee_design).fit()
print(lm.summary2())
                   Results: Ordinary least squares
=====================================================================
Model:                OLS               Adj. R-squared:      0.746   
Dependent Variable:   taste_rating      AIC:                 79.5691 
Date:                 2016-11-10 19:52  BIC:                 90.1715 
No. Observations:     24                Log-Likelihood:      -30.785 
Df Model:             8                 F-statistic:         9.438   
Df Residuals:         15                Prob (F-statistic):  0.000123
R-squared:            0.834             Scale:               1.2184  
---------------------------------------------------------------------
                       Coef.  Std.Err.    t    P>|t|   [0.025  0.975]
---------------------------------------------------------------------
Intercept              5.0318   0.2253 22.3328 0.0000  4.5516  5.5121
amount                 0.9731   0.2759  3.5266 0.0031  0.3850  1.5613
grind_size             0.0022   0.2759  0.0078 0.9939 -0.5860  0.5903
brew_time              1.2061   0.2759  4.3709 0.0005  0.6180  1.7943
grind_type            -0.0974   0.2253 -0.4324 0.6716 -0.5777  0.3828
beans                  0.5774   0.2253  2.5628 0.0216  0.0972  1.0577
amount:beans          -1.4820   0.2759 -5.3707 0.0001 -2.0702 -0.8939
grind_size:brew_time   0.3961   0.2759  1.4354 0.1717 -0.1921  0.9843
grind_size:grind_type -0.6927   0.2759 -2.5103 0.0240 -1.2809 -0.1046
---------------------------------------------------------------------
Omnibus:               4.208          Durbin-Watson:            2.190
Prob(Omnibus):         0.122          Jarque-Bera (JB):         1.550
Skew:                  -0.116         Prob(JB):                 0.461
Kurtosis:              1.777          Condition No.:            1    
=====================================================================

Visualization

  • seaborn is built on top of matplotlib and adds support for pandas dataframes
  • Can build a plot using seaborn, then manipulate it with matplotlib
  • Default themes look a lot nicer than what you get from matplotlib out of the box
In [28]:
display(Markdown('''
If we take the experiment data from the design and use our new model to fit that data, then plot it against
the observed values we can get an idea for how well our model predicts. Points above the 45 degree line are
overpredicting for that combination of inputs. Points below the line predict a lower taste rating than
we actually measured during the experiment.'''))

actual_predicted = pd.DataFrame({ 'actual': coffee_design['taste_rating'],
                                  'predicted': lm.fittedvalues
                                }, index=np.arange(len(coffee_design['taste_rating'])))
fg = sns.FacetGrid(actual_predicted, size=5)
fg.map(plt.scatter, 'actual', 'predicted')
ax = fg.axes[0, 0]
ax.plot([1, 10], [1, 10], color='g', lw=2)
ax.set_xticks(np.arange(1, 11))
ax.set_xlim([0, 11])
ax.set_yticks(np.arange(1, 11))
ax.set_title('Actual vs Predicted')
_ = ax.set_ylim([0, 11])

If we take the experiment data from the design and use our new model to fit that data, then plot it against the observed values we can get an idea for how well our model predicts. Points above the 45 degree line are overpredicting for that combination of inputs. Points below the line predict a lower taste rating than we actually measured during the experiment.

In [29]:
display(Markdown('''
Plotting the prediction for two factors at once shows how they interact with each other.
In this graph you can see that at the low brew time the larger grind size results in
a poor taste rating, likely because the coffee is too weak.'''))

pred_points = pd.DataFrame(1, columns = coffee_design.columns, index=np.arange(4))
pred_points.loc[1,'grind_size'] = -1
pred_points.loc[3,'grind_size'] = -1
pred_points.loc[2,'brew_time'] = -1
pred_points.loc[3,'brew_time'] = -1
pred_points['taste_rating'] = lm.predict(pred_points)
pred_points = coded_to_actual(pred_points, actual_lows, actual_highs)

fg = sns.factorplot('grind_size', 'taste_rating', hue='brew_time', kind='point', data=pred_points)
ax = fg.axes[0, 0]
ax.set_xticklabels(get_tick_labels('grind_size', actual_lows, actual_highs, units))
_ = ax.set_title('Grind Size/Brew Time Interaction')

Plotting the prediction for two factors at once shows how they interact with each other. In this graph you can see that at the low brew time the larger grind size results in a poor taste rating, likely because the coffee is too weak.

In [30]:
display(Markdown('''
This graph contains the prediction with the highest taste rating, 7.72.
However, if you look at the dark bean line there is a point where we can get
a rating of 6.93 with 2.5oz of grounds.
'''))

pred_points = pd.DataFrame(1, columns = coffee_design.columns, index=np.arange(4))
pred_points.loc[1,'amount'] = -1
pred_points.loc[3,'amount'] = -1
pred_points.loc[2,'beans'] = -1
pred_points.loc[3,'beans'] = -1
pred_points['taste_rating'] = lm.predict(pred_points)
pred_points = coded_to_actual(pred_points, actual_lows, actual_highs)

fg = sns.factorplot('amount', 'taste_rating', hue='beans', kind='point', palette={'dark': 'maroon', 'light': 'peru'}, data=pred_points)
ax = fg.axes[0, 0]
ax.set_xticklabels(get_tick_labels('amount', actual_lows, actual_highs, units))
ax.set_title('Amount/Beans Interaction')
plt.show()

display(PrettyPandas(pred_points))
display(Markdown('''That savings of 1.5oz per pot would create a nice surplus in the coffee budget at the end of the year.'''))

This graph contains the prediction with the highest taste rating, 7.72. However, if you look at the dark bean line there is a point where we can get a rating of 6.93 with 2.5oz of grounds.

amount grind_size brew_time grind_type beans taste_rating
0 4 10 4.5 blade dark 5.91462
1 2.5 10 4.5 blade dark 6.93237
2 4 10 4.5 blade light 7.7238
3 2.5 10 4.5 blade light 2.81345

That savings of 1.5oz per pot would create a nice surplus in the coffee budget at the end of the year.

coffeemaker

The End