Design of Experiments in Python¶

Hank Anderson
hank@statease.com
https://github.com/hpanderson/dexpy-pymntos

Agenda¶

What is Design of Experiments?
Design an experiment with dexpy to improve the office coffee
Measure the power of our experiment
Model the taste test results using statsmodels
Visualize the data with seaborn and matplotlib to find the best pot of coffee

What is Design of Experiments (DoE)?¶

A systematic series of tests, in which purposeful changes are made to input factors, so that you may identify causes for significant changes in the output responses.

NIST has a nice primer on DOE.

In [4]:

dot = Digraph(comment='Design of Experiments')
dot.body.extend(['rankdir=LR', 'size="10,10"'])
dot.node_attr.update(shape='rectangle', style='filled', fontsize='20', fontname="helvetica")

dot.node('X', 'Controllable Factors', color='mediumseagreen', width='3')
dot.node('Z', 'Noise Factors', color='indianred2', width='3')
dot.node('P', 'Process', color='lightblue', height='1.25', width='3')
dot.node('Y', 'Responses', color='lightblue')

dot.edges(['XP', 'ZP', 'PY'] * 3)

dot

Out[4]:

What Is It For?¶

Screening Experiments - Determining which inputs to your system are important
Modeling a Process - Gain a better understanding of your system
Optimization - Find the combination of inputs that results in a better output
Robustness - Find a combination of inputs that produces a consistent result

DoE in Python: dexpy¶

Based on Design-Expert^® software, a package for design and analysis of industrial experiments
Apache2 licensed, pure python (for now), available on pypi
Requires numpy, scipy, pandas and patsy
Recommend statsmodels for analysis, matplotlib and seaborn for visualization
Other alternatives are:
- pyDOE (not actively maintained)
- R ???
- SciDOE
- gosset
- Proprietary software ($$$) - Design-Expert, JMP, Minitab

Motivating Example: Better Office Coffee¶

Current coffee is subpar ("disgusting and unacceptable")
Need to answer the following questions via experimentation:
- What coffee beans to use?
- How much coffee to use?
- How to grind the coffee?

Motivating Example: Better Office Coffee¶

5 input factors
- Amount of Coffee (2.5 to 4.0 oz.)
- Grind size (8-10mm)
- Brew time (3.5 to 4.5 minutes)
- Grind Type (burr vs blade)
- Coffee beans (light vs dark)
1 response: Average overall liking by a panel of 5 office coffee addicts
- Each taster rates the coffee from 1-9
Maximum of 3 taste tests a day, for liability reasons

         ((((
        ((((
         ))))
      _ .---.
     ( |`---'|
      \|     |
      : `---' :
       `-----'

Factorial Design¶

Change multiple inputs at once
Reveals interactions
Maximizes information with minimum runs

In [8]:

df = dexpy.factorial.build_factorial(2, 4)
df.columns = ['amount', 'grind_size']

fg = sns.lmplot('amount', 'grind_size', data=df, fit_reg=False)
ax = fg.axes[0, 0]
ax.set_xticks([-1, 1])
ax.set_xticklabels(get_tick_labels('amount', actual_lows, actual_highs, units))
ax.set_yticks([-1, 1])
ax.set_yticklabels(get_tick_labels('grind_size', actual_lows, actual_highs, units))
p = ax.add_patch(patches.Rectangle((-1, -1), 2, 2, color="navy", alpha=0.3, lw=2))

In [9]:

cube_design = dexpy.factorial.build_factorial(3, 8)

points = np.array(cube_design)
fig = plt.figure()

ax = fig.add_subplot(111, projection='3d', axisbg='w')
ax.view_init(30, -60) # rotate plot

X, Y = np.meshgrid([-1,1], [-1,1])

cube_alpha = 0.1
ax.plot_surface(X, Y, 1, alpha=cube_alpha)
ax.plot_surface(X, Y, -1, alpha=cube_alpha)
ax.plot_surface(X, -1, Y, alpha=cube_alpha)
ax.plot_surface(X, 1, Y, alpha=cube_alpha)
ax.plot_surface(1, X, Y, alpha=cube_alpha)
ax.plot_surface(-1, X, Y, alpha=cube_alpha)
ax.scatter3D(points[:, 0], points[:, 1], points[:, 2], c="b")

ax.set_xticks([-1, 1])
ax.set_xticklabels(["8mm", "10mm"])
ax.set_yticks([-1, 1])
ax.set_yticklabels(["2.5oz", "4oz"])
ax.set_zticks([-1, 1])
ax.set_zticklabels(["light", "dark"])

plt.show()

Statistical Power¶

The probability that a design will detect an active effect.

Effect?	Retain H₀	Reject H₀
No	OK	Type 1 Error
Yes	Type II Error	OK

Power is expressed as a probability to detect an effect of size Δ, given noise σ. This is typically given as a delta to sigma ratio Δ/σ. Power is a function of the signal to noise ratio, as well as the number and layout of experiments in the design.

In [10]:

runs = 50
delta = 0.5
sigma = 1.5
alpha = 0.05

one_factor = pd.DataFrame([ -1, 1 ] * runs, columns=['beans'])
one_factor_power = dexpy.power.f_power('beans', one_factor, delta/sigma, alpha)

display(Markdown('''
# Power Example
{} pots of coffee are tested with light beans, then {} pots with dark beans.
There is a variance of {} taste rating from pot to pot. If we expect a {} change
in the taste rating when going from light to dark, what is the likelihood we would detect it?
(Answer: **{:.2f}%**)

Note: this assumes that we reject H<sub>0</sub> at p <= {}
'''.format(int(runs / 2), int(runs / 2), sigma, delta, one_factor_power[1]*100, alpha)
))

def plot_shift(runs, delta, sigma, annotate=False):
    """Plots two sets of random normal data, one shifted up delta units."""
    mean = 5
    low = sigma*np.random.randn(int(runs/2),1)+mean
    high = sigma*np.random.randn(int(runs/2),1)+mean+delta
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.set_ylabel("taste")
    ax.set_xlabel("runs")
    ax.set_ylim([0, 11])
    
    plt.plot(np.concatenate([low, high]))
    plt.plot([0, (runs/2)], [low.mean()] * 2, color='firebrick', lw=2)
    plt.plot([(runs/2), runs-1], [high.mean()] * 2, color='g', lw=2)
    
    p_value = scipy.stats.f_oneway(low, high).pvalue[0]
    if annotate:
        plt.annotate("p: {:.5f}".format(p_value),
                     xy=(runs / 2, (low.mean() + high.mean()) / 2), xycoords='data',
                     xytext=(.8, .9), textcoords='axes fraction',
                     bbox=dict(boxstyle="round4", fc="w"),
                     arrowprops=dict(arrowstyle='-[', linewidth=2, color='black', connectionstyle="angle3")
                     )

    plt.show()
    return [low, high]

low, high = plot_shift(runs, delta, sigma)

Power Example¶

25 pots of coffee are tested with light beans, then 25 pots with dark beans. There is a variance of 1.5 taste rating from pot to pot. If we expect a 0.5 change in the taste rating when going from light to dark, what is the likelihood we would detect it? (Answer: 37.86%)

Note: this assumes that we reject H₀ at p <= 0.05

In [11]:

increased_delta = delta*4
increased_delta_power = dexpy.power.f_power('beans', one_factor, increased_delta/sigma, alpha)

display(Markdown('''
# Power Example - Increase Delta
What if we don't care about a taste increase of 0.5? That's not that much better
than the current coffee, after all. Instead, if we say we only care about a change
in rating above {}, what is the likelihood we would detect such a change?
(Answer: **{:.5f}%**)
'''.format(increased_delta, increased_delta_power[1]*100)
))

_ = plot_shift(runs, increased_delta, sigma)

Power Example - Increase Delta¶

What if we don't care about a taste increase of 0.5? That's not that much better than the current coffee, after all. Instead, if we say we only care about a change in rating above 2.0, what is the likelihood we would detect such a change? (Answer: 99.99983%)

In [12]:

decreased_sigma = sigma*0.5
decreased_sigma_power = dexpy.power.f_power('beans', one_factor, delta/decreased_sigma, alpha)

display(Markdown('''
# Power Example - Decrease Noise
Instead of lowering our standards for our noisy taste ratings, instead
we could bring in expert testers who have a much more accurate palate.
If we assume a decrease in noise from {} to {}, then we can detect a
change in rating of {} with {:.2f}% probability.
'''.format(sigma, decreased_sigma, delta, decreased_sigma_power[1]*100)
))

_ = plot_shift(runs, delta, sigma*0.1)

Power Example - Decrease Noise¶

Instead of lowering our standards for our noisy taste ratings, instead we could bring in expert testers who have a much more accurate palate. If we assume a decrease in noise from 1.5 to 0.75, then we can detect a change in rating of 0.5 with 91.00% probability.

In [13]:

increased_runs = runs * 4
one_factor = pd.DataFrame([ -1, 1 ] * increased_runs, columns=['beans'])
increased_runs_power = dexpy.power.f_power('beans', one_factor, delta/sigma, alpha)

display(Markdown('''
# Power Example - Increase Runs
If expert testers are too expensive, and we are unwilling to compromise
our standards, then the only remaining option is to create a more powerful
design. In this toy example, there isn't much we can do, since it's
just one factor. Increasing the runs from {} to {} gives a power of
{:.2f}%. This may be a more acceptable success rate than the original power
of {:.2f}%, however... that is a lot of coffee to drink.

For more complicated designs changing the structure of the design
can also increase power.
'''.format(runs, increased_runs, increased_runs_power[1]*100, one_factor_power[1]*100)
))
_ = plot_shift(increased_runs, delta, sigma)

Power Example - Increase Runs¶

If expert testers are too expensive, and we are unwilling to compromise our standards, then the only remaining option is to create a more powerful design. In this toy example, there isn't much we can do, since it's just one factor. Increasing the runs from 50 to 200 gives a power of 91.39%. This may be a more acceptable success rate than the original power of 37.86%, however... that is a lot of coffee to drink.

For more complicated designs changing the structure of the design can also increase power.

Calculating Power with dexpy¶

https://statease.github.io/dexpy/evaluation.html#statistical-power

In [14]:

help(dexpy.power.f_power)

Help on function f_power in module dexpy.power:

f_power(model, design, effect_size, alpha)
    Calculates the power of an F test.
    
    This calculates the probability that the F-statistic is above its critical
    value (alpha) given an effect of some size.
    
    :param model: A patsy formula for which to calculate power.
    :type model: patsy.formula
    :param design: A pandas.DataFrame representing a design.
    :type design: pandas.DataFrame
    :param effect_size: The size of the effect that the test should be able to detect (also called a signal to noise
        ratio).
    :type effect_size: float
    :param alpha: The critical value that we want the test to be above.
    :type alpha: float between 0 and 1
    :returns: A list of percentage probabilities that an F-test could detect an effect of the given size at the given
        alpha value for a particular column.
    
    Usage:
      >>> design = dexpy.factorial.build_factorial(4, 8)
      >>> print(dexpy.power.f_power("1 + A + B + C + D", design, 2.0, 0.05))
      [ 95.016, 49.003, 49.003, 49.003, 49.003 ]

In [16]:

full_design = dexpy.factorial.build_factorial(5, 2**5)
full_design.columns = ['amount', 'grind_size', 'brew_time', 'grind_type', 'beans']
factorial_power = dexpy.power.f_power(model, full_design, sn, alpha)
factorial_power.pop(0)
factorial_power = ['{0:.2f}%'.format(i*100) for i in factorial_power] # convert to %
factorial_power = pd.DataFrame(factorial_power, columns=['Power'], index=full_design.columns)

display(Markdown('''
# Calculating Power with dexpy: Factorial
* {} total runs, with a signal to noise ratio of 2.
* Model (`patsy` for: `{}`
'''.format(len(full_design), model)))
display(PrettyPandas(factorial_power))

Calculating Power with dexpy: Factorial¶

32 total runs, with a signal to noise ratio of 2.
Model (patsy for: amount + grind_size + brew_time + grind_type + beans

	Power
amount	99.97%
grind_size	99.97%
brew_time	99.97%
grind_type	99.97%
beans	99.97%

Fractional Factorials¶

Coffee experiment is 2⁵ runs (32)
We want to add 4 center point runs to check for curvature
Total runs = 36, 3 per day if all testers are in the office
Estimate experiment will take a month

Fractional Factorials¶

Power for the experiment is > 99%
Full factorial is overkill
Instead run 2^5-1 experiments, a "half fraction"

Fractional Factorials in dexpy¶

https://statease.github.io/dexpy/design-build.html#module-dexpy.factorial

In [17]:

help(dexpy.factorial.build_factorial)

Help on function build_factorial in module dexpy.factorial:

build_factorial(factor_count, run_count)
    Builds a regular two-level design based on a number of factors and runs.
    
    Full two-level factorial designs may be run for up to 9 factors. These
    designs permit estimation of all main effects and all interaction effects.
    If the number of runs requested is a 2^factor_count, the design will be a
    full factorial.
    
    If the number of runs is less than 2^factor_count (it still must be a power
    of two) a fractional design will be created. Not all combinations of runs
    and factor counts will result in a design. Use the
    :ref:`alias list<alias-list>` method to see what terms are estimable in
    the resulting design.
    
    :param factor_count: The number of factors to build for.
    :type factor_count: int
    :param run_count: The number of runs in the resulting design. Must be a power of 2.
    :type run_count: int
    :returns: A pandas.DataFrame object containing the requested design.

In [18]:

coffee_design = dexpy.factorial.build_factorial(5, 2**(5-1))
coffee_design.columns = ['amount', 'grind_size', 'brew_time', 'grind_type', 'beans']
center_points = [
    [0, 0, 0, -1, -1],
    [0, 0, 0, -1, 1],
    [0, 0, 0, 1, -1],
    [0, 0, 0, 1, 1]
]

coffee_design = coffee_design.append(pd.DataFrame(center_points * 2, columns=coffee_design.columns))
coffee_design.index = np.arange(0, len(coffee_design))

actual_design = coded_to_actual(coffee_design, actual_lows, actual_highs)

display(Markdown("## 2<sup>(5-1)</sup> Factorial Design"))
display(PrettyPandas(actual_design))

2^(5-1) Factorial Design¶

	amount	grind_size	brew_time	grind_type	beans
0	2.5	8	3.5	burr	dark
1	2.5	8	3.5	blade	light
2	2.5	8	4.5	burr	light
3	2.5	8	4.5	blade	dark
4	2.5	10	3.5	burr	light
5	2.5	10	3.5	blade	dark
6	2.5	10	4.5	burr	dark
7	2.5	10	4.5	blade	light
8	4	8	3.5	burr	light
9	4	8	3.5	blade	dark
10	4	8	4.5	burr	dark
11	4	8	4.5	blade	light
12	4	10	3.5	burr	dark
13	4	10	3.5	blade	light
14	4	10	4.5	burr	light
15	4	10	4.5	blade	dark
16	3.25	9	4	burr	light
17	3.25	9	4	burr	dark
18	3.25	9	4	blade	light
19	3.25	9	4	blade	dark
20	3.25	9	4	burr	light
21	3.25	9	4	burr	dark
22	3.25	9	4	blade	light
23	3.25	9	4	blade	dark

In [19]:

model = ' + '.join(coffee_design.columns)
factorial_power = dexpy.power.f_power(model, coffee_design, sn, alpha)
factorial_power.pop(0)
factorial_power = ['{0:.2f}%'.format(i*100) for i in factorial_power] # convert to %
factorial_power = pd.DataFrame(factorial_power, columns=['Power'], index=coffee_design.columns)

display(Markdown('''
## 2<sup>(5-1)</sup> Factorial Power
* Power for {} total runs
* Signal to noise ratio of 2
* Model: `{}`
'''.format(len(coffee_design), model)))
display(PrettyPandas(factorial_power))

2^(5-1) Factorial Power¶

Power for 24 total runs
Signal to noise ratio of 2
Model: amount + grind_size + brew_time + grind_type + beans

	Power
amount	96.55%
grind_size	96.55%
brew_time	96.55%
grind_type	99.61%
beans	99.61%

Analysis¶

statsmodels has lots of routines for modeling data
We will use Ordinary Least Squares (OLS) to fit
statsmodels typically takes numpy arrays for X and y data
It also has a "formulas" api that accepts a patsy formula

In [27]:

reduced_model = "amount + grind_size + brew_time + grind_type + beans + amount:beans + grind_size:brew_time + grind_size:grind_type"
lm = statsmodels.formula.api.ols("taste_rating ~" + reduced_model, data=coffee_design).fit()
print(lm.summary2())

                   Results: Ordinary least squares
=====================================================================
Model:                OLS               Adj. R-squared:      0.746   
Dependent Variable:   taste_rating      AIC:                 79.5691 
Date:                 2016-11-10 19:52  BIC:                 90.1715 
No. Observations:     24                Log-Likelihood:      -30.785 
Df Model:             8                 F-statistic:         9.438   
Df Residuals:         15                Prob (F-statistic):  0.000123
R-squared:            0.834             Scale:               1.2184  
---------------------------------------------------------------------
                       Coef.  Std.Err.    t    P>|t|   [0.025  0.975]
---------------------------------------------------------------------
Intercept              5.0318   0.2253 22.3328 0.0000  4.5516  5.5121
amount                 0.9731   0.2759  3.5266 0.0031  0.3850  1.5613
grind_size             0.0022   0.2759  0.0078 0.9939 -0.5860  0.5903
brew_time              1.2061   0.2759  4.3709 0.0005  0.6180  1.7943
grind_type            -0.0974   0.2253 -0.4324 0.6716 -0.5777  0.3828
beans                  0.5774   0.2253  2.5628 0.0216  0.0972  1.0577
amount:beans          -1.4820   0.2759 -5.3707 0.0001 -2.0702 -0.8939
grind_size:brew_time   0.3961   0.2759  1.4354 0.1717 -0.1921  0.9843
grind_size:grind_type -0.6927   0.2759 -2.5103 0.0240 -1.2809 -0.1046
---------------------------------------------------------------------
Omnibus:               4.208          Durbin-Watson:            2.190
Prob(Omnibus):         0.122          Jarque-Bera (JB):         1.550
Skew:                  -0.116         Prob(JB):                 0.461
Kurtosis:              1.777          Condition No.:            1    
=====================================================================

Visualization¶

seaborn is built on top of matplotlib and adds support for pandas dataframes
Can build a plot using seaborn, then manipulate it with matplotlib
Default themes look a lot nicer than what you get from matplotlib out of the box

In [28]:

display(Markdown('''
If we take the experiment data from the design and use our new model to fit that data, then plot it against
the observed values we can get an idea for how well our model predicts. Points above the 45 degree line are
overpredicting for that combination of inputs. Points below the line predict a lower taste rating than
we actually measured during the experiment.'''))

actual_predicted = pd.DataFrame({ 'actual': coffee_design['taste_rating'],
                                  'predicted': lm.fittedvalues
                                }, index=np.arange(len(coffee_design['taste_rating'])))
fg = sns.FacetGrid(actual_predicted, size=5)
fg.map(plt.scatter, 'actual', 'predicted')
ax = fg.axes[0, 0]
ax.plot([1, 10], [1, 10], color='g', lw=2)
ax.set_xticks(np.arange(1, 11))
ax.set_xlim([0, 11])
ax.set_yticks(np.arange(1, 11))
ax.set_title('Actual vs Predicted')
_ = ax.set_ylim([0, 11])

If we take the experiment data from the design and use our new model to fit that data, then plot it against the observed values we can get an idea for how well our model predicts. Points above the 45 degree line are overpredicting for that combination of inputs. Points below the line predict a lower taste rating than we actually measured during the experiment.

In [29]:

display(Markdown('''
Plotting the prediction for two factors at once shows how they interact with each other.
In this graph you can see that at the low brew time the larger grind size results in
a poor taste rating, likely because the coffee is too weak.'''))

pred_points = pd.DataFrame(1, columns = coffee_design.columns, index=np.arange(4))
pred_points.loc[1,'grind_size'] = -1
pred_points.loc[3,'grind_size'] = -1
pred_points.loc[2,'brew_time'] = -1
pred_points.loc[3,'brew_time'] = -1
pred_points['taste_rating'] = lm.predict(pred_points)
pred_points = coded_to_actual(pred_points, actual_lows, actual_highs)

fg = sns.factorplot('grind_size', 'taste_rating', hue='brew_time', kind='point', data=pred_points)
ax = fg.axes[0, 0]
ax.set_xticklabels(get_tick_labels('grind_size', actual_lows, actual_highs, units))
_ = ax.set_title('Grind Size/Brew Time Interaction')

Plotting the prediction for two factors at once shows how they interact with each other. In this graph you can see that at the low brew time the larger grind size results in a poor taste rating, likely because the coffee is too weak.

In [30]:

display(Markdown('''
This graph contains the prediction with the highest taste rating, 7.72.
However, if you look at the dark bean line there is a point where we can get
a rating of 6.93 with 2.5oz of grounds.
'''))

pred_points = pd.DataFrame(1, columns = coffee_design.columns, index=np.arange(4))
pred_points.loc[1,'amount'] = -1
pred_points.loc[3,'amount'] = -1
pred_points.loc[2,'beans'] = -1
pred_points.loc[3,'beans'] = -1
pred_points['taste_rating'] = lm.predict(pred_points)
pred_points = coded_to_actual(pred_points, actual_lows, actual_highs)

fg = sns.factorplot('amount', 'taste_rating', hue='beans', kind='point', palette={'dark': 'maroon', 'light': 'peru'}, data=pred_points)
ax = fg.axes[0, 0]
ax.set_xticklabels(get_tick_labels('amount', actual_lows, actual_highs, units))
ax.set_title('Amount/Beans Interaction')
plt.show()

display(PrettyPandas(pred_points))
display(Markdown('''That savings of 1.5oz per pot would create a nice surplus in the coffee budget at the end of the year.'''))

This graph contains the prediction with the highest taste rating, 7.72. However, if you look at the dark bean line there is a point where we can get a rating of 6.93 with 2.5oz of grounds.

	amount	grind_size	brew_time	grind_type	beans	taste_rating
0	4	10	4.5	blade	dark	5.91462
1	2.5	10	4.5	blade	dark	6.93237
2	4	10	4.5	blade	light	7.7238
3	2.5	10	4.5	blade	light	2.81345

That savings of 1.5oz per pot would create a nice surplus in the coffee budget at the end of the year.

coffeemaker

The End¶

We were able to build and execute an experiment in Python that resulted in a better tasting (and cheaper!) coffee.
These slides can be found at https://hpanderson.github.io/dexpy-pymntos
The jupyter notebook they are based on can be found on my github: https://github.com/hpanderson/dexpy-pymntos
You can reach me at: hank@statease.com
The dexpy docs are at: https://statease.github.io/dexpy/
dexpy is only at version 0.1, we plan on greatly expanding the design and analysis capabilities

Design of Experiments in Python¶

Agenda¶

What is Design of Experiments (DoE)?¶

What Is It For?¶

DoE in Python: dexpy¶

Motivating Example: Better Office Coffee¶

Motivating Example: Better Office Coffee¶

Factorial Design¶

Statistical Power¶

Power Example¶

Power Example - Increase Delta¶

Power Example - Decrease Noise¶

Power Example - Increase Runs¶

Calculating Power with dexpy¶

Calculating Power with dexpy: Factorial¶

Fractional Factorials¶

Fractional Factorials¶

Fractional Factorials in dexpy¶

2(5-1) Factorial Design¶

2(5-1) Factorial Power¶

Analysis¶

Visualization¶

The End¶

2^(5-1) Factorial Design¶

2^(5-1) Factorial Power¶