Fitting of equilibrium binding data

This notebook plots an equilibrium binding dataset (with error bars, for datasets containing replicates) and performs non-linear curve fitting of a model to the data.

The following reference explains very well the theory of equilibrium binding experiments, as well as many important practical considerations:

Jarmoskaite I, AlSadhan I, Vaidyanathan PP & Herschlag D (2020) How to measure and evaluate binding affinities. eLife 9: e57264 https://doi.org/10.7554/eLife.57264

Example datasets are from the following publication:

Gaullier G, Roberts G, Muthurajan UM, Bowerman S, Rudolph J, Mahadevan J, Jha A, Rae PS & Luger K (2020) Bridging of nucleosome-proximal DNA double-strand breaks by PARP2 enhances its interaction with HPF1. PLOS ONE 15: e0240932 https://doi.org/10.1371/journal.pone.0240932

The original files are freely available from https://doi.org/10.5281/zenodo.3519435 (CC-BY 4.0). They are also available reformatted for compatibility with this notebook in this repository: https://github.com/Guillawme/julia-curve-fitting/tree/main/datasets

xxxxxxxxxx

2.7 ms

Load data

The data file must be in CSV format. The first row is assumed to contain column names. The first column is assumed to be the $X$ values, all other columns are assumed to be replicate $Y$ values that will be averaged (fitting will be done against the mean values, weighted by their standard deviations). In addition, there must not be any row with an $X = 0$ value (this would result in an error when attempting to plot with a logarithmic scale). I always perform two measurements at $X = 0$ , and I always sort rows by descending $X$ values, so this notebook automatically skips the last two rows of the CSV file; adjust accordingly if you don't measure at $X = 0$ and want to keep all rows (see section Data processing below). The data does not need to be scaled such that $Y$ takes values between $0$ and $1$ : the binding models can account for arbitrary minimum and maximum $Y$ values (see section Model functions below). Scaling data will hide differences in signal change between datasets, while these differences may tell you something about the system under study, so scaling should never be done "blindly"; always look at the raw data.

Indicate below which data files to process:

if the path given is not absolute, it is assumed to be relative to the notebook file (wherever the notebook is located)
a list of files can be provided with one file path per line and separated by commas
files can be located by local path or URL, but each type of location should be in the dedicated list

xxxxxxxxxx

21.2 μs

dataURLs

String1

"https://raw.githubusercontent.com/Guillawme/julia-curve-fitting/main/datasets/dataset_003.csv"

"https://raw.githubusercontent.com/Guillawme/julia-curve-fitting/main/datasets/dataset_005.csv"

xxxxxxxxxx
 
dataURLs = [
    "https://raw.githubusercontent.com/Guillawme/julia-curve-fitting/main/datasets/dataset_003.csv",
    "https://raw.githubusercontent.com/Guillawme/julia-curve-fitting/main/datasets/dataset_005.csv"
]

1.3 μs

dataFiles

Any

xxxxxxxxxx
 
dataFiles = [
    #"datasets/dataset_003.csv",
    #"datasets/dataset_005.csv"
]

1.3 μs

Number of rows to ignore at the end of files:

xxxxxxxxxx

26.9 ms

Your data should appear below shortly, check that it looks normal. In addition to the columns present in your CSV file, you should see three columns named mean, std and measurement (these values will be used for fitting and plotting).

xxxxxxxxxx

4.7 μs

DataFrames.DataFrame1

	concentration	rep1	rep2	rep3	mean	std	measurement
	Float64	Float64	Float64	Float64	Float64	Float64	Measurement
1	1000.0	269.8	253.3	272.7	265.267	10.4644	$265.0 \pm 10.0$
2	600.0	264.55	241.1	264.15	256.6	13.4249	$257.0 \pm 13.0$
3	360.0	252.15	235.05	256.35	247.85	11.2823	$248.0 \pm 11.0$
4	216.0	251.6	224.4	247.45	241.15	14.6536	$241.0 \pm 15.0$
5	129.6	241.4	214.0	235.05	230.15	14.3422	$230.0 \pm 14.0$
6	77.76	225.3	204.6	226.15	218.683	12.2039	$219.0 \pm 12.0$
7	46.656	212.35	189.9	218.35	206.867	14.9967	$207.0 \pm 15.0$
8	27.9936	194.7	165.85	199.5	186.683	18.2011	$187.0 \pm 18.0$
9	16.7962	172.5	149.9	174.7	165.7	13.7273	$166.0 \pm 14.0$
10	10.0777	151.0	121.65	153.2	141.95	17.6147	$142.0 \pm 18.0$
more
22	0.021937	73.35	55.5	66.45	65.1	9.00125	$65.1 \pm 9.0$

	concentration	fp1	fp2	fp3	mean	std	measurement
	Float64	Float64	Float64	Float64	Float64	Float64	Measurement
1	1000.0	281.5	283.2	282.05	282.25	0.867468	$282.25 \pm 0.87$
2	600.0	279.05	284.25	276.5	279.933	3.94979	$279.9 \pm 3.9$
3	360.0	281.75	278.35	271.8	277.3	5.05742	$277.3 \pm 5.1$
4	216.0	277.65	270.6	270.65	272.967	4.05596	$273.0 \pm 4.1$
5	129.6	274.55	268.6	256.7	266.617	9.08878	$266.6 \pm 9.1$
6	77.76	268.8	257.25	253.05	259.7	8.15583	$259.7 \pm 8.2$
7	46.656	260.15	244.35	233.0	245.833	13.6356	$246.0 \pm 14.0$
8	27.9936	237.9	217.2	207.75	220.95	15.4208	$221.0 \pm 15.0$
9	16.7962	217.45	196.15	181.7	198.433	17.984	$198.0 \pm 18.0$
10	10.0777	176.75	161.7	150.3	162.917	13.2669	$163.0 \pm 13.0$
more
22	0.021937	71.4	66.0	65.45	67.6167	3.28798	$67.6 \pm 3.3$

xxxxxxxxxx

23.6 s

We will need to keep track of dataset names. You can edit them here if you want to use something more meaningful than the file name or URL; changes will propagate to plot legends. This list must contain the same number of elements as you have datasets: 2 in this case.

xxxxxxxxxx

21.6 μs

datasetNames

Any1

"https://raw.githubusercontent.com/Guillawme/julia-curve-fitting/main/datasets/dataset_003.csv"

"https://raw.githubusercontent.com/Guillawme/julia-curve-fitting/main/datasets/dataset_005.csv"

xxxxxxxxxx

2.3 μs

Visualizations

Your data and fit should appear below shortly. Take a good look at the data and fit, make sure you check the residuals. Once you're happy with it, check the numerical results.

xxxxxxxxxx

9.5 μs

Data and fit

Select binding model:

xxxxxxxxxx

4.8 μs

xxxxxxxxxx

111 ms

Show fit line with initial parameters?

xxxxxxxxxx

26.0 ms

For the quadratic model, indicate receptor concentration (the receptor is the binding partner kept at constant, low concentration across the titration series). Parameter $R_{0} =$ 5.0

xxxxxxxxxx

138 ms

xxxxxxxxxx

4.6 s

Residuals

xxxxxxxxxx

3.0 μs

The fit residuals should follow a random normal distribution around $0$ . If they show a systematic trend, it means the fit systematically deviates from your data, and therefore the model you chose might not be justified (but be careful when considering alternative models: introducing more free parameters will likely get the fit line closer to the data points and yield a lower sum of squared residuals, but this is not helpful if these additional parameters don't contribute to explaining the physical phenomenon being modeled). Another possibility is a problem with your data. The most common problems are:

the data does not cover the proper concentration range
the concentration of receptor is too high relative to the $K_{D}$

In either case, your best option is to design a new experiment and collect new data.

xxxxxxxxxx

12.4 μs

Scatter plot

xxxxxxxxxx

3.2 μs

xxxxxxxxxx

351 ms

Histogram

xxxxxxxxxx

3.2 μs

xxxxxxxxxx

1.2 s

Numerical results

Model parameters

xxxxxxxxxx

4.5 μs

Dataset				Kd				Smin			Smax			h
dataset_003.csv		16.7 ± 0.1		62.9 ± 0.1		262.9 ± 0.2		0.9 ± 0.0
dataset_005.csv		11.8 ± 0.0		68.0 ± 0.2		279.7 ± 0.3		1.2 ± 0.0

xxxxxxxxxx

223 ms

Sum of squared residuals

xxxxxxxxxx

2.8 μs

Dataset				Sum of squared residuals
dataset_003.csv		2161.2
dataset_005.csv		449.13

xxxxxxxxxx

731 ms

Code

The code doing the actual work is in this section. Do not edit unless you know what you are doing.

xxxxxxxxxx

4.7 μs

Necessary packages and notebook setup

xxxxxxxxxx

4.0 μs

xxxxxxxxxx

55.2 s

xxxxxxxxxx

4.8 ms

Data processing

xxxxxxxxxx

2.8 μs

The commonProcessing() function computes the mean and standard deviation of replicates, defines measurements as mean ± std, and returns a DataFrame containing all the data. It is used by all methods of the following processData() function, which handle various inputs (path to a local file, URL to a remote file, loaded CSV file, loaded DataFrame).

xxxxxxxxxx

3.9 μs

commonProcessing (generic function with 1 method)

xxxxxxxxxx

42.9 μs

The processData() function loads one data file, computes the mean and standard deviation of replicates, defines measurements as mean ± std, and returns a DataFrame containing all the data.

xxxxxxxxxx

4.3 μs

processData (generic function with 1 method)

xxxxxxxxxx

41.6 μs

This functions should also work if passed an URL to a remote file:

xxxxxxxxxx

3.4 μs

processData (generic function with 2 methods)

xxxxxxxxxx

44.5 μs

This functions should also work if passed an already loaded CSV file:

xxxxxxxxxx

3.4 μs

processData (generic function with 3 methods)

xxxxxxxxxx

23.5 μs

This functions should also work if passed an already loaded data frame (for example, if the user wants to load data and pre-process it in a different way before averaging replicates):

xxxxxxxxxx

3.0 μs

processData (generic function with 4 methods)

xxxxxxxxxx

16.1 μs

Plotting

xxxxxxxxxx

2.9 μs

The initMainPlot() function initializes a plot, the plotOneDataset() function plots one dataset (call it repeatedly to plot more datasets on the same axes).

xxxxxxxxxx

3.9 μs

initMainPlot (generic function with 1 method)

xxxxxxxxxx

32.9 μs

plotOneDataset! (generic function with 3 methods)

xxxxxxxxxx

70.8 μs

The initResidualPlot() function initializes a plot, the plotOneResiduals!() function plots the fit residuals from one dataset (call it repeatedly to plot more datasets on the same axes).

xxxxxxxxxx

4.0 μs

initResidualPlot (generic function with 1 method)

xxxxxxxxxx

31.7 μs

plotOneResiduals! (generic function with 1 method)

xxxxxxxxxx

33.5 μs

The initResidualHistogram() function initializes a histogram, the plotOneResidualsHistogram!() function plots a histogram of the fit residuals from one dataset (call it repeatedly to plot more datasets on the same axes).

xxxxxxxxxx

3.4 μs

initResidualHistogram (generic function with 1 method)

xxxxxxxxxx

32.6 μs

plotOneResidualsHistogram! (generic function with 1 method)

xxxxxxxxxx

30.5 μs

Model functions

xxxxxxxxxx

3.0 μs

Model selection

xxxxxxxxxx

3.4 μs

This dictionary maps radio button options (in section Visualizations above) to their corresponding model function:

xxxxxxxxxx

5.0 μs

bindingModels

Dict

"Hyperbolic"

hyperbolic

"Quadratic"

quadratic

"Hill"

hill

xxxxxxxxxx

351 ms

The remaining cells in this section are only meant to check that the model selection buttons work. This first cell should return the name of the selected binding model (corresponding to the active radio button in section Visualizations above):

xxxxxxxxxx

5.2 μs

"Hill"

xxxxxxxxxx

100 ns

This other cell should return the model function corresponding to the selected binding model (the active radio button in section Visualizations above):

xxxxxxxxxx

4.8 μs

hill (generic function with 1 method)

xxxxxxxxxx

2.6 μs

Hill model

This is the Hill equation:

$S = S_{m i n} + (S_{m a x} - S_{m i n}) \times \frac{L^{h}}{{K_{D}}^{h} + L^{h}}$

In which $S$ is the measured signal ( $Y$ value) at a given value of ligand concentration $L$ ( $X$ value), $S_{m i n}$ and $S_{m a x}$ are the minimum and maximum values the observed signal can take, respectively, $K_{D}$ is the equilibrium dissociation constant and $h$ is the Hill coefficient.

xxxxxxxxxx

8.2 μs

hill (generic function with 1 method)

xxxxxxxxxx

38.5 μs

Hyperbolic model

The hyperbolic equation is a special case of the Hill equation, in which $h = 1$ :

xxxxxxxxxx

5.2 μs

hyperbolic (generic function with 1 method)

xxxxxxxxxx

27.4 μs

Quadratic model

Unlike the Hill and hyperbolic models, the quadratic model does not make the approximation that the concentration of free ligand at equilibrium is equal to the total ligand concentration:

$S = S_{m i n} + (S_{m a x} - S_{m i n}) \times \frac{(K_{D} + R_{t o t} + L_{t o t}) - \sqrt{(- K_{D} - R_{t o t} - L_{t o t})^{2} - 4 \times R_{t o t} \times L_{t o t}}}{2 \times R_{t o t}}$

Symbols have the same meaning as in the previous equations, except here $L_{t o t}$ is the total concentration of ligand, not the concentration of free ligand at equilibrium. $R_{t o t}$ is the total concentration of receptor.

In principle, $R_{t o t}$ could be left as a free parameter to be determined by the fitting procedure, but in general it is known accurately enough from the experimental set up, and one should replicate the same experiment with different concentrations of receptor to check its effect on the results. $R_{t o t}$ should be set in the experiment to be smaller than $K_{D}$ , ideally, or at least of the same order of magnitude than $K_{D}$ . It might take a couple experiments to obtain an estimate of $K_{D}$ before one can determine an adequately small concentration of receptor at which to perform a definite experiment.

xxxxxxxxxx

7.8 μs

quadratic (generic function with 1 method)

xxxxxxxxxx

34.8 μs

Parameters and their initial values

xxxxxxxxxx

2.9 μs

The findInitialValues() function takes the measured data and returns an array containing initial values for the model parameters (in this order): $S_{m i n}$ , $S_{m a x}$ , $K_{D}$ and $h$ (for the Hill model only, so the function needs to know which model was selected).

Initial values for $S_{m i n}$ and $S_{m a x}$ are simply taken as the minimal and maximal values found in the data. The initial estimate for $K_{D}$ is the concentration of the data point that has a signal closest to halfway between $S_{m i n}$ and $S_{m a x}$ (if the experiment was properly designed, this is a reasonable estimate and close enough to the true value for the fit to converge). The initial estimate of $h$ is $1.0$ , meaning we assume no cooperativity.

xxxxxxxxxx

5.7 μs

findInitialValues (generic function with 1 method)

xxxxxxxxxx

43.3 μs

Determine initial values of the selected model's parameters from the currently loaded datasets:

xxxxxxxxxx

3.4 μs

initialParams

Vector{Float64}1Float641

64.3333

265.267

16.7962

1.0

2Float641

66.95

282.25

10.0777

1.0

xxxxxxxxxx

253 ms

Fitting

xxxxxxxxxx

2.9 μs

Perform fit of the selected model to the measurements' mean values using initial values for the model parameters determined previously. If the dataset contains replicates, the fit will be weighted by the measurements' standard deviations.

xxxxxxxxxx

3.6 μs

LsqFit.LsqFitResult1LsqFitResult{Vector{Float64}, Vector{Float64}, Matrix{Float64}, Vector{Float64}}paramFloat641

62.8743

262.9

16.7121

0.882222

residFloat641

-24.7024

-6.77057

8.57461

10.7606

17.2247

11.3427

-5.99416

-6.07259

-9.60075

-4.25664

-9.7462

8.95732

12.1772

12.4126

7.48185

1.66752

-1.42567

-2.09179

-7.73141

-8.74607

-1.93884

-4.9613

jacobian

22×4 Matrix{Float64}:
 0.0852277  3.14964     -0.876231    67.9152
 0.149256   3.51474     -1.51183    102.55
 0.209864   3.14905     -2.07755    120.82
 0.362452   3.46555     -3.46485    167.97
 0.533953   3.25315     -4.84321    187.925
 0.715535   2.77788     -6.00797    174.982
 1.11479    2.75776     -8.38277    163.031
 ⋮                                  
 3.12997    0.0854486   -0.878296   -67.9083
 2.55232    0.0443997   -0.460812   -40.0883
 3.1649     0.0350821   -0.36638    -35.4185
 2.68096    0.0189363   -0.198551   -21.1156
 3.42975    0.0154365   -0.162267   -18.8271
 2.99163    0.00857973  -0.0903365  -11.3554

converged

true

wtFloat641

10.4644

13.4249

11.2823

14.6536

14.3422

12.2039

14.9967

18.2011

13.7273

17.6147

18.5369

17.0853

13.4128

11.7831

11.1293

9.46348

10.3389

6.74296

10.2399

7.28943

11.8693

9.00125

2LsqFitResult{Vector{Float64}, Vector{Float64}, Matrix{Float64}, Vector{Float64}}paramFloat641

67.9997

279.69

11.8407

1.1958

residFloat641

-3.35862

-4.29772

-2.51543

0.703577

4.89174

-0.524154

-2.00412

11.7914

-11.7911

2.75704

-0.933741

-0.0285809

3.73199

4.43753

0.593579

2.2514

-3.46711

-4.94193

1.11113

2.07233

-3.99565

0.901898

jacobian

22×4 Matrix{Float64}:
 0.0046039  0.926775      -0.0979387    4.30217
 0.0180206  1.96939       -0.381763    14.8387
 0.037276   2.21159       -0.783702    26.4976
 0.0606428  1.9533        -1.25742     36.1542
 0.163079   2.85168       -3.29781     78.1399
 0.272158   2.58368       -5.26389     98.0995
 0.600055   3.09259      -10.7438     145.88
 ⋮                                    
 2.09691    0.0240353     -0.50802    -18.7986
 2.19144    0.0136368     -0.289734   -12.1867
 1.33566    0.00451225    -0.096141    -4.53017
 1.43926    0.00263967    -0.0563292   -2.93916
 1.97307    0.00196457    -0.041958    -2.40152
 1.8123     0.000979646   -0.0209322   -1.30396

converged

true

wtFloat641

0.867468

3.94979

5.05742

4.05596

9.08878

8.15583

13.6356

15.4208

17.984

13.2669

15.7613

10.87

8.32922

7.37637

5.31139

2.2228

4.49843

4.86236

1.79606

2.07906

3.90075

3.28798

xxxxxxxxxx

4.5 s

Degrees of freedom:

xxxxxxxxxx

3.2 μs

Int641

xxxxxxxxxx

76.6 ms

Best fit parameters:

xxxxxxxxxx

2.8 μs

Vector{Float64}1Float641

62.8743

262.9

16.7121

0.882222

2Float641

67.9997

279.69

11.8407

1.1958

xxxxxxxxxx

84.4 ms

Standard errors of best-fit parameters:

xxxxxxxxxx

3.1 μs

paramsStdErrors

Vector{Float64}1Float641

0.143571

0.237742

0.0728792

0.00317502

2Float641

0.201093

0.262722

0.0425967

0.00477153

xxxxxxxxxx

474 ms