Kasper Stovgaard:
Inferential Determination of Protein Structure from Small-Angle X-ray Scattering Data

Date: 01-03-2012    Supervisor: Thomas Hamelryck & Anders Krogh

A new method for calculating small angle X-ray scattering (SAXS) curves for protein structures was developed in this PhD project. This method is based on the Debye formula and a set of coarse-grained scattering form factors, necessary for a computationally efficient evaluation. A maximum likelihood estimation procedure was used to determine a point estimate of the form factors, using generated data from a large set of high quality protein structures from the Protein Data Bank. This method enables information from SAXS to be used as a likelihood function in an inferential structure determination approach using an MCMC sampling scheme. The method was validated using a large set of known protein structures and in a decoy recognition experiment. It performed as well as the current state-of-the-art method CRYSOL.

The coarse-grained model was implemented as a energy function in the PHAISTOS protein structure prediction program. The SAXS likelihood function was thereby combined with a probabilistic model of local protein structure, TorusDBN, and by only sampling structures with a realistic local structure the conformational search space was dramatically reduced. Applying a generalized ensemble with weights inversely proportional to the density of states in the sampling made it possible to obtain a conservative, statistical ensemble of the solution structures that all fit the experimental SAXS data. Such an ensemble is not readily obtained using existing methods.

The program was applied to three different flexible, multi-domain proteins, a class that is notoriously difficult for high-resolution structure determination techniques. For a protein with known high-resolution structure and generated SAXS data, results were within 2 Å from native in terms of the root-mean-square deviation. For larger proteins our models agreed with low resolution, bead models from existing methods while providing biologically meaningful results and information on domain arrangement and linker flexibility.