Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding Protein Folding Dynamics: A Submillisecond Time Scale Approach, Lab Reports of Materials science

The challenges of simulating protein folding dynamics on the submillisecond time scale using distributed computing and molecular dynamics. The authors describe their methodology for simulating folding events and the importance of balancing communication time against work time. They also discuss the implications of their research for understanding protein folding mechanisms and its potential applications in nanotechnology and protein design.

Typology: Lab Reports

Pre 2010

Uploaded on 07/29/2009

koofers-user-80p
koofers-user-80p 🇺🇸

10 documents

1 / 19

Toggle sidebar

Related documents


Partial preview of the text

Download Understanding Protein Folding Dynamics: A Submillisecond Time Scale Approach and more Lab Reports Materials science in PDF only on Docsity! Atomistic Protein Folding Simulations on the Submillisecond Time Scale Using Worldwide Distributed Computing Vijay S. Pande1–4 Ian Baker1 Jarrod Chapman4,* Sidney P. Elmer1 Siraj Khaliq5 Stefan M. Larson2 Young Min Rhee1 Michael R. Shirts1 Christopher D. Snow2 Eric J. Sorin1 Bojan Zagrovic2 1 Department of Chemistry, Stanford University, Stanford, CA 94305-5080 2 Biophysics Program, Stanford University, Stanford, CA 94305-5080 3 Department of Structural Biology, Stanford University, Stanford, CA 94305-5080 4 Stanford Synchrotron Radiation Laboratory, Stanford University, Stanford, CA 94305-5080 5 Department of Computer Science, Stanford University, Stanford, CA 94305-5080 Received 27 March 2002; accepted 22 May 2002 Abstract: Atomistic simulations of protein folding have the potential to be a great complement to experimental studies, but have been severely limited by the time scales accessible with current computer hardware and algorithms. By employing a worldwide distributed computing network of tens of thousands of PCs and algorithms designed to efficiently utilize this new many-processor, highly heterogeneous, loosely coupled distributed computing paradigm, we have been able to simulate hundreds of microseconds of atomistic molecular dynamics. This has allowed us to directly Correspondence to: Vijay S. Pande; email: pande@stanford. edu *Present address: Physics Department, University of California, Berkeley, CA Contact grant sponsor: ACS PRF, NSF MRSEC CPIMA, NIH BISTI, ARO, and Stanford University Contact grant numbers: 36028-AC4 (ACS PRF), DMR-9808677 (NSF MRSEC CPIMA), IP20 6M64782-01 (NIH BISTI), 41778- LS-RIP (ARO) Biopolymers, Vol. 68, 91–109 (2003) © 2002 Wiley Periodicals, Inc. 91 simulate the folding mechanism and to accurately predict the folding rate of several fast-folding proteins and polymers, including a nonbiological helix, polypeptide -helices, a -hairpin, and a three-helix bundle protein from the villin headpiece. Our results demonstrate that one can reach the time scales needed to simulate fast folding using distributed computing, and that potential sets used to describe interatomic interactions are sufficiently accurate to reach the folded state with exper- imentally validated rates, at least for small proteins. © 2002 Wiley Periodicals, Inc. Biopolymers 68: 91–109, 2003 Keywords: atomistic protein folding; microsecond time scale; computer hardware; computer algorithms; molecular dynamics; distributed computing; villin; beta hairpin INTRODUCTION Understanding the sequence–structure relationship of proteins will play a pivotal role in the postgenomic era, and will have great impact on genetics, biochem- istry, and pharmaceutical chemistry.1–3 A detailed picture of the folding process itself will be important in understanding diseases, such as Alzheimer’s and variant Creutzfeldt–Jacob disease, believed to be re- lated to protein misfolding.4 Finally, an understanding of protein folding mechanisms will be important in protein design and nanotechnology, in which self- assembling nanomachines may be designed using synthetic polymers with protein-like folding proper- ties.5 Unfortunately, current computational techniques to tackle protein folding simulations are fundamentally limited by the long time scales (from a simulation point of view) needed to study the dynamics of inter- est. For example, while the fastest proteins fold on the order of tens of microseconds, current single com- puter processors can only simulate on the order of a nanosecond of real-time folding in full atomic detail per CPU day—a 10,000-fold-computational gap. Great strides in traditional parallel molecular dynam- ics (MD), utilizing many processors to speed a single dynamics simulation, have been made and have par- tially overcome this divide. A tour-de-force parallel- ization of simulation code for supercomputers by Duan and Kollman has previously led to the simulation of 1 s of dynamics for the villin headpiece three- helix bundle,6 demonstrating that parallelization schemes using hundreds of processors can be used to make significant progress at closing this computa- tional gap. However, such methods have fundamental drawbacks: in particular, these methods require com- plex, expensive supercomputers due to the need for fast communication between processors. Moreover, due to the stochastic nature of folding, in order to study the folding of a 10-s folder, one must simulate hundreds of microseconds, requiring computing power equal to thousands or tens of thousands of today’s processors. Developing such large-scale parallelization meth- ods is very difficult, and current parallelization schemes cannot scale to the level of even thousands of processors (i.e., cannot use so many processors effi- ciently). To understand why scalability to thousands of processors is so difficult, consider an analogy to a graduate student thesis. A typical thesis takes 1500 graduate student days. If one employed 1500 graduate students to accomplish this goal, would it be possible to complete a thesis in a single day? Clearly not—the overhead of communication between students, as well as the inability to devise an “algorithm” to divide the labor evenly, would make 100% efficiency impossible in this case. At this level of scaling, it is likely that the work would actually take longer. These issues, in particular balancing communication time against time spent actually doing work, are mirrored in the division of labor between computer processors. Clearly, the only way to efficiently utilize such a large number of processors is to divide work in such a way that re- quires minimal communication. Even with an algorithm with perfect scalability (e.g., with a 10,000-fold increase in speed using 10,000 processors), we are still left with the problem of obtaining a 10,000-processor supercomputer. For comparison, the largest unclassified supercomputer in the world (the SP at NERSC) has 2500 processors, and of course this resource must be shared between many (hundreds) different research groups. Recently, another approach has been developed to bridge this enormous computational gap: worldwide distributed computing.7 There are hundreds of millions of idle PCs potentially available for use at any given time, the majority of which are vastly underused. These computers could be used to form the most powerful supercomputer on the planet by several orders of magnitude.7 However, to tap into this resource effi- ciently and productively, we must employ nontradi- tional parallelization techniques.8,9 Indeed, we wish to accomplish a seemingly impossible goal: to push the scalability of MD simulations to previously unattain- able levels (i.e., the efficient use of tens of thousands 92 Pande et al. Standard propagation of error results in time constant  standard deviation of   1/k  Ntotalt/Nfolded  Ntotalt/Nfolded 3/2 For example, for the -hairpin folding data (see below), we have Ntotal  2700, Nfolded  8, and t  14 ns, which results in k  2.1  105  0.74  104 s1, and   4.7  1.7 s. We stress that as long as one can identify transitions, thus allowing an M times speed-up for all barrier crossings, the dynamics we simulate will faithfully follow the dynam- ics one would obtain from traditional MD, but simply M times faster. If there are off-pathway traps, our method will go to them; indeed, we will reach them M times faster. However, we will escape these traps M times faster as well. This method is not intended as a structure prediction algo- rithm, but rather a means to speed dynamics and study the mechanism of folding, which may include on-pathway in- termediates or diversions to traps. How Can One Identify Free Energy Barrier Crossings (“Transitions”)? Of course, the utility of this method rests on our ability to identify transitions, i.e., to calculate whether a simulation has crossed a free energy barrier. Voter’s parallel replica method was intended to accelerate the dynamics of solid- state systems that have energy barriers, in which one can identify new states by performing energy minimization to see whether one has crossed an energy barrier.9 However, in protein folding (as well as many other complex systems), the relevant barriers are free energy barriers, and thus an energy minimization technique is not applicable. In order for this method to be applied to a broad range of barrier crossing problems, one needs to use a more general way to identify free energy barrier crossings. We suggest that, in analogy to first-order phase transi- tions, one could look for a large variance in energy, which can loosely be related to a momentary surge in the heat capacity (a common sign of a first-order phase transition). Such energy variance peaks have been seen to coincide with free energy barrier crossings in simple models23 and all- atom (S. Perkins and V. Pande, unpublished results) models of protein folding. This technique has the significant advan- tage that it does not require any knowledge of the structure of the protein at the barrier.24 Moreover, to the extent that energy variance peaks correctly identify transitions, these peaks would aid in the interpretation of the simulation results, since they would demarcate transitions to new free energy minima. Of course, in the case of single-exponential kinetics, as is experimentally observed for almost all small proteins, there exists only one rate-determining free energy barrier, and thus recognizing the barrier is not essential to the technique. In fact, although we do not discuss it in this paper, in some cases ignoring the transitions can result in reaching the folded state faster than by recognizing all the barriers.8 Finally, simulating completely independent trajec- tories is another appealing possibility for systems with single exponential kinetics; we discuss this possibility in the Discussion section below. Simulation Details For all of the molecules presented here, each simulation is started from a completely extended state. This is done to avoid any possibility of biasing the initial state toward the native state of the molecule. Clearly the extended state does not represent the structure of the unfolded state. Indeed, we find that rapidly—i.e., within 1–3 ns of MD simulation— this extended state relaxes to the unfolded state of the protein. While this practice utilizes more computational time than, for example, starting from some predicted un- folded state, it has the virtue of not making any assumptions of the unfolded ensemble and removes any possibility of biasing the system to the native state. For each run, we have used M “clone” processors, each simulating folding in atomic detail with molecular or sto- chastic dynamics simulations (Figure 1). Once one of these clones makes a transition (identified by a spike in the energy variance: see below), we declare that the simulation has gone through a transition, copy the resulting configuration to all of the other processors, and recommence simulations from the new configuration. After restarting all simulations from the coordinates of the barrier-crossing simulation, one must ensure decorrelation of the next ensemble of trajecto- ries in order to achieve an increase in computational speed. This process is performed many times, over several “runs.” In our simulations, the spatial coordinates of the barrier- crossing simulation were copied and unique random number seeds (for Langevin dynamics random forces) were used to immediately differentiate the simulations. In a purely deter- ministic simulation, one would need to differentiate each simulation by restarting them with differing velocities, which may lead to potentially nonphysical discontinuities in the path; however, if the velocity decorrelation time is much shorter than the conformational decorrelation time (certainly true for dynamics in any water-like solvent), then the effects are likely to be minimal. In either case, the path obtained would correspond to a fast traversal of the potential land- scape, but the total simulation time among all processors would be equivalent to the additional time waiting in min- ima that a representative “serial” simulation would take. The Folding@Home distributed computing system (http://folding.stanford.edu) was used for the two most de- manding calculations (the -hairpin and villin simulations) presented here. The Folding@Home client software (which performs the scientific calculations) is based upon the Tinker molecular dynamics code,25 with numerous modifi- cations performed by Michael Shirts, other members of the Pande group, Jed Pitera, and Bill Swope. We simulated folding and unfolding at 300 K and at pH 7 (unless noted otherwise), using the OPLS26 parameter set and the GB/ SA27 implicit solvent model. Stochastic dynamics were used to simulate the viscous drag of water (  91/ps), and a 2 fs integration time step was used with the RATTLE Protein Folding Simulations on the Submillisecond Time Scale 95 algorithm28 to maintain bond lengths. Long-range interac- tions were truncated using 16 Å cutoffs and 12 Å tapers. We identified transitions by a heat capacity spike asso- ciated with crossing a free energy barrier. It has been previously shown that this is a means to identify transitions in all-atom8 and simplified protein model29 simulations. To monitor the heat capacity during the simulation, we calcu- late the energy variance, and use the thermodynamic rela- tionship Cv  (E 2  E2)/T, where E is the energy of the system; note that since we are using an implicit solvent, our “energy” is often called an “internal free energy” (the total free energy except for protein conformational entropy). Each PC runs a 100 ps MD simulation (“1 generation”), calculates the energy variance within this time period, and then returns this data to the Folding@Home server. If the energy variance exceeds a preset threshold value, the server identifies this trajectory as having gone through a transition, and then resets all other processors to the newly reported coordinates. Since the heat capacity is extensive, we used a fixed value of this threshold per atom (0.8 kcal2/mol2/atom) for all molecules (which for example leads to a threshold of 300 kcal2/mol2 for villin). Since transitions occur relatively infrequently (see be- low), one need not run these simulations on massively parallel supercomputers (with high speed communication); instead, these simulations are well suited for large, decen- tralized distributed computing clusters, such as the Folding@Home project. Not only is this a demonstration that such distributed computing clusters can be used to study long time-scale kinetics with molecular dynam- ics—we stress that this is likely the only way such calcula- tions could have been practically performed, considering the great computational demands of these calculations. RESULTS Protein -Helices To test the methodology presented above, we have simulated the folding of two different -helical pep- tides. One sequence we examined, the “Fs peptide” Ac–A5(A3RA)3A–NH2, has been shown experimen- tally to have biexponential kinetics, with characteris- tic times of 10 ns and 160  60 ns.10 Helices are believed to form via nucleation,30,31 which is influ- enced by the disorder in a system (either as a nucle- ation accelerator or blocker), analogous to a liquid with impurities. In our system, the arginine residues could be considered to be an analog of these impuri- ties that blocks propagation, and it is interesting to consider the role of this disorder in the sequence above, i.e., whether the arginine residues affect the nucleation processes. To address this, we have also folded a pure poly-A chain, Ac–A20–NH2. We have been able to fold both of these protein sequences and find rates comparable to experiment at a temperature of 10°C. The initial configurations were completely elongated chains (  135°,  135°). Qualitatively, both the poly-A helix and the Fs pep- tide folded by first undergoing nucleation followed by propagation toward the termini. We found propaga- tion in both directions (N to C and C to N), although we do not have sufficient statistics to determine a bias in propagation direction.30 Quantitatively, the Fs pep- tide folded (i.e., reached 15 helical residues, the value expected from experiment10) in 82  60 ns in our simulations. Note that while 7 runs were used for this average, 2 of the 7 Fs peptide runs did not fold after 160 ns. Since these runs were included in the average as folding in 160 ns, the average is somewhat lower than it should be. However, since the experimental rate is 1/160 ns, one would expect that on average 4.4/ 7 [63%  1  exp(kt)] of the runs would fold after 160 ns, whereas we observed 5/7, which is well within the experimental bounds.11 Moreover, our simulations capture some finer de- tail about the nature of folding. We see fast early events, as found experimentally. The alanine-rich N- terminal part of the Fs peptide folded very quickly, in 15  10 ns, consistent with the observation of N- terminal fluorophore quenching in 10 ns by Eaton and co-workers11 and the faster rate observed by Dyer and co-workers.10 The poly-A helix folded considerably faster (18  8 ns, out of 8 runs) and typically had more helical content (17.8 residues vs 15.1 for the Fs peptide). Apparently, the arginine residues are responsible for the differences in these folding rates by acting as blockers of helix nucleation and propagation. Looking at the formation of secondary structure vs. time (see Figure 2, right), we see that helical propagation halts at the arginine residues (R) and often the completion of helix formation requires additional nucleation events. While our eight poly-A helix runs did show stalling of propagation (Figure 2), these events were not localized to any particular point in the chain. Why does arginine limit propagation? We suggest that the long Arg side chain significantly limits its mobility, and moving into a helical / orientation thus occurs much more slowly. Nonbiological Helices How generally applicable and accurate is the coupled simulation method? To address this question, we have applied this method to study the folding of a nonbio- logical helix, a 12-mer of polyphenylacetylene (PPA).12 This polymer can be considered to be a nonbiological analog of polyalanine, since it is a ho- mopolymer with a simple side chain that folds into a 96 Pande et al. helix.12 We have previously shown13 that this poly- mer folds to a helix on the tens of nanosecond time- scale, in accordance with previous experimental ob- servations.14 We find that our ensemble dynamics method works well for PPA. The mean folding time and folding time distribution are consistent with brute force, traditional simulations of PPA.13 This is dem- onstrated by the agreement in mean folding times between the two methods and the similarity of the folding time distribution (see Figure 3 and Table 1). C-Terminal -Hairpin of Protein G -Helices and -hairpins together represent the most ubiquitous secondary structural elements in proteins. In a previous section, we discussed our simulation results for helices and now we concentrate on hair- pins. We have recently reported a full-atom, implicit- solvent simulation of folding of the hairpin at a bio- logically relevant temperature,32 and here we briefly summarize those results. We have obtained a very large ensemble of conformations, which includes mostly partially folded structures, as well as eight complete, fully independent folding trajectories. These data sets allow us to determine the key trends characterizing the folding process and determine sev- eral average properties that have been measured or could, in principle, be measured experimentally. Based on our results, we can estimate the folding rate of the hairpin in the following way: we have simulated 27 independent runs, each consisting of M  100 clone simulations that, on average, completed approximately 14 ns of simulated time, bringing the total to approximately 38 s of real time simulation. Out of 2700 simulations, we have detected eight com- plete folding events, which (if we assume single ex- ponential folding kinetics) results in an estimated folding time of approximately 4.7 s. This prediction is in excellent agreement with the experimentally measured time of 6 s.16,32 Our results offer the following picture of the fold- ing mechanism (Figure 4). Folding from a fully ex- FIGURE 2 Folding simulation of -helices. Shown above are trajectory data for simulations of the poly-A helix (left) and Fs peptide (right). Top: number of helical units vs time (dotted line) and energy variance vs time. We see that peaks are associated with nucleation events. Bottom: Secondary structure formation vs time: red, yellow, and blue denote helices, -sheets, and turns respectively. In both cases, we see nucleation events (corresponding to energy variance peaks). However, in the case of the Fs peptide, nucleation events did not occur at the arginine residues (R) and propagation typically was blocked at these residues (also seen in the other seven runs we performed, data not shown). We estimated the time by multiplying the directly simulated time t by the number of processors M (M  24 and M  128 for the left and right trajectories, respectively). Protein Folding Simulations on the Submillisecond Time Scale 97 folding to the intermediate state: the protein did not completely unfold during the simulation time scale ( 1 s). However, a transition to the partially folded intermediate (I) was detected. Finally, we compare our results to what one might expect from protein folding theory. One of our pri- mary results is that folding appears to proceed through transitions between free energy minima: starting in an unfolded state (U) to an intermediate (I) and then to the native state (N) (e.g., see reviews Refs. 1–3 and 40, and references therein). As previously dis- cussed,1–3 the collapse to the I state appears to be driven by hydrophobic interactions. However, consid- ering that villin is one of the fastest folding proteins currently known, it is interesting to consider that many of these interactions were non-native. 100 Pande et al. While there is little structure in U, the intermediate I is collapsed, with some native tertiary structure, much non-native structure, and little side-chain pack- ing. Thus, I is very much like the molten globule intermediates found in other proteins.41 Also, the in- termediate state found by Duan and Kollman6 fits our I state, since it is collapsed, partly native, but missing certain native contacts, and satisfies our I state defi- nition in terms of the RMSD and radius of gyration. We see a somewhat cooperative U 3 I transition (e.g., reflected in free energy barriers in Figure 7a) and a very cooperative I3 N transition, as predicted previously.42,43 Indeed, the cooperativity of the I3 N transition appears to result from side chain packing in our model, since many non-native contacts must col- lectively break to allow the formation of native side- chain packing. It seems highly unlikely that the native structure could be reached so rapidly through piece- wise movements without this collective event. It is clear that any potential set employed to model atomic interactions will have its limitations. The rel- evant question to ask is, How good do they need to be and what would result from errors in these potentials? Since we do see folding to the native state, it appears that the potentials we used were sufficient in this case. However, we cannot rule out the possibility that in our model, the I state is comparably (or more) stable than the N state. This could be the result of slight errors in the potentials.44 Much like adding denaturant in a physical example, adding errors to a potential reduces the energetic favorability of the N state and thus makes the I state relatively more favorable due to its entropic advantage.44 DISCUSSION Scalability of the Algorithm As more computers become accessible to distrib- uted computing methods, it is important to under- stand the limits of the scalability of the method, i.e., the limits to the number of processors one can use to achieve a speed increase. While this method can yield significant speed improvements for simulating complex systems (and scalability considerably be- yond traditional parallel MD), there are some im- portant limitations to its scalability we must con- sider. For example, simulating a process where the mean time to fold tfold  100 ns using M  10 6 processors will not necessarily mean that one will achieve folding events using only tfold/M  100 fs trajectories. The scalability will be inherently lim- ited by the barrier crossing time tcross (i.e., the time spent actually crossing the barrier, not including the much longer time spent “waiting” in the free energy minima, which dominates the folding time tfold). Since the speed increase from our method is due to the elimination of the waiting time, we expect that M tfold/tcross additional processors will not give any additional speed increase.8,9 Thus, the bounds of scalability for this method are also related to an interesting physical question: How much time is required to actually cross the free energy barrier? This time can be quantified by using our method to look for the limits of scalability within our tech- nique. For the proteins we have examined, it is likely that this time is on the hundreds of picoseconds to nano- second time scale. It is interesting to consider how FIGURE 4 A detailed analysis of a folding trajectory of the -hairpin from the C-terminal segment of protein G. (a) Cartoon representation of the folding trajectory; the backbone of the peptide is represented as a gray trace; the core hydrophobic residues (Trp43, Tyr45, Phe52, Val54) are shown in dot representation; (b) RMSD from the 1GB1 structure of the hairpin (residues 43–54), radius of gyration, and the number of backbone–backbone hydrogen bonds; (c) distance between key hydrogen bonding partners (green: Trp43–Val54; red: Tyr45–Phe52), and the minimum distance between Trp43 and Phe52 (black). Note that the minimum distance between Trp43 and Phe52 reaches its final value before the key hydrogen bonds are established; (d) solvation energy (Esolvation), charge–charge energy (Echarge), and total potential energy vs time. The initial hydro- phobic collapse of the unfolded peptide correlates with a sharp decrease in Etotal, while the attainment of the final structure correlates with Etotal reaching its final value. A significant deviation around G160 of Echarge and Esolvation from their final value is correlated with the temporary breaking of the key Tyr45–Phe52 hydrogen bond; (e) a concise summary of the key events along the folding trajectory (color code: yellow—high; violet—low). HB-ij denotes the distance between the hydro- gen bonding partners i and j; min-kl denotes the minimum distance between residues k and l. Note that the establishment of the hydrophobic Trp43–Phe52 interaction is the earliest event of signifi- cance along the trajectory. Time is reported in the number of generations: roughly, 1 generation corresponds to 100 processors  0.1 ns/generation/processor  10 ns/generation. Protein Folding Simulations on the Submillisecond Time Scale 101 this minimum time varies with the folding time. Since there need not be any correlation between these times, it is possible that slower folding proteins (e.g., those which fold on the millisecond and longer time scale) could be folded using our method with current micro- processors by simply employing more of them. In- deed, computational resources on the million-proces- sor scale have been proposed, such as IBM’s Blue Gene, as have other distributed computing projects. With such computational resources, it is possible that we could push our simulations from the hundreds of microsecond timescale to fractions of a second, al- lowing us to reach timescales relevant for slow fold- ers. 102 Pande et al. vant physics to faithfully reproduce and predict the physical effect of interest. For the small proteins we have examined, it appears that the model we have employed is indeed “correct enough” for predicting rates (see Figure 8 and Table I). This implies that either the discrete nature of water is not relevant for folding, that folding rates are fairly robust to such inaccuracies of the model, or that there is a convenient cancellation of errors. In order to discern between these two possibilities, one must resimulate these pro- teins with explicit solvation models and compare the rate and mechanistic predictions; if these predictions agree, then perhaps the potential gain in accuracy of explicit solvent models would indeed not be relevant for folding kinetics. Furthermore, we stress that ex- plicit solvent models make approximations as well,47 and there is no reason why an arbitrary explicit sol- vation model would necessarily be better than a well- designed implicit model. Finally, it is important to consider that the question of the validity of implicit solvation models goes be- yond a simple debate of the validity of particular computational methodology, but also impacts the way in which one thinks of protein structure in general. If explicit solvation were critical to protein folding, then it is likely that one should not think of protein struc- tures without the requisite cloud of water molecules it interacts with, as it is the very discrete and potentially structural aspects of the water that play a large role in folding. However, if implicit solvent models are suf- ficiently accurate, this suggests that a structural pic- ture of a protein alone (implicitly considering the effects of water, such as hydrophobicity, etc., but not with a discrete, structural form in mind) is indeed sufficient. Alternative Methods to Simulate Dynamical Events on Long Time Scales Using Low Viscosity Simulation Water is a relatively viscous solvent. Indeed, in quan- titative terms, the damping force of water is on the order of  100/ps. It is intriguing to consider whether one can simulate the effective result of long time-scale events by simulating the effect of much lower viscosity solvents, say  1/ps, while keeping all of the other properties of a water-implicit solvation model unchanged. This is appealing since this is triv- ial to perform with implicit solvation models and this ability to explicitly set the viscosity of the solvent in the model may represent one of the great strengths of using implicit solvent models. FIGURE 8 Comparison of theoretical rate predictions from @Folding@Home and the according experimental folding rate determinations. We compare the folding rates for the proteins and polymers described in this review: PPA, polyalanine-based helices, the C-terminal -hairpin from protein G, and the villin headpiece. If our folding rate prediction were perfect, all points would lie on the diagonal line. The agreement strongly suggests that our method can accurately predict the absolute folding rate for small proteins, peptides, and foldamers. Protein Folding Simulations on the Submillisecond Time Scale 105 If one were to decrease the viscosity by 100 times, could one simulate 10 ns and expect to get 1000 ns  1 s of sampling? This question has been ad- dressed in many models and systems. For example, Klimov and Thirumalai48 have shown that (for a coarse-grained protein model) the rate of folding in- creases with decreasing viscosity to a point ( 1/ps) at which the rate decreases with decreasing viscosity. This nonmonotonic dependence can be understood in terms of the dual role of the solvent viscosity: viscos- ity retards motion through the solvent, but also creates the random forces that are needed to drive the system over energy and free energy barriers. Thus, the rate should be optimal at intermediate viscosity. If this peak in the rate vs viscosity curve does peak at 1/ps, then one should expect an increase in sam- pling at viscosities in between 1/ps and 100/ps, and for thermodynamic properties, this increased sam- pling should be beneficial. However, it is still unclear whether kinetic properties would be unchanged by significant changes in the viscosity. Alternative Methods to Simulate Dynamical Events on Long Time Scales Using Large-Scale Distributed Computing In this review, we have discussed protein folding simulations using ensemble dynamics, our parallel replica-like method intended to handle free energy barrier crossing problems. The greatest weakness of this method rests in the need for transitions and for the accurate identification of these transitions. Our sug- gested means to identify transitions, looking for en- ergy variance spikes during dynamics, has the benefit of being a purely thermodynamic method and thus does not use any information of the protein native state or any folding-related hypothesized reaction co- ordinates. However, if transitions are incorrectly iden- tified, the validity of the resulting data is put under question. Considering that great computational re- sources are needed to generate the folding simulations presented here, this limitation could be very expen- sive computationally—the failure to accurately iden- tify transitions may mean that the resulting data set is invalid. It is interesting to consider a simpler method, which does not have the liabilities described above. Namely, instead of loosely coupling simulations (i.e., restarting simulations after transitions have been de- tected), one could simply run a large number M of completely independent simulations. For single expo- nential kinetics, we would still gain an M times speed- up (as described above). However, even if the reaction under study did not have single exponential kinetics, independent trajectories might still have value. In- deed, in a sense, a set of thousands of simulations each on the tens of nanosecond time scale is a data set that stands on its own. For example, one could inter- pret the results for single-exponential kinetics, by examining the fraction f(t) that fold in time t and fitting a rate with the slope. However, one would not be limited to this exponential kinetics analysis, and this data could be reanalyzed a postiori to test new hypotheses or kinetic models. Considering the great computational cost of producing these data sets, this more “pure” method for simulating kinetics has a great appeal. Indeed, we have reexamined the folding of villin with uncoupled trajectories49 in this manner (B. Zagrovic, et al. J Mol Biol, 2002, in press) and it will be interesting to determine how the uncoupled simulations differ (e.g., in rates and mechanism) from those presented here. Is Transition Detection Really Necessary? The discussion above regarding the possibility of us- ing independent trajectories and still gaining a speed increase linear with the number of processors raises the question, Must one bother with transition detec- tion as used here? Another way to examine this ques- tion is to ask with what frequency were transitions detected in the examples described here. For the helix folding simulations, 2 or 3 transitions were detected before the simulation reached the folded state. The first transition accompanied the first formation of he- lical structure and the other transitions occurred dur- ing propagation. For the larger molecules, transitions were even more infrequent. For example, the -hair- pin and villin simulations typically had a single tran- sition that occurred early in the folding process, ac- companying the collapse of the protein chain. Thus, we find that for the larger molecules, transi- tion detection was likely not necessary, since the transitions occurred earlier and thus the simulations were essentially running independently (as suggested in the subsection above). We suggest that the transi- tions were not needed since these larger molecules fold with single exponential kinetics, and thus have a single rate-limiting step. The helices are potentially different: the rates of nucleation and propagation of helices in our model are not highly separated (e.g., see the trajectories in Figure 2) and thus transition detec- tion may be needed in the helix case, but not for the -hairpin or villin molecules. 106 Pande et al. CONCLUSIONS Comparison to Experiment With a wide range of molecules under study, from the nonbiological PPA helices to the 36-residue villin headpiece, we have simulated a set of molecules with a range of folding times spanning over four orders of magnitude, from nanoseconds to tens of microsec- onds. Since the primary means of comparison to ex- periment is the comparison of rates determined by simulation and experiment, we concentrate on our prediction of rates. Figure 8 shows a striking agree- ment between predicted and experimental rates (see Table I for details). Of course, with just four mole- cules simulated, it is unclear whether this agreement is simply fortuitous. In order to more fully address this question, we plan to simulate the folding kinetics of additional molecules, including larger and more slowly folding proteins. Indeed, more recent work on a small -fold (C. Snow et al., Nature, 2002, in press) and villin (B. Zagrovic et al., J Mol Biol, 2002, in press) also result in strong agreement with experi- mental rates. With this quantitative agreement with experiment, it is also interesting to ask how do our results reflect upon the quality of modern force fields? On the sur- face, one might conclude that our agreement with experiment is evidence that force fields are suffi- ciently accurate. We stress that the only question that can truly be addressed by our work is whether force fields are sufficiently accurate to reproduce experi- mental rates and structures. Ignoring for the moment the possibility that the agreement may be fortuitous, the agreement between our simulations and experi- ments suggest that force fields are sufficiently accu- rate to predict the folding rates of small proteins. Indeed, this accuracy can be quantified in terms of the strong correlation (R2  0.996) and low p value (0.000026) of the logarithms of the predicted to ex- perimental rates. However, this statement should def- initely not be overgeneralized—it is unclear whether the analogous rate prediction for large protein folding would be similarly accurate or whether these results are fortuitous (such that the simulation of additional proteins would weaken the correlation). We are cur- rently addressing this question by examining the fold- ing of different and larger proteins. What Have We Learned About the Protein Folding Mechanism? The question of “how proteins fold” has been asked for decades, and remains a difficult problem due to the complexities and difficulties of computational and experimental methods. However, the methods pre- sented here have allowed us to understand, for the first time, the folding mechanism for some small fast fold- ing proteins, in atomistic detail with experimentally validated rates (Figure 8 and Table I). We have been able to discern the mechanism of a few particular proteins, but it is unclear whether we can expect these to generalize to larger and more complex proteins. An understanding of the mechanism of larger proteins will likely require further direct simulation. However, considering the diversity of mechanistic results found even in these small proteins, it seems reasonable to consider that there may not be a single, universal folding mechanism. Indeed, evolution may be mech- anistically agnostic and may have selected proteins for function, without concern for folding mechanism. This could lead to a variety of protein folding mech- anisms (even for sequences which fold to the same structure), and thus there may not be a single answer to the question of “how proteins fold.” Future Perspectives The ensemble dynamics technique coupled with dis- tributed computing has allowed us to break funda- mental computational barriers in the dynamics of complex systems, such as protein and polymer fold- ing. However, one need not build a distributed com- puting infrastructure to gain the benefits of our meth- ods. Indeed, with a cluster accessible to almost any group (e.g., a hundred PCs), one can simulate 100 ns in a day (assuming 1 ns/processor). This is a signifi- cant advance over state of the art of traditional parallel molecular dynamics.6 Of course, the combination of our method on top of traditional parallel MD (i.e., using traditional MD to speed up individual simula- tions to the maximum scalability of parallel MD and then using our method to statistically sample runs) may lead to the greatest advance, especially on mas- sively parallel architectures with millions of proces- sors, such as IBM’s proposed million processor Blue Gene supercomputer. Moreover, this technique should have broad appli- cability to any dynamical system that progresses by crossing free energy barriers, especially in the most intractable problems with high free energy barriers. It could also serve to augment existing computational methods, such as path sampling20 (which requires simulating a fast trajectory over the relevant free energy barriers) or the determination of transition states using pfold analysis50 (which is currently hin- dered by simulations dwelling in transiently stable intermediate states). Protein Folding Simulations on the Submillisecond Time Scale 107
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved