The 5' untranslated region (5'UTR) plays a crucial regulatory role in messenger RNA (mRNA), with modified 5'UTRs extensively utilized in vaccine production, gene therapy, etc. Nevertheless, manually optimizing 5'UTRs may encounter difficulties in balancing the effects of various cis-elements. Consequently, multiple 5'UTR libraries have been created, and machine learning models have been employed to analyze and predict translation efficiency (TE) and protein expression, providing insights into critical regulatory features. On the one hand, these screening libraries, based on TE and mean ribosome load, struggle to accurately quantify protein expression; on the other hand, a precise method for quantifying 5'UTRs n... More
The 5' untranslated region (5'UTR) plays a crucial regulatory role in messenger RNA (mRNA), with modified 5'UTRs extensively utilized in vaccine production, gene therapy, etc. Nevertheless, manually optimizing 5'UTRs may encounter difficulties in balancing the effects of various cis-elements. Consequently, multiple 5'UTR libraries have been created, and machine learning models have been employed to analyze and predict translation efficiency (TE) and protein expression, providing insights into critical regulatory features. On the one hand, these screening libraries, based on TE and mean ribosome load, struggle to accurately quantify protein expression; on the other hand, a precise method for quantifying 5'UTRs necessitates a significantly costlier library. To resolve this dilemma, we constructed a library utilizing firefly luciferase as the reporter to measure accurate protein expression. In addition, we optimized the library construction method by clustering mRNA sequences to reduce redundant data and minimize the size of the dataset. This dual strategy by increasing accuracy and reducing dataset size was found to be effective in predicting the 5'UTRs from the PC3 cell line.