Indeed. From our own experience: we use pretty much off-the-shelf maximum entropy parameter estimators for parse disambiguation and fluency ranking. In the past ~10 years most of the gain has come from smart feature engineering by using linguistic insights, analyzing common classes of classification errors, etc. Beyond l1 or l2 regularization, the use of (even) more sophisticated machine learning algorithms/techniques have not yet given much, if any, improvement for these tasks in our system.
What did help in understanding models is the application of newer feature selection techniques that give a ranked list of features, such as grafting.
What did help in understanding models is the application of newer feature selection techniques that give a ranked list of features, such as grafting.