Writing Tips

Writing tips #

This page discusses some observations that we have made when reading assignments for Statistics 504. Many of these points apply to any writing about data, but a few of them are specific to this course. It is very important to keep in mind that most of the points made here are general guidelines, not absolute rules. In a specific context, you should use your judgment and violate any of these principles if you feel that it is appropriate to do so.

Content/overall approach #

  • Make specific claims and assertions, then back them up with specific evidence and arguments.

  • Focus on findings that follow directly from your analysis of the data at hand. For this class, you do not need to research the scientific literature to corroborate or interpret any of your findings.

  • Be confident about what you have found, but be modest about its implications. It is very unlikely that your data analysis will change the world.

  • Be realistic about what can be accomplished when addressing a complex and challenging question. One analysis of one data set will only very rarely allow definitive or deeply novel insights to be gained.

  • Don’t use “hedge words” unnecessarily, but do use qualifications whenever a definite statement cannot be justified. If you find yourself hedging too frequently, your findings may be too weak to report.

  • Avoid “straw man” arguments, for example, spending a lot of time talking about the difficulty of answering questions that are tangential or unrelated to the main analysis goal.

  • Don’t enumerate things that are impossible to do with the data that you have. Don’t use much (or any) space in your writing talking about things that you did not do (even if you think that you have an interesting reason for not having done them).

  • Avoid presenting your work as if you followed a set script. There are no pre-defined scripts for data analysis. The flow of your report should follow the logic of the question you are aiming to address, and the evidence and interpretation that you use to support your claims.

  • Discussion of how you “cleaned” the data is only relevant if it is needed for the reader to understand your question, approach, and conclusions.

  • Try to make your writing as self-contained as reasonably possible. For this course, you can assume that the reader has a knowledge of statistical methodology similar to yours. Nevertheless, it helps to reinforce certain key points by explaining the statistical reasoning behind your conclusions.

  • To make the writing more realistic, do not refer to our class as such, or to prior assignments.

Analysis #

  • Avoid over-summarization of data. In most cases, if you report a statistic such as the mean based on more than a few hundred observations, you have missed an opportunity to delve deeper into the data, e.g. by reporting the means for relevant subgroups.

  • Provide quantitative context around your key findings. For example, if your primary point is about the mean of some measure, it may also be important to report its standard deviation. Context can also be provided by reporting the mean and standard deviation of the variable in relevant subgroups, not only in the dataset as a whole.

  • Define any measures or variables that you use unambiguously. For example if you are looking at crime, is it the absolute number of crimes or the crime rate? When reporting rates, always be explicit about the denominator of the rate (e.g. violent crimes per 10,000 population per year).

  • If you are presenting a regression analysis, unambiguously state what is the dependent variable, and what are the independent variables.

  • Avoid presenting lists of descriptive statistics without using them to support a larger point, e.g. don’t report the mean of every variable in a dataset without stating what we learn from knowing these means.

  • Avoid over-simplified analyses – most systems in the real-world have nonlinear and interactive behavior. When modeling heterogeneous data, you should generally aim to fit as complicated of a model as you have statistical power to estimate. If presenting summary statistics for heterogeneous data, present them at an appropriately disaggregated level.

  • Don’t overstate or misstate the role of Gaussianity in statistical analyses. Often it is a minor consideration. It is almost never important that the data are Gaussian. To the extent that Gaussianity matters, it is usually only that summary values should be Guassian in order to calibrate tests and confidence intervals.

Organization and structure #

  • Always include a title. The title should convey something specific about the main focus or your writing. It should reflect the main substantive question that you are addressing, and may also refer to the data that you will be using. You generally will not want to mention the analytic methodology (e.g. Poisson regression) in the title.

  • The introductory paragraph should convey the main topic and focus of your writing. Information about the data set, and the methodology that you plan to use should also be discussed, but the primary substantive question is the most important information and should be covered first. You may need to provide a bit of background information to motivate your question, but keep the background discussion in the first paragraph brief.

  • It should be possible to understand your main ideas by reading the essay once from top to bottom. Avoid making statements that are only meaningful after reading something that comes later in the essay (even if it is in the very next sentence).

Language and style #

  • Avoid the passive voice

  • Avoid presenting your work like a recipe or diary, e.g. “First I did …, then I did …”.

  • Avoid excessive use of “would”, e.g. “the analysis would provide us with some insights…” is better written as “the analysis will provide us with some insights…” or “the analysis provides us with some insights…”

  • Limit ambiguous internal cross-references, e.g. “above”, “below”, “before” (used to refer to something that appears elsewhere in your report). Only do this if it is unambiguous what you are referring to, otherwise be more specific (e.g. provide a section number if you are using them, or use language such as “in the regression analysis presented above we saw that…").

  • Favor short sentences and paragraphs. If a sentence has more than two clauses, consider reworking it into multiple shorter sentences. If a paragraph has more than five sentences, consider splitting it into multiple shorter paragraphs.

  • Don’t start a sentence with “And”. Avoid starting sentences with “Such”, or starting non-interrogative sentences with “Which”.

  • Avoid generic and low-information statements such as stating that something is “reasonable” or “interesting”.

  • Rarely use “etc.” unless it is unambiguous what it could be replaced with.

  • Use consistent tense. If you are describing something concrete that you did (like conducting an analysis of a data set), it may be appropriate to use past tense (e.g. “I dropped all rows with missing data”). If you are describing a research finding, it is usually preferable to use present tense (“smoking and age are positively associated” not “smoking and age were positively associated”).

Terminology #

  • Don’t be afraid to use technical terms as needed, even when writing for a somewhat non-technical audience, but use them precisely and only where the exact definition is important. Avoid “jargon” (technical language that is used informally, needlessly excluding people from a discussion).

  • Since the word “significant” has a specific technical meaning in statistics, it is best to avoid using it in its colloquial sense. In general, it is best to use the phrase “statistically significant” when you mean that a formal statistical uncertainty assessment has been used to support a finding. Using “significant” on its own can be ambiguous since it has both a common-language meaning and a technical meaning.

  • The word “valid” is best avoided. In addition to being vague, this term reinforces a binary interpretation of statistical evidence, i.e. that there is a sharp line between analytic methods that are right or wrong (“valid” or “invalid”). Most statisticians prefer to think in more continuous terms. An analysis should be meaningful (otherwise don’t present it), but nearly always will have limitations. It is neither fully valid nor fully invalid.

  • “Assumption” is very commonly used when describing statistical methods, but is arguably over-used. A good alternative term is “condition”. Your findings will be more meaningful if the conditions approximately hold, and less meaningful if they are strongly violated. If you do choose to refer to “assumptions”, in applied statistics we usually want to argue that the assumptions hold “approximately”, not “exactly”.

  • It is important to distinguish “testable assumptions” (which need not really be assumptions at all if you carefully assess them), versus “untestable assumptions”, which you must take on faith. Of course, most assumptions are neither completely testable or completely untestable, but fall somewhere in between.

  • Keep in mind the distinction between “methods” and “models”. A “method” is any technique used for analyzing and gaining insight from data. A “model” is a specific mathematical or computational formalism that aims to describe how a system works. For any given class of models, there is a corresponding class of methods that can be used to fit those models to data. For example, we have “linear models” and “least squares regression” – the former is a class of models, the latter is a class of techniques for fitting models to data. It is usually not appropriate to write “least squares models”, since it confuses these two distinct ideas.

Causality #

  • Causality is a fundamental issue in most statistical analysis. In some cases, due to how the data were collected and analyzed, it is possible to interpret associations in causal terms. More often, however, we are unable to be strongly confident about the extent to which any findings reflect mechanisms in a causal sense.

  • Certain phrases such as “the effect of x on y” can be re-worded to convey that the findings are not meant to be interpreted causally. For example, it is possible to discuss “the relationship between x and y”, or to state that “x predicts y”.

  • When discussing a regression model, it is better to refer to the “coefficient for x” or the “slope parameter for x” rather than the “effect of x”, since the latter implies causality.

  • It is usually not necessary to eliminate all hints of causal thinking from your writing. Science is primarily concerned with identifying causal relationships. Evidence for causality does not need to come solely from the data that you are analyzing. In some cases, there may be a basis for interpreting results in a causal manner even when the data themselves do not demonstrate causality.

  • There is a branch of statistics called “causal inference”. Methods from this field, such as weighting, stratification, and matching can be useful ways to assess whether a data set supports causal interpretations of findings. However, there is not a bright line between methods from “causal inference” and other statistical methods. A regression model with suitable control variables may support a causal interpretation, even though it is not viewed as being a technique from causal inference. In general, demonstrating causality is a trade-off between rigor and power. Causal inference tends to favor rigor even when a great sacrifice in power results. Mainstream regression analysis favors a different balance in which some effort is made to promote a causal interpretation (e.g. by controlling for known confounders), but efforts to achieve causality are balanced with efforts to maintain power.

Graphics #

  • Avoid pie charts in almost all situations

  • Bar graphs are a very elementary and limited form of statistical graphic. It is almost always possible to create something more informative. For example, side-by-side box plots show the mean (which is often what is shown in a bar graph), but also convey information about the dispersion. Or, you can identify another attribute of the data underlying each bar, and then show a scatterplot of the primary feature against this secondary attribute. Another good alternative to bar graphs is a dot plot.

Grammar #

  • In general, it is better to avoid contractions (e.g. “it’s”) in this type of writing.

Statistical content #

  • Statistical analysis usually aims to address a question about a population, based on a sample of data from the population. Be sure that it is clear what is the relevant population for your analysis, and how the sample was obtained.

  • In almost any report, you should state the sample size for any analyses that you are reporting.

  • Avoid fitting overly-simple models to large datasets. Underfitting is just as serious a problem as overfitting.

  • Be explicit when discussing percentage changes in variables that are themselves percentages. The term “percentage points” can clarify your meaning in this setting. For example, if you say that the unemployment rate increased by 3% (say, from 3%), it is not clear if you mean that it increased by 3 percentage points (to 6%), or by 3% of 3% (to 3.09%).

  • Categorical variables must usually be coded into multiple dummy variables before being considered in a regression model. The interpretation of the coefficients for these dummy variables is only meaningful relative to the reference category (if there is one, or more generally it can only be interpreted in light of the coding scheme). It is critical to consider and clearly report the coding scheme when discussing parameter estimates for categorical variables.

  • It’s fine to do a certain amount of “descriptive analysis”, but avoid providing descriptive statistics for no clear reason.

  • Avoid referring to single data values (e.g. outliers, extremes). Statistical data analysis is almost always about abstracting away from the specific data that you have observed and saying something about the population. Single data values in your data set rarely provide much evidence regarding properties of the population.

  • One of the main contributions that someone with advanced statistical training can make in a data analysis is to reveal what the data say after removing or controlling for certain forms of unwanted variation, or to uncover how various subgroups relate to (or differ from) each other, or to adjust for confounding, selection biases, or other measurement artifacts and inadequacies. This can be accomplished in many ways, for example, by reporting summary statistics on stratified data, or by using a model-based adjustment (e.g. regression).

  • Administrative geographical units (e.g. US states and counties) are generally very unbalanced in terms of population. When using such an unbalanced partition as a grouping variable, it is rarely of interest to report on the absolute amount of any variable. For example, in the case of U.S. states, virtually everything (GDP, murders, bankruptcies, airports, …) will track with state population. If you rank the states by any of these features, you will usually get a list that starts with California, Texas, Florida, etc. In nearly all cases, when working with an unbalanced group you should report some type of rate or other normalized value. Note that this does not imply that you should never model absolute quantities. But whether you are reporting modeled results, or raw data, you should generally present your findings on a relative scale.