Plot an histogram of annual compensations — Python Data Analysis series part 2
In the first part of this series we went through some exploratory data analysis of ages to filter out the bad data, and at the end plot a bar chart with the age frequencies. Today we are working on the annual compensations of the 2020 Stack Overflow Developer Survey results.
This will involve binning the values so that we can plot them in a histogram at the end. For that, we need to create bin labels (to improve the visualization) and the bin intervals. We’ll make plenty use of the wonderful list comprehension feature of Python!
While Plotly can bin data on its own, given the number of bins to create, in this demo I’m taking you through the approach of creating custom bins. Not only can we control precisely the bin intervals, but we can also reuse some of the code to create bin labels to improve the visualization at the end.
Before we get into it, here are some handy links you may need:
- Part 1: Analyse the distribution of ages
- Part 3: Analyse the education level of respondents
- Part 4: Unpivot delimited data
- Part 5: Move your Jupyter notebooks to an Azure DataBricks workspace
- Jupyter notebook for this article
Don’t forget that, if you prefer, you can read the notebook instead of this article, it has the same information. Just return here for the links to the other parts of the series :)
Analyse the annual compensations
The numerical column of data we are working with today has the annual compensation of respondents, converted to USD. As before, we are working with this column outside of the context of the dataset, i.e., we only want to put this single column in a suitable format for plotting a histogram.
Those first eight lines are the exact same from before, used to import libraries and load the dataset. Line 10 keeps only the column we are interested in, “ConvertedComp”. I did not mention this in the previous part, but the double brackets in
data = data[["ConvertedComp"]]
Is used to return a one-column DataFrame instead of a Series of “ConvertedComp”. For the transformations we are doing it’s easier to work with the data as a DataFrame.
data = data[(data["ConvertedComp"] >= 0) & (data["ConvertedComp"] <= 200_000)]
Since we already went through exploratory data analysis in the previous article, I skipped that code and included only the filter, but by all means feel free to explore the data and establish your sensible limits for the values reported in the survey. I chose to keep only compensations between 0$ and 200,00$ (line 12). Oh and notice the underscore (_) used in the numbers. It is a neat trick available in Python to visually separate the units in large numbers. Python doesn’t “read” the underscore, but it does improve readability for us developers.
bin_labels = [
f"[{int(i / 1_000):,}K, {int((i + 15_000) / 1_000):,}K)"
for i in range(0, 200_001, 15_000)
]
On line 15 we start working on the bins. In that line we create the bin labels, which consists of the intervals, and the compensations displayed in thousands of $ (mind the K in the strings). We combine Python’s f-string with list comprehensions to create a list of labels with much more concise code. Also note how the string labels are closed on the left ([
) and open on the right ()
). This is how the bin intervals will be treated as well.
compensation_bins = pd.IntervalIndex.from_tuples(
[
(i, i + 15_000)
for i in range(0, 200_001, 15_000)
],
closed="left"
)
On line 20 we use very similar code to create the bin intervals. The biggest difference in this second list comprehension is that we are creating tuples of integers for the intervals instead of strings. This list comprehension is used to create an IntervalIndex from_tuples
. In other words, on line 20 we create the actual bin intervals, specifying that we want them closed on the left side, as per the closed="left"
argument. The string labels’ open and closed notation is just a visual label, here we actually define which side is closed.
data = pd.cut(
data["ConvertedComp"],
compensation_bins,
precision=0,
include_lowest=True
)
On line 28 we realize the binning with the cut
function. precision
specifies the decimal precision at which to compare values, in this case integer precision, and to make the first interval left-inclusive with include_lowest
.
As I mentioned earlier, it is possible to let Plotly bin the data on its own by telling it how many bins we want. However, with this approach we kept control of the precise bin intervals to use, the type of intervals (closed on left), and we could even reuse the code for the labels.
data.sort_values(inplace=True)
data = data.astype("str")
From here on it is smooth sailing. Lines 34 and 36 sort the binned values and convert them to strings, respectively. The sort comes before casting to string because this way we sort by numerical order; otherwise the strings would be sorted by alphabetical order. Oh, and we need them as strings or Plotly would complain about the IntervalIndex values.
Finally, the rest of the code is to plot the histogram and give the visual some nice formatting.
Note how the X-axis labels use the bin labels we defined before with the K-abbreviated values and not the complete numbers from the intervals. This way we did not have to add an extra data transformation but still delivered a more readable visualization.
And with this we’ve reached the end of part 2. We did everything needed to bin a column of numerical data, creating both the bin labels for the visualization and the bin intervals for the data transformations. At the end, we created a nice histogram that shows which compensation intervals are most common. Maybe we could’ve used smaller intervals and the data would’ve revealed different answers, but I will leave that up to you :)
To conclude, I leave you with some handy links for this series: