Structuring a dataset
This article is part of the publishing journey for statistics.gov.scot. You can find the full journey in Publishing open data on statistics.gov.scot.
To upload to statistics.gov.scot successfully, datasets must meet the criteria specified in this article. We require linked datasets to be in tidyformat. In having all datasets in this format, we are avoiding the issue of messy data:
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” ~ Hadley Wickham.
An example of a dataset that would upload successfully is shown in the screenshot below.
Column headers are case sensitive – the first letter of each word should be capitalised. As a minimum, each dataset must have the 5 columns highlighted in red, with the following headings:
GeographyCode/FeatureCode
The standard geography register code relating to the geography. This may also be called the ‘FeatureCode.’
Please be aware that the system reads the feature code and creates two new columns in the dataset: ‘FeatureName’, and ‘FeatureType’. These columns provide more information on the geographies of the observations and are produced automatically. As these columns are created from the FeatureCode value, these two new columns do not create any issues for your draft.
DateCode
The date to which the observation is related. We have multiple formats available for this, which are as follows:
- Calendar year e.g., 2016.
- Quarter of a calendar year, e.g., 2016-Q2 (for April-June 2016).
- Month e.g., 2016M08. Government/Financial year e.g., 2009/2010 (for April 2009 - March 2010).
- Range of years, 2005-2007 (currently under consideration).
- Range of government years, 2005/06 - 2009/10 - currently a 4-year period in that example, but for consistency with the range of calendar years issue, should become a 3-year period from 1 April 2005 to 31 March 2009.
- Weeks.
- Single dates.
Measurement
A range of measurement types are available, but count (number), or ratio (percentage, or per x) are the most used.
Units
The unit type for the data – think of how you would describe it in a sentence. For example, if it were GBP currency you would set it to Pounds.
Value
The observed value at the geography, date, measurement, and unit (e.g., total pancakes cooked in 2010 in S01000001).
You can use special characters to indicate where a value has been suppressed or it is not available. The convention is using an asterisk ("*") when values have been supressed due to statistical disclosure control or a small sample size, which makes them unreliable, and using a hyphen ("-") when data are not available due to varied reasons, like a change in the methodology, data not able to be collected or other reasons.
Others
Alongside these five mandatory columns, you are able to add additional columns relevant to your data. For example, in the screenshot above “Type of animal” is a 6th dimension to the data, which describes the type of pet. The story of the first row in that example would be “For area S01000001 in 2014, there were 10 pets which were cats”.
The column header will appear as the dimension name within the data set, please try to name them appropriately so that they make sense regarding the header.
You can add as many additional columns as you feel are necessary for the dataset, however, please be aware that a value must exist for every single observation, which can quickly create large data sets.
For example, in the above example if we have 3 types of animals, there would be 6505 data zones * 3 types of animal = 19515 observations. If we were to add an additional dimension for example colour, this would then be 6505 * 3 * number of colours observations.
Large data sets begin to cause performance issues on the site, so it may be more appropriate if you have many different dimensions to split the data up into several smaller data sets. The statistics publishing team are happy to advise on this.
Others - Naming conventions
Where possible, please refer to the defined concepts list available as a list of vocabularies on the website and use these as naming conventions. It is unlikely that different statistics publications will use the same names, however for things like “Age” these would be likely to be about a range of ages so could be under the same concept. In this case, case sensitivity is important and must match existing concepts; otherwise, an upload error will occur.