1 Using microdata for research

1.1 Background to research data at Statistics Finland

Finnish statistics are an exceptional data source for economic and social research. The data sets at Statistics Finland, collected from registers and surveys, form a comprehensive collection of detailed data about different areas of society. According to the Finnish Statistics Act (280/2004), Statistics Finland may issue a licence to use the confidential data in its possession, originally collected for statistical purposes, for scientific research and statistical surveys of social conditions.

FIONA – remote data access

The use of microdata for research has slowly expanded, especially following the example of the other Nordic countries and the Netherlands. In 2010, we launched a remote access service to allow easy and equal access for researchers to licensed microdata from their local workstation in a monitored and secure environment.
You can use microdata in the remote access system. Working on the remote system’s desktop is similar to working on your own computer, but there are no inward or outward network connections in the remote access environment. Thus, the research output to be exported from the remote access environment goes through a separate data protection review.

Pseudoidentifiers improve data protection

The direct identifiers in research data, such as personal identity codes and business identifiers, have been replaced with pseudoidentifiers in the remote access service. Pseudoidentifiers are fake identifiers that replace direct unit identifiers, allowing units (or individuals) to be tracked across data sets and their data to be linked between different sets.

1.2 Data for scientific research and statistical surveys

Statistics Finland compiles microdata based on its diverse statistics and issues user licences for said microdata for scientific research and statistical surveys. The units in microdata can be enterprises, establishments, households or individuals, for example.

Ready-made and tailored research data

Our data offering includes both research data sets that are ready for use and custom research data tailored from the data sets at Statistics Finland. If permitted by the Statistics Act, these data can be combined with your own data or data delivered by other organisations. The data sets and data descriptions suitable for research are being continuously developed in cooperation by Statistics Finland experts and researchers.

Data can be combined by using protected unit identifiers that remain static over time (pseudoidentifiers), which allows the statistical units to be tracked across years and data sets. By combining data, the data content required for research can be built to a very extensive degree if necessary. The remote access service allows you to create your own data combination by adding Statistics Finland data to register and survey data collected from other sources.

Business activity data

You can research the features and development of business activities by collecting information from group, enterprise and establishment data, which are based on comprehensive administrative registers and surveys.

For example, we provide the following information:

enterprise attributes (industry, location, enterprise form and ownership)
activities (profitability, production, exports, imports, research and development expenditure, innovations, ICT and business subsidies)
personnel (pay, education, occupations and mobility).

Population data

You can analyse the properties, behaviour and history of the population from data based on both registers and surveys.

The following data are available:

employment and employment relationships
unemployment and pensions
pay and working conditions
education, studies and qualifications
income, consumption and time use
housing and construction
criminal matters and causes of death.

Combined employee-employer data provide an opportunity to study different phenomena in the business world and the labour market, including their interactions, through employee characteristics, employee mobility between enterprises and industries and the dynamics of occupational structures, for example. Additional information is available about employees’ pay and pay structures by employer.

1.3 Special features of longitudinal data used for research

The data sets’ panel features provide an extensive description of the historical development of the statistical units. The longest time series for enterprises extend all the way back to the 1970s, and samples of individuals have been registered since the 1950 census. Annual data for the entire working population are available from 1988 onwards.

As a rule, ready-made data sets are updated annually after the statistics have been completed. Some ready-made data have been edited as far as possible to unify or harmonise the changes in data content over time. For example, industry categories, areal division categories and occupations have been pre-harmonised. Variable grouping and summation over time, as well as other processing, have also been carried out in advance for some data sets. Detailed data descriptions are available in the Taika research data catalogue.

Most of the data are not harmonised for changes in categories (e.g. occupation, education, industry, area or product titles) or variable content over time. The target populations and sampling frames may also have changed due to updates to data collection or information systems, for example. We try to include any changes to classifications and categories in the data descriptions.

Statistics Finland – Classifications

Issues with time series

If you want to track changes over time, you must check that the variables you use continue to measure the phenomenon you are researching at different times (over several years, for example). You can trace and harmonise changes based on the data descriptions and classifications.

Concepts and titles must use the same factors

For example, concepts of pay and occupational titles must include the same factors (unit of time, basic pay, seniority allowances, overtime bonuses, performance-based bonuses, etc.) to analyse changes in pay reliably by occupation. Occupational titles must describe the same occupation or work at different times. The occupational categories defined by employers’ organisations that are used in the national classification system may change significantly at times. The changes can be traced and harmonised as needed by using the data descriptions and classifications.

Data may come from multiple sources

Data generated by decision rules may be used instead of data from survey forms. For example, the variable “person’s main type of activity” in the employment statistics is inferred by using the “register estimation method”. This method uses information about a person’s age, employment relationships, unemployment, studies, pension, etc. Decision rules have been created based on data from previous censuses and register data from those periods. The decision rules can also prioritise the data sets in case of conflicting data.

Challenges of combined data

Combined data can be used to expand the description of a topic or compare data from different sources. Data may prove difficult to combine due to differences in the target populations, dates of data collection and statistical limits. For example, the annual statistics of the Business Register only include enterprises that have operated for at least six months in that statistical year, and who have employed more than half a person or whose turnover has exceeded the annual statistical limit.

On the other hand, in the employment statistics the employer enterprise of persons is determined according to the situation in the last week of the year. Enterprises founded and terminated within the same year may be treated differently by both statistics. The content and values of variables (e.g. “main activity”) may also differ, with different definitions in different sources.

It may be especially difficult to combine register and survey data. Combining such data may require extra consideration of their sample designs and weighting the data by statistical methods.

1.4. Quality and scope of data

When examining data, it is good to remember that extensive statistics include deviance and errors, including:

repeated observations (duplicates)
erroneous variable values
missing information.

The statistical production process can only correct some of the errors. At the same time, the data content is as comprehensive as possible, as the basic data are often released for research use almost as such. As a researcher, this leaves you free to come up with your own solutions for handling outliers, for example. The initial analysis of data should be done carefully with distribution tables and graphical charts. The annual calculated indicators of time series data reveal much about the changes.

Statistical data are checked for observation quality, defects (unit and item non-responses) and errors, either automatically en masse or manually. Special attention is paid to systematic and significant errors. Large enterprises receive more scrutiny than small ones, for example.

Imputation means estimating missing information

Imputation is a methodological approach to replacing missing or incorrect values with substitute values that are as accurate as possible. The imputed observations and methods used are documented and indicated in the data wherever possible. In some cases, such as financial statement data, the corrections include multiple stages with various correction procedures and imputation methods.

The variance of imputed variables is decreasing, partly due to the acquisition of more data and the lower variation of the imputed values. When imputed variables are used, pay attention to your research design.

For example, the number of personnel has been partly estimated for enterprises with fewer than 20 people based on pay. In this group, using an enterprise’s size to explain pay distorts the dependencies. Further information about the quality and non-response corrections (correction of missing information) of data is given in the quality description of each set of statistics.

Concept quiz

The remote access system is a desktop environment researchers use to access licensed research data securely from their own workstation.
Imputation is a methodological approach to replacing missing or incorrect values with substitute values that are as accurate as possible.
Microdata are register or survey data collected for statistics that include information about individual statistical units such as people, households or enterprises.
Pseudoidentifiers are fake identifiers used to replace personal identity codes, for example.

Remote Access to Research Data