Case 4: Analysis of Danish Parliament Data
The project will carry out linguistic analyses of material from the Danish Parliament. Furthermore, the project group work participates in a European Research Network focusing on Parliamentary data from approx. 20 countries.
- Timeline: 2017-(ongoing)
- Number of people involved: currently 3 participants
- Number of institutions involved: University of Copenhagen, collaboration with research group at Aarhus University
- Funding (EU funding or not): No EU funding. Currently funded by University of Copenhagen
Knowledge of FAIR prior to case engagement
When was tools used working towards FAIR in the case
The project is still in the start phase. The currently shared data from the Danish Parliament are already released and available for re-use with a persistent identifier together with searchable and findable metadata in the Clarin.dk repository. The data has a license attached and are openly available.
Results and new data package will be released when results of the analysis have been evaluated and documented. These data sets are planned to be released in the Clarin.dk as well.
The text data are currently being annotated automatically or semi-automatically with various types of information.
The research will address the data processes marked in the figure:
dig på rette spor.
Which FAIR tools were used in the case and for what purpose
- Clarin.dk for release of data sets.
- Software and scripts will be stored in the Git Version Control System at University of Copenhagen for tracking and documenting the sw.
What are the main challenges using FAIR in this case
- Always remember to acknowledge the Danish Parliament as source of the data, both when releasing derived data sets when choosing a valid license to the derived data.
- Deciding when a version of the annotated data can be released as a particular version with a new persistent identifier. The data will be processed and analyzed continuously, but versions of annotated data have to be released during the work, and each version has to be well documented.
- Selecting formats for released annotations and derived datasets that are recognized and well documented. In the analysis workflow internal formats might be used, these will have to be converted hopefully without loss to sharable formats.
- How to identify subsets of the data, e.g. data that has been manually annotated with certain types of information/data that has been automatically annotated etc.
What were the main benefits using FAIR in this case
- When data is openly available in a repository with PID, metadata and a clear license, it is easier to use it in cooperative projects with external partners.
- It is easier to compare similar data in different countries.
- Data must be documented for later use, both internally and externally.
Key learning points
- The data lifecycle and project lifecycle can help identifying phases in the annotation work.
- There should be a way of identifying subsets of large data collections in the data lifecycle and project lifecycle.
For further information on this case contact: Senior researcher Costanza Navarretta or Dorte Haltrup Hansen, University of Copenhagen.