Accessing the CDB via R under Windows 10

Example script how to access the climate database (CDB) from R under Windows 10. In comparison to Ubuntu, we have to first install some external libraries to use the database from within R. For more information, see also

Since the climate database is a PostgreSQL one, we need the appropriate driver. For windows, the .msi installer can be downloaded from here https://www.postgresql.org/ftp/odbc/versions/msi/. Just scroll to the bottom, get the latest version, and install it. If succesful, then you should see "PostgreSQL Unicode" when running the following command (requires the odbc package).

sort(unique(odbc::odbcListDrivers()[[1]]))

## [1] "PostgreSQL ANSI"         "PostgreSQL ANSI(x64)"   
## [3] "PostgreSQL Unicode"      "PostgreSQL Unicode(x64)"
## [5] "SQL Server"

Required packages

To set up the connection, the following two packages are required.

library(DBI)
library(odbc)

Using the tidyverse in R, we can circumvent the direct usage of SQL language, and can use the more user-friendly dplyr syntax. For this we need these packages.

library(dplyr)
library(dbplyr)

For data wrangling and visualization we will use these packages.

library(sf)
library(lubridate)
library(ggplot2)
library(mapview)

Set up the connection

The following sets up the connection to the SQL database. This is currently only possible within the ScientificNet. The parameters are copied from the documentation.

For windows, we also need to set up the encoding parameter, otherwise accented characters do net get read in correctly and might cause problems.

conn <- dbConnect(odbc(), 
                  Driver = "PostgreSQL Unicode", 
                  Server = "10.8.244.31", 
                  Database = "climate_data", 
                  UID = "climate_user", 
                  PWD = "meteo_data", Port = 5432,
                  encoding = "windows-1252")

Overview of available data

List the available tables.

dbListTables(conn)

##  [1] "climatologies"            "data_restricted_temp"    
##  [3] "data_temp"                "geography_columns"       
##  [5] "geometry_columns"         "metadata"                
##  [7] "raster_columns"           "raster_overviews"        
##  [9] "spatial_ref_sys"          "stations_data"           
## [11] "stations_data_backup"     "stations_data_join"      
## [13] "stations_data_restricted"

To view the data, we will use dplyr. This is efficient, since dplyr uses lazy evaluation, in the sense that it will only fetch the data required for printing, and will not download all the data (unless explicitly requested, see also further below). You can see this in the output below that identifies the source of the table and does not know the number of rows.

The daily values are stored in the table stations_data.

tbl(conn, "stations_data")

## # Source:   table<stations_data> [?? x 7]
## # Database: postgres [climate_user@localhost:/climate_data]
##    date        tmin  tmax tmean  prec province station     
##    <date>     <dbl> <dbl> <dbl> <dbl> <chr>    <chr>       
##  1 2017-06-15 10.8   17.6  14.2   0.2 TN       Passo_Brocon
##  2 2017-06-16 11.3   22.1  16.7   1.2 TN       Passo_Brocon
##  3 2017-06-17 10.4   20.2  15.3   0   TN       Passo_Brocon
##  4 2017-06-18  8.60  17.3  12.9   0   TN       Passo_Brocon
##  5 2017-06-19  9.90  17.5  13.7   0   TN       Passo_Brocon
##  6 2017-06-20 11.6   21.6  16.6   0   TN       Passo_Brocon
##  7 2017-06-21 12.6   21.7  17.1   0   TN       Passo_Brocon
##  8 2017-06-22 11.8   22.5  17.1  17   TN       Passo_Brocon
##  9 2017-06-23 13.6   23.5  18.6   0   TN       Passo_Brocon
## 10 2017-06-24 14.2   23.4  18.8   0   TN       Passo_Brocon
## # ... with more rows

The climatologies (averages for the period 1981-2010) values are stored in the table climatologies.

tbl(conn, "climatologies")

## # Source:   table<climatologies> [?? x 6]
## # Database: postgres [climate_user@localhost:/climate_data]
##    month  tmin  tmax tmean  prec station              
##    <int> <dbl> <dbl> <dbl> <dbl> <chr>                
##  1     1 NA    NA    NA     62.4 Galtür               
##  2     2 NA    NA    NA     56.9 Galtür               
##  3     3 NA    NA    NA     66.7 Galtür               
##  4     4 NA    NA    NA     50.6 Galtür               
##  5     5 NA    NA    NA     79.4 Galtür               
##  6     6 NA    NA    NA    126.  Galtür               
##  7     7 NA    NA    NA    143.  Galtür               
##  8     2 -2.8   6.30  1.8   40.4 Innsbruck_Universität
##  9     3  1    11.8   6.40  55.8 Innsbruck_Universität
## 10     4  4.60 16.3  10.4   55.4 Innsbruck_Universität
## # ... with more rows

And finally the metadata with the station locations and more info on the processing in metadata.

tbl(conn, "metadata")

## # Source:   table<metadata> [?? x 16]
## # Database: postgres [climate_user@localhost:/climate_data]
##       id geom    LAT   LON   ELE P_use P_lenght P_source P_homo T_use T_lenght
##    <int> <chr> <dbl> <dbl> <dbl> <dbl>    <dbl> <chr>     <dbl> <dbl>    <dbl>
##  1   260 0101~  46.2  10.8  1681     1        2 <NA>          1     1        2
##  2    27 0101~  47.0  11.1  2290     1        0 BZ-Prov~      0    NA       NA
##  3    30 0101~  46.5  11.3   239     0        2 BZ-Prov~      0    NA       NA
##  4    63 0101~  47.0  11.4  1360     1        2 BZ-Prov~      0    NA       NA
##  5   164 0101~  46.9  10.4  2730     0        0 METEOSW~      0     1        2
##  6   162 0101~  46.9  11.2  2230    NA       NA <NA>         NA     1        0
##  7   160 0101~  46.5  10.6  2825     0        0 BZ-Prov~      0    NA       NA
##  8   151 0101~  46.8  11.8   746     1        2 BZ-Prov~      0    NA       NA
##  9    94 0101~  46.5  11.7  2055     1        0 BZ-Prov~      0     1        0
## 10   109 0101~  46.5  11.0  1142     1        2 BZ-Prov~      1     1        2
## # ... with more rows, and 5 more variables: T_source <chr>, Tmin_homo <dbl>,
## #   Tmax_homo <dbl>, Province <chr>, NAME <chr>

Map of the station locations

The metadata contains the geom column, which holds the spatial information. We can read it in as a spatial object using the sf package.

sf_metadata <- st_read(dsn = conn, "metadata")
sf_metadata

## Simple feature collection with 346 features and 15 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 10 ymin: 45.73919 xmax: 12.61463 ymax: 47.261
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
##     id      LAT      LON  ELE P_use P_lenght    P_source P_homo T_use T_lenght
## 1  260 46.24248 10.83836 1681     1        2        <NA>      1     1        2
## 2   27 46.99690 11.13565 2290     1        0 BZ-Province      0    NA       NA
## 3   30 46.48451 11.29918  239     0        2 BZ-Province      0    NA       NA
## 4   63 47.01774 11.42200 1360     1        2 BZ-Province      0    NA       NA
## 5  164 46.93333 10.45000 2730     0        0  METEOSWISS      0     1        2
## 6  162 46.91667 11.15000 2230    NA       NA        <NA>     NA     1        0
## 7  160 46.50000 10.61667 2825     0        0 BZ-Province      0    NA       NA
## 8  151 46.81667 11.75000  746     1        2 BZ-Province      0    NA       NA
## 9   94 46.51188 11.68879 2055     1        0 BZ-Province      0     1        0
## 10 109 46.54187 10.99014 1142     1        2 BZ-Province      1     1        2
##         T_source Tmin_homo Tmax_homo Province                       NAME
## 1  METEOTRENTINO         0         1       TN    Passo_Campo_Carlo_Magno
## 2           <NA>        NA        NA       AU             Dresdner_Hütte
## 3           <NA>        NA        NA       BZ     Etsch_Sigmundskron (M)
## 4           <NA>        NA        NA       AU       Obernberg_am_Brenner
## 5     METEOSWISS         0         0       CH             Vinadi_Alpetta
## 6    BZ-Province         0         0       BZ                 Timmelsalm
## 7           <NA>        NA        NA       BZ           Sulden_Madritsch
## 8           <NA>        NA        NA       BZ                  Obervintl
## 9    BZ-Province         0         0       BZ       Seiser_Alm_Zallinger
## 10   BZ-Province         0         0       BZ St_Walburg_Zoggler_Stausee
##                         geom
## 1  POINT (10.83836 46.24248)
## 2   POINT (11.13565 46.9969)
## 3  POINT (11.29918 46.48451)
## 4    POINT (11.422 47.01774)
## 5     POINT (10.45 46.93333)
## 6     POINT (11.15 46.91667)
## 7      POINT (10.61667 46.5)
## 8     POINT (11.75 46.81667)
## 9  POINT (11.68879 46.51188)
## 10 POINT (10.99014 46.54187)

We will use mapview to interactively explore the station locations.

mapview(sf_metadata)

Get some example data and plot

Let’s extract the time series of a station, make monthly means of minimum temperature for every year and visualize them.

For example, let’s take the station named Welschnofen.

tbl(conn, "stations_data") %>% 
  filter(station == "Deutschnofen")

## # Source:   lazy query [?? x 7]
## # Database: postgres [climate_user@localhost:/climate_data]
##    date        tmin  tmax tmean  prec province station     
##    <date>     <dbl> <dbl> <dbl> <dbl> <chr>    <chr>       
##  1 1978-01-01    -7     4  -1.5   NaN BZ       Deutschnofen
##  2 1978-01-02    -6     4  -1     NaN BZ       Deutschnofen
##  3 1978-01-03    -3     7   2     NaN BZ       Deutschnofen
##  4 1978-01-04    -8     3  -2.5   NaN BZ       Deutschnofen
##  5 1978-01-05   -10    -3  -6.5   NaN BZ       Deutschnofen
##  6 1978-01-06   -11     2  -4.5   NaN BZ       Deutschnofen
##  7 1978-01-07    -3     7   2     NaN BZ       Deutschnofen
##  8 1978-01-08    -4     8   2     NaN BZ       Deutschnofen
##  9 1978-01-09    -3     8   2.5   NaN BZ       Deutschnofen
## 10 1978-01-10    -4     5   0.5   NaN BZ       Deutschnofen
## # ... with more rows

For aggregating, we have two options. First to download the the data using collect, and then doing the aggregations. From what is printed, we can see that the table is now stored locally (and R knows the number of rows).

tbl(conn, "stations_data") %>% 
  filter(station == "Deutschnofen") %>% 
  collect() -> tbl_local_daily
tbl_local_daily

## # A tibble: 15,340 x 7
##    date        tmin  tmax tmean  prec province station     
##    <date>     <dbl> <dbl> <dbl> <dbl> <chr>    <chr>       
##  1 1978-01-01    -7     4  -1.5   NaN BZ       Deutschnofen
##  2 1978-01-02    -6     4  -1     NaN BZ       Deutschnofen
##  3 1978-01-03    -3     7   2     NaN BZ       Deutschnofen
##  4 1978-01-04    -8     3  -2.5   NaN BZ       Deutschnofen
##  5 1978-01-05   -10    -3  -6.5   NaN BZ       Deutschnofen
##  6 1978-01-06   -11     2  -4.5   NaN BZ       Deutschnofen
##  7 1978-01-07    -3     7   2     NaN BZ       Deutschnofen
##  8 1978-01-08    -4     8   2     NaN BZ       Deutschnofen
##  9 1978-01-09    -3     8   2.5   NaN BZ       Deutschnofen
## 10 1978-01-10    -4     5   0.5   NaN BZ       Deutschnofen
## # ... with 15,330 more rows

tbl_local_daily %>% 
  group_by(year = year(date), month = month(date)) %>% 
  summarise(mean_tmin = mean(tmin)) -> tbl_local_agg
tbl_local_agg

## # A tibble: 504 x 3
## # Groups:   year [42]
##     year month mean_tmin
##    <dbl> <dbl>     <dbl>
##  1  1978     1     -5.77
##  2  1978     2     -6.71
##  3  1978     3     -2.35
##  4  1978     4     -1   
##  5  1978     5      2.77
##  6  1978     6      6.8 
##  7  1978     7      9.06
##  8  1978     8      8.84
##  9  1978     9      6.93
## 10  1978    10      3.13
## # ... with 494 more rows

The second option is to do the aggregations on the server. The dplyr syntax automatically converts the verbose code into SQL code which is run on the server, and then we collect only the results. Since the internet connections are good and data volumes are low, this is not really necessary but can be useful. However, attention has to be paid because of possibly unexpected behaviour concerning missing values.

tbl(conn, "stations_data") %>% 
  filter(station == "Deutschnofen") %>% 
  group_by(year = year(date), month = month(date)) %>% 
  summarise(mean_tmin = mean(tmin))  %>% 
  collect() -> tbl_summary_done_remote

## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.

tbl_summary_done_remote

## # A tibble: 504 x 3
## # Groups:   year [42]
##     year month mean_tmin
##    <dbl> <dbl>     <dbl>
##  1  2004     9      7.56
##  2  2019     5    NaN   
##  3  1991    12     -5.23
##  4  2012     2     -7.75
##  5  1984    12     -5.97
##  6  2011     6      8.93
##  7  2013     3     -2.73
##  8  1992     9      7.1 
##  9  2001     2     -3.93
## 10  1996     1     -4.03
## # ... with 494 more rows

One can see the SQL code that dplyr generated and ran on the server with the function show_query(). We will run it before collecting the data for illustration purposes.

tbl(conn, "stations_data") %>% 
  filter(station == "Deutschnofen") %>% 
  group_by(year = year(date), month = month(date)) %>% 
  summarise(mean_tmin = mean(tmin)) %>% 
  show_query()

## <SQL>
## SELECT "year", "month", AVG("tmin") AS "mean_tmin"
## FROM (SELECT "date", "tmin", "tmax", "tmean", "prec", "province", "station", EXTRACT(year FROM "date") AS "year", EXTRACT(MONTH FROM "date") AS "month"
## FROM (SELECT *
## FROM "stations_data"
## WHERE ("station" = 'Deutschnofen')) "dbplyr_003") "dbplyr_004"
## GROUP BY "year", "month"

And some plot to see the data.

tbl_local_agg %>% 
  ggplot(aes(year, mean_tmin))+
  geom_point()+
  geom_smooth()+
  facet_wrap(~month)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 35 rows containing non-finite values (stat_smooth).

## Warning: Removed 35 rows containing missing values (geom_point).

Just for the fun of it, let’s plot the deviations from the 1981-2010 climatology, to better see the different seasonal changes. We’ll get the climatology data first, then merge it to the previous one, calculate differences and plot it.

tbl(conn, "climatologies") %>% 
  filter(station == "Deutschnofen") %>% 
  collect() -> tbl_climatology
tbl_climatology

## # A tibble: 12 x 6
##    month  tmin  tmax tmean  prec station     
##    <int> <dbl> <dbl> <dbl> <dbl> <chr>       
##  1     1 -5.60  2.10 -1.7   18   Deutschnofen
##  2     2 -5.80  2.8  -1.5   17.8 Deutschnofen
##  3     3 -3.10  5.80  1.4   32.4 Deutschnofen
##  4     4 -0.1   8.70  4.30  58.1 Deutschnofen
##  5     5  4.60 13.6   9.10  94.6 Deutschnofen
##  6     6  7.60 17.5  12.6  105.  Deutschnofen
##  7     7 10    20    15    104   Deutschnofen
##  8     8  9.80 19.3  14.6   95.1 Deutschnofen
##  9     9  6.70 15.3  11     76.1 Deutschnofen
## 10    10  3    10.9   7     81   Deutschnofen
## 11    11 -1.8   5.30  1.7   57.6 Deutschnofen
## 12    12 -4.80  2    -1.4   31.7 Deutschnofen

tbl_local_agg %>% 
  left_join(tbl_climatology) %>% 
  mutate(mean_tmin_anomalies = mean_tmin - tmin) -> tbl_anomalies

## Joining, by = "month"

tbl_anomalies %>%
  ggplot(aes(year, mean_tmin_anomalies))+
  geom_point()+
  geom_smooth()+
  facet_wrap(~month)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 35 rows containing non-finite values (stat_smooth).

## Warning: Removed 35 rows containing missing values (geom_point).

When finished, close the connection

dbDisconnect(conn)