Example script how to access the climate database (CDB) from R under Windows 10. In comparison to Ubuntu, we have to first install some external libraries to use the database from within R. For more information, see also
Since the climate database is a PostgreSQL one, we need the appropriate driver. For windows, the .msi installer can be downloaded from here https://www.postgresql.org/ftp/odbc/versions/msi/. Just scroll to the bottom, get the latest version, and install it. If succesful, then you should see "PostgreSQL Unicode"
when running the following command (requires the odbc
package).
sort(unique(odbc::odbcListDrivers()[[1]]))
## [1] "PostgreSQL ANSI" "PostgreSQL ANSI(x64)"
## [3] "PostgreSQL Unicode" "PostgreSQL Unicode(x64)"
## [5] "SQL Server"
To set up the connection, the following two packages are required.
library(DBI)
library(odbc)
Using the tidyverse in R, we can circumvent the direct usage of SQL language, and can use the more user-friendly dplyr
syntax. For this we need these packages.
library(dplyr)
library(dbplyr)
For data wrangling and visualization we will use these packages.
library(sf)
library(lubridate)
library(ggplot2)
library(mapview)
The following sets up the connection to the SQL database. This is currently only possible within the ScientificNet. The parameters are copied from the documentation.
For windows, we also need to set up the encoding parameter, otherwise accented characters do net get read in correctly and might cause problems.
conn <- dbConnect(odbc(),
Driver = "PostgreSQL Unicode",
Server = "10.8.244.31",
Database = "climate_data",
UID = "climate_user",
PWD = "meteo_data", Port = 5432,
encoding = "windows-1252")
List the available tables.
dbListTables(conn)
## [1] "climatologies" "data_restricted_temp"
## [3] "data_temp" "geography_columns"
## [5] "geometry_columns" "metadata"
## [7] "raster_columns" "raster_overviews"
## [9] "spatial_ref_sys" "stations_data"
## [11] "stations_data_backup" "stations_data_join"
## [13] "stations_data_restricted"
To view the data, we will use dplyr. This is efficient, since dplyr uses lazy evaluation, in the sense that it will only fetch the data required for printing, and will not download all the data (unless explicitly requested, see also further below). You can see this in the output below that identifies the source of the table and does not know the number of rows.
The daily values are stored in the table stations_data
.
tbl(conn, "stations_data")
## # Source: table<stations_data> [?? x 7]
## # Database: postgres [climate_user@localhost:/climate_data]
## date tmin tmax tmean prec province station
## <date> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 2017-06-15 10.8 17.6 14.2 0.2 TN Passo_Brocon
## 2 2017-06-16 11.3 22.1 16.7 1.2 TN Passo_Brocon
## 3 2017-06-17 10.4 20.2 15.3 0 TN Passo_Brocon
## 4 2017-06-18 8.60 17.3 12.9 0 TN Passo_Brocon
## 5 2017-06-19 9.90 17.5 13.7 0 TN Passo_Brocon
## 6 2017-06-20 11.6 21.6 16.6 0 TN Passo_Brocon
## 7 2017-06-21 12.6 21.7 17.1 0 TN Passo_Brocon
## 8 2017-06-22 11.8 22.5 17.1 17 TN Passo_Brocon
## 9 2017-06-23 13.6 23.5 18.6 0 TN Passo_Brocon
## 10 2017-06-24 14.2 23.4 18.8 0 TN Passo_Brocon
## # ... with more rows
The climatologies (averages for the period 1981-2010) values are stored in the table climatologies
.
tbl(conn, "climatologies")
## # Source: table<climatologies> [?? x 6]
## # Database: postgres [climate_user@localhost:/climate_data]
## month tmin tmax tmean prec station
## <int> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 NA NA NA 62.4 Galtür
## 2 2 NA NA NA 56.9 Galtür
## 3 3 NA NA NA 66.7 Galtür
## 4 4 NA NA NA 50.6 Galtür
## 5 5 NA NA NA 79.4 Galtür
## 6 6 NA NA NA 126. Galtür
## 7 7 NA NA NA 143. Galtür
## 8 2 -2.8 6.30 1.8 40.4 Innsbruck_Universität
## 9 3 1 11.8 6.40 55.8 Innsbruck_Universität
## 10 4 4.60 16.3 10.4 55.4 Innsbruck_Universität
## # ... with more rows
And finally the metadata with the station locations and more info on the processing in metadata
.
tbl(conn, "metadata")
## # Source: table<metadata> [?? x 16]
## # Database: postgres [climate_user@localhost:/climate_data]
## id geom LAT LON ELE P_use P_lenght P_source P_homo T_use T_lenght
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 260 0101~ 46.2 10.8 1681 1 2 <NA> 1 1 2
## 2 27 0101~ 47.0 11.1 2290 1 0 BZ-Prov~ 0 NA NA
## 3 30 0101~ 46.5 11.3 239 0 2 BZ-Prov~ 0 NA NA
## 4 63 0101~ 47.0 11.4 1360 1 2 BZ-Prov~ 0 NA NA
## 5 164 0101~ 46.9 10.4 2730 0 0 METEOSW~ 0 1 2
## 6 162 0101~ 46.9 11.2 2230 NA NA <NA> NA 1 0
## 7 160 0101~ 46.5 10.6 2825 0 0 BZ-Prov~ 0 NA NA
## 8 151 0101~ 46.8 11.8 746 1 2 BZ-Prov~ 0 NA NA
## 9 94 0101~ 46.5 11.7 2055 1 0 BZ-Prov~ 0 1 0
## 10 109 0101~ 46.5 11.0 1142 1 2 BZ-Prov~ 1 1 2
## # ... with more rows, and 5 more variables: T_source <chr>, Tmin_homo <dbl>,
## # Tmax_homo <dbl>, Province <chr>, NAME <chr>
The metadata contains the geom
column, which holds the spatial information. We can read it in as a spatial object using the sf
package.
sf_metadata <- st_read(dsn = conn, "metadata")
sf_metadata
## Simple feature collection with 346 features and 15 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 10 ymin: 45.73919 xmax: 12.61463 ymax: 47.261
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
## id LAT LON ELE P_use P_lenght P_source P_homo T_use T_lenght
## 1 260 46.24248 10.83836 1681 1 2 <NA> 1 1 2
## 2 27 46.99690 11.13565 2290 1 0 BZ-Province 0 NA NA
## 3 30 46.48451 11.29918 239 0 2 BZ-Province 0 NA NA
## 4 63 47.01774 11.42200 1360 1 2 BZ-Province 0 NA NA
## 5 164 46.93333 10.45000 2730 0 0 METEOSWISS 0 1 2
## 6 162 46.91667 11.15000 2230 NA NA <NA> NA 1 0
## 7 160 46.50000 10.61667 2825 0 0 BZ-Province 0 NA NA
## 8 151 46.81667 11.75000 746 1 2 BZ-Province 0 NA NA
## 9 94 46.51188 11.68879 2055 1 0 BZ-Province 0 1 0
## 10 109 46.54187 10.99014 1142 1 2 BZ-Province 1 1 2
## T_source Tmin_homo Tmax_homo Province NAME
## 1 METEOTRENTINO 0 1 TN Passo_Campo_Carlo_Magno
## 2 <NA> NA NA AU Dresdner_Hütte
## 3 <NA> NA NA BZ Etsch_Sigmundskron (M)
## 4 <NA> NA NA AU Obernberg_am_Brenner
## 5 METEOSWISS 0 0 CH Vinadi_Alpetta
## 6 BZ-Province 0 0 BZ Timmelsalm
## 7 <NA> NA NA BZ Sulden_Madritsch
## 8 <NA> NA NA BZ Obervintl
## 9 BZ-Province 0 0 BZ Seiser_Alm_Zallinger
## 10 BZ-Province 0 0 BZ St_Walburg_Zoggler_Stausee
## geom
## 1 POINT (10.83836 46.24248)
## 2 POINT (11.13565 46.9969)
## 3 POINT (11.29918 46.48451)
## 4 POINT (11.422 47.01774)
## 5 POINT (10.45 46.93333)
## 6 POINT (11.15 46.91667)
## 7 POINT (10.61667 46.5)
## 8 POINT (11.75 46.81667)
## 9 POINT (11.68879 46.51188)
## 10 POINT (10.99014 46.54187)
We will use mapview
to interactively explore the station locations.
mapview(sf_metadata)
Let’s extract the time series of a station, make monthly means of minimum temperature for every year and visualize them.
For example, let’s take the station named Welschnofen.
tbl(conn, "stations_data") %>%
filter(station == "Deutschnofen")
## # Source: lazy query [?? x 7]
## # Database: postgres [climate_user@localhost:/climate_data]
## date tmin tmax tmean prec province station
## <date> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1978-01-01 -7 4 -1.5 NaN BZ Deutschnofen
## 2 1978-01-02 -6 4 -1 NaN BZ Deutschnofen
## 3 1978-01-03 -3 7 2 NaN BZ Deutschnofen
## 4 1978-01-04 -8 3 -2.5 NaN BZ Deutschnofen
## 5 1978-01-05 -10 -3 -6.5 NaN BZ Deutschnofen
## 6 1978-01-06 -11 2 -4.5 NaN BZ Deutschnofen
## 7 1978-01-07 -3 7 2 NaN BZ Deutschnofen
## 8 1978-01-08 -4 8 2 NaN BZ Deutschnofen
## 9 1978-01-09 -3 8 2.5 NaN BZ Deutschnofen
## 10 1978-01-10 -4 5 0.5 NaN BZ Deutschnofen
## # ... with more rows
For aggregating, we have two options. First to download the the data using collect, and then doing the aggregations. From what is printed, we can see that the table is now stored locally (and R knows the number of rows).
tbl(conn, "stations_data") %>%
filter(station == "Deutschnofen") %>%
collect() -> tbl_local_daily
tbl_local_daily
## # A tibble: 15,340 x 7
## date tmin tmax tmean prec province station
## <date> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1978-01-01 -7 4 -1.5 NaN BZ Deutschnofen
## 2 1978-01-02 -6 4 -1 NaN BZ Deutschnofen
## 3 1978-01-03 -3 7 2 NaN BZ Deutschnofen
## 4 1978-01-04 -8 3 -2.5 NaN BZ Deutschnofen
## 5 1978-01-05 -10 -3 -6.5 NaN BZ Deutschnofen
## 6 1978-01-06 -11 2 -4.5 NaN BZ Deutschnofen
## 7 1978-01-07 -3 7 2 NaN BZ Deutschnofen
## 8 1978-01-08 -4 8 2 NaN BZ Deutschnofen
## 9 1978-01-09 -3 8 2.5 NaN BZ Deutschnofen
## 10 1978-01-10 -4 5 0.5 NaN BZ Deutschnofen
## # ... with 15,330 more rows
tbl_local_daily %>%
group_by(year = year(date), month = month(date)) %>%
summarise(mean_tmin = mean(tmin)) -> tbl_local_agg
tbl_local_agg
## # A tibble: 504 x 3
## # Groups: year [42]
## year month mean_tmin
## <dbl> <dbl> <dbl>
## 1 1978 1 -5.77
## 2 1978 2 -6.71
## 3 1978 3 -2.35
## 4 1978 4 -1
## 5 1978 5 2.77
## 6 1978 6 6.8
## 7 1978 7 9.06
## 8 1978 8 8.84
## 9 1978 9 6.93
## 10 1978 10 3.13
## # ... with 494 more rows
The second option is to do the aggregations on the server. The dplyr syntax automatically converts the verbose code into SQL code which is run on the server, and then we collect only the results. Since the internet connections are good and data volumes are low, this is not really necessary but can be useful. However, attention has to be paid because of possibly unexpected behaviour concerning missing values.
tbl(conn, "stations_data") %>%
filter(station == "Deutschnofen") %>%
group_by(year = year(date), month = month(date)) %>%
summarise(mean_tmin = mean(tmin)) %>%
collect() -> tbl_summary_done_remote
## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
tbl_summary_done_remote
## # A tibble: 504 x 3
## # Groups: year [42]
## year month mean_tmin
## <dbl> <dbl> <dbl>
## 1 2004 9 7.56
## 2 2019 5 NaN
## 3 1991 12 -5.23
## 4 2012 2 -7.75
## 5 1984 12 -5.97
## 6 2011 6 8.93
## 7 2013 3 -2.73
## 8 1992 9 7.1
## 9 2001 2 -3.93
## 10 1996 1 -4.03
## # ... with 494 more rows
One can see the SQL code that dplyr
generated and ran on the server with the function show_query()
. We will run it before collecting the data for illustration purposes.
tbl(conn, "stations_data") %>%
filter(station == "Deutschnofen") %>%
group_by(year = year(date), month = month(date)) %>%
summarise(mean_tmin = mean(tmin)) %>%
show_query()
## <SQL>
## SELECT "year", "month", AVG("tmin") AS "mean_tmin"
## FROM (SELECT "date", "tmin", "tmax", "tmean", "prec", "province", "station", EXTRACT(year FROM "date") AS "year", EXTRACT(MONTH FROM "date") AS "month"
## FROM (SELECT *
## FROM "stations_data"
## WHERE ("station" = 'Deutschnofen')) "dbplyr_003") "dbplyr_004"
## GROUP BY "year", "month"
And some plot to see the data.
tbl_local_agg %>%
ggplot(aes(year, mean_tmin))+
geom_point()+
geom_smooth()+
facet_wrap(~month)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 35 rows containing non-finite values (stat_smooth).
## Warning: Removed 35 rows containing missing values (geom_point).
Just for the fun of it, let’s plot the deviations from the 1981-2010 climatology, to better see the different seasonal changes. We’ll get the climatology data first, then merge it to the previous one, calculate differences and plot it.
tbl(conn, "climatologies") %>%
filter(station == "Deutschnofen") %>%
collect() -> tbl_climatology
tbl_climatology
## # A tibble: 12 x 6
## month tmin tmax tmean prec station
## <int> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 -5.60 2.10 -1.7 18 Deutschnofen
## 2 2 -5.80 2.8 -1.5 17.8 Deutschnofen
## 3 3 -3.10 5.80 1.4 32.4 Deutschnofen
## 4 4 -0.1 8.70 4.30 58.1 Deutschnofen
## 5 5 4.60 13.6 9.10 94.6 Deutschnofen
## 6 6 7.60 17.5 12.6 105. Deutschnofen
## 7 7 10 20 15 104 Deutschnofen
## 8 8 9.80 19.3 14.6 95.1 Deutschnofen
## 9 9 6.70 15.3 11 76.1 Deutschnofen
## 10 10 3 10.9 7 81 Deutschnofen
## 11 11 -1.8 5.30 1.7 57.6 Deutschnofen
## 12 12 -4.80 2 -1.4 31.7 Deutschnofen
tbl_local_agg %>%
left_join(tbl_climatology) %>%
mutate(mean_tmin_anomalies = mean_tmin - tmin) -> tbl_anomalies
## Joining, by = "month"
tbl_anomalies %>%
ggplot(aes(year, mean_tmin_anomalies))+
geom_point()+
geom_smooth()+
facet_wrap(~month)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 35 rows containing non-finite values (stat_smooth).
## Warning: Removed 35 rows containing missing values (geom_point).
dbDisconnect(conn)