1. Load data

Load the data.

star <- read.csv.ffdf(file = "star2002-full.csv", header = TRUE)


2. Some analyses

2.1 Exploration

2.2 Visualisation

Convert to df.

star_df <-
Not sure what these variables mean.

For illustration only.


ggplot(star_df) + aes(x = X1613424) + geom_histogram() +
  ggtitle("Histogram of X1613424")
ggplot(star_df) + aes(x = X1613424, y = X4518) +
  geom_point(shape = 1, colour = "blue") + ggtitle("Scatter plot")

2.3 Random sample

star_sample <- sample_frac(star_df, 0.0001)


2.4 Training validation split

Our favourite seed :-)


Create the indices for the split This samples the row indices to split the data into training and validation.

train_index <- sample(1:nrow(star), 0.6 * nrow(star))
valid_index <- setdiff(1:nrow(star), train_index)

Using the indices, create the training and validation sets This is similar in principle to splitting a data frame by row.

train_df <- star[train_index, ]
valid_df <- star[valid_index, ]


Column names.

3. A model

Build the model.

Just a random model for illustration.

No idea what the variables mean :-)

regression_model <- lm(X10.559091 ~ X10.955403 + X2288071 + X.0.28820264, data = train_df)
Predict and check the accuracy.

4. PostgreSQL

Install the packages if not already.

4.1 Connect to the database

Connect to the database


The credentials are hypothetical for illustration. In reality, these credentials will be given to connect.

db <- "maytheforcebewithyou"

host_db <- ""

db_port <- "666"

db_user <- "thisistheway"  

db_password <- "theforceiswithme"

con <- dbConnect(RPostgres::Postgres(), dbname = db, host=host_db, port=db_port, user=db_user, password=db_password)  

## [1] "star_sample"

4.2 Work with database 1

The data (star_sample) were written to the database.

A small sample is used for illustration. In reality, a much bigger data set can be added to the database.

Read the desired data.

df <- dbReadTable(con, "star_sample")
4.2.1 Exploration

Compute the mean.

Filter the data.

subset <- subset(df, X4518 > 3000)
Small subset for illustration.

subset_2 <- df[, c("X807", "X4518")]
4.2.2 Visualisations

Not sure what these mean.

For illustration only.

ggplot(df) + aes(x = X1613424) + geom_histogram() +
  ggtitle("Histogram of X1613424")
ggplot(df) + aes(x = X1613424, y = X4518) +
  geom_point(shape = 1, colour = "blue") + ggtitle("Scatter plot")

4.2.3 A model

Our favourite seed :-)


Create the indices for the split This samples the row indices to split the data into training and validation.

train_index <- sample(1:nrow(df), 0.6 * nrow(df))
valid_index <- setdiff(1:nrow(df), train_index)

Using the indices, create the training and validation sets This is similar in principle to splitting a data frame by row.

train_df <- df[train_index, ]
valid_df <- df[valid_index, ]


Column names.


The model.

regression_model_2 <- lm(X10.559091 ~ X10.955403 + X2288071 + X.0.28820264, data = train_df)
The prediction.


regression_pred_2 <- predict(regression_model_2, valid_df)

4.3 Work with database 2

Read the desired data.


df_2 <- dplyr::tbl(con, "star_sample")
4.3.1 Exploration

Compute the mean using dplyr.

df_2 %>%
  group_by(X1) %>%
  dplyr::summarise(mean = mean(X807, na.rm=TRUE))
Filter the data using dplyr.

subset_3 <- filter(df_2, X4518 > 3000, X1395 <= 1000)
Small subset for illustration.

subset_4 <- df_2 %>% 
  select(c(X807, X4518))

4.3.2 Visualisations

Not sure what these mean.

For illustration only.

ggplot(df_2) + aes(x = X1613424) + geom_histogram() +
  ggtitle("Histogram of X1613424")
ggplot(df_2) + aes(x = X1613424, y = X4518) +
  geom_point(shape = 1, colour = "blue") + ggtitle("Scatter plot")

4.3.3 A model

Our favourite seed :-)


Create an ID.

df_2 <- df_2 %>% mutate(id = row_number())

Check ID variable.

Create training set.

train_df <- sample_frac(as_tibble(df_2), 0.6)
Create test set.

valid_df  <- anti_join(as_tibble(df_2), train_df, by = 'id')
The model.

regression_model_3 <- lm(X10.559091 ~ X10.955403 + X2288071 + X.0.28820264, data = train_df)
The prediction.


regression_pred_3 <- predict(regression_model_3, valid_df)

accuracy(regression_pred_3, valid_df$X10.559091)
Close the connection.


