Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli

DATA ANALYSIS AND BIG DATA LAB-formulario, Formulari di Database Distribuiti

formulario per la preparazione dell'esame di DATA ANALYSIS AND BIG DATA LAB (A.A. 2022) tramite l'utilizzo del programma STUDIO R

Tipologia: Formulari

2020/2021

In vendita dal 28/06/2023

klevisa.ba5
klevisa.ba5 🇮🇹

12 documenti

Anteprima parziale del testo

Scarica DATA ANALYSIS AND BIG DATA LAB-formulario e più Formulari in PDF di Database Distribuiti solo su Docsity! FORMULE Import data set: se txt: data <-read.delim(“data.txt”,header=TRUE,sep=””,row.names”text”) se csv: data <- read.csv(“data.csv”,header=TRUE) se tsv : data <- read_tsv(data.tsv) Create a tibble: library(tidyverse) library(tibble) str(data) as_tibble(data)< Use as_tibble(): external file and you want to work with that with tidyverse, you may transform it as tibble data.t <- as_tibble(data) str(data.t) you can explicitly print() the data frame and control the number of rows (n) nycflights13::flights %>% print(n = 15, width = Inf) Subsetting 1. df <- tibble(x = runif(5), y = rnorm(5)) 2. str(df) 3. print(df) df$x < Extract by name If you have activated the pipe, you can extract the elements, but you'll need to use the special placeholder df %>% .$x x <- subset(citydat,Year==1,select=c(Population)) x crea un seubset #merge 2 tibble Data.tt <- data %>% left_join(data2,by=”variable of text”) data.tt Data.tt <- as.data.frame(data.tt) Provide a graphival representationof the distribution: #crea regression model delle variabili ~ = alt+126 model.lm <- lm( x ~y, data=”boston”, filter(reg.ttt,"ID">=40)) summary(model.lm) model.lm <-lm(x~.,data=”boston”) crea una regressione di x e tutte le variabili rimanenti compare models fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata) < multiple regression fit2 <- lm(y ~ x1 + x2) anova(fit1, fit2) Trovare i valori nulli Is.na(“variabile da verificare”) Data[“variabile da verificare”][is.na(data[“variabile da verificare”])] <- “nuovo valore da assegnare/0” ggplot ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy) size = class)/color=class/shape=class) You can split your plot into facets, subplots that each display one subset of the data to facet your plot by a single variable, use facet_wrap() ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow = 2) To change the geom in your plot, change the type of the geom function FORMULE ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv), show.legend = FALSE) To display multiple geoms in the same plot, add multiple geom functions to ggplot(): ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + geom_smooth(mapping = aes(x = displ, y = hwy)) #You can color a bar chart using either the colour aesthetic, or fill: #The colour option colors the edge ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, colour = cut)) #line chart: geom_line() #boxplot: geom_boxplot() #histogram: geom_histogram() #area chart: geom_area() Data transformation filter() #Pick observations by their values arrange() #Reorder the rows select() #Pick variables by their names mutate() #Create new variables with functions of existing variables summarise() #Collapse many values down to a single summary group_by() #changes the scope of each function from operating on the entire dataset to t group-by-group filter filter(flights, month == 1, day == 1) It selects all flights on January 1st jan1 <- filter(flights, month == 1, day == 1) print(jan1)  filter(flights, month == 11 | month == 12) #Finds all flights that departed in November or Decembe #If you want to preserve missing values, ask for them explicitly 1. df <- tibble(x = c(1, NA, 3)) 2. filter(df, x > 1) 3. filter(df, is.na(x) | x > 1) arrange works similarly to filter() except that instead of selecting rows, it changes their order. arrange(flights, year, month, day) arrange(flights, desc(dep_delay)) <- Use desc() to re-order by a column in descending order Missing values are always sorted at the end: 1. df <- tibble(x = c(5, 2, NA)) 2. arrange(df, x) 3. arrange(df, desc(x)) Select allows you to rapidly zoom in on a useful subset using operations based on the names of the variables select() select(flights, year:day) < Select all columns between year and day (inclusive) ename(flights, tail_num = tailnum) < Rename: A variant of select() mutate() mutate(data, variabile da cambiare)< it's often useful to add new columns that are functions of existing columns. summarise() summarise(data ,variable =mean(variable2)) < collapses a data frame to a single row, creating a summary FORMULE es. b <- 0 for (i in 1:10){ b <- b + i} b #while loop while (test_expression){ statement} Examples: a <-0 while (a >0) { print(a) a <- a - 1} a #next if (test_condition) { next} #Examples: x <- 1:5 for (val in x) { if (val == 3){next } print(val)} #break if (test_expression) { break} exemples: a <- 5 for (i in 1:10){ if (i == 2) break a <- a - i } a #Repeat loop repeat {statement} Example: x <- 1 repeat {print(x) x = x+1 if (x == 6){ break }} #Functions Functions are used to logically break our code into simpler parts which become easy to maintain and understand. Exemple : my_func <- function(){ print('hello') } my_func Graphs review #We can put the two graphs in the same page using split split = c(x, y, nx, ny) #Plots barplot(table(X)) #Bar plot plot(table(Z)) #Stick plot pie(table(Y)) #Pie plot bubbleplot(table(X,Y)) #Bubbleplot plot(Z,W) #Scatter plot text(8,40,expression(y[i]==8.87 - 0.093*x[i])) #Boxplot boxplot( Sepal.Length ~ Species, main = "Iris", las = 1, layout=c(1,6),) #Q-Q plot qqnorm(x,col = 2) qqline(x,col = 4,lwd = 2) qqplot(x,rt(200,df=1)) #Histograms hist(Sepal.Length, main = "", density = FALSE, freq = TRUE) < con frequenze assolute) hist( W, breaks=3 ) A density line lines(density(Sepal.Length), col = "blue", lwd = 2) #Barplot barplot(VADeaths, beside = TRUE, las = 1, col = c("lightblue", "mistyrose", "lightcyan", ylim= c(0,120)) #Pairs Pairs is used to visualize the scatter plot of multiple variables taken by pairs pairs(iris[1:4], main =””,cex = 1.5, cex.labels = 2, font.labels = 2, pch = 21, ) sequence seq(-3,6,2, len=11) #sequence of values from -3 to 6, pace 2, length=11 which((x >= -1) & (x < 5)) # & means AND which((x < -2) | (x > 1)) # | means OR x[index] #extract the values cmp <- complex(real=1:10,imaginary=-1:9) cmp # defines a complex number FORMULE string <- c("gianni", "luca", "fabio") string defines a string bool <- c(TRUE, TRUE, FALSE, FALSE, FALSE) bool # defines boolean-type operators elle["NOMI"] It returns a list elle <- list(CPLX = cmp, NOMI = string, BOOL = bool, matrice = A) Lists How to visualize the structure of an object str(cmp) str(bool) str(string) str(matrix) str(elle) #It shows us the structure of "elle" #The linear model model <- lm(dist ~ speed, data=cars) summary(model) #Regression graph plot(cars) abline(model) #Q-Q plot: If quantiles are not gaussian, p-values are basically not valid str(model) plot(model$res) #It shows the residual, predict(model, newdata=data.frame(speed=9)) #Predicted values of the regression line for speed=9 coplot(Gas ~ Temp | Insul,whiteside) It shows the relation of gas on temperature conditional to Insul #Random numbers generation x1 <- runif(50) x1 #We generate 50 obs. from a uniform r.v. x3 <- rnorm(50,sd=45) x3 We generate 50 obs. from a normal r.v, while we specify the value for the SD x4 <- rchisq(50,df=1) x4 We generate 50 obs. from a chi-squared r.v. with 1 df x5 <- rt(50,df=3) x5 #We generate 50 obs. from a t- r.v. with 3 df #Create the dataframe z <- data.frame(y, x1, x2, x3, x4, x5) with the data.frame command, we create a data frame lm( y ~ . , data = z) We run a linear regression model specifying all the coefficients lm(y ~ x1 + x3 + x5, data=z) regression multiple Statistical tables and quantiles dnorm(0.4) # density function: density of N(0,1) calculated in .4 qnorm(0.96) # quantile function: 96th percentile of a N(0,1) Normal distribution dnorm(x,mean=0, sd=1) pnorm(x,mean=0, sd=1) qnorm(x,mean=0, sd=1) rnorm(x,mean=0, sd=1) #Descriptive statistics and graphs #Qualitative variables attach(data) table(X) #absolute frequencies table(X)/length(X) #relative frequencies table(X)/length(X)*100 #percentage frequencies FORMULE cumsum(table(X)) #cumulative frequencie #Continuous variables table(cut(W, breaks=c(40,50,58,70,95))) # data organized in classes with suitable breakpoints classes <- c(40,50,58,70,95) hist(W, br=classes, plot = FALSE) #it provides additional information, included the density of frequency table(X,Y) #double distribution table(X, cut(W,br=classes)) table(X,Y,Z) #three-way contingency table ftable(X,Y,Z) #creates a flat contingency table margin.table(tab,1) #marginal distribution of X prop.table(tab) #joint relative frequency distribution prop.table(tab,1) #conditional distribution of Y|X (relative) #Kernel density estimation plot(density(W)) #Kernel density estimation (optimal bandwidth) plot(density(W,bw=3)) #Kernel density estimation (fixed bandwidth) density(W) #Descriptive statistics summary(X) #Frequency distribution for a qualitative variable summary(Z) #Quartiles,Range,Mean summary(table(X,Y)) #Chi-square test of independence median(W) #Median quantile(W, c(0.1,0.3,0.7,0.93,.98)) #Quantiles var(W) #Sample variance of W cor(Z,W,method='pearson') #Correlation coefficient, method=Pearson cor(Z,W,method='spearman') #Correlation coefficient, method=Spearman Operazioni sqrt(16)< radice quadrata log(100,base=10)< logaritmo base 10 exp(2)< esponenziale cos(0)< coseno sin(pi/2) tan(pi/4) abs(-3) < valore assoluto factorial(5)< fattoriale choose(10,4)< scegli tra 10 a 4 un n curve(5-3*x^2+6*x, add=TRUE,lty=2)< crea una curva par(mfrow=c(2,2)) < parabola con max tabella 2x2 curve(sqrt,0,100) title("Square root")< curva con radice curve(cos, -2*pi, 2*pi,add=TRUE, lty=2)< curva con pi greco In a ballot there are 7 white balls and 3 black balls. We draw “in block” (i.e., without replacement) 4 balls. The experiment consists of evaluating how many white balls have been drawn. Use a for loop to repeat this experiment 1,000 times. Evaluate your results. 1. ballot <- c(rep("White",7),rep("Black",3)) 2. nrep <- 1000
Docsity logo


Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved