2020
using LinearAlgebra
using Distributions
using Random
using StatsBase
=zeros(nrep,nanim) #matrix of 0's
B=Array{Float64}(undef,nanim,nsnp) Z
In R or Fortran we are used to do element-wise operations in a matrix like multiplication by a scalar (\(a \times \mathbf{B}\)) raising elements of a matrix to the square \(\mathbf{B} \odot \mathbf{B}\) , or taking the absolute value of a matrix \(abs(\mathbf{B})\) as:
*B
a^2
Babs(B)
2*sum(freq*(1-freq))
sum2pq=*B A
*B
a**2
Babs(B)
=2*sum(freq*(1-freq))
sum2pq*B A
in Julia this doesn’t work (references here and here ). We need to do:
.* B
a .^ 2
B
abs.(B)=2*sum(freq .* (1 .- freq))
sum2pq.* B A
the syntax imposes (optional in Fortran but opposite to R) to put “:” to indicate that this is an array slice
for i in 1:nsnp
# sum paternal and maternal, skip id
=1+(i-1)*2+1
pos:,i]=geno[:,pos];
Z[:,i]=Z[:,i]+geno[:,pos+1];
Z[end
this can be pretty dangerous, for instance: this is correct:
=zeros(5,5)
LHS1,2:5] .= 1
LHS[#or
= 1 LHS .
but this is not correct
=zeros(5,5)
LHS1,2:5]=1 #error
LHS[=1 #changes our matrix to a scalar LHS
works naturally, for instance to get \(\mathbf{X}'\mathbf{V}^{-1}\mathbf{X}\):
'*inv(V)*X) (X
There are dedicated functions in DelimitedFiles
using DelimitedFiles
# reading (there are many possibilities)
@time Y=readdlm("kk")
I have found that in the cluster it is good to indicate dimensions (probably related to I/O latency problems) as follows:
=open("kk","r")
f=countlines(f) #ojo que se va a l final
nanim
close(f)=open("kk","r")
f=size(split(readline(f)),1)
nsnp
close(f)@time Y=readdlm("kk",dims=(nanim,nsnp))
To read a vector (a matrix of 1 column):
=open("kk","r")
f=vec(readdlm(f))
z close(f)
more compactly:
=vec(readdlm("kk")) z
again, there are many options. I like using spaces (not tabs) as separators and thus:
=open("kk","w")
f,z," ") # the 3rd value is the separator
writedlm(f close(f)
more compactly
"kk",z," ") writedlm(
In the package Using Statistics there are a few elementary things such as correlations and variances, but other things have moved to different packages. I try to get correlation and basic linear regression between two variables x and y, and I have found package GLM here.
It seems that GLM will admit only incidence matrices, not vectors, so for a linear regression \(y=a+bx\) I need to translate \(x\) to matrix (2D array). I found the trick here:
using GLM
=randn(10000)
x= x .+ randn(10000)
y,y)
cor(x=reshape(x,:,1)
X,y)
lm(X# add general mean
=hcat(ones(10000,1),x)
X,y) lm(X
Using @formula
is more R-like.
I found this here. The documentation of Julia for statistical analysis is confusing and very quickly many of them do DataScience (plotting data) or ML (MachineLearning) but a good honest multiple regression is hard to find. It seems that using @formula
needs DataFrames.
using DataFrames
using GLM
= randn(100)
x = 0.9 .* x + 0.5 * rand(100)
y = DataFrame(x=x, y=y)
df = lm(@formula(y ~ x), df) # R-style notation ols
The coefficients of ols
are extracted using coef()
In the same notes there is an example of cross-classified effects.