julia for old bluppers

using LinearAlgebra
using Distributions
using Random
using StatsBase

2 Common operations with matrices and arrays

2.1 creating empty matrix with defined size

B=zeros(nrep,nanim) #matrix of 0's
Z=Array{Float64}(undef,nanim,nsnp)

2.2 scalar - to - array operations

In R or Fortran we are used to do element-wise operations in a matrix like multiplication by a scalar (\(a \times \mathbf{B}\)) raising elements of a matrix to the square \(\mathbf{B} \odot \mathbf{B}\) , or taking the absolute value of a matrix \(abs(\mathbf{B})\) as:

a*B
B^2
abs(B)
sum2pq=2*sum(freq*(1-freq))
A*B

a*B
B**2
abs(B)
sum2pq=2*sum(freq*(1-freq))
A*B

a .* B
B .^ 2
abs.(B)
sum2pq=2*sum(freq .* (1 .- freq))
A .* B

2.3 loops, working with array sections

the syntax imposes (optional in Fortran but opposite to R) to put “:” to indicate that this is an array slice

for i in 1:nsnp
        # sum paternal and maternal, skip id
        pos=1+(i-1)*2+1
        Z[:,i]=geno[:,pos];
        Z[:,i]=Z[:,i]+geno[:,pos+1];
end

2.4 assignment to slices of matrices

LHS=zeros(5,5)
LHS[1,2:5] .= 1
#or
LHS .= 1

LHS=zeros(5,5)
LHS[1,2:5]=1 #error
LHS=1 #changes our matrix to a scalar

2.5 Matrix multiplication, transposes, inverses:

(X'*inv(V)*X)

3 reading matrices and arrays from files

3.1 reading

using DelimitedFiles
# reading (there are many possibilities)
@time Y=readdlm("kk")

I have found that in the cluster it is good to indicate dimensions (probably related to I/O latency problems) as follows:

f=open("kk","r")
nanim=countlines(f) #ojo que se va a l final
close(f)
f=open("kk","r")
nsnp=size(split(readline(f)),1)
close(f)
@time Y=readdlm("kk",dims=(nanim,nsnp))

f=open("kk","r")
z=vec(readdlm(f))
close(f)

z=vec(readdlm("kk"))

3.2 writing

again, there are many options. I like using spaces (not tabs) as separators and thus:

f=open("kk","w")
writedlm(f,z," ") # the 3rd value is the separator
close(f)

writedlm("kk",z," ")

4 Linear Regression

In the package Using Statistics there are a few elementary things such as correlations and variances, but other things have moved to different packages. I try to get correlation and basic linear regression between two variables x and y, and I have found package GLM here.

It seems that GLM will admit only incidence matrices, not vectors, so for a linear regression \(y=a+bx\) I need to translate \(x\) to matrix (2D array). I found the trick here:

using GLM
x=randn(10000)
y= x .+ randn(10000)
cor(x,y)
X=reshape(x,:,1)
lm(X,y)
# add general mean
X=hcat(ones(10000,1),x)
lm(X,y)

4.1 using @formula and DataFrames

I found this here. The documentation of Julia for statistical analysis is confusing and very quickly many of them do DataScience (plotting data) or ML (MachineLearning) but a good honest multiple regression is hard to find. It seems that using @formula needs DataFrames.

using DataFrames
using GLM
x = randn(100)
y = 0.9 .* x + 0.5 * rand(100)
df = DataFrame(x=x, y=y)
ols = lm(@formula(y ~ x), df) # R-style notation

1 useful libraries