How to use Julia in Quarto

Julia

Quarto

Author

Jihong Zhang

Published

March 15, 2024

1 Previous posts

This post illustrates how to use Julia to create a gradient descent algorithm. What has not been introduced, however, is how to perform the data analysis using Julia in Quarto. This post will illustrate the workflow step by step.

2 Initial Setup

First of all, refer to this Quarto.org, JuliaHub, and Patrick Altmeyer’s post. The first step is to install following components:

IJulia
Revise.jl
Jupyter Cache

Terminal

using Pkg
Pkg.add("IJulia")
Pkg.add("Revise")
using Conda
Conda.add("jupyter-cache")

Second, when you create the new quarto document, make sure the yaml header contains the jupyter item. For example, the yaml of this post is:

title: 'How to use Julia in Quarto'
author: 'Jihong Zhang'
date: 'Mar 10 2024'
categories:
  - Julia
  - Quarto
format: 
  html: 
    code-summary: 'Code'
    code-fold: false
    code-line-numbers: false
jupyter: julia-1.6

After the installation, you should be able to run the julia code in quarto like:

print("Hello World!")

Hello World!

3 Import dataset

# import packages
using DataFrames
using CSV
# load in the diamonds.csv
diamonds = DataFrame(CSV.File("diamonds.csv"))

first(diamonds, 7)

7×10 DataFrame

Row	carat	cut	color	clarity	depth	table	price	x	y	z
	Float64	String15	String1	String7	Float64	Float64	Int64	Float64	Float64	Float64
1	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
2	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
3	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
4	0.29	Premium	I	VS2	62.4	58.0	334	4.2	4.23	2.63
5	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75
6	0.24	Very Good	J	VVS2	62.8	57.0	336	3.94	3.96	2.48
7	0.24	Very Good	I	VVS1	62.3	57.0	336	3.95	3.98	2.47

4 Basic Statistical Modeling

Following the previous post, we can easily model a generalized linear regression using GLM module:

using GLM
lm_fit = lm(@formula(price ~ depth), diamonds)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

price ~ 1 + depth

Coefficients:
────────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%   Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)  5763.67    740.556    7.78    <1e-14  4312.17    7215.16
depth         -29.65     11.9897  -2.47    0.0134   -53.1499    -6.15005
────────────────────────────────────────────────────────────────────────

Let’s do some more advanced measurement - Factor analysis:

using MultivariateStats
# only sample first 300 cases and four variables
Xtr = diamonds[1:300 , [:x, :y, :z]]
# with each observation in a column
Xtr = Matrix(Xtr)' # somehow the data matrix has size of (d, n), which is the trasponse of data matrix in R 
# train a one-factor model
M = fit(FactorAnalysis, Xtr; maxoutdim=1, method=:em)

Factor Analysis(indim = 3, outdim = 1)

You can refer to this doc for more details for parameter estimation of factor analysis

loadings(M)

3×1 Matrix{Float64}:
 0.8294175777737991
 0.8157441937710099
 0.5052202721703213

Let’s quickly compare the results of lavaan

library(ggplot2)
library(lavaan)
data('diamonds')
X = diamonds[1:300, c('x', 'y', 'z')]
fa_model = "
F1 =~ x + y + z
"
fit = cfa(fa_model, data = X, std.lv = TRUE)
coef(fit)[1:3] # factor loading
    F1=~x     F1=~y     F1=~z 
0.7802245 0.7673664 0.4752576

5 Some Popular Julia Package

5.1 Flux

Flux is an elegant approach to machine learning. It’s a 100% pure-Julia stack, and provides lightweight abstractions on top of Julia’s native GPU and AD support. Flux makes the easy things easy while remaining fully hackable. See more details in Juliahub.

using Flux, Plots
data = [([x], 2x-x^3) for x in -2:0.1f0:2]

model = Chain(Dense(1 => 23, tanh), Dense(23 => 1, bias=false), only)

optim = Flux.setup(Adam(), model)
for epoch in 1:1000
  Flux.train!((m,x,y) -> (m(x) - y)^2, model, data, optim)
end

plot(x -> 2x-x^3, -2, 2, legend=false)
scatter!(x -> model([x]), -2:0.1f0:2)

Code

# This will prompt if neccessary to install everything, including CUDA:
using Flux, CUDA, Statistics, ProgressMeter

# Generate some data for the XOR problem: vectors of length 2, as columns of a matrix:
noisy = rand(Float32, 2, 1000)                                    # 2×1000 Matrix{Float32}
truth = [xor(col[1]>0.5, col[2]>0.5) for col in eachcol(noisy)]   # 1000-element Vector{Bool}

# Define our model, a multi-layer perceptron with one hidden layer of size 3:
model = Chain(
    Dense(2 => 3, tanh),   # activation function inside layer
    BatchNorm(3),
    Dense(3 => 2),
    softmax) |> gpu        # move model to GPU, if available

# The model encapsulates parameters, randomly initialised. Its initial output is:
out1 = model(noisy |> gpu) |> cpu                                 # 2×1000 Matrix{Float32}

# To train the model, we use batches of 64 samples, and one-hot encoding:
target = Flux.onehotbatch(truth, [true, false])                   # 2×1000 OneHotMatrix
loader = Flux.DataLoader((noisy, target) |> gpu, batchsize=64, shuffle=true);
# 16-element DataLoader with first element: (2×64 Matrix{Float32}, 2×64 OneHotMatrix)

optim = Flux.setup(Flux.Adam(0.01), model)  # will store optimiser momentum, etc.

# Training loop, using the whole data set 1000 times:
losses = []
@showprogress for epoch in 1:1_000
    for (x, y) in loader
        loss, grads = Flux.withgradient(model) do m
            # Evaluate model and loss inside gradient context:
            y_hat = m(x)
            Flux.crossentropy(y_hat, y)
        end
        Flux.update!(optim, model, grads[1])
        push!(losses, loss)  # logging, outside gradient context
    end
end

optim # parameters, momenta and output have all changed
out2 = model(noisy |> gpu) |> cpu  # first row is prob. of true, second row p(false)

mean((out2[1,:] .> 0.5) .== truth)  # accuracy 94% so far!

0.932

using Plots  # to draw the above figure

p_true = scatter(noisy[1,:], noisy[2,:], zcolor=truth, title="True classification", legend=false)
p_raw =  scatter(noisy[1,:], noisy[2,:], zcolor=out1[1,:], title="Untrained network", label="", clims=(0,1))
p_done = scatter(noisy[1,:], noisy[2,:], zcolor=out2[1,:], title="Trained network", legend=false)

plot(p_true, p_raw, p_done, layout=(1,3), size=(1000,330))

plot(losses; xaxis=(:log10, "iteration"),
    yaxis="loss", label="per batch")
n = length(loader)
plot!(n:n:length(losses), mean.(Iterators.partition(losses, n)),
    label="epoch mean", dpi=200)

--- title: 'How to use Julia in Quarto' author: 'Jihong Zhang' date: 'Mar 15 2024' categories: - Julia - Quarto toc: true number-sections: true execute: eval: true output: false format: html: code-summary: 'Code' code-fold: false code-line-numbers: false jupyter: julia-1.10 --- ## Previous posts This [post](posts/2021-08-30-gradient-descent-via-julia/index.md) illustrates how to use Julia to create a gradient descent algorithm. What has not been introduced, however, is how to perform the data analysis using Julia in Quarto. This post will illustrate the workflow step by step. ## Initial Setup First of all, refer to this [Quarto.org](https://quarto.org/docs/computations/julia.html), [JuliaHub](https://help.juliahub.com/juliahub/stable/tutorials/quarto/), and Patrick Altmeyer's [post](https://www.paltmeyer.com/blog/posts/tips-and-tricks-for-using-quarto-with-julia/). The first step is to install following components: 1. IJulia 2. Revise.jl 3. Jupyter Cache ``` {.bash filename="Terminal"} using Pkg Pkg.add("IJulia") Pkg.add("Revise") using Conda Conda.add("jupyter-cache") ``` Second, when you create the new quarto document, make sure the `yaml` header contains the `jupyter` item. For example, the yaml of this post is: ``` yaml title: 'How to use Julia in Quarto' author: 'Jihong Zhang' date: 'Mar 10 2024' categories: - Julia - Quarto format: html: code-summary: 'Code' code-fold: false code-line-numbers: false jupyter: julia-1.6 ``` After the installation, you should be able to run the julia code in quarto like: ```{julia} #| output: true print("Hello World!") ``` ## Import dataset ```{julia} # import packages using DataFrames using CSV # load in the diamonds.csv diamonds = DataFrame(CSV.File("diamonds.csv")) ``` ```{julia} #| output: true first(diamonds, 7) ``` ## Basic Statistical Modeling Following the previous post, we can easily model a generalized linear regression using `GLM` module: ```{julia} #| output: true using GLM lm_fit = lm(@formula(price ~ depth), diamonds) ``` Let's do some more advanced measurement - Factor analysis: ```{julia} #| output: true using MultivariateStats # only sample first 300 cases and four variables Xtr = diamonds[1:300 , [:x, :y, :z]] # with each observation in a column Xtr = Matrix(Xtr)' # somehow the data matrix has size of (d, n), which is the trasponse of data matrix in R # train a one-factor model M = fit(FactorAnalysis, Xtr; maxoutdim=1, method=:em) ``` You can refer to this [doc](https://multivariatestatsjl.readthedocs.io/en/latest/fa.html) for more details for parameter estimation of factor analysis ```{julia} #| output: true loadings(M) ``` Let's quickly compare the results of `lavaan` ``` r library(ggplot2) library(lavaan) data('diamonds') X = diamonds[1:300, c('x', 'y', 'z')] fa_model = " F1 =~ x + y + z " fit = cfa(fa_model, data = X, std.lv = TRUE) coef(fit)[1:3] # factor loading F1=~x F1=~y F1=~z 0.7802245 0.7673664 0.4752576 ``` ## Some Popular Julia Package ### Flux Flux is an elegant approach to machine learning. It's a 100% pure-Julia stack, and provides lightweight abstractions on top of Julia's native GPU and AD support. Flux makes the easy things easy while remaining fully hackable. See more details in [Juliahub](https://juliahub.com/ui/Packages/General/Flux). ```{julia} #| output: true using Flux, Plots data = [([x], 2x-x^3) for x in -2:0.1f0:2] model = Chain(Dense(1 => 23, tanh), Dense(23 => 1, bias=false), only) optim = Flux.setup(Adam(), model) for epoch in 1:1000 Flux.train!((m,x,y) -> (m(x) - y)^2, model, data, optim) end plot(x -> 2x-x^3, -2, 2, legend=false) scatter!(x -> model([x]), -2:0.1f0:2) ``` ```{julia} #| output: true #| code-fold: true # This will prompt if neccessary to install everything, including CUDA: using Flux, CUDA, Statistics, ProgressMeter # Generate some data for the XOR problem: vectors of length 2, as columns of a matrix: noisy = rand(Float32, 2, 1000) # 2×1000 Matrix{Float32} truth = [xor(col[1]>0.5, col[2]>0.5) for col in eachcol(noisy)] # 1000-element Vector{Bool} # Define our model, a multi-layer perceptron with one hidden layer of size 3: model = Chain( Dense(2 => 3, tanh), # activation function inside layer BatchNorm(3), Dense(3 => 2), softmax) |> gpu # move model to GPU, if available # The model encapsulates parameters, randomly initialised. Its initial output is: out1 = model(noisy |> gpu) |> cpu # 2×1000 Matrix{Float32} # To train the model, we use batches of 64 samples, and one-hot encoding: target = Flux.onehotbatch(truth, [true, false]) # 2×1000 OneHotMatrix loader = Flux.DataLoader((noisy, target) |> gpu, batchsize=64, shuffle=true); # 16-element DataLoader with first element: (2×64 Matrix{Float32}, 2×64 OneHotMatrix) optim = Flux.setup(Flux.Adam(0.01), model) # will store optimiser momentum, etc. # Training loop, using the whole data set 1000 times: losses = [] @showprogress for epoch in 1:1_000 for (x, y) in loader loss, grads = Flux.withgradient(model) do m # Evaluate model and loss inside gradient context: y_hat = m(x) Flux.crossentropy(y_hat, y) end Flux.update!(optim, model, grads[1]) push!(losses, loss) # logging, outside gradient context end end optim # parameters, momenta and output have all changed out2 = model(noisy |> gpu) |> cpu # first row is prob. of true, second row p(false) mean((out2[1,:] .> 0.5) .== truth) # accuracy 94% so far! ``` ```{julia} #| output: true using Plots # to draw the above figure p_true = scatter(noisy[1,:], noisy[2,:], zcolor=truth, title="True classification", legend=false) p_raw = scatter(noisy[1,:], noisy[2,:], zcolor=out1[1,:], title="Untrained network", label="", clims=(0,1)) p_done = scatter(noisy[1,:], noisy[2,:], zcolor=out2[1,:], title="Trained network", legend=false) plot(p_true, p_raw, p_done, layout=(1,3), size=(1000,330)) ``` ```{julia} #| output: true plot(losses; xaxis=(:log10, "iteration"), yaxis="loss", label="per batch") n = length(loader) plot!(n:n:length(losses), mean.(Iterators.partition(losses, n)), label="epoch mean", dpi=200) ```